2009/12/29
2009/12/27
2009/11/26
2009/11/22
2009/11/20
2009/11/15
Friday 4 December 2009
8:30-9:15 | Registration |
9:15-9:30 | Opening |
9:30-10:30 | Invited lecture: Eva Hajičová From Prague Structuralism to Treebank Annotation |
10:30-11:00 | Coffee break |
11:00-12:30 | Session A Chair: Koenraad De Smedt |
11:00-11:30 | Federico Sangati and Chiara Mazza An English Dependency Treebank à la Tesnière |
11:30-12:00 | Katri Haverinen, Filip Ginter, Veronika Laippala, Timo Viljanen and Tapio Salakoski Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank |
12:00-12:30 | Markus Dickinson and Marwa Ragheb Dependency Annotation for Learner Corpora |
12:30-14:00 | Lunch |
14:00-15:30 | Session B Chair: Adam Przepiórkowski |
14:00-14:30 | Jörg Tiedemann and Gideon Kotzé Building a Large Machine-Aligned Parallel Treebank |
14:30-15:00 | Marie Mikulová and Jan Štĕpánek Annotation Quality Checking and Its Implications for Design of Treebank (in Building the Prague Czech-English Dependency Treebank) |
15:00-15:30 | Alina Wróblewska and Anette Frank Cross-Lingual Projection of LFG F-Structures: Building an F-Structure Bank for Polish |
15:30-16:00 | Coffee break |
16:00-17:30 | Poster session |
Eduard Bejček, Pavel Straňák and Jan Hajič Finalising Multiword Annotations in PDT | |
Kristýna Čermáková, Lucie Mladová, Eva Fučíková and Kateřina Veselá Annotation of Selected Non-dependency Relations in a Dependency Treebank | |
Barbara McGillivray Selectional Preferences from a Latin Treebank | |
Helge Dyvik, Paul Meurer, Victoria Rosén and Koenraad De Smedt Linguistically Motivated Parallel Parsebanks |
Saturday 5 December 2009
9:30-10:30 | Invited lecture: Roberto Busa SJ From Punched Cards to Treebanks: 60 Years of Computational Linguistics |
10:30-11:00 | Coffee break |
11:00-12:30 | Session C Chair: Anette Frank |
11:00-11:30 | David Bamman, Francesco Mambrini and Gregory Crane An Ownership Model of Annotation: The Ancient Greek Dependency Treebank |
11:30-12:00 | Johan Bos, Cristina Bosco and Alessandro Mazzei Converting a Dependency Treebank to a Categorial Grammar Treebank for Italian |
12:00-12:30 | Torsten Marek, Gerold Schneider and Martin Volk A Declarative Formalism for Constituent-to-Dependency Conversion |
12:30-14:00 | Lunch |
14:00-15:30 | Session D Chair: Victoria Rosén |
14:00-14:30 | Seth Kulick and Ann Bies Treebank Analysis and Search Using an Extracted Tree Grammar |
14:30-15:00 | Adam Przepiórkowski TEI P5 as an XML Standard for Treebank Encoding |
15:00-15:30 | Ines Rehbein, Josef Ruppenhofer and Jonas Sunde MaJo - A Toolkit for Supervised Word Sense Disambiguation and Active Learning |
15:30-16:00 | Coffee break |
16:00-17:30 | Session E Chair: Charles J. Fillmore |
16:00-16:30 | Karin Harbusch and Gerard Kempen Clausal Coordinate Ellipsis and its Varieties in Spoken German: A Study with the TüBa-D/S Treebank of the VERBMOBIL Corpus |
16:30-17:00 | Jana Šindlerová and Ondřej Bojar Towards English-Czech Parallel Valency Lexicon via Treebank Examples |
17:00-17:30 | António Branco, Sara Silveira, Sérgio Castro, Mariana Avelãs, Clara Pinto and Francisco Costa Dynamic Propbanking with Deep Linguistic Grammars |
17:30-17:45 | Closing session |
2009/11/11
2009/11/10
2009/11/01
2009/10/31
2009/10/28
2009/10/23
2009/10/20
2009/10/19
2009/10/18
2009/10/11
2009/10/09
2009/10/03
2009/10/01
2009/09/28
2009/09/26
2009/09/23
2009/08/31
2009/08/21
2009/08/12
2009/08/07
2009/07/30
SVM
general: http://www.support-vector-machines.org/
LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (C++)
SVMLIGHT: http://svmlight.joachims.org/ (C++)
BIOJAVA: http://www.biojava.org/ (The package org.biojava.stats.svm contains SVM classification and regression.) (JAVA)
Package pt.tumba.ngram.svm: http://tcatng.sourceforge.net/javadocs/pt/tumba/ngram/svm/package-summary.html (JAVA)
The following packages either implement SVM by themselves or wrap some SVM packages written in C/C++.
RapidMiner: http://rapid-i.com/
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
MALLET: http://mallet.cs.umass.edu/
MINORTHIRD: http://minorthird.sourceforge.net/
2009/07/28
2009/07/21
Pascal Challenge
Large Scale Hierarchical Text classification
Web site: http://lshtc.iit.demokritos.gr/
Email: lshtc_info@iit.demokritos.gr
We are pleased to announce the launch of the Large Scale Hierarchical Text classification (LSHTC) Pascal Challenge. The LSHTC Challenge is a
hierarchical text classification competition using large datasets based on the ODP Web directory data (www.dmoz.org).
Hierarchies are becoming ever more popular for the organization of text
documents, particularly on the Web. Web directories are an example. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue despite the vastness of available data. The reasons for this are the simultaneous increase in the number of classes and their hierarchical organization. The latter leads to a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for the learning methods.
The challenge will consist of four tasks with partially overlapping data. Information regarding the tasks and the challenge rules can be found at challenge Web site, under the "Tasks, Rules and Guidelines" link.
We plan a two-stage evaluation of the participating methods: one measuring classification performance and one computational performance. It is important to measure both, as they are dependent. The results will be included in a final report about the challenge and we also aim at organizing a special NIPS'09 workshop.
In order to register for the challenge and gain access to the datasets,
please create a new account at challenge Web site.
Key dates:
Start of testing: July 10, 2009.
End of testing, submission of executables and short papers: September 29, 2009.
End of scalability test and announcement of results: October 25, 2009.
NIPS'09 workshop (subject to approval): December 11-12, 2009
Organisers:
Eric Gaussier, LIG, Grenoble, France
George Paliouras, NCSR "Demokritos", Athens, Greece
Aris Kosmopoulos, NCSR "Demokritos", Athens, Greece
Sujeevan Aseervatham, LIG, Grenoble & Yakaz, Paris, France
2009/07/13
2009/07/09
2009/07/03
2009/07/02
Sanity test
training:
1 a a a CD_A CD_A _ _ 0 0 NMOD_A NMOD_A Y A _
2 b b b DT_B RBR_B _ _ 1 1 NMOD_B NMOD_B _ _ l1
1 d d d CD_D CD_D _ _ 0 0 NMOD_D NMOD_D Y D _
2 c c c DT_C RBR_C _ _ 1 1 NMOD_C NMOD_C _ _ l2
3 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3
testing:
1 d d d CD_D CD_D _ _ 0 0 NMOD_D NMOD_D Y D _
2 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3
1 a a a CD_A CD_A _ _ 0 0 NMOD_A NMOD_A Y A _
2 b b b DT_B RBR_B _ _ 1 1 NMOD_B NMOD_B _ _ l2
3 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3
system output:
1 d d _ CD_D _ _ _ 0 0 NMOD_D _ Y D _
2 e e _ DT_E _ _ _ 1 1 NMOD_E _ _ _ l3
1 a a _ CD_A _ _ _ 0 0 NMOD_A _ Y A _
2 b b _ DT_B _ _ _ 1 1 NMOD_B _ _ _ l1
3 e e _ DT_E _ _ _ 1 1 NMOD_E _ _ _ l3
results:
correct = 2 wrong = 1
as I expected
5 hours to execute English data set.
I found doubles spaces in the training set, that is why input and output files have different amount of lines.
2009/07/01
2009/06/24
2009/06/23
2009/06/19
2009/06/16
postgraduate skills development summer school, analisys
Workshops:
Thesis writing process sciences
-> i learn that i should write something every week.
Time management & overcoming procrastination
-> nothing new.
Get your head arround PG experience SCS
-> lots of good recomendation to try.
starting to put it in practice from today.
Viva preparation.
-> good recommendations for prepare the transfer report.
2009/06/15
2009/06/12
2009/06/11
programming
loading training data
loading testing data
loading testing data2
removing labels
mutate relations
relabeling
writing output.txt
correct = 13772 wrong = 93
program finish properly
melody > pwd
/shared/teaching/CSLL/4thYrProjects/SRL_task/HectorCoNLL2009Software
2009/06/10
2009/06/09
summer school trinity
Please select your options for both dates below
* One asterisk indicates this session is suitable for students in the earlier stages of their degree.
** Two asterisks indicate this session is suitable for students in the later stages of their degree.
15th June | |||
---|---|---|---|
09:30 - 10:00 | Registration Outside the JM.Synge Theatre (previously Walton Lecture Theatre) | ||
10:00 - 10:30 | Opening Session Professor Carol O’Sullivan, Dean of Graduate Studies - JM.Synge Theatre | ||
10:30 - 11:00 | Meet & Greet Session | ||
11:00 - 13:00 | name="15 June Morning" value="Session A - Career Planning for Postgraduate Students" type="radio" Session A - Career Planning for Postgraduate Students* Room 4050A name="15 June Morning" value="Session B - Thesis Writing Process for Arts/Humanities" type="radio" Session B - Thesis Writing Process for Arts/Humanities ** Room 4050B name="15 June Morning" value="Session B -Thesis Writing Process for Sciences " type="radio" Session B -Thesis Writing Process for Sciences ** Room 5039 name="15 June Morning" value="Session C - Planning a Thesis Using Word" type="radio" Session C - Planning a Thesis Using Word Room 1013 name="15 June Morning" value="Session D - Presentation Skills for Postgraduate Students" type="radio" Session D - Presentation Skills for Postgraduate Students Room 5025 Session E - Systematic Approaches to Literature Reviewing* Room 5052 - Fully Booked |
14:00 - 16:00
| name="15 June Afternoon" value="Session A - Job Hunting Essentials for Postgraduate " type="radio" Session A - Job Hunting Essentials for Postgraduate
name="15 June Afternoon" value="Session B - In your own words: citing with confidence and avoiding plagiarism" type="radio" Session B - In your own words: citing with confidence and avoiding plagiarism* Room 4050B name="15 June Afternoon" value="Session C - Preparing an Article for Publication (Sciences)" type="radio" Session C - Preparing an Article for Publication (Sciences) Room 5039 name="15 June Afternoon" value="Session C -Preparing an Article for Publication Arts/Humanities" type="radio" Session C -Preparing an Article for Publication Arts/Humanities** Room 3126 name="15 June Afternoon" value="Session D - Effective Presentations Using PowerPoint" type="radio" Session D - Effective Presentations Using PowerPoint Room 1013 name="15 June Afternoon" value="Session E - EndNote for Beginners" type="radio" Session E - EndNote for Beginners* Berkeley Library |
13:00 - 14:00 | Lunch | ||
16:15 - 17:00 | General Session - Life as a Postgrad - Q & A Discussion Panel with Postgraduate Advisory Service - JM.Synge Theatre Submit your question in advance here | ||
17:00 | Reception Drink - Pavillion | ||
16th June | |||
10:00 - 12:00 |
Session A - Time Management & Overcoming Procrastination* Room 4050A - Fully Booked name="16 June Morning" value="Session B - Creating Your Own Research/Writing Support Group" type="radio" Session B - Creating Your Own Research/Writing Support Groups Room 4050B Session C - Moved to afternoon name="16 June Morning" value="Session D - Getting Your Head Around Your PG Experience" type="radio" Session D - Getting Your Head Around Your PG Experience Room 5025 Session E - EndNote for Beginners*Berkeley Library - Fully Booked
| 13:30 - 15:30 | name="16 June Afternoon" value="Session A - Time Management & Overcoming Procrastination" type="radio" Session A - Time Management & Overcoming Procrastination* Room 4050A name="16 June Afternoon" value="Session B - Developing Critical Arguments" type="radio" Session B - Developing Critical Arguments** Room 4050B Session C - Creating Effective Conference Posters IS Services Room, Pearse St - Fully Booked name="16 June Afternoon" value="Session D - Viva Preparation" type="radio" Session D - Viva Preparation** Room 5025 name="16 June Afternoon" value="Session E - Copyright and Intellectual Property for Research " type="radio" Session E - Copyright and Intellectual Property for Research Room 5052 name="16 June Afternoon" value="Session F - An Insider's Guide to Getting Published in Research Journals" type="radio" Session F - An Insider's Guide to Getting Published in Research Journals** Room 5039 |
12:00 - 13:30 | Lunchtime Reception & Exhibition | ||
15:45 - 16:30 | General session - "Motivation, Critical Thinking and Decision Making" by Dr. Kevin Thomas, School of Psychology - JM.Synge Theatre name="General Session" id="General Session" Can Attend Cannot Attend | ||
16:30 - 16:45 | Closing Session - JM.Synge Theatre | ||
17:00 | Evening Function - GSU Reception at GSU Common Room name="Evening Function" id="Evening Function" Can Attend Cannot Attend |
Please print the confirmation page you receive when you select the Register Button below.
A confirmation email will be sent to you within 3 days.
2009/06/08
reading slides about srl : www.denizyuret.com/ref/yih/SRL-Tutorial-hlt-naacl-06.pdf
2pm to 4pm
example of srl
atach labels (syntactic and semantics)
shared task -> they defined the labels
corpora:
*propbank
*verbNet
*frame net
explain pruning
identification argument
labeling
predicate word: select the sense of the word
data formats:
1- constituent structure tree 2005
2- dependenci strutture 2009
SYSTEMS:
- tree distance -> hector
- tree kernel -> liliana
- graph matching
- conventionals: max entropy, conditional random fields
next week:
presentation on svm & knn
framenet is a dictionary
propbank is a corpus
every one should see the work of the others
OUTPUT of the directed studies:
c++ system
presentation
report
produce new ideas
home work:
basic statistics:
how often appear A0
make a better description list of what has been done.
2009/06/05
Fwd: Materials for SRL Directed Studies
If you take a look at
www.cs.tcd.ie/Martin.Emms/SRL_Module
you'll find some web-pages relating to this. In particular in the
section 'Details', there's an evolving set of links to materials. You could
profitably take a look at the links from the first 3 sections before Monday.
# Semantic Role Labeling overview
# A digest of SRL 'shared-tasks' to date
# 'semantic' corpora and lexica
see you Monday
Martin
2009/06/04
2009/06/03
2009/05/28
2009/05/27
2009/05/26
2009/05/25
Information Extraction: Algorithms and Prospects in a
information retrieval summer school padova
You were registered at our conference on EUROPEAN SUMMER SCHOOL IN INFORMATION RETRIEVAL 2009.
2009/05/24
2009/05/23
2009/05/22
Towards Emotional Sensitivity in Human-Computer Interaction
2009/05/21
SRL 1.0.
2009/05/20
2009/05/19
2009/05/18
2009/05/15
2009/04/26
2009/04/22
2009/04/21
2009/04/20
reading group
2009/04/19
Designing the SRL system,
but it will be not able to predict semantic relations
2009/04/16
2009/04/14
2009/04/13
2009/04/10
2009/04/08
Fwd: [SIGIR2009-Poster] Your Paper #169
We regret to inform you that your poster submission
Title: Estimating performance of text classification
has not been accepted for the SIGIR 2009 Poster Track.
The review process was extremely selective and many submissions could
not be accepted for the final program. Out of the 256 poster submissions,
the program committee selected only 86 posters, an acceptance rate of about
34%.
The reviews for your submission are included below. Each poster was
reviewed by at least three reviewers. Final poster decisions were made
by the poster co-chairs.
The conference program and registration details will be available on
the conference website shortly at:
We hope to see you in Boston in July. If you plan on attending the
conference, then it is important to note that all visitors to the US require
a visa or a visa waiver. The starting point is the page
https://esta.cbp.dhs.gov/esta/esta.html,
through which citizens of many countries can obtain a visa waiver.
Thank you again for submitting your poster to SIGIR2009-Poster.
Best regards,
The SIGIR2009-Poster Program Chairs
Jimmy Lin and Don Metzler
------------- Review from Reviewer 1 -------------
Relevance to SIGIR (1-5, accept threshold=3) : 3
Originality of work (1-5, accept threshold=3) : 3
Quality of work (1-5, accept threshold=3) : 3
Adequacy of Citations (1-5, accept threshold=3) : 3
Quality of presentation (1-5, accept threshold=3) : 3
Impact of Ideas or Results (1-5, accept threshold=3) : 2
Impact of Resources (1-5, no threshold) : 1
Recommendation (1-6) : 2
Confidence in review (1-4) : 2
-- Comments to the author(s):
This paper takes a look at the relationship between the number of
classes and accuracy regarding classfication in multiple classes.
Note that some of the equations came out as gibberish (in Adobe Reader
7) - i.e. some of the text in the last para of Section 2.
-- Summary:
This seems like a small work toward looking at text classifier
performance. For me, it seemed more like a core dump of bits of
information. I would recommend that the authors try to make it clear
what the real contribution of this paper is in the future.
---------- End of Review from Reviewer 1 ----------
------------- Review from Reviewer 2 -------------
Relevance to SIGIR (1-5, accept threshold=3) : 4
Originality of work (1-5, accept threshold=3) : 3
Quality of work (1-5, accept threshold=3) : 2
Adequacy of Citations (1-5, accept threshold=3) : 3
Quality of presentation (1-5, accept threshold=3) : 2
Impact of Ideas or Results (1-5, accept threshold=3) : 2
Impact of Resources (1-5, no threshold) : 1
Recommendation (1-6) : 2
Confidence in review (1-4) : 4
-- Comments to the author(s):
The paper studied the relationship between the number of classes and
the classification accuracy. The problems of the paper are listed
below.
(1) For multi-class classification, the numbers of samples in
different categories are often very imbalanced, which can have big
effects on the classification accuracy. However, this important factor
is ignored in the paper.
(2) In Figure 1, it is not clear to judge that naïve Bayes performed
better than kNN.
(3) The presentation is not good. Section 2 needs to be well
re-organized and greatly polished. The English needs much improvement.
-- Summary:
There is a big technical problem in the paper, and the presentation is bad.
---------- End of Review from Reviewer 2 ----------
------------- Review from Reviewer 3 -------------
Relevance to SIGIR (1-5, accept threshold=3) : 3
Originality of work (1-5, accept threshold=3) : 1
Quality of work (1-5, accept threshold=3) : 1
Adequacy of Citations (1-5, accept threshold=3) : 1
Quality of presentation (1-5, accept threshold=3) : 1
Impact of Ideas or Results (1-5, accept threshold=3) : 1
Impact of Resources (1-5, no threshold) : 1
Recommendation (1-6) : 1
Confidence in review (1-4) : 6
-- Comments to the author(s):
The poster analyses the relation between the expected accuracy of
classifiers and the number of classes.
It describes an incremental algorithm for estimating the accuracy of
classifiers for a given classification problem.
Some experiments are performed on a small dataset.
The paper is not understandable in its present form and should be
rewritten. Definitions should be provided (e.g. epistasis or synergy
of a split), the algorithm should be carefully described.
-- Summary:
Paper should be completely rewritten. The present version cannot be understood.
---------- End of Review from Reviewer 3 ----------
////////////////////////////////////////////////////
Powered by ConfMaster.net
///////////////////////////////////////////////////
2009/04/07
reading group
Labeling" on Monday the 6th of April (from 4:00 to 5:00)
2009/04/06
2009/04/05
2009/04/04
CLUSTERING BY TREE DISTANCE FOR PARSE TREE NORMALISATION:
CLUSTERING BY TREE DISTANCE FOR PARSE TREE NORMALISATION:
Writed by Martin Emms.
Notes by Hector Franco.
0 Abstract
Potential application: Transformation of interrogative to indicative sentences. -> is a step on question answering. | A tree distance is proposed -> find a pattern tree that summarizes the cluster. .
1 Introduction
Previous work:
Question-answering with tree-distance.
1 take a parse-structure from a question, 2 match it up to parse-structures of candidate answers.
Normalization: change passive structures to active structures: interrogative to indicative.
Popular parser: Collins probabilistic. Trained on Penn Treebank.
Trees are not assigned in accordance with any finite grammar.
…
Simple transformation -> mentally induction. (Very boring)
Method described:
Parse structures can be hierarchically clustered by tree-distance and kind of centroid tree for a chosen cluster can be generated which exemplifies typical traits of trees with the cluster.
2 Tree distance
Concepts:
Source and target trees
Preserve left to right order and ancestry.
Descendant.
(Not sense to summarize, just look the original).
2.1. question answering by tree distance.
Answers ranked according to the tree-distance from the questions.
QATD : question answering by tree distance.
Additional methods: query-expansion, query-type identification, named entity recognition.
Syntactic structures group items semantically related.
Syntactic structures might encode or represent a great deal that is not semantic in any sense.
Variances in tree distances:
Sub-tree: the cost of the least cost mapping from a sub-tree of the source.
Sub-traversal: the least cost mapping from a sub-traversal of the left-to-right post –order traversal of the source.
Structural weights: weights according to the syntactic structure.
Wild cards: can have zero cost matching, ???????????
Lexical Emphasis: leaf nodes have weights which are scaled up in comparison to nodes which are internal to the tree.
String Distance: if code source and target the string distance coincides with tree distance. ??????
Results:
Tree distance which uses sub-trees, weights , wild-cards and lexical emphasis, are better than sub-string distance and each parameter improve it.
???????
3 Clustering by tree distance
Used the agglomerative clustering algorithm: pic a pair of cluster with minimal distance and merge it into a single one.
Agglomerative coefficient: measure of overall quality.
S(q) cluster of q
Merge_dist(q) intercluster distance.
Agglomerative coefficient AC merge_dist/Df. 1 the best (0 to 1).
Giving different weight give a better results. (head/complement/adjunt/…)
4 Deriving a pattern tree form a cluster
How to seek the centre point of a cluster. (the one with minimal distance to the others).
Distance is Euclidean or cosine.
New function: aling_outcome( node I, paramb)
B =0 matched perfectly, b=1 substituted, b=2 deleted.
Used to derive an alignment summary tree, align_sum( c )
Final step:
Deletion nodes are deleted
Substitution nodes become wild-card trees.
5 conclusions and future work
Adaptations of tree distance improve question answering, and cluster quality.
2009/04/03
MULTEXT-East Version 4: multilingual morphosyntactic specifications
Friday talk
multilingual morphosyntactic
POS
Determine ambiguity class
Saw – nn saw – vrd
I saw, a saw (ver / serrucho)
Popular targers:
TNT
Tree tagser (decision tree)
TBL transformation based taggin.
Tag Sets
- Brown
- CLAWS
- PTB.
MSD Morphosyntactic Descriptors
|POS| < |MSD|
Basic Language Resource Kit:
1 specification
2 Lexicon
3 parallelcorpus
The talk presents work in progress on the fourth version of the multilingual language resources originating in the MULTEXT and MULTEXT-East projects in the '90s. The resources are focused on language technology oriented morphosyntactic descriptions of languages, i.e. on providing features and tagsets useful for word-level tagging of corpora, what is commonly known as part-of-speech tagging. But unlike English, where »part-of-speech« tagsets number around 50, most other (inflectional, agglutinating) languages have much richer word-level morphosyntactic structures; the tagset for Slovene, for example, has almost 2,000 different tags. The MULTEXT-East resources comprise morphosyntactic specifications, defining the features and their tagsets, lexica, and annotated corpora. Version 3 (2004) is the last released version, with the resources being freely available for research from http://nl.ijs.si/me/ and having been downloaded by over 200 registered users, mostly from universities and research institutions. The talk introduces the XML structure of the specifications in Version 4, to contain data for over 13 languages. We discuss the characteristics of the languages covered, the use of the Text Encoding Initiative Guidelines as the encoding scheme and XSLT in transforming the specifications into other formats. An application of this framework is then given, namely the JOS language resources for Slovene, http://nl.ijs.si/jos/, which provide a manually validated morphosyntactically annotated reference corpus for the language. Finally, the methodology of adding new languages to the specifications is presented.
2009/04/02
evaluation
I was talking with Baoli, and we think that maybe will be a good idea
to start to prepare a evaluation for this year.
I was thinking that will be good to try this question answering task:
http://celct.isti.cnr.it/ResPubliQA/index.php?page=Pages/documents.php&option=newTrackSetup
I will try to find you tomorrow to talk about it
Evaluation forums
Evaluation Forums on http://www.clef-campaign.org/
Cross-Language Evaluation Forum | |
Text Retrieval Conference | |
NII-NACSIS Test Collection for IR Systems | |
INitiative for the Evaluation of XML Retrieval | |
FIRE | Forum for Information Retrieval Evaluation |
Research Programmes
European Network of Excellence in Human Languages Technologies | |
Translingual Information Detection, Extraction and Summarization (DARPA) |
Resources
Evaluations and Language Resources Distribution Association | |
Linguistic Data Consortium | |
United Nations Official Documents Online - consists of 800,000 searchable parallel documents in Arabic, Chinese, English, French, Russian and Spanish, primarily in PDF and / or MS Word format. Resource Shelf summarizes the main features and other UN resources. | |
Free retrieval system from NIST, complete with (simplistic) German and French stemmers | |
SDA data | Training collection used for the TREC6-8 CLIR tracks (password-protected. Please consult the CLEF administration.) |
online dictionaries for 100s of languages | |
(Systran-powered) online machine translation from Altavista | |
On-line translation tools | |
On-line translation tools | |
On-line translation tools | |
On-line translation tools |
Selected References on Evaluation
(for CLEF papers, see our Website under Publications)
2009/04/01
Web Information Retrieval: Spam Detection and Named Entity Recognition
Spam detection
Linguistic features
Adds: web spam
Full search engines: for ranking it-self
Challenge:
Complexity
Scale
Co-adaptation.
Blog spam: blog of hide links
Attractive keywords
Linguistic analysis
Light-weight linguistic analysis
Air web – workshop
Attributes for ML.
Lexical diversity
Syntactical entropy
Labels
- Hosts
- Documents
String distance metrics
Name variations complicate the t…
Permutations, abbreviations, speling mistakes, declensions
Edit distance metrics:
Levenshtein
Bag distance
Needleman-wunsh
Smith-watermar
Smith-waterman with affine gaps.
Common character-level n-grams
q-grams, positional, q-grams, skip-grams
longest common substring LCS
string distance
jaro
jaro – winkler
jwm
2009/03/31
directed study module
Aim of this directed study module
In various areas of what might be termed intelligent language processing, the question of categorising units and of ranking units for similarity arise.
In Information Extraction, from sentences, units have to recognised that can be seen as filling roles in database-entries in some domain you are particularly interested in (business events, terrorist incidents). So units have to be found and given categories such as company, date, weapon etc. Sometimes systems for such categorisations are called Named Entity Recognisers
In Question Answering, given a natural language question (eg. what command will copy a directory?), an answering natural language sentence is sought within a corpus of natural language text (such as a manual: command cp Dir1 Dir2 will copy the directory Dir1 to location Dir2). Roughly this can be seen as categorising the corpus sentences into the 2 categories of being an answer and not being an answer. More realistically it can be seen as ranking the corpus sentences for their likelihood to be answer to the question, and this ranking might take the form of computing a similarity between the answer and the question. Often within an overall question-answering system there are sub-systems whose concern is the categorisation of phrases. So units within the answer-corpus might be sought and categorised (such as location, person etc), and similar or identical categories might be assigned to questions (where was ...:location, who ...:person) and might be used to influence the selection or ranking of answers.
The techniques could be (and have been) used for such categorisation and similarity-ranking tasks might be classified along the following two dimensions
- Structural
- whether the method makes reference to linguistic structures or not (for example methods referring pretty much to just word-count statistics do not make reference to linguistic structures)
- Machine-Learning
- whether the method is hand-engineered, or instead is a method which uses training data to define the parameters and behaviour of a classifier.
The aim of this directed study module will be to look at techniques for categorisation and similarity ranking which (i) do make reference to linguistic structures and (ii) do make use of machine-learning methods, whilst making comparisons with techniques which do not share these characteristics.
--------------
Possible Plan
Within the area of techniques which do make reference to linguistic structure, and are not hand-engineered, the main contendors are
- Tree Distance
- defining a 'distance' between two trees and , basically representing what is left-out and relabelled in an ancestry and linearity preserving mapping between and . For identical trees, the distance is 0. [PRtY04,ZSW+05,KM05,Emm06a,Emm06b]
- Tree Kernels
- defining a so-called 'kernel' function between two trees and , which effectively represents a vector product on images of the trees projected into a space, each dimension of which represents a distinct substructure. 'kernel' functions are often thought of as 'similarity' functions: for identical trees, the 'similarity' is large, and for trees with no shared substructures, is 0. [ZAR03,ZL03,QMMB07,MGCB05,BM07]
- Support Vector Machine Classification
- using training data on which a kernel function is defined and which are in 2 categories, a hyperplane is effectively found, defined by a normal vector w and an origin offset , dividing the space of points according to0 \mbox{\ \ \ \ \ \ } {\bf w} \bullet {\bf x} + b <>
Typically w is not directly found, but instead the quantity is expressed by a summation involving the kernel function and a finite subset of training points, the so-called support vectors
- k-NN classification
- assuming a distance function which can be calculated between a test item and any member of a set of classified training data, rank the training data in order of increasing distance from the test item. In 1-NN classication, return as the category of the test item the category of its nearest neighbour. In k-NN classication, find the nearest neighbours, and treat this as a panel of votes for various categories, finally applying some weighted voting scheme to derive a winning category from the panel.
I would propose a plan in which first materials are studied to make sure that the above 4 concepts are reasonably well understood, followed by particular papers in which various combinations of these concepts have been applied to categorisation and ranking tasks.
Here's a first (and still unfinished) draft of the papers/materials to be looked at
- k-NN and distance functions
WHAT TO READ ? STILL DECIDING
- SVMs and tree kernels
WHAT TO READ ? STILL DECIDING
- Question Categorisation
- ZL03
- Dell Zhang and Wee Sun Lee.
Question classification using support vector machines.
In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26-32, New York, NY, USA, 2003. ACM. - QMMB07
- Silvi Quaternoni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili.
Advanced structural representations for question classification and answer re-ranking.
In Advances in Information Retrieval, proceedings of ECIR 2007. Springer, 2007.
- Question Answering and Text Entailment
- PRtY04
- Vasin Punyakanok, Dan Roth, and Wen tau Yih.
Natural language inference via dependency tree mapping: An application to question answering.
Computational Linguistics, 2004. - KM05
- Milen Kouylekov and Bernardo Magnini.
Recognizing textual entailment with tree edit distance algorithms.
In Ido Dagan, Oren Glickman, and Bernardo Magnini, editors, Pascal Challenges Workshop on Recognising Textual Entailment, 2005. - Emm06b
- Martin Emms.
Variants of tree similarity in a question answering task.
In Proceedings of the Workshop on Linguistic Distances, held in conjunction with COLING 2006, pages 100-108, Sydney, Australia, July 2006. Association for Computational Linguistics.
- Relation Extraction
- ZAR03
- Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella.
Kernel methods for relation extraction.
Journal of Machine Learning Research, 3:1083-1106, 2003. - ZSW05
- Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan.
Discovering relations between named entities from a large raw corpus using tree similarity-based clustering.
In IJCNLP, 2005. - MGCB05
- Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola, and Roberto Basili.
Hierarchical semantic role labeling.
In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), 2005. - BM07
- Stephan Bloehdorn and Alessandro Moschitti.
Combined syntactic and semantic kernels for text classification.
In Gianni Amati, Claudio Carpineto, and Gianni Romano, editors, Advances in Information Retrieval - Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), 2-5 April 2007, Rome, Italy, volume 4425 of Lecture Notes in Computer Science, pages 307-318. Springer, APR 2007.
------------------
Bibliography
- BM07
- Stephan Bloehdorn and Alessandro Moschitti.
Combined syntactic and semantic kernels for text classification.
In Gianni Amati, Claudio Carpineto, and Gianni Romano, editors, Advances in Information Retrieval - Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), 2-5 April 2007, Rome, Italy, volume 4425 of Lecture Notes in Computer Science, pages 307-318. Springer, APR 2007. - Emm06a
- Martin Emms.
Clustering by tree distance for parse tree normalisation.
In Proceedings of NLUCS 2006, pages 91-100, 2006. - Emm06b
- Martin Emms.
Variants of tree similarity in a question answering task.
In Proceedings of the Workshop on Linguistic Distances, held in conjunction with COLING 2006, pages 100-108, Sydney, Australia, July 2006. Association for Computational Linguistics. - KM05
- Milen Kouylekov and Bernardo Magnini.
Recognizing textual entailment with tree edit distance algorithms.
In Ido Dagan, Oren Glickman, and Bernardo Magnini, editors, Pascal Challenges Workshop on Recognising Textual Entailment, 2005. - MGCB05
- Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola, and Roberto Basili.
Hierarchical semantic role labeling.
In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), 2005. - PRtY04
- Vasin Punyakanok, Dan Roth, and Wen tau Yih.
Natural language inference via dependency tree mapping: An application to question answering.
Computational Linguistics, 2004. - QMMB07
- Silvi Quaternoni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili.
Advanced structural representations for question classification and answer re-ranking.
In Advances in Information Retrieval, proceedings of ECIR 2007. Springer, 2007. - ZAR03
- Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella.
Kernel methods for relation extraction.
Journal of Machine Learning Research, 3:1083-1106, 2003. - ZL03
- Dell Zhang and Wee Sun Lee.
Question classification using support vector machines.
In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26-32, New York, NY, USA, 2003. ACM. - ZSW+05
- Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan.
Discovering relations between named entities from a large raw corpus using tree similarity-based clustering.
In IJCNLP, 2005.
Martin Emms 2009-02-09