2009/12/29

habia un error en el codigo, asi que hay que volver a correr los experimentos

2009/12/27

hace unos dias, programe el systema para que los nodos que son argumentos esten marcados, y forzar a estos a a emparejarse,
lance cuatro procesos de los cuales 3 se han muerto, manyana tendre los resultados del ultimo proceso

2009/11/26

el anual report esta entregado

2009/11/22

reading emails
and changing the report content

2009/11/20

Trial evaluation defense:

attend:
Liliana
Stephan
Rami
and Martin

2009/11/15


El reporte ya esta casi acabado

hay una conferencia importante en milan





Friday 4 December 2009



8:30-9:15Registration
9:15-9:30Opening
9:30-10:30Invited lecture: Eva Hajičová
From Prague Structuralism to Treebank Annotation
10:30-11:00Coffee break
11:00-12:30Session A
Chair: Koenraad De Smedt
11:00-11:30Federico Sangati and Chiara Mazza
An English Dependency Treebank à la Tesnière
11:30-12:00Katri Haverinen, Filip Ginter, Veronika Laippala, Timo Viljanen and Tapio Salakoski
Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank
12:00-12:30Markus Dickinson and Marwa Ragheb
Dependency Annotation for Learner Corpora
12:30-14:00Lunch
14:00-15:30Session B
Chair: Adam Przepiórkowski
14:00-14:30Jörg Tiedemann and Gideon Kotzé
Building a Large Machine-Aligned Parallel Treebank
14:30-15:00Marie Mikulová and Jan Štĕpánek
Annotation Quality Checking and Its Implications for Design of Treebank (in Building the Prague
Czech-English Dependency Treebank)
15:00-15:30Alina Wróblewska and Anette Frank
Cross-Lingual Projection of LFG F-Structures: Building an F-Structure Bank for Polish
15:30-16:00Coffee break
16:00-17:30Poster session
Eduard Bejček, Pavel Straňák and Jan Hajič
Finalising Multiword Annotations in PDT
Kristýna Čermáková, Lucie Mladová, Eva Fučíková and Kateřina Veselá
Annotation of Selected Non-dependency Relations in a Dependency Treebank
Barbara McGillivray
Selectional Preferences from a Latin Treebank
Helge Dyvik, Paul Meurer, Victoria Rosén and Koenraad De Smedt
Linguistically Motivated Parallel Parsebanks




Saturday 5 December 2009



9:30-10:30Invited lecture: Roberto Busa SJ
From Punched Cards to Treebanks: 60 Years of Computational Linguistics
10:30-11:00Coffee break
11:00-12:30Session C
Chair: Anette Frank
11:00-11:30David Bamman, Francesco Mambrini and Gregory Crane
An Ownership Model of Annotation: The Ancient Greek Dependency Treebank
11:30-12:00Johan Bos, Cristina Bosco and Alessandro Mazzei
Converting a Dependency Treebank to a Categorial Grammar Treebank for Italian
12:00-12:30Torsten Marek, Gerold Schneider and Martin Volk
A Declarative Formalism for Constituent-to-Dependency Conversion
12:30-14:00Lunch
14:00-15:30Session D
Chair: Victoria Rosén
14:00-14:30Seth Kulick and Ann Bies
Treebank Analysis and Search Using an Extracted Tree Grammar
14:30-15:00Adam Przepiórkowski
TEI P5 as an XML Standard for Treebank Encoding
15:00-15:30Ines Rehbein, Josef Ruppenhofer and Jonas Sunde
MaJo - A Toolkit for Supervised Word Sense Disambiguation and Active Learning
15:30-16:00Coffee break
16:00-17:30Session E
Chair: Charles J. Fillmore
16:00-16:30Karin Harbusch and Gerard Kempen
Clausal Coordinate Ellipsis and its Varieties in Spoken German: A Study with the TüBa-D/S Treebank of the VERBMOBIL Corpus
16:30-17:00Jana Šindlerová and Ondřej Bojar
Towards English-Czech Parallel Valency Lexicon via Treebank Examples
17:00-17:30António Branco, Sara Silveira, Sérgio Castro, Mariana Avelãs, Clara Pinto and Francisco Costa
Dynamic Propbanking with Deep Linguistic Grammars
17:30-17:45Closing session

2009/11/11

reading week

2009/11/10

todavia escribiendo el report

2009/11/01

Semantic RL aproaches written.

2009/10/31

el reporte no se acepto para obtener los cinco creditos que faltan.

estoy escribiendo la evaluacion

2009/10/28

perfilando srl report
c++ clases (2 semana)

2009/10/23

read semEval-2010 Task 10

2009/10/20

some bugs fixed at the code.

experiments need to be run it again

2009/10/19

escribiendo reporte del software

2009/10/18

depurando codigo y corriendo experimentos
ayer estuve 3 horas pregramando y 3 horas depurando,
he anyadido un algoritmo genetico al programa para que busque los mejores parametros para medir la distancia entre los nodos de un arbol.

2009/10/11

He reescrito las referencias del poster a mano, espero q asi salga bien

2009/10/09

escribiendo el poster,
parece que todas las referencias que queria utilizar son procedings, asi que hay que buscar los papers

2009/10/03

fijando errores en sanity test report
modificando el codigo fuente

2009/10/01

le he enviado el borrador del reporte interno a martin
y rellenado wiki con mi trabajo

2009/09/28

40 minutos ingles
todavia escribiendo el abstract y el sanity report

2009/09/26

still developing the poster's abstract

2009/09/23

memory allocation bug fixed.
AtomicCost -> needs to be free
debuggin memory alocation srl

2009/08/31

PADOVA-ESSIR09

2009/08/21

preparing evaluation for next October,
describing work.

2009/08/12

LATEX course 3h

2009/08/07

running experiment SRL
training and testing -> same data set.
I wish to prove that there are only errors when there are multiple trees at the same distance.

distance must be alwais 0
and number of trees at the same distance must be more than 1 if the label is wrong

2009/07/30

pascal changen aborted

SVM

SVM:

general: http://www.support-vector-machines.org/

LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (C++)
SVMLIGHT: http://svmlight.joachims.org/ (C++)

BIOJAVA: http://www.biojava.org/ (The package org.biojava.stats.svm contains SVM classification and regression.) (JAVA)

Package pt.tumba.ngram.svm: http://tcatng.sourceforge.net/javadocs/pt/tumba/ngram/svm/package-summary.html (JAVA)

The following packages either implement SVM by themselves or wrap some SVM packages written in C/C++.

RapidMiner: http://rapid-i.com/
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
MALLET: http://mallet.cs.umass.edu/
MINORTHIRD: http://minorthird.sourceforge.net/

2009/07/28

start programming the sofware for the pascal challenge

2009/07/21

booking fly to padova

Pascal Challenge

Pascal Challenge on
Large Scale Hierarchical Text classification

Web site: http://lshtc.iit.demokritos.gr/
Email: lshtc_info@iit.demokritos.gr

We are pleased to announce the launch of the Large Scale Hierarchical Text classification (LSHTC) Pascal Challenge. The LSHTC Challenge is a
hierarchical text classification competition using large datasets based on the ODP Web directory data (www.dmoz.org).

Hierarchies are becoming ever more popular for the organization of text
documents, particularly on the Web. Web directories are an example. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue despite the vastness of available data. The reasons for this are the simultaneous increase in the number of classes and their hierarchical organization. The latter leads to a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for the learning methods.

The challenge will consist of four tasks with partially overlapping data. Information regarding the tasks and the challenge rules can be found at challenge Web site, under the "Tasks, Rules and Guidelines" link.

We plan a two-stage evaluation of the participating methods: one measuring classification performance and one computational performance. It is important to measure both, as they are dependent. The results will be included in a final report about the challenge and we also aim at organizing a special NIPS'09 workshop.

In order to register for the challenge and gain access to the datasets,
please create a new account at challenge Web site.

Key dates:
Start of testing: July 10, 2009.
End of testing, submission of executables and short papers: September 29, 2009.
End of scalability test and announcement of results: October 25, 2009.
NIPS'09 workshop (subject to approval): December 11-12, 2009

Organisers:
Eric Gaussier, LIG, Grenoble, France
George Paliouras, NCSR "Demokritos", Athens, Greece
Aris Kosmopoulos, NCSR "Demokritos", Athens, Greece
Sujeevan Aseervatham, LIG, Grenoble & Yakaz, Paris, France

2009/07/09

writing the SRL report

2009/07/03

correct = 9922 wrong = 3943
program finish properly

2009/07/02

Sanity test

sanity test:

training:




1 a a a CD_A CD_A _ _ 0 0 NMOD_A NMOD_A Y A _
2 b b b DT_B RBR_B _ _ 1 1 NMOD_B NMOD_B _ _ l1



1 d d d CD_D CD_D _ _ 0 0 NMOD_D NMOD_D Y D _
2 c c c DT_C RBR_C _ _ 1 1 NMOD_C NMOD_C _ _ l2
3 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3



testing:








1 d d d CD_D CD_D _ _ 0 0 NMOD_D NMOD_D Y D _
2 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3










1 a a a CD_A CD_A _ _ 0 0 NMOD_A NMOD_A Y A _
2 b b b DT_B RBR_B _ _ 1 1 NMOD_B NMOD_B _ _ l2
3 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3




system output:

1 d d _ CD_D _ _ _ 0 0 NMOD_D _ Y D _
2 e e _ DT_E _ _ _ 1 1 NMOD_E _ _ _ l3

1 a a _ CD_A _ _ _ 0 0 NMOD_A _ Y A _
2 b b _ DT_B _ _ _ 1 1 NMOD_B _ _ _ l1
3 e e _ DT_E _ _ _ 1 1 NMOD_E _ _ _ l3

results:
correct = 2 wrong = 1
as I expected

5 hours to execute English data set.

I found doubles spaces in the training set, that is why input and output files have different amount of lines.

2009/07/01

parece que el software esta funcionando
si es asi, tardara 8h en dar resultados

2009/06/23

fixing errors in my slides
+ english class

2009/06/19

study English 40 min
from Wednesday i was preparing a presentqation about tree distance and knn for next monday

2009/06/16

postgraduate skills development summer school, analisys

a very positive experience

Workshops:
Thesis writing process sciences
-> i learn that i should write something every week.

Time management & overcoming procrastination
-> nothing new.

Get your head arround PG experience SCS
-> lots of good recomendation to try.
starting to put it in practice from today.

Viva preparation.
-> good recommendations for prepare the transfer report.

2009/06/15

Postgraduate skills development, summer school
first day.

thesis writing process for sciences.

semantic role labeling system meting

2009/06/12

reading about inducting grammar,
and leave it for future time

2009/06/11

I should have 25 credits of 30 already.
+ 5 credits coming soon from SRL directed studies

Research methods presentation

programming

melody > ./hh_conll_dist_tester ../Data/CoNLL2009-ST-English-development.txt ../Data/CoNLL2009-ST-English-development.txt
loading training data
loading testing data
loading testing data2
removing labels
mutate relations
relabeling
writing output.txt
correct = 13772 wrong = 93
program finish properly
melody > pwd
/shared/teaching/CSLL/4thYrProjects/SRL_task/HectorCoNLL2009Software

2009/06/10

debugging code
developing slides for research methods

2009/06/09

summer school trinity

Please select your options for both dates below

* One asterisk indicates this session is suitable for students in the earlier stages of their degree.

** Two asterisks indicate this session is suitable for students in the later stages of their degree.

15th June

09:30 - 10:00

Registration

Outside the JM.Synge Theatre (previously Walton Lecture Theatre)

10:00 - 10:30

Opening Session

Professor Carol O’Sullivan, Dean of Graduate Studies - JM.Synge Theatre

10:30 - 11:00

Meet & Greet Session

11:00 - 13:00

name="15 June Morning" value="Session A - Career Planning for Postgraduate Students" type="radio" Session A - Career Planning for Postgraduate Students* Room 4050A

name="15 June Morning" value="Session B - Thesis Writing Process for Arts/Humanities" type="radio" Session B - Thesis Writing Process for Arts/Humanities ** Room 4050B

name="15 June Morning" value="Session B -Thesis Writing Process for Sciences " type="radio" Session B -Thesis Writing Process for Sciences ** Room 5039

name="15 June Morning" value="Session C - Planning a Thesis Using Word" type="radio" Session C - Planning a Thesis Using Word Room 1013

name="15 June Morning" value="Session D - Presentation Skills for Postgraduate Students" type="radio" Session D - Presentation Skills for Postgraduate Students Room 5025

Session E - Systematic Approaches to Literature Reviewing* Room 5052 - Fully Booked

14:00 - 16:00

name="15 June Afternoon" value="Session A - Job Hunting Essentials for Postgraduate " type="radio" Session A - Job Hunting Essentials for Postgraduate
Students**
Room 4050A

name="15 June Afternoon" value="Session B - In your own words: citing with confidence and avoiding plagiarism" type="radio" Session B - In your own words: citing with confidence and avoiding plagiarism* Room 4050B

name="15 June Afternoon" value="Session C - Preparing an Article for Publication (Sciences)" type="radio" Session C - Preparing an Article for Publication (Sciences) Room 5039

name="15 June Afternoon" value="Session C -Preparing an Article for Publication Arts/Humanities" type="radio" Session C -Preparing an Article for Publication Arts/Humanities** Room 3126

name="15 June Afternoon" value="Session D - Effective Presentations Using PowerPoint" type="radio" Session D - Effective Presentations Using PowerPoint Room 1013

name="15 June Afternoon" value="Session E - EndNote for Beginners" type="radio" Session E - EndNote for Beginners* Berkeley Library

13:00 - 14:00

Lunch

16:15 - 17:00

General Session - Life as a Postgrad - Q & A Discussion Panel with Postgraduate Advisory Service - JM.Synge Theatre

Submit your question in advance here

17:00

Reception Drink - Pavillion

16th June

10:00 - 12:00

Session A - Time Management & Overcoming Procrastination* Room 4050A - Fully Booked

name="16 June Morning" value="Session B - Creating Your Own Research/Writing Support Group" type="radio" Session B - Creating Your Own Research/Writing Support Groups Room 4050B

Session C - Moved to afternoon

name="16 June Morning" value="Session D - Getting Your Head Around Your PG Experience" type="radio" Session D - Getting Your Head Around Your PG Experience Room 5025

Session E - EndNote for Beginners*Berkeley Library - Fully Booked

13:30 - 15:30

name="16 June Afternoon" value="Session A - Time Management & Overcoming Procrastination" type="radio" Session A - Time Management & Overcoming Procrastination* Room 4050A

name="16 June Afternoon" value="Session B - Developing Critical Arguments" type="radio" Session B - Developing Critical Arguments** Room 4050B

Session C - Creating Effective Conference Posters IS Services Room, Pearse St - Fully Booked

name="16 June Afternoon" value="Session D - Viva Preparation" type="radio" Session D - Viva Preparation** Room 5025

name="16 June Afternoon" value="Session E - Copyright and Intellectual Property for Research " type="radio" Session E - Copyright and Intellectual Property for Research Room 5052

name="16 June Afternoon" value="Session F - An Insider's Guide to Getting Published in Research Journals" type="radio" Session F - An Insider's Guide to Getting Published in Research Journals** Room 5039

12:00 - 13:30

Lunchtime Reception & Exhibition

15:45 - 16:30

General session - "Motivation, Critical Thinking and Decision Making" by Dr. Kevin Thomas, School of Psychology - JM.Synge Theatre

name="General Session" id="General Session" Can Attend Cannot Attend

16:30 - 16:45

Closing Session - JM.Synge Theatre

17:00

Evening Function - GSU Reception at GSU Common Room

name="Evening Function" id="Evening Function" Can Attend Cannot Attend

Please print the confirmation page you receive when you select the Register Button below.
A confirmation email will be sent to you within 3 days.

programming

1h english lesson

preparing research methods presentation

2009/06/08

during the morning:
reading slides about srl : www.denizyuret.com/ref/yih/SRL-Tutorial-hlt-naacl-06.pdf

2pm to 4pm
example of srl
atach labels (syntactic and semantics)
shared task -> they defined the labels

corpora:
*propbank
*verbNet
*frame net

explain pruning
identification argument
labeling

predicate word: select the sense of the word

data formats:
1- constituent structure tree 2005
2- dependenci strutture 2009




SYSTEMS:
  • tree distance -> hector
  • tree kernel -> liliana
  • graph matching
  • conventionals: max entropy, conditional random fields
error analysis -> Gerard.




next week:
presentation on svm & knn


framenet is a dictionary
propbank is a corpus


every one should see the work of the others

OUTPUT of the directed studies:
c++ system
presentation
report
produce new ideas


home work:
basic statistics:
how often appear A0
make a better description list of what has been done.

2009/06/05

re write the research methods proposal
talk with martin
do some bureaucracy

Fwd: Materials for SRL Directed Studies




If you take a look at


  www.cs.tcd.ie/Martin.Emms/SRL_Module


you'll find some web-pages relating to this. In particular in the
section 'Details', there's an evolving set of links to materials. You could
profitably take a look at the links from the first 3 sections before Monday.

# Semantic Role Labeling overview
# A digest of SRL 'shared-tasks' to date
# 'semantic' corpora and lexica

see you Monday

Martin

2009/06/04

1 Spanish lecture, the last one
research methods project done, + personal report

tomorrow, work on SRL

2009/06/03

statistics exam

2009/05/28

research methods + 1 seminar on dynamic grammar
1 spanish class
los ninyos de rusia

2009/05/27

2009/05/26

writing a recomendation letter as part of the research methods project proposal
studing statistics

2009/05/25

Information Extraction: Algorithms and Prospects in a

Information Extraction: Algorithms and Prospects in a
Retrieval Context:


SRL

some relations have multiple labels

information retrieval summer school padova

We inform you that on 25 may       2009 ore 16.44.47

You were registered at our conference on EUROPEAN SUMMER SCHOOL IN INFORMATION RETRIEVAL 2009.
studing statistics

examples seems wrong :s about paired comparisons
1h studing english
conditionals
writing Reseach methods project proposal

2009/05/24

Describing SRL system

2009/05/23

reading exploring Multilingual Semantic Role Labeling
Studing english
unit 1

20min

2009/05/22

writing a Semantic Role Labeling introduction for Research Methods Project proposal, 

Towards Emotional Sensitivity in Human-Computer Interaction

Friday semminar with Professor Dr. Elisabeth Andre


HCI
emotinal states
SSI
Bio signal data
the wild divine
emotional speach





2009/05/21

SRL 1.0.

Commenting Semantic Role Labeling System code.
removing testing points
describing the code in the following presentation 


analysis
15 days for understand the code. (26-4 to 11-5)
11 days for expand the system.     (11-5 to 21-5).
total: 26days.  (looks quite a lot  :s ) 



research methods , atending lecture
spanish lecture 1h

2009/05/20

FINALLY THE SRL SYSTEMS LOOKS COMPILED AND WORKING

2009/05/19

editing tests for the code

2009/05/18

depurando el codigo

2009/05/15

fixing the code of the semantic role labeling system

2009/05/11

developing SRL system
writing code


how to get a tree node from a sub-tree node

2009/04/26

reading c++ code tree-distance

2009/04/23

2h research methods + discussion about the project
1 spanish class 1h
Apply for a TCHPC accoun


http://www.tchpc.tcd.ie/support/training/#index1h2

2009/04/22

preparing spanish lecture
reading dependence tree code.


tree-distance:
/sharde/teaching/CSLL/4th  /ZhangShasha/lib/
Bayesian Learning tutorial

2009/04/21

Study english 1h
burocracy.

2009/04/20

reading group

A New Approach to the study of translationese: Machine-learning the difference between original and translated text

MT text categorization
translationese: dialect

translation:
less lexical dense
order repeat linguistic features


unigram , bi gram , tri gram -> windows size n words.

lemma -> root of the word

SVM -> hav capacity of feature selection.

majority vouting
recall maximitacion (at least 1 vote).

pronouns & adverbial forms the most important.



concept:
compatible corpus: same topic.
humman comparation, same performance as machine.

2009/04/19

Designing the SRL system,

it will be able to label nodes, and to label semantic relations,
but it will be not able to predict semantic relations

2009/04/16

1 spanish lecture
2 h research methods

2009/04/14

chatting about how to design the SRL system.
2 implementations

2009/04/13

recall English lessons

13 Hidden Markov models

Introduction to machine learning.


urns example
reading mail

2009/04/10

undestanding SRL corpus

2009/04/08

Fwd: [SIGIR2009-Poster] Your Paper #169

We regret to inform you that your poster submission

Title: Estimating performance of text classification

has not been accepted for the SIGIR 2009 Poster Track.

The review process was extremely selective and many submissions could
not be accepted for the final program. Out of the 256 poster submissions,
the program committee selected only 86 posters, an acceptance rate of about
34%.

The reviews for your submission are included below. Each poster was
reviewed by at least three reviewers. Final poster decisions were made
by the poster co-chairs.

The conference program and registration details will be available on
the conference website shortly at:

http://www.sigir2009.org/

We hope to see you in Boston in July. If you plan on attending the
conference, then it is important to note that all visitors to the US require
a visa or a visa waiver. The starting point is the page
https://esta.cbp.dhs.gov/esta/esta.html,
through which citizens of many countries can obtain a visa waiver.

Thank you again for submitting your poster to SIGIR2009-Poster.

Best regards,

The SIGIR2009-Poster Program Chairs
Jimmy Lin and Don Metzler
------------- Review from Reviewer 1 -------------
Relevance to SIGIR (1-5, accept threshold=3)  : 3
Originality of work (1-5, accept threshold=3)  : 3
Quality of work (1-5, accept threshold=3)     : 3
Adequacy of Citations (1-5, accept threshold=3) : 3
Quality of presentation (1-5, accept threshold=3) : 3
Impact of Ideas or Results (1-5, accept threshold=3) : 2
Impact of Resources (1-5, no threshold)       : 1
Recommendation (1-6)                          : 2
Confidence in review (1-4)                    : 2

-- Comments to the author(s):
This paper takes a look at the relationship between the number of
classes and accuracy regarding classfication in multiple classes.

Note that some of the equations came out as gibberish (in Adobe Reader
7) - i.e. some of the text in the last para of Section 2.
-- Summary:
This seems like a small work toward looking at text classifier
performance.  For me, it seemed more like a core dump of bits of
information.  I would recommend that the authors try to make it clear
what the real contribution of this paper is in the future.
---------- End of Review from Reviewer 1 ----------
------------- Review from Reviewer 2 -------------
Relevance to SIGIR (1-5, accept threshold=3)  : 4
Originality of work (1-5, accept threshold=3)  : 3
Quality of work (1-5, accept threshold=3)     : 2
Adequacy of Citations (1-5, accept threshold=3) : 3
Quality of presentation (1-5, accept threshold=3) : 2
Impact of Ideas or Results (1-5, accept threshold=3) : 2
Impact of Resources (1-5, no threshold)       : 1
Recommendation (1-6)                          : 2
Confidence in review (1-4)                    : 4

-- Comments to the author(s):
The paper studied the relationship between the number of classes and
the classification accuracy. The problems of the paper are listed
below.

(1) For multi-class classification, the numbers of samples in
different categories are often very imbalanced, which can have big
effects on the classification accuracy. However, this important factor
is ignored in the paper.

(2) In Figure 1, it is not clear to judge that naïve Bayes performed
better than kNN.

(3) The presentation is not good. Section 2 needs to be well
re-organized and greatly polished. The English needs much improvement.


-- Summary:
There is a big technical problem in the paper, and the presentation is bad.
---------- End of Review from Reviewer 2 ----------
------------- Review from Reviewer 3 -------------
Relevance to SIGIR (1-5, accept threshold=3)  : 3
Originality of work (1-5, accept threshold=3)  : 1
Quality of work (1-5, accept threshold=3)     : 1
Adequacy of Citations (1-5, accept threshold=3) : 1
Quality of presentation (1-5, accept threshold=3) : 1
Impact of Ideas or Results (1-5, accept threshold=3) : 1
Impact of Resources (1-5, no threshold)       : 1
Recommendation (1-6)                          : 1
Confidence in review (1-4)                    : 6

-- Comments to the author(s):
The poster analyses the relation between the expected accuracy of
classifiers and  the number of classes.

It describes an incremental algorithm for estimating the accuracy of
classifiers for a given classification problem.

Some experiments are performed on a small dataset.

The paper is not understandable in its present form and should be
rewritten. Definitions should be provided (e.g. epistasis or synergy
of a split), the algorithm should be carefully described.


-- Summary:
Paper should be completely rewritten. The present version cannot be understood.
---------- End of Review from Reviewer 3 ----------

////////////////////////////////////////////////////
Powered by ConfMaster.net
///////////////////////////////////////////////////

2009/04/07

reading 50 emails.
two hours meeting on semantic role labeling research group
study english

reading group

EACL2009 paper "Semi-Supervised Semantic Role
Labeling" on Monday the 6th of April (from 4:00 to 5:00)

2009/04/06

Introduction to teaching and supporting learning for postgraduate who teach 8:30 to 4:00pm



Teaching theory
Teaching plans
Teaching Delivery
Reflection on Teaching

2009/04/05

2009/04/04

border-smote reading group

going to uni

directed sutdies 10% complete

CLUSTERING BY TREE DISTANCE FOR PARSE TREE NORMALISATION:

CLUSTERING BY TREE DISTANCE FOR PARSE TREE NORMALISATION:

Writed by Martin Emms.

Notes by Hector Franco.

0 Abstract

Potential application: Transformation of interrogative to indicative sentences. -> is a step on question answering. | A tree distance is proposed -> find a pattern tree that summarizes the cluster. .

1 Introduction

Previous work:

Question-answering with tree-distance.

1 take a parse-structure from a question, 2 match it up to parse-structures of candidate answers.

Normalization: change passive structures to active structures: interrogative to indicative.

Popular parser: Collins probabilistic. Trained on Penn Treebank.

Trees are not assigned in accordance with any finite grammar.

Simple transformation -> mentally induction. (Very boring)

Method described:

Parse structures can be hierarchically clustered by tree-distance and kind of centroid tree for a chosen cluster can be generated which exemplifies typical traits of trees with the cluster.

 

2 Tree distance

Concepts:

Source and target trees

Preserve left to right order and ancestry.

Descendant.

(Not sense to summarize, just look the original).

2.1. question answering by tree distance.

Answers ranked according to the tree-distance from the questions.

QATD : question answering by tree distance.

Additional methods: query-expansion, query-type identification, named entity recognition.

Syntactic structures group items semantically related.

Syntactic structures might encode or represent a great deal that is not semantic in any sense.

 

Variances in tree distances:

Sub-tree: the cost of the least cost mapping from a sub-tree of the source.

Sub-traversal: the least cost mapping from a sub-traversal of the left-to-right post –order traversal of the source.

Structural weights: weights according to the syntactic structure.

Wild cards: can have zero cost matching, ???????????

Lexical Emphasis: leaf nodes have weights which are scaled up in comparison to nodes which are internal to the tree.

String Distance: if code source and target the string distance coincides with tree distance. ??????

Results:

Tree distance which uses sub-trees, weights , wild-cards and lexical emphasis, are better than sub-string distance and each parameter improve it.

???????

 

 

 

 

3 Clustering by tree distance

Used the agglomerative clustering algorithm: pic a pair of cluster with minimal distance and merge it into a single one.

Agglomerative coefficient: measure of overall quality.

S(q) cluster of q

Merge_dist(q) intercluster distance.

Agglomerative coefficient AC merge_dist/Df. 1 the best (0 to 1).

Giving different weight give a better results. (head/complement/adjunt/…)

4 Deriving a pattern tree form a cluster

How to seek the centre point of a cluster. (the one with minimal distance to the others).

Distance is Euclidean or cosine.

 

New function: aling_outcome( node I, paramb)

B =0 matched perfectly, b=1 substituted, b=2 deleted.

Used to derive an alignment summary tree, align_sum( c )

 

Final step:

Deletion nodes are deleted

Substitution nodes become wild-card trees.

 

5 conclusions and future work

Adaptations of tree distance improve question answering, and cluster quality.

 --finish--

2009/04/03

MULTEXT-East Version 4: multilingual morphosyntactic specifications

Friday talk


 

multilingual morphosyntactic

POS

Determine ambiguity class

Saw – nn saw – vrd


 


 

I saw, a saw (ver / serrucho)


 

Popular targers:

TNT

Tree tagser (decision tree)

TBL transformation based taggin.


 


 

Tag Sets

  1. Brown
  2. CLAWS
  3. PTB.


 

MSD Morphosyntactic Descriptors

|POS| < |MSD|


 


 

Basic Language Resource Kit:

http://nl.ijs.si/ME/


 

1 specification

2 Lexicon

3 parallelcorpus


 


 


 


 

The talk presents work in progress on the fourth version of the multilingual language resources originating in the MULTEXT and MULTEXT-East projects in the '90s. The resources are focused on language technology oriented morphosyntactic descriptions of languages, i.e. on providing features and tagsets useful for word-level tagging of corpora, what is commonly known as part-of-speech tagging. But unlike English, where »part-of-speech« tagsets number around 50, most other (inflectional, agglutinating) languages have much richer word-level morphosyntactic structures; the tagset for Slovene, for example, has almost 2,000 different tags. The MULTEXT-East resources comprise morphosyntactic specifications, defining the features and their tagsets, lexica, and annotated corpora. Version 3 (2004) is the last released version, with the resources being freely available for research from http://nl.ijs.si/me/ and having been downloaded by over 200 registered users, mostly from universities and research institutions. The talk introduces the XML structure of the specifications in Version 4, to contain data for over 13 languages. We discuss the characteristics of the languages covered, the use of the Text Encoding Initiative Guidelines as the encoding scheme and XSLT in transforming the specifications into other formats. An application of this framework is then given, namely the JOS language resources for Slovene, http://nl.ijs.si/jos/, which provide a manually validated morphosyntactically annotated reference corpus for the language. Finally, the methodology of adding new languages to the specifications is presented.


 


 

demestration 1h c++ natural language processing

2009/04/02

evaluation

http://www.clef-campaign.org/

several hours searching for a suitable evaluation
seting up a lab note book
several hours
atending research methods

1h teaching spanish
+
1h talk of artificial live and the selfish gene

evaluation

Dear Martin.

I was talking with Baoli, and we think that maybe will be a good idea
to start to prepare a evaluation for this year.

I was thinking that will be good to try this question answering task:

http://celct.isti.cnr.it/ResPubliQA/index.php?page=Pages/documents.php&option=newTrackSetup


I will try to find you tomorrow to talk about it

Evaluation forums

Evaluation Forums on http://www.clef-campaign.org/

CLEF  

Cross-Language Evaluation Forum

TREC

Text Retrieval Conference

NTCIR

NII-NACSIS Test Collection for IR Systems

INEX

INitiative for the Evaluation of XML Retrieval

FIREForum for Information Retrieval Evaluation

Research Programmes

ELSNET

European Network of Excellence in Human Languages Technologies

TIDES

Translingual Information Detection, Extraction and Summarization (DARPA)

Resources

ELRA/ELDA

Evaluations and Language Resources Distribution Association

LDC

Linguistic Data Consortium

ODS

United Nations Official Documents Online - consists of 800,000 searchable parallel documents in Arabic, Chinese, English, French, Russian and Spanish, primarily in PDF and / or MS Word format.  Resource Shelf  summarizes the main features and other UN resources.

Prise

Free retrieval system from NIST, complete with (simplistic) German and French stemmers

SDA data
German/French/Italian 1988-90
Please consult the README.txtfile.

Training collection used for the TREC6-8 CLIR tracks (password-protected. Please consult the CLEF administration.)

Web of Online-Dictionaries

online dictionaries for 100s of languages

Altavista Babelfish

(Systran-powered) online machine translation from Altavista

Google

On-line translation tools

FreeTranslation

On-line translation tools

InterTran

On-line translation tools

Reverso Online

On-line translation tools

Selected References on Evaluation 
(for CLEF papers, see our Website under Publications)

2009/04/01

preparing spanish lecture





directed studies

reading papers. on question anwering. 

c++ projects

1 hour demostrating

Six challenging projects for the students

https://www.cs.tcd.ie/Martin.Emms/NLP/projects_08_09.pdf

Web Information Retrieval: Spam Detection and Named Entity Recognition

Spam detection

Linguistic features

Adds: web spam

Full search engines: for ranking it-self

Challenge:

Complexity

Scale

Co-adaptation.

 

 

 

Blog spam: blog of hide links

Attractive keywords

Linguistic analysis

Light-weight linguistic analysis

Air web – workshop

Attributes for ML.

Lexical diversity

Syntactical entropy

Labels

-          Hosts

-          Documents

 

String distance metrics

 

Name variations complicate the t…

Permutations, abbreviations, speling mistakes, declensions

 

Edit distance metrics:

Levenshtein

Bag distance

Needleman-wunsh

Smith-watermar

Smith-waterman with affine gaps.

Common character-level n-grams

q-grams, positional, q-grams, skip-grams

 

longest common substring LCS

 

string distance

jaro

jaro – winkler

jwm 

2009/03/31

clustering by tree distance for parse tree normalisation

directed study module

Aim of this directed study module

In various areas of what might be termed intelligent language processing, the question of categorising units and of ranking units for similarity arise.

In Information Extraction, from sentences, units have to recognised that can be seen as filling roles in database-entries in some domain you are particularly interested in (business events, terrorist incidents). So units have to be found and given categories such as companydateweapon etc. Sometimes systems for such categorisations are called Named Entity Recognisers

In Question Answering, given a natural language question (eg. what command will copy a directory?), an answering natural language sentence is sought within a corpus of natural language text (such as a manual: command cp Dir1 Dir2 will copy the directory Dir1 to location Dir2). Roughly this can be seen as categorising the corpus sentences into the 2 categories of being an answer and not being an answer. More realistically it can be seen as ranking the corpus sentences for their likelihood to be answer to the question, and this ranking might take the form of computing a similarity between the answer and the question. Often within an overall question-answering system there are sub-systems whose concern is the categorisation of phrases. So units within the answer-corpus might be sought and categorised (such as locationperson etc), and similar or identical categories might be assigned to questions (where was ...:location, who ...:person) and might be used to influence the selection or ranking of answers.

The techniques could be (and have been) used for such categorisation and similarity-ranking tasks might be classified along the following two dimensions

Structural
whether the method makes reference to linguistic structures or not (for example methods referring pretty much to just word-count statistics do not make reference to linguistic structures)
Machine-Learning
whether the method is hand-engineered, or instead is a method which uses training data to define the parameters and behaviour of a classifier.

The aim of this directed study module will be to look at techniques for categorisation and similarity ranking which (i) do make reference to linguistic structures and (ii) do make use of machine-learning methods, whilst making comparisons with techniques which do not share these characteristics.

--------------


Possible Plan

Within the area of techniques which do make reference to linguistic structure, and are not hand-engineered, the main contendors are

Tree Distance
defining a 'distance' $d(S,T)$ between two trees $S$ and $T$, basically representing what is left-out and relabelled in an ancestry and linearity preserving mapping between $S$ and $T$. For identical trees, the distance $d(T,T)$ is 0. [PRtY04,ZSW+05,KM05,Emm06a,Emm06b]

Tree Kernels
defining a so-called 'kernel' function $k(S,T)$ between two trees $S$ and $T$, which effectively represents a vector product $\sigma(S) \bullet \sigma(T)$ on images of the trees projected into a space, each dimension of which represents a distinct substructure. 'kernel' functions are often thought of as 'similarity' functions: for identical trees, the 'similarity' $k(T,T)$ is large, and for trees with no shared substructures, $k(S,T)$ is 0. [ZAR03,ZL03,QMMB07,MGCB05,BM07]

Support Vector Machine Classification
using training data on which a kernel function $k$ is defined and which are in 2 categories, a hyperplane is effectively found, defined by a normal vector w and an origin offset $b$, dividing the space of points according to


0 \mbox{\ \ \ \ \ \ } {\bf w} \bullet {\bf x} + b <>

Typically w is not directly found, but instead the quantity ${\bf w} \bullet {\bf x} + b$ is expressed by a summation involving the kernel function $k$ and a finite subset of training points, the so-called support vectors

k-NN classification
assuming a distance function $d$ which can be calculated between a test item and any member of a set of classified training data, rank the training data in order of increasing distance from the test item. In 1-NN classication, return as the category of the test item the category of its nearest neighbour. In k-NN classication, find the $k$ nearest neighbours, and treat this as a panel of $k$ votes for various categories, finally applying some weighted voting scheme to derive a winning category from the panel.

I would propose a plan in which first materials are studied to make sure that the above 4 concepts are reasonably well understood, followed by particular papers in which various combinations of these concepts have been applied to categorisation and ranking tasks.

Here's a first (and still unfinished) draft of the papers/materials to be looked at

k-NN and distance functions

WHAT TO READ ? STILL DECIDING

SVMs and tree kernels

WHAT TO READ ? STILL DECIDING

Question Categorisation

ZL03
Dell Zhang and Wee Sun Lee. 
Question classification using support vector machines. 
In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26-32, New York, NY, USA, 2003. ACM.

QMMB07
Silvi Quaternoni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili. 
Advanced structural representations for question classification and answer re-ranking. 
In Advances in Information Retrieval, proceedings of ECIR 2007. Springer, 2007.

Question Answering and Text Entailment

PRtY04
Vasin Punyakanok, Dan Roth, and Wen tau Yih. 
Natural language inference via dependency tree mapping: An application to question answering. 
Computational Linguistics, 2004.

KM05
Milen Kouylekov and Bernardo Magnini. 
Recognizing textual entailment with tree edit distance algorithms. 
In Ido Dagan, Oren Glickman, and Bernardo Magnini, editors, Pascal Challenges Workshop on Recognising Textual Entailment, 2005.

Emm06b
Martin Emms. 
Variants of tree similarity in a question answering task. 
In Proceedings of the Workshop on Linguistic Distances, held in conjunction with COLING 2006, pages 100-108, Sydney, Australia, July 2006. Association for Computational Linguistics.

Relation Extraction

ZAR03
Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 
Kernel methods for relation extraction. 
Journal of Machine Learning Research, 3:1083-1106, 2003.

ZSW05
Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan. 
Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. 
In IJCNLP, 2005.

MGCB05
Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola, and Roberto Basili. 
Hierarchical semantic role labeling. 
In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), 2005.

BM07
Stephan Bloehdorn and Alessandro Moschitti. 
Combined syntactic and semantic kernels for text classification. 
In Gianni Amati, Claudio Carpineto, and Gianni Romano, editors, Advances in Information Retrieval - Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), 2-5 April 2007, Rome, Italy, volume 4425 of Lecture Notes in Computer Science, pages 307-318. Springer, APR 2007.

------------------

Bibliography

BM07
Stephan Bloehdorn and Alessandro Moschitti. 
Combined syntactic and semantic kernels for text classification. 
In Gianni Amati, Claudio Carpineto, and Gianni Romano, editors, Advances in Information Retrieval - Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), 2-5 April 2007, Rome, Italy, volume 4425 of Lecture Notes in Computer Science, pages 307-318. Springer, APR 2007.

Emm06a
Martin Emms. 
Clustering by tree distance for parse tree normalisation. 
In Proceedings of NLUCS 2006, pages 91-100, 2006.

Emm06b
Martin Emms. 
Variants of tree similarity in a question answering task. 
In Proceedings of the Workshop on Linguistic Distances, held in conjunction with COLING 2006, pages 100-108, Sydney, Australia, July 2006. Association for Computational Linguistics.

KM05
Milen Kouylekov and Bernardo Magnini. 
Recognizing textual entailment with tree edit distance algorithms. 
In Ido Dagan, Oren Glickman, and Bernardo Magnini, editors, Pascal Challenges Workshop on Recognising Textual Entailment, 2005.

MGCB05
Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola, and Roberto Basili. 
Hierarchical semantic role labeling. 
In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), 2005.

PRtY04
Vasin Punyakanok, Dan Roth, and Wen tau Yih. 
Natural language inference via dependency tree mapping: An application to question answering. 
Computational Linguistics, 2004.

QMMB07
Silvi Quaternoni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili. 
Advanced structural representations for question classification and answer re-ranking. 
In Advances in Information Retrieval, proceedings of ECIR 2007. Springer, 2007.

ZAR03
Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 
Kernel methods for relation extraction. 
Journal of Machine Learning Research, 3:1083-1106, 2003.

ZL03
Dell Zhang and Wee Sun Lee. 
Question classification using support vector machines. 
In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26-32, New York, NY, USA, 2003. ACM.

ZSW+05
Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan. 
Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. 
In IJCNLP, 2005.



Martin Emms 2009-02-09