Research diary: April 2009

2009/04/26

reading c++ code tree-distance

2009/04/23

2h research methods + discussion about the project

1 spanish class 1h

Apply for a TCHPC accoun

http://www.tchpc.tcd.ie/support/training/#index1h2

2009/04/22

preparing spanish lecture

reading dependence tree code.

tree-distance:

/sharde/teaching/CSLL/4th /ZhangShasha/lib/

Bayesian Learning tutorial

2009/04/21

Study english 1h

burocracy.

2009/04/20

reading group

A New Approach to the study of translationese: Machine-learning the difference between original and translated text

MT text categorization

translationese: dialect

translation:

less lexical dense

order repeat linguistic features

unigram , bi gram , tri gram -> windows size n words.

lemma -> root of the word

SVM -> hav capacity of feature selection.

majority vouting

recall maximitacion (at least 1 vote).

pronouns & adverbial forms the most important.

concept:

compatible corpus: same topic.

humman comparation, same performance as machine.

2009/04/19

Designing the SRL system,

it will be able to label nodes, and to label semantic relations,
but it will be not able to predict semantic relations

2009/04/16

1 spanish lecture

2 h research methods

2009/04/14

chatting about how to design the SRL system.
2 implementations

2009/04/13

recall English lessons

13 Hidden Markov models

Introduction to machine learning.

urns example

reading mail

2009/04/10

undestanding SRL corpus

2009/04/08

Fwd: [SIGIR2009-Poster] Your Paper #169

We regret to inform you that your poster submission

Title: Estimating performance of text classification

has not been accepted for the SIGIR 2009 Poster Track.

The review process was extremely selective and many submissions could
not be accepted for the final program. Out of the 256 poster submissions,
the program committee selected only 86 posters, an acceptance rate of about
34%.

The reviews for your submission are included below. Each poster was
reviewed by at least three reviewers. Final poster decisions were made
by the poster co-chairs.

The conference program and registration details will be available on
the conference website shortly at:

http://www.sigir2009.org/

We hope to see you in Boston in July. If you plan on attending the
conference, then it is important to note that all visitors to the US require
a visa or a visa waiver. The starting point is the page
https://esta.cbp.dhs.gov/esta/esta.html,
through which citizens of many countries can obtain a visa waiver.

Thank you again for submitting your poster to SIGIR2009-Poster.

Best regards,

The SIGIR2009-Poster Program Chairs
Jimmy Lin and Don Metzler
------------- Review from Reviewer 1 -------------
Relevance to SIGIR (1-5, accept threshold=3) : 3
Originality of work (1-5, accept threshold=3) : 3
Quality of work (1-5, accept threshold=3) : 3
Adequacy of Citations (1-5, accept threshold=3) : 3
Quality of presentation (1-5, accept threshold=3) : 3
Impact of Ideas or Results (1-5, accept threshold=3) : 2
Impact of Resources (1-5, no threshold) : 1
Recommendation (1-6) : 2
Confidence in review (1-4) : 2

-- Comments to the author(s):
This paper takes a look at the relationship between the number of
classes and accuracy regarding classfication in multiple classes.

Note that some of the equations came out as gibberish (in Adobe Reader
7) - i.e. some of the text in the last para of Section 2.
-- Summary:
This seems like a small work toward looking at text classifier
performance. For me, it seemed more like a core dump of bits of
information. I would recommend that the authors try to make it clear
what the real contribution of this paper is in the future.
---------- End of Review from Reviewer 1 ----------
------------- Review from Reviewer 2 -------------
Relevance to SIGIR (1-5, accept threshold=3) : 4
Originality of work (1-5, accept threshold=3) : 3
Quality of work (1-5, accept threshold=3) : 2
Adequacy of Citations (1-5, accept threshold=3) : 3
Quality of presentation (1-5, accept threshold=3) : 2
Impact of Ideas or Results (1-5, accept threshold=3) : 2
Impact of Resources (1-5, no threshold) : 1
Recommendation (1-6) : 2
Confidence in review (1-4) : 4

-- Comments to the author(s):
The paper studied the relationship between the number of classes and
the classification accuracy. The problems of the paper are listed
below.

(1) For multi-class classification, the numbers of samples in
different categories are often very imbalanced, which can have big
effects on the classification accuracy. However, this important factor
is ignored in the paper.

(2) In Figure 1, it is not clear to judge that naïve Bayes performed
better than kNN.

(3) The presentation is not good. Section 2 needs to be well
re-organized and greatly polished. The English needs much improvement.

-- Summary:
There is a big technical problem in the paper, and the presentation is bad.
---------- End of Review from Reviewer 2 ----------
------------- Review from Reviewer 3 -------------
Relevance to SIGIR (1-5, accept threshold=3) : 3
Originality of work (1-5, accept threshold=3) : 1
Quality of work (1-5, accept threshold=3) : 1
Adequacy of Citations (1-5, accept threshold=3) : 1
Quality of presentation (1-5, accept threshold=3) : 1
Impact of Ideas or Results (1-5, accept threshold=3) : 1
Impact of Resources (1-5, no threshold) : 1
Recommendation (1-6) : 1
Confidence in review (1-4) : 6

-- Comments to the author(s):
The poster analyses the relation between the expected accuracy of
classifiers and the number of classes.

It describes an incremental algorithm for estimating the accuracy of
classifiers for a given classification problem.

Some experiments are performed on a small dataset.

The paper is not understandable in its present form and should be
rewritten. Definitions should be provided (e.g. epistasis or synergy
of a split), the algorithm should be carefully described.

-- Summary:
Paper should be completely rewritten. The present version cannot be understood.
---------- End of Review from Reviewer 3 ----------

////////////////////////////////////////////////////
Powered by ConfMaster.net
///////////////////////////////////////////////////

2009/04/07

reading 50 emails.

two hours meeting on semantic role labeling research group

study english

reading group

EACL2009 paper "Semi-Supervised Semantic Role
Labeling" on Monday the 6th of April (from 4:00 to 5:00)

2009/04/06

Introduction to teaching and supporting learning for postgraduate who teach 8:30 to 4:00pm

Teaching theory
Teaching plans
Teaching Delivery
Reflection on Teaching

2009/04/05

reading variants of tree similarity in a question answering

feautre selection

A Comparative Study On Featuree Selection In Text2

View more presentations from hecfran.

2009/04/04

border-smote reading group

Borderline Smote

View more presentations from hecfran.

going to uni

View more presentations from hecfran.

directed sutdies 10% complete

CLUSTERING BY TREE DISTANCE FOR PARSE TREE NORMALISATION:

Writed by Martin Emms.

Notes by Hector Franco.

0 Abstract

Potential application: Transformation of interrogative to indicative sentences. -> is a step on question answering. | A tree distance is proposed -> find a pattern tree that summarizes the cluster. .

1 Introduction

Previous work:

Question-answering with tree-distance.

1 take a parse-structure from a question, 2 match it up to parse-structures of candidate answers.

Normalization: change passive structures to active structures: interrogative to indicative.

Popular parser: Collins probabilistic. Trained on Penn Treebank.

Trees are not assigned in accordance with any finite grammar.

…

Simple transformation -> mentally induction. (Very boring)

Method described:

Parse structures can be hierarchically clustered by tree-distance and kind of centroid tree for a chosen cluster can be generated which exemplifies typical traits of trees with the cluster.

2 Tree distance

Concepts:

Source and target trees

Preserve left to right order and ancestry.

Descendant.

(Not sense to summarize, just look the original).

2.1. question answering by tree distance.

Answers ranked according to the tree-distance from the questions.

QATD : question answering by tree distance.

Additional methods: query-expansion, query-type identification, named entity recognition.

Syntactic structures group items semantically related.

Syntactic structures might encode or represent a great deal that is not semantic in any sense.

Variances in tree distances:

Sub-tree: the cost of the least cost mapping from a sub-tree of the source.

Sub-traversal: the least cost mapping from a sub-traversal of the left-to-right post –order traversal of the source.

Structural weights: weights according to the syntactic structure.

Wild cards: can have zero cost matching, ???????????

Lexical Emphasis: leaf nodes have weights which are scaled up in comparison to nodes which are internal to the tree.

String Distance: if code source and target the string distance coincides with tree distance. ??????

Results:

Tree distance which uses sub-trees, weights , wild-cards and lexical emphasis, are better than sub-string distance and each parameter improve it.

???????

3 Clustering by tree distance

Used the agglomerative clustering algorithm: pic a pair of cluster with minimal distance and merge it into a single one.

Agglomerative coefficient: measure of overall quality.

S(q) cluster of q

Merge_dist(q) intercluster distance.

Agglomerative coefficient AC merge_dist/Df. 1 the best (0 to 1).

Giving different weight give a better results. (head/complement/adjunt/…)

4 Deriving a pattern tree form a cluster

How to seek the centre point of a cluster. (the one with minimal distance to the others).

Distance is Euclidean or cosine.

New function: aling_outcome( node I, paramb)

B =0 matched perfectly, b=1 substituted, b=2 deleted.

Used to derive an alignment summary tree, align_sum( c )

Final step:

Deletion nodes are deleted

Substitution nodes become wild-card trees.

5 conclusions and future work

Adaptations of tree distance improve question answering, and cluster quality.

--finish--

2009/04/03

MULTEXT-East Version 4: multilingual morphosyntactic specifications

Friday talk

multilingual morphosyntactic

POS

Determine ambiguity class

Saw – nn saw – vrd

I saw, a saw (ver / serrucho)

Popular targers:

TNT

Tree tagser (decision tree)

TBL transformation based taggin.

Tag Sets

Brown
CLAWS
PTB.

MSD Morphosyntactic Descriptors

|POS| < |MSD|

Basic Language Resource Kit:

http://nl.ijs.si/ME/

1 specification

2 Lexicon

3 parallelcorpus

The talk presents work in progress on the fourth version of the multilingual language resources originating in the MULTEXT and MULTEXT-East projects in the '90s. The resources are focused on language technology oriented morphosyntactic descriptions of languages, i.e. on providing features and tagsets useful for word-level tagging of corpora, what is commonly known as part-of-speech tagging. But unlike English, where »part-of-speech« tagsets number around 50, most other (inflectional, agglutinating) languages have much richer word-level morphosyntactic structures; the tagset for Slovene, for example, has almost 2,000 different tags. The MULTEXT-East resources comprise morphosyntactic specifications, defining the features and their tagsets, lexica, and annotated corpora. Version 3 (2004) is the last released version, with the resources being freely available for research from http://nl.ijs.si/me/ and having been downloaded by over 200 registered users, mostly from universities and research institutions. The talk introduces the XML structure of the specifications in Version 4, to contain data for over 13 languages. We discuss the characteristics of the languages covered, the use of the Text Encoding Initiative Guidelines as the encoding scheme and XSLT in transforming the specifications into other formats. An application of this framework is then given, namely the JOS language resources for Slovene, http://nl.ijs.si/jos/, which provide a manually validated morphosyntactically annotated reference corpus for the language. Finally, the methodology of adding new languages to the specifications is presented.

demestration 1h c++ natural language processing

2009/04/02

evaluation

http://www.clef-campaign.org/

several hours searching for a suitable evaluation

seting up a lab note book

several hours

atending research methods

1h teaching spanish

1h talk of artificial live and the selfish gene

evaluation

Dear Martin.

I was talking with Baoli, and we think that maybe will be a good idea
to start to prepare a evaluation for this year.

I was thinking that will be good to try this question answering task:

http://celct.isti.cnr.it/ResPubliQA/index.php?page=Pages/documents.php&option=newTrackSetup

I will try to find you tomorrow to talk about it

Evaluation forums

Evaluation Forums on http://www.clef-campaign.org/

CLEF	Cross-Language Evaluation Forum
TREC	Text Retrieval Conference
NTCIR	NII-NACSIS Test Collection for IR Systems
INEX	INitiative for the Evaluation of XML Retrieval
FIRE	Forum for Information Retrieval Evaluation

Research Programmes

ELSNET	European Network of Excellence in Human Languages Technologies
TIDES	Translingual Information Detection, Extraction and Summarization (DARPA)

Resources

ELRA/ELDA	Evaluations and Language Resources Distribution Association
LDC	Linguistic Data Consortium
ODS	United Nations Official Documents Online - consists of 800,000 searchable parallel documents in Arabic, Chinese, English, French, Russian and Spanish, primarily in PDF and / or MS Word format. Resource Shelf summarizes the main features and other UN resources.
Prise	Free retrieval system from NIST, complete with (simplistic) German and French stemmers
SDA data German/French/Italian 1988-90 Please consult the README.txtfile.	Training collection used for the TREC6-8 CLIR tracks (password-protected. Please consult the CLEF administration.)
Web of Online-Dictionaries	online dictionaries for 100s of languages
Altavista Babelfish	(Systran-powered) online machine translation from Altavista
Google	On-line translation tools
FreeTranslation	On-line translation tools
InterTran	On-line translation tools
Reverso Online	On-line translation tools

Selected References on Evaluation
(for CLEF papers, see our Website under Publications)

2009/04/01

preparing spanish lecture

directed studies

reading papers. on question anwering.

c++ projects

1 hour demostrating

Six challenging projects for the students

https://www.cs.tcd.ie/Martin.Emms/NLP/projects_08_09.pdf

Web Information Retrieval: Spam Detection and Named Entity Recognition

Spam detection

Linguistic features

Adds: web spam

Full search engines: for ranking it-self

Challenge:

Complexity

Scale

Co-adaptation.

Blog spam: blog of hide links

Attractive keywords

Linguistic analysis

Light-weight linguistic analysis

Air web – workshop

Attributes for ML.

Lexical diversity

Syntactical entropy

Labels

- Hosts

- Documents

String distance metrics

Name variations complicate the t…

Permutations, abbreviations, speling mistakes, declensions

Edit distance metrics:

Levenshtein

Bag distance

Needleman-wunsh

Smith-watermar

Smith-waterman with affine gaps.

Common character-level n-grams

q-grams, positional, q-grams, skip-grams

longest common substring LCS

string distance

jaro

jaro – winkler

jwm

2009/04/26

2009/04/23

2009/04/22

2009/04/21

2009/04/20

2009/04/19

2009/04/16

2009/04/14

2009/04/13

2009/04/10

2009/04/08

2009/04/07

2009/04/06

2009/04/05

2009/04/04

2009/04/03

2009/04/02

INitiative for the Evaluation of XML Retrieval

2009/04/01

actually

Labels

Blog Archive

Followers

WEB