2009/03/31

directed study module

Aim of this directed study module

In various areas of what might be termed intelligent language processing, the question of categorising units and of ranking units for similarity arise.

In Information Extraction, from sentences, units have to recognised that can be seen as filling roles in database-entries in some domain you are particularly interested in (business events, terrorist incidents). So units have to be found and given categories such as companydateweapon etc. Sometimes systems for such categorisations are called Named Entity Recognisers

In Question Answering, given a natural language question (eg. what command will copy a directory?), an answering natural language sentence is sought within a corpus of natural language text (such as a manual: command cp Dir1 Dir2 will copy the directory Dir1 to location Dir2). Roughly this can be seen as categorising the corpus sentences into the 2 categories of being an answer and not being an answer. More realistically it can be seen as ranking the corpus sentences for their likelihood to be answer to the question, and this ranking might take the form of computing a similarity between the answer and the question. Often within an overall question-answering system there are sub-systems whose concern is the categorisation of phrases. So units within the answer-corpus might be sought and categorised (such as locationperson etc), and similar or identical categories might be assigned to questions (where was ...:location, who ...:person) and might be used to influence the selection or ranking of answers.

The techniques could be (and have been) used for such categorisation and similarity-ranking tasks might be classified along the following two dimensions

Structural
whether the method makes reference to linguistic structures or not (for example methods referring pretty much to just word-count statistics do not make reference to linguistic structures)
Machine-Learning
whether the method is hand-engineered, or instead is a method which uses training data to define the parameters and behaviour of a classifier.

The aim of this directed study module will be to look at techniques for categorisation and similarity ranking which (i) do make reference to linguistic structures and (ii) do make use of machine-learning methods, whilst making comparisons with techniques which do not share these characteristics.

--------------


Possible Plan

Within the area of techniques which do make reference to linguistic structure, and are not hand-engineered, the main contendors are

Tree Distance
defining a 'distance' $d(S,T)$ between two trees $S$ and $T$, basically representing what is left-out and relabelled in an ancestry and linearity preserving mapping between $S$ and $T$. For identical trees, the distance $d(T,T)$ is 0. [PRtY04,ZSW+05,KM05,Emm06a,Emm06b]

Tree Kernels
defining a so-called 'kernel' function $k(S,T)$ between two trees $S$ and $T$, which effectively represents a vector product $\sigma(S) \bullet \sigma(T)$ on images of the trees projected into a space, each dimension of which represents a distinct substructure. 'kernel' functions are often thought of as 'similarity' functions: for identical trees, the 'similarity' $k(T,T)$ is large, and for trees with no shared substructures, $k(S,T)$ is 0. [ZAR03,ZL03,QMMB07,MGCB05,BM07]

Support Vector Machine Classification
using training data on which a kernel function $k$ is defined and which are in 2 categories, a hyperplane is effectively found, defined by a normal vector w and an origin offset $b$, dividing the space of points according to


0 \mbox{\ \ \ \ \ \ } {\bf w} \bullet {\bf x} + b <>

Typically w is not directly found, but instead the quantity ${\bf w} \bullet {\bf x} + b$ is expressed by a summation involving the kernel function $k$ and a finite subset of training points, the so-called support vectors

k-NN classification
assuming a distance function $d$ which can be calculated between a test item and any member of a set of classified training data, rank the training data in order of increasing distance from the test item. In 1-NN classication, return as the category of the test item the category of its nearest neighbour. In k-NN classication, find the $k$ nearest neighbours, and treat this as a panel of $k$ votes for various categories, finally applying some weighted voting scheme to derive a winning category from the panel.

I would propose a plan in which first materials are studied to make sure that the above 4 concepts are reasonably well understood, followed by particular papers in which various combinations of these concepts have been applied to categorisation and ranking tasks.

Here's a first (and still unfinished) draft of the papers/materials to be looked at

k-NN and distance functions

WHAT TO READ ? STILL DECIDING

SVMs and tree kernels

WHAT TO READ ? STILL DECIDING

Question Categorisation

ZL03
Dell Zhang and Wee Sun Lee. 
Question classification using support vector machines. 
In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26-32, New York, NY, USA, 2003. ACM.

QMMB07
Silvi Quaternoni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili. 
Advanced structural representations for question classification and answer re-ranking. 
In Advances in Information Retrieval, proceedings of ECIR 2007. Springer, 2007.

Question Answering and Text Entailment

PRtY04
Vasin Punyakanok, Dan Roth, and Wen tau Yih. 
Natural language inference via dependency tree mapping: An application to question answering. 
Computational Linguistics, 2004.

KM05
Milen Kouylekov and Bernardo Magnini. 
Recognizing textual entailment with tree edit distance algorithms. 
In Ido Dagan, Oren Glickman, and Bernardo Magnini, editors, Pascal Challenges Workshop on Recognising Textual Entailment, 2005.

Emm06b
Martin Emms. 
Variants of tree similarity in a question answering task. 
In Proceedings of the Workshop on Linguistic Distances, held in conjunction with COLING 2006, pages 100-108, Sydney, Australia, July 2006. Association for Computational Linguistics.

Relation Extraction

ZAR03
Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 
Kernel methods for relation extraction. 
Journal of Machine Learning Research, 3:1083-1106, 2003.

ZSW05
Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan. 
Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. 
In IJCNLP, 2005.

MGCB05
Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola, and Roberto Basili. 
Hierarchical semantic role labeling. 
In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), 2005.

BM07
Stephan Bloehdorn and Alessandro Moschitti. 
Combined syntactic and semantic kernels for text classification. 
In Gianni Amati, Claudio Carpineto, and Gianni Romano, editors, Advances in Information Retrieval - Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), 2-5 April 2007, Rome, Italy, volume 4425 of Lecture Notes in Computer Science, pages 307-318. Springer, APR 2007.

------------------

Bibliography

BM07
Stephan Bloehdorn and Alessandro Moschitti. 
Combined syntactic and semantic kernels for text classification. 
In Gianni Amati, Claudio Carpineto, and Gianni Romano, editors, Advances in Information Retrieval - Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), 2-5 April 2007, Rome, Italy, volume 4425 of Lecture Notes in Computer Science, pages 307-318. Springer, APR 2007.

Emm06a
Martin Emms. 
Clustering by tree distance for parse tree normalisation. 
In Proceedings of NLUCS 2006, pages 91-100, 2006.

Emm06b
Martin Emms. 
Variants of tree similarity in a question answering task. 
In Proceedings of the Workshop on Linguistic Distances, held in conjunction with COLING 2006, pages 100-108, Sydney, Australia, July 2006. Association for Computational Linguistics.

KM05
Milen Kouylekov and Bernardo Magnini. 
Recognizing textual entailment with tree edit distance algorithms. 
In Ido Dagan, Oren Glickman, and Bernardo Magnini, editors, Pascal Challenges Workshop on Recognising Textual Entailment, 2005.

MGCB05
Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola, and Roberto Basili. 
Hierarchical semantic role labeling. 
In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), 2005.

PRtY04
Vasin Punyakanok, Dan Roth, and Wen tau Yih. 
Natural language inference via dependency tree mapping: An application to question answering. 
Computational Linguistics, 2004.

QMMB07
Silvi Quaternoni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili. 
Advanced structural representations for question classification and answer re-ranking. 
In Advances in Information Retrieval, proceedings of ECIR 2007. Springer, 2007.

ZAR03
Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella. 
Kernel methods for relation extraction. 
Journal of Machine Learning Research, 3:1083-1106, 2003.

ZL03
Dell Zhang and Wee Sun Lee. 
Question classification using support vector machines. 
In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26-32, New York, NY, USA, 2003. ACM.

ZSW+05
Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan. 
Discovering relations between named entities from a large raw corpus using tree similarity-based clustering. 
In IJCNLP, 2005.



Martin Emms 2009-02-09



No comments:

Post a Comment