Aim of this directed study module
In various areas of what might be termed intelligent language processing, the question of categorising units and of ranking units for similarity arise.
In Information Extraction, from sentences, units have to recognised that can be seen as filling roles in database-entries in some domain you are particularly interested in (business events, terrorist incidents). So units have to be found and given categories such as company, date, weapon etc. Sometimes systems for such categorisations are called Named Entity Recognisers
In Question Answering, given a natural language question (eg. what command will copy a directory?), an answering natural language sentence is sought within a corpus of natural language text (such as a manual: command cp Dir1 Dir2 will copy the directory Dir1 to location Dir2). Roughly this can be seen as categorising the corpus sentences into the 2 categories of being an answer and not being an answer. More realistically it can be seen as ranking the corpus sentences for their likelihood to be answer to the question, and this ranking might take the form of computing a similarity between the answer and the question. Often within an overall question-answering system there are sub-systems whose concern is the categorisation of phrases. So units within the answer-corpus might be sought and categorised (such as location, person etc), and similar or identical categories might be assigned to questions (where was ...:location, who ...:person) and might be used to influence the selection or ranking of answers.
The techniques could be (and have been) used for such categorisation and similarity-ranking tasks might be classified along the following two dimensions
- Structural
- whether the method makes reference to linguistic structures or not (for example methods referring pretty much to just word-count statistics do not make reference to linguistic structures)
- Machine-Learning
- whether the method is hand-engineered, or instead is a method which uses training data to define the parameters and behaviour of a classifier.
The aim of this directed study module will be to look at techniques for categorisation and similarity ranking which (i) do make reference to linguistic structures and (ii) do make use of machine-learning methods, whilst making comparisons with techniques which do not share these characteristics.
--------------
Possible Plan
Within the area of techniques which do make reference to linguistic structure, and are not hand-engineered, the main contendors are
- Tree Distance
- defining a 'distance' between two trees and , basically representing what is left-out and relabelled in an ancestry and linearity preserving mapping between and . For identical trees, the distance is 0. [PRtY04,ZSW+05,KM05,Emm06a,Emm06b]
- Tree Kernels
- defining a so-called 'kernel' function between two trees and , which effectively represents a vector product on images of the trees projected into a space, each dimension of which represents a distinct substructure. 'kernel' functions are often thought of as 'similarity' functions: for identical trees, the 'similarity' is large, and for trees with no shared substructures, is 0. [ZAR03,ZL03,QMMB07,MGCB05,BM07]
- Support Vector Machine Classification
- using training data on which a kernel function is defined and which are in 2 categories, a hyperplane is effectively found, defined by a normal vector w and an origin offset , dividing the space of points according to0 \mbox{\ \ \ \ \ \ } {\bf w} \bullet {\bf x} + b <>
Typically w is not directly found, but instead the quantity is expressed by a summation involving the kernel function and a finite subset of training points, the so-called support vectors
- k-NN classification
- assuming a distance function which can be calculated between a test item and any member of a set of classified training data, rank the training data in order of increasing distance from the test item. In 1-NN classication, return as the category of the test item the category of its nearest neighbour. In k-NN classication, find the nearest neighbours, and treat this as a panel of votes for various categories, finally applying some weighted voting scheme to derive a winning category from the panel.
I would propose a plan in which first materials are studied to make sure that the above 4 concepts are reasonably well understood, followed by particular papers in which various combinations of these concepts have been applied to categorisation and ranking tasks.
Here's a first (and still unfinished) draft of the papers/materials to be looked at
- k-NN and distance functions
WHAT TO READ ? STILL DECIDING
- SVMs and tree kernels
WHAT TO READ ? STILL DECIDING
- Question Categorisation
- ZL03
- Dell Zhang and Wee Sun Lee.
Question classification using support vector machines.
In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26-32, New York, NY, USA, 2003. ACM. - QMMB07
- Silvi Quaternoni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili.
Advanced structural representations for question classification and answer re-ranking.
In Advances in Information Retrieval, proceedings of ECIR 2007. Springer, 2007.
- Question Answering and Text Entailment
- PRtY04
- Vasin Punyakanok, Dan Roth, and Wen tau Yih.
Natural language inference via dependency tree mapping: An application to question answering.
Computational Linguistics, 2004. - KM05
- Milen Kouylekov and Bernardo Magnini.
Recognizing textual entailment with tree edit distance algorithms.
In Ido Dagan, Oren Glickman, and Bernardo Magnini, editors, Pascal Challenges Workshop on Recognising Textual Entailment, 2005. - Emm06b
- Martin Emms.
Variants of tree similarity in a question answering task.
In Proceedings of the Workshop on Linguistic Distances, held in conjunction with COLING 2006, pages 100-108, Sydney, Australia, July 2006. Association for Computational Linguistics.
- Relation Extraction
- ZAR03
- Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella.
Kernel methods for relation extraction.
Journal of Machine Learning Research, 3:1083-1106, 2003. - ZSW05
- Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan.
Discovering relations between named entities from a large raw corpus using tree similarity-based clustering.
In IJCNLP, 2005. - MGCB05
- Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola, and Roberto Basili.
Hierarchical semantic role labeling.
In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), 2005. - BM07
- Stephan Bloehdorn and Alessandro Moschitti.
Combined syntactic and semantic kernels for text classification.
In Gianni Amati, Claudio Carpineto, and Gianni Romano, editors, Advances in Information Retrieval - Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), 2-5 April 2007, Rome, Italy, volume 4425 of Lecture Notes in Computer Science, pages 307-318. Springer, APR 2007.
------------------
Bibliography
- BM07
- Stephan Bloehdorn and Alessandro Moschitti.
Combined syntactic and semantic kernels for text classification.
In Gianni Amati, Claudio Carpineto, and Gianni Romano, editors, Advances in Information Retrieval - Proceedings of the 29th European Conference on Information Retrieval (ECIR 2007), 2-5 April 2007, Rome, Italy, volume 4425 of Lecture Notes in Computer Science, pages 307-318. Springer, APR 2007. - Emm06a
- Martin Emms.
Clustering by tree distance for parse tree normalisation.
In Proceedings of NLUCS 2006, pages 91-100, 2006. - Emm06b
- Martin Emms.
Variants of tree similarity in a question answering task.
In Proceedings of the Workshop on Linguistic Distances, held in conjunction with COLING 2006, pages 100-108, Sydney, Australia, July 2006. Association for Computational Linguistics. - KM05
- Milen Kouylekov and Bernardo Magnini.
Recognizing textual entailment with tree edit distance algorithms.
In Ido Dagan, Oren Glickman, and Bernardo Magnini, editors, Pascal Challenges Workshop on Recognising Textual Entailment, 2005. - MGCB05
- Alessandro Moschitti, Ana-Maria Giuglea, Bonaventura Coppola, and Roberto Basili.
Hierarchical semantic role labeling.
In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL 2005 shared task), 2005. - PRtY04
- Vasin Punyakanok, Dan Roth, and Wen tau Yih.
Natural language inference via dependency tree mapping: An application to question answering.
Computational Linguistics, 2004. - QMMB07
- Silvi Quaternoni, Alessandro Moschitti, Suresh Manandhar, and Roberto Basili.
Advanced structural representations for question classification and answer re-ranking.
In Advances in Information Retrieval, proceedings of ECIR 2007. Springer, 2007. - ZAR03
- Dmitry Zelenko, Chinatsu Aone, and Anthony Richardella.
Kernel methods for relation extraction.
Journal of Machine Learning Research, 3:1083-1106, 2003. - ZL03
- Dell Zhang and Wee Sun Lee.
Question classification using support vector machines.
In SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 26-32, New York, NY, USA, 2003. ACM. - ZSW+05
- Min Zhang, Jian Su, Danmei Wang, Guodong Zhou, and Chew Lim Tan.
Discovering relations between named entities from a large raw corpus using tree similarity-based clustering.
In IJCNLP, 2005.
Martin Emms 2009-02-09
No comments:
Post a Comment