2009/04/03

MULTEXT-East Version 4: multilingual morphosyntactic specifications

Friday talk


 

multilingual morphosyntactic

POS

Determine ambiguity class

Saw – nn saw – vrd


 


 

I saw, a saw (ver / serrucho)


 

Popular targers:

TNT

Tree tagser (decision tree)

TBL transformation based taggin.


 


 

Tag Sets

  1. Brown
  2. CLAWS
  3. PTB.


 

MSD Morphosyntactic Descriptors

|POS| < |MSD|


 


 

Basic Language Resource Kit:

http://nl.ijs.si/ME/


 

1 specification

2 Lexicon

3 parallelcorpus


 


 


 


 

The talk presents work in progress on the fourth version of the multilingual language resources originating in the MULTEXT and MULTEXT-East projects in the '90s. The resources are focused on language technology oriented morphosyntactic descriptions of languages, i.e. on providing features and tagsets useful for word-level tagging of corpora, what is commonly known as part-of-speech tagging. But unlike English, where »part-of-speech« tagsets number around 50, most other (inflectional, agglutinating) languages have much richer word-level morphosyntactic structures; the tagset for Slovene, for example, has almost 2,000 different tags. The MULTEXT-East resources comprise morphosyntactic specifications, defining the features and their tagsets, lexica, and annotated corpora. Version 3 (2004) is the last released version, with the resources being freely available for research from http://nl.ijs.si/me/ and having been downloaded by over 200 registered users, mostly from universities and research institutions. The talk introduces the XML structure of the specifications in Version 4, to contain data for over 13 languages. We discuss the characteristics of the languages covered, the use of the Text Encoding Initiative Guidelines as the encoding scheme and XSLT in transforming the specifications into other formats. An application of this framework is then given, namely the JOS language resources for Slovene, http://nl.ijs.si/jos/, which provide a manually validated morphosyntactically annotated reference corpus for the language. Finally, the methodology of adding new languages to the specifications is presented.


 


 

No comments:

Post a Comment