Research diary: July 2009

2009/07/30

pascal changen aborted

SVM

SVM:

general: http://www.support-vector-machines.org/

LIBSVM: http://www.csie.ntu.edu.tw/~cjlin/libsvm/ (C++)
SVMLIGHT: http://svmlight.joachims.org/ (C++)

BIOJAVA: http://www.biojava.org/ (The package org.biojava.stats.svm contains SVM classification and regression.) (JAVA)

Package pt.tumba.ngram.svm: http://tcatng.sourceforge.net/javadocs/pt/tumba/ngram/svm/package-summary.html (JAVA)

The following packages either implement SVM by themselves or wrap some SVM packages written in C/C++.

RapidMiner: http://rapid-i.com/
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
MALLET: http://mallet.cs.umass.edu/
MINORTHIRD: http://minorthird.sourceforge.net/

2009/07/28

start programming the sofware for the pascal challenge

2009/07/21

booking fly to padova

Pascal Challenge

Pascal Challenge on
Large Scale Hierarchical Text classification

Web site: http://lshtc.iit.demokritos.gr/
Email: lshtc_info@iit.demokritos.gr

We are pleased to announce the launch of the Large Scale Hierarchical Text classification (LSHTC) Pascal Challenge. The LSHTC Challenge is a
hierarchical text classification competition using large datasets based on the ODP Web directory data (www.dmoz.org).

Hierarchies are becoming ever more popular for the organization of text
documents, particularly on the Web. Web directories are an example. Along with their widespread use, comes the need for automated classification of new documents to the categories in the hierarchy. As the size of the hierarchy grows and the number of documents to be classified increases, a number of interesting machine learning problems arise. In particular, it is one of the rare situations where data sparsity remains an issue despite the vastness of available data. The reasons for this are the simultaneous increase in the number of classes and their hierarchical organization. The latter leads to a very high imbalance between the classes at different levels of the hierarchy. Additionally, the statistical dependence of the classes poses challenges and opportunities for the learning methods.

The challenge will consist of four tasks with partially overlapping data. Information regarding the tasks and the challenge rules can be found at challenge Web site, under the "Tasks, Rules and Guidelines" link.

We plan a two-stage evaluation of the participating methods: one measuring classification performance and one computational performance. It is important to measure both, as they are dependent. The results will be included in a final report about the challenge and we also aim at organizing a special NIPS'09 workshop.

In order to register for the challenge and gain access to the datasets,
please create a new account at challenge Web site.

Key dates:
Start of testing: July 10, 2009.
End of testing, submission of executables and short papers: September 29, 2009.
End of scalability test and announcement of results: October 25, 2009.
NIPS'09 workshop (subject to approval): December 11-12, 2009

Organisers:
Eric Gaussier, LIG, Grenoble, France
George Paliouras, NCSR "Demokritos", Athens, Greece
Aris Kosmopoulos, NCSR "Demokritos", Athens, Greece
Sujeevan Aseervatham, LIG, Grenoble & Yakaz, Paris, France

2009/07/13

the proposition bank: an annotated corpus of semantic roles

describes how is builded propbank

2009/07/09

writing the SRL report

2009/07/03

correct = 9922 wrong = 3943
program finish properly

2009/07/02

Sanity test

sanity test:

training:

1 a a a CD_A CD_A _ _ 0 0 NMOD_A NMOD_A Y A _
2 b b b DT_B RBR_B _ _ 1 1 NMOD_B NMOD_B _ _ l1

1 d d d CD_D CD_D _ _ 0 0 NMOD_D NMOD_D Y D _
2 c c c DT_C RBR_C _ _ 1 1 NMOD_C NMOD_C _ _ l2
3 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3

testing:

1 d d d CD_D CD_D _ _ 0 0 NMOD_D NMOD_D Y D _
2 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3

1 a a a CD_A CD_A _ _ 0 0 NMOD_A NMOD_A Y A _
2 b b b DT_B RBR_B _ _ 1 1 NMOD_B NMOD_B _ _ l2
3 e e e DT_E RBR_E _ _ 1 1 NMOD_E NMOD_E _ _ l3

system output:

1 d d _ CD_D _ _ _ 0 0 NMOD_D _ Y D _
2 e e _ DT_E _ _ _ 1 1 NMOD_E _ _ _ l3

1 a a _ CD_A _ _ _ 0 0 NMOD_A _ Y A _
2 b b _ DT_B _ _ _ 1 1 NMOD_B _ _ _ l1
3 e e _ DT_E _ _ _ 1 1 NMOD_E _ _ _ l3

results:
correct = 2 wrong = 1
as I expected

5 hours to execute English data set.

I found doubles spaces in the training set, that is why input and output files have different amount of lines.

2009/07/01

parece que el software esta funcionando
si es asi, tardara 8h en dar resultados

Research diary