2009/02/28

the program about question anwering in knn seems done.

2009/02/27

How does noise affect generalization

How does noise affect generalization

statistical noise
physical noise: variation in the target
injecting artificial noise : jitter


noise in the target" danger of overfitting
nois in the inputs limits the accuracy of generalization.
Genetic Kernel Support Vector Machine: description and evaluation
gp genetic programming

1 hour
documents

2009/02/26

today Microsoft interview

2009/02/25

talk with martin c++ 
2CSLL3 lecture

2009/02/24

preprocessing samples before give them to the learning machines

There is a large work done on preprocessing samples before give them to the learning machines:

·         Remove noise

o   Algorithms for detecting noise samples based in knn algorithms.

·         Add noise

o   Small noise produces a better performance in neural networks (and maybe also in other algorithms).

·         Re-structure the dimensionality and distance metrix.

o   Nahanalobis distance.

o   Scaling the data: It give an improvement in SVM machines

o   Kernels: increase dimensions.

o   Genetic kernel (GK SVM)

o   Removing features:

§  removing dimensions (feature selection)

·         information gain (the best)

·         mutual information

·         x2 statistic chi (second best)

·         term strength

§  principal component analysis.

§  neighborhood component analysis

·         Re-sampling:

o   Under-sampling:

§  Randomly

§  Inconsistent data

§  Duplicate data

§  Removing noise (bis)

o   Over-sampling:

§  Randomly

§  SMOTE

§  Border SMOTE-1

§  Border SMOTE-2

§  Adding noise (bis)

§  Give more weight to hard samples.

·         Windowed data:

o   In some cases context information increases the accuracy.

Split features: in text categorization words can split using the morphology features: Morfesor. 

meeting with martin.

meeting with martin.

long explanation of kNN implementation algorithm,
dudani vote method  -> depend on the distance of the most fard point

the size of k (knn) can change if are several samples at the same distance.

our dataset is base on trees distance (there are not vectors), 
is about question answering, the original data are questions, and the labels are about what is looking for.

all is implemented in c++ 

QBank Manager class recover the original question.





SIGR Poster submited

Stephan Schlogl re write the text

2009/02/22

project proposal

Distance in learning machine

today i write the bones of the project proposal

A study of distance-based machine learning algorithms

(phd thesis)

A study of distance-based machine learning algorithms 

looks a lecture review

Distance Metric Learning for Large Margin Nearest Neighbor Classification

Distance Metric Learning for Large Margin    Nearest Neighbor Classification

key words:
Nahanalobis distance:  not dependent on the scale of measurement. 
knn classification can be significantly improved by using a distance metric learned from labeled examples.

many researches -> knn classification can be significantly improved by using a distance metric learned from labeled examples. 
approach similar to svm
large margin nearest neighbor LMNN classification

linear transformation that optimizes knn

cost funtion
penalizes large distances between each input and its target neighbors and penalizes small distances between each input and all other inputs that do not share the same label. 

matlab implementation online. 

related work

semidefinite programming

neighborhood component analisis. 
relevant component analisis.

PCA

kernelizing the algorithm will  improve it

Distance Metric Learning, with Application to Clustering with Side-Information




not readit jet
book:
fundations of estatistical natural language processing
read up to page 55

2009/02/21

knn with kernel tranformation --> changin the way to measure distance

kernelizing linear classificators
writing Estimating performance of text classification
Learning from imbalanced data sets with boosting and data generation: thge data boost-im approach

papers:

Borderline-SMOTE: A New over-sampling method in imbalanced data sets learning

ACM

I get enrroled in acm

Blog

Today i set up my research blog

2009/02/20

cngl meeting

2009/02/19

cngl meeting