Research diary: February 2009

2009/02/28

the program about question anwering in knn seems done.

2009/02/27

How does noise affect generalization

statistical noise

physical noise: variation in the target

injecting artificial noise : jitter

noise in the target" danger of overfitting

nois in the inputs limits the accuracy of generalization.

Genetic Kernel Support Vector Machine: description and evaluation

gp genetic programming

1 hour

documents

2009/02/26

today Microsoft interview

2009/02/25

talk with martin c++

2CSLL3 lecture

2009/02/24

preprocessing samples before give them to the learning machines

There is a large work done on preprocessing samples before give them to the learning machines:

· Remove noise

o Algorithms for detecting noise samples based in knn algorithms.

· Add noise

o Small noise produces a better performance in neural networks (and maybe also in other algorithms).

· Re-structure the dimensionality and distance metrix.

o Nahanalobis distance.

o Scaling the data: It give an improvement in SVM machines

o Kernels: increase dimensions.

o Genetic kernel (GK SVM)

o Removing features:

§ removing dimensions (feature selection)

· information gain (the best)

· mutual information

· x2 statistic chi (second best)

· term strength

§ principal component analysis.

§ neighborhood component analysis

· Re-sampling:

o Under-sampling:

§ Randomly

§ Inconsistent data

§ Duplicate data

§ Removing noise (bis)

o Over-sampling:

§ Randomly

§ SMOTE

§ Border SMOTE-1

§ Border SMOTE-2

§ Adding noise (bis)

§ Give more weight to hard samples.

· Windowed data:

o In some cases context information increases the accuracy.

Split features: in text categorization words can split using the morphology features: Morfesor.

meeting with martin.

long explanation of kNN implementation algorithm,

dudani vote method -> depend on the distance of the most fard point

the size of k (knn) can change if are several samples at the same distance.

our dataset is base on trees distance (there are not vectors),

is about question answering, the original data are questions, and the labels are about what is looking for.

all is implemented in c++

QBank Manager class recover the original question.

SIGR Poster submited

Stephan Schlogl re write the text

2009/02/22

project proposal

Distance in learning machine

today i write the bones of the project proposal

A study of distance-based machine learning algorithms

(phd thesis)

A study of distance-based machine learning algorithms

looks a lecture review

Distance Metric Learning for Large Margin Nearest Neighbor Classiﬁcation

key words:

Nahanalobis distance: not dependent on the scale of measurement.

knn classification can be significantly improved by using a distance metric learned from labeled examples.

many researches -> knn classification can be significantly improved by using a distance metric learned from labeled examples.

approach similar to svm

large margin nearest neighbor LMNN classification

linear transformation that optimizes knn

cost funtion

penalizes large distances between each input and its target neighbors and penalizes small distances between each input and all other inputs that do not share the same label.

matlab implementation online.

related work

semidefinite programming

neighborhood component analisis.

relevant component analisis.

PCA

kernelizing the algorithm will improve it

Distance Metric Learning, with Application to Clustering with Side-Information

not readit jet

book:

fundations of estatistical natural language processing

read up to page 55

2009/02/21

knn with kernel tranformation --> changin the way to measure distance

kernelizing linear classificators

writing Estimating performance of text classification

Learning from imbalanced data sets with boosting and data generation: thge data boost-im approach

2009/02/20

cngl meeting

2009/02/19

cngl meeting

Research diary

2009/02/28

2009/02/27

How does noise affect generalization

2009/02/26

2009/02/25

2009/02/24

preprocessing samples before give them to the learning machines

meeting with martin.

2009/02/22

project proposal

A study of distance-based machine learning algorithms

Distance Metric Learning for Large Margin Nearest Neighbor Classiﬁcation

Distance Metric Learning, with Application to Clustering with Side-Information

2009/02/21

papers:

ACM

Blog

2009/02/20

2009/02/19

actually

Labels

Blog Archive

Followers

WEB