Research diary: preprocessing samples before give them to the learning machines

2009/02/24

There is a large work done on preprocessing samples before give them to the learning machines:

· Remove noise

o Algorithms for detecting noise samples based in knn algorithms.

· Add noise

o Small noise produces a better performance in neural networks (and maybe also in other algorithms).

· Re-structure the dimensionality and distance metrix.

o Nahanalobis distance.

o Scaling the data: It give an improvement in SVM machines

o Kernels: increase dimensions.

o Genetic kernel (GK SVM)

o Removing features:

§ removing dimensions (feature selection)

· information gain (the best)

· mutual information

· x2 statistic chi (second best)

· term strength

§ principal component analysis.

§ neighborhood component analysis

· Re-sampling:

o Under-sampling:

§ Randomly

§ Inconsistent data

§ Duplicate data

§ Removing noise (bis)

o Over-sampling:

§ Randomly

§ SMOTE

§ Border SMOTE-1

§ Border SMOTE-2

§ Adding noise (bis)

§ Give more weight to hard samples.

· Windowed data:

o In some cases context information increases the accuracy.

Split features: in text categorization words can split using the morphology features: Morfesor.

Research diary