2009/02/24

preprocessing samples before give them to the learning machines

There is a large work done on preprocessing samples before give them to the learning machines:

·         Remove noise

o   Algorithms for detecting noise samples based in knn algorithms.

·         Add noise

o   Small noise produces a better performance in neural networks (and maybe also in other algorithms).

·         Re-structure the dimensionality and distance metrix.

o   Nahanalobis distance.

o   Scaling the data: It give an improvement in SVM machines

o   Kernels: increase dimensions.

o   Genetic kernel (GK SVM)

o   Removing features:

§  removing dimensions (feature selection)

·         information gain (the best)

·         mutual information

·         x2 statistic chi (second best)

·         term strength

§  principal component analysis.

§  neighborhood component analysis

·         Re-sampling:

o   Under-sampling:

§  Randomly

§  Inconsistent data

§  Duplicate data

§  Removing noise (bis)

o   Over-sampling:

§  Randomly

§  SMOTE

§  Border SMOTE-1

§  Border SMOTE-2

§  Adding noise (bis)

§  Give more weight to hard samples.

·         Windowed data:

o   In some cases context information increases the accuracy.

Split features: in text categorization words can split using the morphology features: Morfesor. 

No comments:

Post a Comment