2009/04/01

Web Information Retrieval: Spam Detection and Named Entity Recognition

Spam detection

Linguistic features

Adds: web spam

Full search engines: for ranking it-self

Challenge:

Complexity

Scale

Co-adaptation.

 

 

 

Blog spam: blog of hide links

Attractive keywords

Linguistic analysis

Light-weight linguistic analysis

Air web – workshop

Attributes for ML.

Lexical diversity

Syntactical entropy

Labels

-          Hosts

-          Documents

 

String distance metrics

 

Name variations complicate the t…

Permutations, abbreviations, speling mistakes, declensions

 

Edit distance metrics:

Levenshtein

Bag distance

Needleman-wunsh

Smith-watermar

Smith-waterman with affine gaps.

Common character-level n-grams

q-grams, positional, q-grams, skip-grams

 

longest common substring LCS

 

string distance

jaro

jaro – winkler

jwm 

No comments:

Post a Comment