Spam detection
Linguistic features
Adds: web spam
Full search engines: for ranking it-self
Challenge:
Complexity
Scale
Co-adaptation.
Blog spam: blog of hide links
Attractive keywords
Linguistic analysis
Light-weight linguistic analysis
Air web – workshop
Attributes for ML.
Lexical diversity
Syntactical entropy
Labels
- Hosts
- Documents
String distance metrics
Name variations complicate the t…
Permutations, abbreviations, speling mistakes, declensions
Edit distance metrics:
Levenshtein
Bag distance
Needleman-wunsh
Smith-watermar
Smith-waterman with affine gaps.
Common character-level n-grams
q-grams, positional, q-grams, skip-grams
longest common substring LCS
string distance
jaro
jaro – winkler
jwm
No comments:
Post a Comment