This is a reading note from 'Building Machine learning System with Python'.@P59
Train_data=
['This is a toy post about machine learning. Actually, it contains not much interesting stuff.',
'Imaging databases provide storage capabilities.',
'Most imaging databases safe images permanently.',
'Imaging databases store data.',
'Imaging databases store data. Imaging databases store data. Imaging databases store data.',
'Does imaging databases store data?']
There are some tricks while sklearn.feature_extracting text,:
1.Normalization: if we would like to consider frequency instead of counts, just define a normalize function.
def normalize(a):
return a/sp.linalg.norm(a)
As a result, Train_data[3]==Train_data[4]
2.Removing less import words: Words such as "most" appear very often in all sorts of different contexts, and words such as this are called "stop words". The best option would be to remove all words that are so frequent that they do not help to distinguish between different texts.
e.g. vectorizer=CountVectorizer(min_df=1, stop_words='english')
3.[Most Important!!!!]Stemming: We count similar words in different variants as different words, for instance,'imaging' and 'images'. It would make sense to count them together. After all, it is the same concept they are referring to. That's why we need NlTK.
import nltk.stem
s=nltk.stem.SnowballStemmer('english')
s.stem('imaging') #u'imag'
s.stem('image') #u'imag'
s.stem('imagination') #u'imagin'
Then, we extend the vectorizer with NLTK's stemmer.
We need to stem the posts before we feed them into CountVectorizer. The class provides several hooks with which we could customize the preprocessing and tokenization stages. The preprocessor and tokenizer can be set in the constructor as parameters. We do not want to place the stemmer into any of them, because we would then have to do the tokenization and normalization by ourselves. Instead, we overwrite the method build_analyzer as follows.
import nltk.stem
english_stemmer=nltk.stem.SnowballStemmer('english')
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer=super(StemmedCountVectorizer,self).build_analyzer()
return lambda doc:(english_stemmer.stem(w)for w in analyzer(doc))
vectorizer=StemmedCountVectorizer(min_df=1,stop_words='english')
new_post=["imaging databases provides storage capabilities and store image."]
vectorizer.fit_trainsform(new_post).toarray() # We got [[1 1 2 1 1 1]]
print vectorizer.get_feature_names() # We got [u'capabl', u'databas', u'imag', u'provid', u'storag', u'store']
# We now have one feature less,because "images" and "imaging" collapsed to one.
Super in Class Inheritance
A typical use for calling a cooperative superclass method is:
class C(B):
def meth(self, arg):
super(C, self).meth(arg)
TfidfVectorizer is inherited from CountVectorizer, which considers the Tf-idf algorithm. Similarly, we can inherit from TfidfVectorizer.
4. Drawbacks
Here comes our current text preprocessing phase:
1. tokenizing the text
2.throwing away words that occur way too often to be of any help in detecting relevant posts
3. throwing away words that occur so seldom that there is only a small chance that they occur in future posts
4. counting the remaining words
5. calculating TF-IDF values from the counts, considering the whole text corpus
But, the drawbacks are also obvious:
1. It does not cover word relations, for example, 'Car hits wall' and 'Wall hits car' will both have the same feature vector
2. it does not capture negations correctly. For example, 'I will eat ice cream' and 'I will not eat ice cream' will look very similar.
3. It totally fails with misspelled words. Although it is clear that 'database' and 'databas' convey the same meaning.
The first two drawbacks can be easily solved by n_grams.