dirty cat:
ngram
When you're working with scikit-learn you'll often need to deal with categorical data. The way you deal with this type of data really matters. In this series of videos we'll explore a the dirty-cat while we try to deal with categorical data.
Notes
Let's change the analyzer
and ngram_range
parameters.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(analyzer='char', ngram_range=(2, 4))
cv.fit(ml_df['employee_position_title'])
cv.transform(ml_df['employee_position_title']).shape
You can also inspect the vocabulary.
cv.vocab_
Feedback? See an issue? Something unclear? Feel free to mention it here.
If you want to be kept up to date, consider signing up for the newsletter.