Data preprocessing

We see items, which are obviously not words, such as 00 and 000. Maybe we should ignore items that contain only digits. However, 0d and 0t are also not words. We also see items as __, so maybe we should only allow items that consist only of letters. The posts contain names such as andrew as well. We can filter names with the Names corpus from NLTK we just worked with. Of course, with every filtering we apply, we have to make sure that we don't lose information. Finally, we see words that are very similar, such as try and trying, and word and words.

We have two basic strategies to deal words from the same root--stemming and lemmatization. Stemming is the more quick and dirty type approach. It involves chopping, if necessary, off letters, for example, 'words' becomes 'word' after stemming. The result of stemming doesn't have to be a valid word. Lemmatizing, on the other hand, is slower but more accurate. Lemmatizing performs a dictionary lookup and guarantees to return a valid word unless we start with a non-valid word. Recall that we have implemented both stemming and lemmatization using NLTK in a previous section.

Let's reuse the code from the previous section to get the 500 words with highest counts, but this time, we will apply filtering:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.datasets import fetch_20newsgroups
>>> from nltk.corpus import names
>>> from nltk.stem import WordNetLemmatizer

>>> def letters_only(astr):
return astr.isalpha()

>>> cv = CountVectorizer(stop_words="english", max_features=500)
>>> groups = fetch_20newsgroups()
>>> cleaned = []
>>> all_names = set(names.words())
>>> lemmatizer = WordNetLemmatizer()

>>> for post in groups.data:
cleaned.append(' '.join([lemmatizer.lemmatize(word.lower()
for word in post.split()
if letters_only(word)
and word not in all_names]))

>>> transformed = cv.fit_transform(cleaned)
>>> print(cv.get_feature_names())

We are able to obtain the following features:

    ['able', 'accept', 'access', 'according', 'act', 'action', 'actually', 'add', 'address', 'ago', 'agree', 'algorithm', 'allow', 'american', 'anonymous', 'answer', 'anybody', 'apple', 'application', 'apr', 'arab', 'area', 'argument', 'armenian', 'article', 'ask', 'asked', 'assume', 'atheist', 'attack', 'attempt', 'available', 'away', 'bad', 'based', 'basic', 'belief', 'believe', 'best', 'better', 'bible', 'big', 'bike', 'bit', 'black', 'board', 'body', 'book', 'box', 'build', 'bus', 'business', 'buy', 'ca', 'california', 'called', 'came', 'car', 'card', 'care', 'carry', 'case', 'cause', 'center', 'certain', 'certainly', 'chance', 'change', 'check', 'child', 'chip', 'christian', 'church', 'city', 'claim', 'clear', 'clipper', 'code', 'college', 'color', 'come', 'coming', 'command', 'comment', 'common', 'communication', 'company', 'computer', 'computing', 'consider', 'considered', 'contact', 'control', 'controller', 'copy', 'correct', 'cost', 'country', 'couple', 'course', 'cover', 'create', 'crime', 'current', 'cut', 'data', 'day', 'db', 'deal', 'death', 'department', 'design', 'device', 'did', 'difference', 'different', 'discussion', 'disk', 'display', 'division', 'dod', 'doe', 'doing', 'drive', 'driver', 'drug', 'early', 'earth', 'easy', 'effect', 'email', 'encryption', 'end', 'engineering', 'entry', 'error', 'especially', 'event', 'evidence', 'exactly', 'example', 'expect', 'experience', 'explain', 'face', 'fact', 'faq', 'far', 'fast', 'federal', 'feel', 'figure', 'file', 'final', 'following', 'food', 'force', 'form', 'free', 'friend', 'ftp', 'function', 'game', 'general', 'getting', 'given', 'gmt', 'goal', 'god', 'going', 'good', 'got', 'government', 'graphic', 'great', 'greek', 'ground', 'group', 'guess', 'gun', 'guy', 'ha', 'hand', 'hard', 'hardware', 'having', 'head', 'health', 'hear', 'heard', 'hell', 'help', 'high', 'history', 'hit', 'hockey', 'hold', 'home', 'hope', 'house', 'human', 'ibm', 'idea', 'image', 'important', 'include', 'includes', 'including', 'individual', 'info', 'information', 'instead', 'institute', 'interested', 'interesting', 'international', 'internet', 'israeli', 'issue', 'jew', 'jewish', 'job', 'just', 'key', 'kill', 'killed', 'kind', 'know', 'known', 'la', 'large', 'later', 'law', 'le', 'lead', 'league', 'left', 'let', 'level', 'life', 'light', 'like', 'likely', 'line', 'list', 'little', 'live', 'local', 'long', 'longer', 'look', 'looking', 'lost', 'lot', 'love', 'low', 'machine', 'mail', 'main', 'major', 'make', 'making', 'man', 'manager', 'matter', 'maybe', 'mean', 'medical', 'member', 'memory', 'men', 'message', 'method', 'military', 'million', 'mind', 'mode', 'model', 'money', 'monitor', 'month', 'moral', 'mouse', 'muslim', 'na', 'nasa', 'national', 'near', 'need', 'needed', 'network', 'new', 'news', 'nice', 'north', 'note', 'number', 'offer', 'office', 'old', 'open', 'opinion', 'order', 'original', 'output', 'package', 'particular', 'past', 'pay', 'pc', 'people', 'period', 'person', 'personal', 'phone', 'place', 'play', 'player', 'point', 'police', 'policy', 'political', 'position', 'possible', 'post', 'posted', 'posting', 'power', 'president', 'press', 'pretty', 'previous', 'price', 'private', 'probably', 'problem', 'product', 'program', 'project', 'provide', 'public', 'purpose', 'question', 'quite', 'radio', 'rate', 'read', 'reading', 'real', 'really', 'reason', 'recently', 'reference', 'religion', 'religious', 'remember', 'reply', 'report', 'research', 'response', 'rest', 'result', 'return', 'right', 'road', 'rule', 'run', 'running', 'russian', 'said', 'sale', 'san', 'save', 'saw', 'say', 'saying', 'school', 'science', 'screen', 'scsi', 'second', 'section', 'security', 'seen', 'sell', 'send', 'sense', 'sent', 'serial', 'server', 'service', 'set', 'shall', 'short', 'shot', 'similar', 'simple', 'simply', 'single', 'site', 'situation', 'size', 'small', 'software', 'sort', 'sound', 'source', 'space', 'special', 'specific', 'speed', 'standard', 'start', 'started', 'state', 'statement', 'stop', 'strong', 'study', 'stuff', 'subject', 'sun', 'support', 'sure', 'taken', 'taking', 'talk', 'talking', 'tape', 'tax', 'team', 'technical', 'technology', 'tell', 'term', 'test', 'texas', 'text', 'thanks', 'thing', 'think', 'thinking', 'thought', 'time', 'tin', 'today', 'told', 'took', 'total', 'tried', 'true', 'truth', 'try', 'trying', 'turkish', 'turn', 'type', 'understand', 'unit', 'united', 'university', 'unix', 'unless', 'usa', 'use', 'used', 'user', 'using', 'usually', 'value', 'various', 'version', 'video', 'view', 'wa', 'want', 'wanted', 'war', 'water', 'way', 'weapon', 'week', 'went', 'western', 'white', 'widget', 'willing', 'win', 'window', 'woman', 'word', 'work', 'working', 'world', 'write', 'written', 'wrong', 'year', 'york', 'young']

This list seems to be much cleaner. We may also decide to only use nouns or another part of speech as an alternative.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset