We start with retaining letter-only words so that numbers such as 00 and 000 and combinations of letter and number such as b8f will be removed. The filter function is defined as follows:
>>> def is_letter_only(word):
... for char in word:
... if not char.isalpha():
... return False
... return True
...
>>> data_cleaned = []
>>> for doc in groups.data:
... doc_cleaned = ' '.join(word for word in doc.split()
if is_letter_only(word) )
... data_cleaned.append(doc_cleaned)
It will generate a cleaned version of the newsgroups data.