Natural language preprocessing with Gensim and NLTK - why is it tokenizing down to letters instead of words?

EllieLockhart · November 17, 2020, 7:12pm

I’ve posted before about my project to map some texts related to an online controversy using natural language processing and someone pointed out that what I should be trying to do is unsupervised topic modeling. I’m working on making that work, and I keep running into a problem, which is that all documentation I can find seems to indicate Gensim with NLTK support is the best way to do this - but when I preprocess my documents into tokens following common tutorials, it ends up reducing things to letters rather than words. Here’s some code:

> # preprocess the reviews for deeper machine learning
> # using NLTK, remove stop words from the two corpuses
> stop_words = stopwords.words("english")
> stop_words.extend(["from","subject","re","edu","use"])
> 
> # TODO: figure out why the preprocessing to remove single quotes and newlines is instead separating out individual letters


#negative_minus_newline = [re.sub('\s+', ' ', sent) for sent in v_neg_string]
#positive_minus_newline = [re.sub('\s+', ' ', sent) for sent in v_pos_string]

#negative_minus_single_quotes = [re.sub("\'", "", sent) for sent in v_neg_string]
#positive_minus_single_quotes = [re.sub("\'", "", sent) for sent in v_pos_string]

#print(negative_minus_single_quotes)
#pprint(positive_minus_single_quotes[:1])

In [32]:

v_pos_no_breaks = ','.join(partially_processed_v_pos)
v_neg_no_breaks = ','.join(partially_processed_v_neg)
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations    
positive_gensim_pre = list(sent_to_words(v_pos_no_breaks))
negative_gensim_pre = list(sent_to_words(v_neg_no_breaks))
    
#positive_gensim_pre = [i for i in positive_gensim_pre if i] 
#negative_gensim_pre = [i for i in negative_gensim_pre if i]

In [35]:

tokens_pos = [list(gensim.utils.tokenize(v_pos_no_breaks))]
tokens_neg = [list(gensim.utils.tokenize(v_neg_no_breaks))]

bi_positive = gensim.models.phrases.Phrases(tokens_pos, min_count=1, threshold=2)
bi_furious = gensim.models.phrases.Phrases(tokens_neg, min_count=1, threshold=2)

The list of tokens commented out after the TODO seems to be the problem. The last bit, where I prepare tokens for bigrams, prints out a list of word tokens:

(following is output from running the end of the previous snippet inside a Jupyter cell)

> ',
>   'Just',
>   'Perfect',
>   'Everything',
>   'is',
>   'perfect',
>   'I',
>   'have',
>   'just',
>   'finished',
>   'it',
>   'and',
>   'it',
>   'was',
>   'a',
>   'memorable',
>   'game',
>   'Thanks',
>   'Naughty',
>   'Dog',

But running this and printing the results will somehow yield lists of letters.

# find words that commonly occur together in twos and threes
positive_bigram = gensim.models.Phrases(positive_gensim_pre, min_count=5, threshold=100) # higher threshold fewer phrases.
negative_bigram = gensim.models.Phrases(negative_gensim_pre, min_count=5, threshold=100) 
positive_trigram = gensim.models.Phrases(bigram[positive_bigram], threshold=100)

What sort of rookie mistake am I making here? I’ve started over several times and it’s really making me feel stuck. Any help would be appreciated.

thirty_two · November 19, 2020, 3:01am

Say I have this:

#dummy
my_phraser = gensim.models.phrases.Phrases(docs)

If I do this, I get letters:

>>> sent = "I am ok"
>>> my_phraser[sent]
['I', ' ', 'a', 'm', ' ', 'o', 'k']

If i pass a tokenized sentences I get a list of words:

>>> sent = ["I", "am" ,"ok"]
>>> my_phraser[sent]
['I', 'am', 'ok']

Maybe this is what you wanted to achieve.

EllieLockhart · November 19, 2020, 12:10pm

…oh. OH! That… Makes sense. Thank you.