Natural language preprocessing with Gensim and NLTK - why is it tokenizing down to letters instead of words?

I’ve posted before about my project to map some texts related to an online controversy using natural language processing and someone pointed out that what I should be trying to do is unsupervised topic modeling. I’m working on making that work, and I keep running into a problem, which is that all documentation I can find seems to indicate Gensim with NLTK support is the best way to do this - but when I preprocess my documents into tokens following common tutorials, it ends up reducing things to letters rather than words. Here’s some code:

> # preprocess the reviews for deeper machine learning
> # using NLTK, remove stop words from the two corpuses
> stop_words = stopwords.words("english")
> stop_words.extend(["from","subject","re","edu","use"])
> 
> # TODO: figure out why the preprocessing to remove single quotes and newlines is instead separating out individual letters


#negative_minus_newline = [re.sub('\s+', ' ', sent) for sent in v_neg_string]
#positive_minus_newline = [re.sub('\s+', ' ', sent) for sent in v_pos_string]

#negative_minus_single_quotes = [re.sub("\'", "", sent) for sent in v_neg_string]
#positive_minus_single_quotes = [re.sub("\'", "", sent) for sent in v_pos_string]

#print(negative_minus_single_quotes)
#pprint(positive_minus_single_quotes[:1])

In [32]:

v_pos_no_breaks = ','.join(partially_processed_v_pos)
v_neg_no_breaks = ','.join(partially_processed_v_neg)
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations    
positive_gensim_pre = list(sent_to_words(v_pos_no_breaks))
negative_gensim_pre = list(sent_to_words(v_neg_no_breaks))
    
#positive_gensim_pre = [i for i in positive_gensim_pre if i] 
#negative_gensim_pre = [i for i in negative_gensim_pre if i]

In [35]:

tokens_pos = [list(gensim.utils.tokenize(v_pos_no_breaks))]
tokens_neg = [list(gensim.utils.tokenize(v_neg_no_breaks))]

bi_positive = gensim.models.phrases.Phrases(tokens_pos, min_count=1, threshold=2)
bi_furious = gensim.models.phrases.Phrases(tokens_neg, min_count=1, threshold=2)

The list of tokens commented out after the TODO seems to be the problem. The last bit, where I prepare tokens for bigrams, prints out a list of word tokens:

(following is output from running the end of the previous snippet inside a Jupyter cell)

> ',
>   'Just',
>   'Perfect',
>   'Everything',
>   'is',
>   'perfect',
>   'I',
>   'have',
>   'just',
>   'finished',
>   'it',
>   'and',
>   'it',
>   'was',
>   'a',
>   'memorable',
>   'game',
>   'Thanks',
>   'Naughty',
>   'Dog',

But running this and printing the results will somehow yield lists of letters.

# find words that commonly occur together in twos and threes
positive_bigram = gensim.models.Phrases(positive_gensim_pre, min_count=5, threshold=100) # higher threshold fewer phrases.
negative_bigram = gensim.models.Phrases(negative_gensim_pre, min_count=5, threshold=100) 
positive_trigram = gensim.models.Phrases(bigram[positive_bigram], threshold=100)

What sort of rookie mistake am I making here? I’ve started over several times and it’s really making me feel stuck. Any help would be appreciated.

Say I have this:

#dummy
my_phraser = gensim.models.phrases.Phrases(docs)

If I do this, I get letters:

>>> sent = "I am ok"
>>> my_phraser[sent]
['I', ' ', 'a', 'm', ' ', 'o', 'k']

If i pass a tokenized sentences I get a list of words:

>>> sent = ["I", "am" ,"ok"]
>>> my_phraser[sent]
['I', 'am', 'ok']

Maybe this is what you wanted to achieve.

1 Like

…oh. OH! That… Makes sense. Thank you.