I’ve posted before about my project to map some texts related to an online controversy using natural language processing and someone pointed out that what I should be trying to do is unsupervised topic modeling. I’m working on making that work, and I keep running into a problem, which is that all documentation I can find seems to indicate Gensim with NLTK support is the best way to do this - but when I preprocess my documents into tokens following common tutorials, it ends up reducing things to letters rather than words. Here’s some code:
> # preprocess the reviews for deeper machine learning
> # using NLTK, remove stop words from the two corpuses
> stop_words = stopwords.words("english")
> stop_words.extend(["from","subject","re","edu","use"])
>
> # TODO: figure out why the preprocessing to remove single quotes and newlines is instead separating out individual letters
#negative_minus_newline = [re.sub('\s+', ' ', sent) for sent in v_neg_string]
#positive_minus_newline = [re.sub('\s+', ' ', sent) for sent in v_pos_string]
#negative_minus_single_quotes = [re.sub("\'", "", sent) for sent in v_neg_string]
#positive_minus_single_quotes = [re.sub("\'", "", sent) for sent in v_pos_string]
#print(negative_minus_single_quotes)
#pprint(positive_minus_single_quotes[:1])
In [32]:
v_pos_no_breaks = ','.join(partially_processed_v_pos)
v_neg_no_breaks = ','.join(partially_processed_v_neg)
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) # deacc=True removes punctuations
positive_gensim_pre = list(sent_to_words(v_pos_no_breaks))
negative_gensim_pre = list(sent_to_words(v_neg_no_breaks))
#positive_gensim_pre = [i for i in positive_gensim_pre if i]
#negative_gensim_pre = [i for i in negative_gensim_pre if i]
In [35]:
tokens_pos = [list(gensim.utils.tokenize(v_pos_no_breaks))]
tokens_neg = [list(gensim.utils.tokenize(v_neg_no_breaks))]
bi_positive = gensim.models.phrases.Phrases(tokens_pos, min_count=1, threshold=2)
bi_furious = gensim.models.phrases.Phrases(tokens_neg, min_count=1, threshold=2)
The list of tokens commented out after the TODO seems to be the problem. The last bit, where I prepare tokens for bigrams, prints out a list of word tokens:
(following is output from running the end of the previous snippet inside a Jupyter cell)
> ',
> 'Just',
> 'Perfect',
> 'Everything',
> 'is',
> 'perfect',
> 'I',
> 'have',
> 'just',
> 'finished',
> 'it',
> 'and',
> 'it',
> 'was',
> 'a',
> 'memorable',
> 'game',
> 'Thanks',
> 'Naughty',
> 'Dog',
But running this and printing the results will somehow yield lists of letters.
# find words that commonly occur together in twos and threes
positive_bigram = gensim.models.Phrases(positive_gensim_pre, min_count=5, threshold=100) # higher threshold fewer phrases.
negative_bigram = gensim.models.Phrases(negative_gensim_pre, min_count=5, threshold=100)
positive_trigram = gensim.models.Phrases(bigram[positive_bigram], threshold=100)
What sort of rookie mistake am I making here? I’ve started over several times and it’s really making me feel stuck. Any help would be appreciated.