Natural language processing: what's the ideal workflow for training a topic classification model?

So, I’ve been hanging around here for a while - I really got on the “gotta learn to be a full stack developer” wagon for a while, and I still want to do that to maximize my job opportunities - but because I chose to self-teach instead of take courses etc., I’ve been focusing on personal projects that can become part of a portfolio, and I’m exceptionally interested in doing essentially analytic journalism using natural language analysis. (I have a Ph.D. in a field that works with texts, sometimes with quantitative methods.)

Essentially, I’ve used data gathering scripts to gather reviews of a single controversial video game which got review bombed on Metacritic this summer, and four “control” games with substantial Metacritic reviews and that share traits like being Playstation 4 games and coming from similar studios. I’ve separated each game’s corpus of reviews into its own set of linked reviews in a neo4j causal cluster database instance, so that if I want to review the central, controversial game, I can pull the text of every MF review from the central database; I did this, and created a truly massive file (because of all of the disingenuous reviews of this game, which make it an interesting research topic) and then another, not quite as massive but sizable file for all the reviews not about that game. (I found out that for spaCy, at least, even the smaller corpus was too large to process; I may limit the corpus to one game and some randomly selected reviews. That’s not my primary challenge here.)

So, what I want to do is train a text classification model on the corpus of reviews of the game that didn’t get review bombed, and then use that model to identify what’s different about the one that did. Is that a reasonable goal? If so, I’m trying to figure out exactly how you build a training model - how large should it be? Tutorials, regardless of library, tend to show just a few sentences, but those are obviously examples.

I’ve read up on some of the theory of this; I know general statistics better than linguistic statistics, but I’m working on that. I’ve done some text preparation using TextBlob, spaCy, NLTK itself, and Tensorflow with Keras. As far as I can tell, there’s not a huge advantage to any of them, except that TextBlob is probably not sophisticated enough for my work.

I’d love any tips folks with NLP experience have to offer. If text classification isn’t the exact right thing to compare these two corpuses, what is? I could make quite a point with just a word cloud and some graphs, but…

Last question, is there an NLP library that can determine arbitrary word correlations? I know even TextBlob can look up the words that tend to be on a vector with, say, “graphics,” but what if I want a model to look for what words tend to be the closest together (like, if “graphics” and “console” occur together more than any other word pair, I want to discover that without necessarily knowing either word to begin with; is that possible?)

From what I can understand, you scrapped reviews right? But these reviews aren’t labelled are they? For any classification model, we need labels as you might already know.

But since we do not have any labels data, you might instead try using an unsupervised algorithm, aka, clustering. Now clustering itself is a huge domain. So, perhaps your goals can be re-phrase as follows:

  1. Cluster the reviews into two groups, I) review bombs, II) legit reviews
  2. Once you have those clusters, you can analyse them to find text patterns
  3. Can you scale your clustering to work on massive datasets?
  4. Measure the quality of your clustering.
  5. Contrast between classical clustering algorithms and state-of-the-art algorithms.

Most popular clustering algorithms are in scikit-learn.

The current state of the art in NLP is Transformers, AFAIK. Read the paper “Attention is all your need”. This is a type of neural network. Although there are many versions of it now, like BERT, GPT-3, etc. These can practically be programmed to do everything NLP related. You will eventually have to find your way into these, sooner or later.

That being said, there are many ways of approaching this. For instance, train an anomaly detection model and treat review bombs as anomalies, etc…
The right answer often lies in understanding more about the data IMO.

You can try Gensim: Topic modelling for humans
I suspect this needs a word vector model like GloVe, Word2Vec, FastText.

But perhaps, something as simple as text singling is enough for your needs as in Chapter 3 of MMDS (http://mmds.org)

If I made any mistakes, I am happy to correct myself. Thanks!

1 Like

Hi there! This is really helpful in getting things into perspective. To answer your questions and clarify:

From what I can understand, you scrapped reviews right? But these reviews aren’t labelled are they? For any classification model, we need labels as you might already know.

But since we do not have any labels data, you might instead try using an unsupervised algorithm, aka, clustering.

Correct, I probably should have articulated this better. I’m looking for something that can do an unsupervised labeling of the reviews I expect to be more “normal.” Is that what you mean by clustering, exactly?

The current state of the art in NLP is Transformers, AFAIK. Read the paper “Attention is all your need”. This is a type of neural network. Although there are many versions of it now, like BERT, GPT-3, etc. These can practically be programmed to do everything NLP related. You will eventually have to find your way into these, sooner or later.

Yeah, I was told on another thread where I asked for advice on this that Transformers were the solution - I’m already in touch with Optimus Prime :wink:

I’ve taken a look at Gensim and that leads to the issue with your final question that I’m not sure how to overcome yet: only Tensorflow’s text classification model has not been crashed by plugging all my review datasets in, and I’m not clear on how one could combine several similar corpuses that have all been analyzed separately, or even if that’s possible. Gensim specifically crashed on the “small” dataset of “normal” reviews.

By clustering, I mean things like k-means, Latent Semantic Analysis
You can learn more about clustering here 分群法  |  Machine Learning  |  Google for Developers

A successful clustering will automatically give you two clusters, one with legit reviews, one review bombs. You can find many tuts online that can teach you how to cluster text data.

If you can share a subset of your reviews (properly formatted pandas csv or something easily readable) perhaps I and other forum members can take a look. Make sure it contains a healthy subset of both bomb and legit reviews.

I’ve taken a look at Gensim and that leads to the issue with your final question that I’m not sure how to overcome yet: only Tensorflow’s text classification model has not been crashed by plugging all my review datasets in, and I’m not clear on how one could combine several similar corpuses that have all been analyzed separately, or even if that’s possible. Gensim specifically crashed on the “small” dataset of “normal” reviews.

Dealing with massive datasets is a separate domain of its own.
The book I suggested (MMDs) has several solutions to dealing with massive solutions. You might gander a read. :slight_smile:

Tensorflow works because it takes data in batches, aka batch gradient descent. It will only bite as much as it can chew, i.e, only as much as fits into memory. (You might have set a batch size yourself ) Tensorflow will let you use your entire dataset, will take very long, but it will run. You just need to make sure:
TF Model + BatchSize <= Available Memory
Other algorithms may try to load everything into memory because they probably can’t do squat without it and hence crash. So, basically, any classification model that can take data in batches will be able to use the entire dataset.
I hope this was helpful.

Thanks, that’s very helpful! I’d be happy to share my data, but the question of “bomb reviews” versus not is something I feel like I’ve been unclear on, and would need clarification (I’m reading the links in the thread too, to try to answer this myself): I do not have a set of all bad-faith reviews. What I have is all the reviews for several different games. One of these games (The Last of Us Part II) was aggressively targeted for review bombing. The basic hypothesis I have is that these reviews, which are not all negative but are way more negative than the other games (which were released for the same platform, are targeted at similar audiences, etc.) by a margin of like 90-10 positive for the others, 30-70 positive for TLOU2, will show a higher rate of specifically derogatory references to people’s identifies (race, sexuality, gender identity in particular because of the plot of the game). It’s sort of a common sense thing but as someone who works at the margins of game journalism, no one seems to believe when I simply assert that the reaction to this game had a lot of bigotry, so I picked this as a starting project for my natural language project.

Anyway, the point is, I can give a set of reviews that are for TLOU2 and likely negative, and a set of reviews that are for other games and likely positive (and there’s way more TLOU2 reviews, too, such that selecting randomly will almost certainly yield a negative review, and is unlikely to from the other set), but without rewriting my script to capture the score the user gave to the game (which is something I have yet to fully figure out how to do because of Metacritic’s DOM, and no, there’s no API or maintained existing package for this) I can’t just get negative reviews.

One thing I guess I could do is get a corpus of early reviews of games (could be just TLOU2 because it was a bigger problem for it than for most other games, but it’s affected others) , as in, reviews from prior to the release date - these reviews have to be in bad faith because the players couldn’t have played the game. This could then be contrasted with later reviews, or reviews of a different game.

Again, I appreciate being able to talk through this with anyone.

You goal lines up with “Topic Modelling”.
Topic modelling analyzes documents to learn meaningful patterns of words.

If a topic modelling algorithm finds topics such as hate speech, bigotry, you can empirically claim your hypothesis.
One of the most used topic modelling algorithms is called Latent Dirichlet allocation . You might try this out on your dataset. It could be a good starting point.

If you want to go full-blown research mode which might be too much work for a project, you can try recent papers like

I have not seen much works “analysing review bombings in video games reviews” in Google Scholar. Which is why I cannot say for certain if what I suggested will work since this is unknown territory. So you are venturing into the unknown, which is very exciting and also terrifying. xD

One slightly related paper did not use any computational methods

Since your project is somewhat related to hate speech, you might find some useful things in this domain.

Sorry for the late reply.
I had been busy brooding in quarantine solitude.

Yeah, to be clear I’m transitioning out of an academic career where I researched online hate speech. I don’t want that to be my professional focus but this is one of my interests and it makes it easy to contextualize the coding stuff that’s new to me.