So, I’ve been hanging around here for a while - I really got on the “gotta learn to be a full stack developer” wagon for a while, and I still want to do that to maximize my job opportunities - but because I chose to self-teach instead of take courses etc., I’ve been focusing on personal projects that can become part of a portfolio, and I’m exceptionally interested in doing essentially analytic journalism using natural language analysis. (I have a Ph.D. in a field that works with texts, sometimes with quantitative methods.)
Essentially, I’ve used data gathering scripts to gather reviews of a single controversial video game which got review bombed on Metacritic this summer, and four “control” games with substantial Metacritic reviews and that share traits like being Playstation 4 games and coming from similar studios. I’ve separated each game’s corpus of reviews into its own set of linked reviews in a neo4j causal cluster database instance, so that if I want to review the central, controversial game, I can pull the text of every MF review from the central database; I did this, and created a truly massive file (because of all of the disingenuous reviews of this game, which make it an interesting research topic) and then another, not quite as massive but sizable file for all the reviews not about that game. (I found out that for spaCy, at least, even the smaller corpus was too large to process; I may limit the corpus to one game and some randomly selected reviews. That’s not my primary challenge here.)
So, what I want to do is train a text classification model on the corpus of reviews of the game that didn’t get review bombed, and then use that model to identify what’s different about the one that did. Is that a reasonable goal? If so, I’m trying to figure out exactly how you build a training model - how large should it be? Tutorials, regardless of library, tend to show just a few sentences, but those are obviously examples.
I’ve read up on some of the theory of this; I know general statistics better than linguistic statistics, but I’m working on that. I’ve done some text preparation using TextBlob, spaCy, NLTK itself, and Tensorflow with Keras. As far as I can tell, there’s not a huge advantage to any of them, except that TextBlob is probably not sophisticated enough for my work.
I’d love any tips folks with NLP experience have to offer. If text classification isn’t the exact right thing to compare these two corpuses, what is? I could make quite a point with just a word cloud and some graphs, but…
Last question, is there an NLP library that can determine arbitrary word correlations? I know even TextBlob can look up the words that tend to be on a vector with, say, “graphics,” but what if I want a model to look for what words tend to be the closest together (like, if “graphics” and “console” occur together more than any other word pair, I want to discover that without necessarily knowing either word to begin with; is that possible?)