How to match this complex dataset, is Machine Learning possible?

Hi,

I have a large dataset and I need to match some documents to it. At the moment I have a function with several if/else statements to try and match all the various combinations the document could match to the dataset however I am only achieving a 70% pass rate. I’m trying to work out if machine learning or neural network models is the correct approach to try and get a better conversion rate; or is there another way that I am unfamiliar with. The below is a typical example of the ways these can match;

Document:
Title: “”
Sub-Title: “”
Category: “”
Type:""

Dataset:
Title: “”
Sub-Title: “”
Category: “”
Type:""

In this, the document ‘Title’ may match to the dataset’s ‘Title’ or ‘Sub-title’ however it also could just be that the document ‘Title’ is in the dataset’s ‘Title’ but does not exactly match. Similarly it may not be within the other property but document ‘Title’ may be some combined dataset ‘Title’ and ‘Sub-Title’ merged in some various way such as separated by a comma.

Finally the document ‘Title’ may match with lots of ‘Title’ in the dataset but will then need to match ‘Sub-Title’ to get the correct match.

Apologies for the long winded question, I’m hoping someone can point me in the direction of a tutorial to help me train a model to so I can attempt to get to 90-99% match rate.

Thanks

There are a number of document classification methods out there.

The “simplest” method may be to do a K-nearest neighbor clustering, which is more or less a scaled up version of your if-else statements matching those documents.

A more sophisticated document classification method you may consider is tf-idf, or term frequency-inverse document frequency. This will take a look at the contents of the document and rank words in importance based on how often they occur in the document. So if two documents talk a lot about dogs, they are probably very similar.

So the tf-idf method is well understood and implemented, so I would suggest trying to use this as a first method to compare your if-else statements to. Of course, this is assuming you have the actual text of the document.

Side note, with the higher lever information you have curated, you may be able to combine that information with the information you gain from tf-idf so be more sophisticated. In other words, you can classify using tf-idf and your own if-else statements. Then take some weighted average of those two to give you a classification.

Good luck!

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.