Faster text matching option than difflib?

I’ve got two sets of a few thousand documents that I need to match and find duplicates.

Some Context: There is no need to match documents within set A and set B, only between A and B. Also a document in A could have let’s say 20% of it from one document in B and 80% from another document. Also documents in B will generally be much longer than documents in A, meaning most matches from A will be a subset of the text in the document from B. There’s the possibility that the matches won’t be one monolithic paragraph or section, but could be spread out in bits of 2-3 sentences across document in B.

What I’ve Done So Far: Right now I’m using Sequence Matcher from difflib. Strange part is a lot of matches don’t show up unless I set Autojunk False. Not sure why that is. Maybe the fact that these documents were originally saved as Word files, which inserts some random formatting (??) I’m extracting the text from MS Word using Mammoth and then comparing the plain text.

Anyway, setting autojunk False really slows the process down. So was wondering if there’s a faster option out there. The end goal is to have the code “mark out” the portions from documents in A, in the relevant documents in B - let’s say by adding a SPAN tag or something.

Can you share the code you have currently? It might help to see your approach.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.