How to search through many text files

irenem · February 3, 2022, 3:00pm

Help! The university I work for has several projects and we’re not doing this efficiently at all. I am technically-inclined, but not comfortable programming (yet, as this website here is amazing and amazingly helpful!).

Where should I look next for setting up a painless way of searching through text files? We have a ton of small PDFs, they’re scanned and OCRed, and professors sometimes need to look for a word or phrase within them. For now, we use Google Drive, but running out of space even in the paid version. It’s spread among three Google Drive accounts and people are getting frustrated because they don’t see all the relevant documents. I’d like to come up with an easier, more elegant solution, and I can learn from here how to create it, but I do need some guidance on which way to go… what will help me do this? Ideally, this is a service we put on Amazon cloud or similar (so that professors working from home can access without going through a VPN to our main campus), low cost or free, and not too-too crazy to set up…

Thanks!!!
Irene.

brandon_wallace · February 4, 2022, 9:31pm

Welcome Irene to the forum!

It sounds like you need a database. You would be able to store and query relevant information about each file, such as the date, subject, title, department, and more. Look into Sqlite3 or if you need something more heavy duty PostgreSQL.

irenem · February 4, 2022, 9:58pm

Thanks!
I have some familiarity with SQL… but how do I store PDFs in the database? Export the text? Or hm maybe store a URL to them in a different storage?

bradtaniguchi · February 4, 2022, 10:48pm

If your running out of space, you probably not on the highest tier option of Google Drive /Google Workspace which has unlimited space. Not only would this let your professors use tools they are already using (Google Drive), it would allow them to use it for more than just PDF searching.

I’d look into what GSuite settings/options are available for them.

The issue is the solution you’d optimally use would be something like a search engine for files… which is exactly what Google Drive is. This would be trying to build from scratch something a full team at Google works on. Along with this, you’d need a place to host/provide all the memory to store all the files, and a place to execute the queries.

If you wanted to build this, theoretically your looking at essentially a search engine that goes over all the text parsed. As you probably know Google has decades of experience building search. You’d not only be trying to replicate their capabilities, but also replicate all of their indexes to get anywhere near the accuracy and speed.

Finally, even a simple solution to the problem of “finding a file” comes down to queries. Its possible your professors can’t find their files if they aren’t sure what name it is. No search engine can help you find something you aren’t sure what is named. The only fix is more organization and effort to categorize your data. (which requires a human, or a well trained AI trained on labeled data)

So in general this is the sort of problem that is difficult for Google to solve. Building a custom solution is possible, but odds are it wont be as good as the dedicated one from Google. The human factor also still leaves some wiggle room for a bad experience, even with a “perfect” product.

If your working at a school, I’d talk with the Google Workspace vendor about upgrades and solutions to your plans. Google themselves has a number of vendors that work with them to “enhance” their products, including Google Drive. Your schools IT department should be aware of what sort of service contract they have with Google to use their Google Workspace service.

I think Google gives discounts to schools to use their services, but I consider it money well spent, especially because usually students and professors can leverage these tools for their projects and classes.

system · August 6, 2022, 10:49am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Pdf search engine	2	378	June 1, 2021
Storing big files in database	3	1527	July 27, 2021
Fetching data from PDF Python	2	147	November 22, 2024
Choosing Database	9	381	July 6, 2021
Store pdf files for a website	5	1776	June 1, 2021

How to search through many text files

Related topics