Hey guys, I’ve recently found myself working in an office doing data entry. Essentially I’ve automated a good deal of my workflow, but there is a specific problem I’d like to solve, I need to make a little program, and I need the community’s help.
I’ve got text that I’ve extracted from pdf. Mixed in with a bunch of irrelevant things, I’ll get some data with this exact format-
Dom Anzalone (212) 915-6316 VP, Regional Branch Manager something.person@cmail.com
Peter Workman (212) 254-5900 President petey@notrealperson.com
I want to extract that data for the entire document. I COULD potentially use the fact that all of this information is preceded by ____ and followed by either another ____ or a line break, example
____ Neal Shapiro Chief Executive Officer and President Shapppynam(at)pajamas(dot)org ____
But even then I’m not exactly sure how I would crawl for such information, and there are a couple of secondary things I’d like to do automatically as well.
The final product should be csv file, or an excel spreadsheet, such that each TYPE of information ends up in it’s own column within it’s respective contact row. Example
First Name, Last name, Title, Phone, Email
Neal, Shapiro, Chief Executive Officer and President, , Shapppynam(at)pajamas(dot)org
So you see there are a couple of hurdles, like what does the program do when it doesn’t find a phone number or an email address, ideally, leave that piece blank right? Lots of moving parts and I was just hoping to be pointed in the right direction. I think if I can successfully finish this I might finally understand regex! (this is a dream of mine)
Finally, because it’s now considering me a new user, because I haven’t been on in a couple of months and my account got scrubbed, I can only post two “links” so those imagine my (at)s and (dot)s are actualy @ and .
Thanks for any help!
-Z
I