Working on a mini data extraction program, suck with regex, hoping the community could help

Hey guys, I’ve recently found myself working in an office doing data entry. Essentially I’ve automated a good deal of my workflow, but there is a specific problem I’d like to solve, I need to make a little program, and I need the community’s help.

I’ve got text that I’ve extracted from pdf. Mixed in with a bunch of irrelevant things, I’ll get some data with this exact format-

Dom Anzalone (212) 915-6316 VP, Regional Branch Manager

Peter Workman (212) 254-5900 President

I want to extract that data for the entire document. I COULD potentially use the fact that all of this information is preceded by ____ and followed by either another ____ or a line break, example

____ Neal Shapiro Chief Executive Officer and President Shapppynam(at)pajamas(dot)org ____

But even then I’m not exactly sure how I would crawl for such information, and there are a couple of secondary things I’d like to do automatically as well.

The final product should be csv file, or an excel spreadsheet, such that each TYPE of information ends up in it’s own column within it’s respective contact row. Example

First Name, Last name, Title, Phone, Email
Neal, Shapiro, Chief Executive Officer and President, , Shapppynam(at)pajamas(dot)org

So you see there are a couple of hurdles, like what does the program do when it doesn’t find a phone number or an email address, ideally, leave that piece blank right? Lots of moving parts and I was just hoping to be pointed in the right direction. I think if I can successfully finish this I might finally understand regex! (this is a dream of mine)

Finally, because it’s now considering me a new user, because I haven’t been on in a couple of months and my account got scrubbed, I can only post two “links” so those imagine my (at)s and (dot)s are actualy @ and . :stuck_out_tongue:
Thanks for any help!


I love regex, but as they say, “when you try to solve your problem with regex, you now have 2 problems”. :stuck_out_tongue:

If (and that’s a really big if) you’re input is the same as you state:

Dom Anzalone (212) 915-6316 VP, Regional Branch Manager

Here’s one way to do it:

var contactInfo = 'Dom Anzalone (212) 915-6316 VP, Regional Branch Manager'
  .replace(/ \(/, '::(')
  .replace(/(\d) (\w)/, '$1::$2')
  .replace(/ ([^@ ]+@)/, '::$1');

var contactArr = contactInfo.split('::')

  name: contactArr[0],
  phone: contactArr[1],
  title: contactArr[2],
  email: contactArr[3],

This isn’t the only way.

If you can control the input, you should add a delimiter (in my case, ::) before processing. Then, you only need to do a split and don’t need any regex.

This looks extremely helpful, way easier than what I was thinking but there are a couple of snags I have to work around. For instance, while I’m splitting the string, is there a way to “if” in there that if one of these values isn’t found that it’s just indexed as a blank member of the array, instead of what I assume will be attempting to assign the next piece of the string to the wrong index? It looks like using this system I can probably make a function to make a grid in which every 4 indexes are a row, I was imagining 4 different arrays, Name, Phone, Title, Email, which I looped through the each contact chunk, split it into it’s parts, and assigned each bit to it’s respective array. Alright, this should get me moving, thanks. Any further help would be amazing.

That’s the real trick. The more structured your input is, the easier it will be. Otherwise, you’re back to a series of regex .match statements.

Which leads to the unfortunate “what happens if it’s just a name?” :frowning:

But, if you have code or a program that pulls just those pieces of data out of your source file, you should be able to either add a delimiter before processing or, as you suggested, just plug the data into the right object key and skip all the above code…

Good luck.