Regex uniqe match ignore case

Snowf · March 27, 2020, 4:44pm

Hi,
I’m new to regex.
I have the regex below but want to know how to set it to only find unique matches. At the moment it finds both uppercase and lowercase emails. This is for a DLP system to stop data lacing a company.

[A-z0-9]{1,}[._-]{0,1}[A-z0-9]{1,}[@][A-z_-0-9]{1,}[.][A-z_-0-9]{1,}[.]{0,1}[A-z_-0-9]{1,}[.]{0,1}[A-z_-0-9]{0,}

Thanks
Snowf

napolimatiase · March 27, 2020, 4:50pm

Hi. For lowercase letters: [a-z]
You can test your regex here

bbsmooth · March 27, 2020, 5:15pm

I’m not sure I understand your question, so I’ll give the answer I think you are asking and if it isn’t correct you can explain in more detail.

The domain part of the email address (@…) is not case sensitive. For example. @gmail.com, @GMAIL.COM, and @Gmail.Com are all perfectly fine. So you’ll want the regex to look for both upper/lower case there. Technically, the part before the @ is case sensitive (according to the RFC) but for all intents and purposes it is treated as case insensitive by almost every mail server. So you’ll want to look for both upper/lower case there too. Moral of the story, if you are trying to capture email addresses then the regex should be looking for both lower and upper case letters everywhere.

The regex itself doesn’t keep track of the previous emails it has found so it has no way of knowing whether it already found a similar email address but with different capitalization. So if your question is about eliminating all the email addresses that are the same but have different capitalization, my suggestion would be to convert each email address the regex captures to all lowercase and then do whatever you want with it.

Snowf · March 27, 2020, 6:54pm

Yes at the moment he regex is picking up both:
some.one@gmail.com
Some.One@gmail.Com
Or any variation of cases in the email
We want unique matches and want to ignore the case for the email.
The regex is suppose to look at PII information leaving company.
Thanks

bbsmooth · March 27, 2020, 7:00pm

As I suggested, you can convert the email address captured by the regex to all lowercase letters, then Some.One@gmail.com will be treated as some.one@gmail.com. Or I suppose you could convert the string you are searching for email addresses in to all lowercase before you apply the regex.

What you want to do won’t be solved by fixing the regex. The regex doesn’t know what previous variations of the email address you have already found. You need to implement a way to store the email addresses you have found so that when you find a new email address (which you have converted to all lowercase) then you can cross check it against the store to make sure you haven’t already found the email address.

Snowf · March 27, 2020, 9:38pm

Thank you for your help… sounds complicated. I will look into what you suggested.

Snowf · March 29, 2020, 10:03pm

That is going to be a problem as the DLP solution only takes regex queries. I will not be able to implement a way to store the email addresses.
Is there no way to modify the regex to only pick unique matches with not case?

Thanks

bbsmooth · March 29, 2020, 10:21pm

I think I might be at an impasse here. You might need to provide further details on just what exactly you are trying do to and how you are currently doing it. I’m assuming that you are scanning some sort of text content (perhaps email headers) and pulling out email addresses using the regex, but maybe you are doing something else. Maybe someone else here is more familiar with what you are trying to do and has more insight on how to do it.

I still don’t think the regex is your problem here, but until I know more I don’t think there is anything more that I can add.

Good luck.

Snowf · March 30, 2020, 2:50pm

Thanks bbsmoth

So the product is Symantec DLP. its is checking to see if PII credit card numbers are being sent to external email addresses. The DLP config is limited but there is a option to report on only unique matched. The problems is both the upper case email address and the lowercase email addresses are triggering two incidents.
I’m really new to this and still do no have a good understanding. If I change the regex to from and/or to just or this might work. That’s [A-z0-9]{1,} I believe this is saying UC letters and/or LC letters and/or number with one occurrence. If I change all of them to Or could that work or am I talking nonsense… sorry

Snowf · March 30, 2020, 3:19pm

Just had another thought. Can I convert all text to a single case via the regex?

bbsmooth · March 30, 2020, 5:45pm

Sorry, this still doesn’t give me enough information (I know nothing about this product or how to configure it). And really, this is probably not the place to be asking this question, you should be asking Symantec. I would find it hard to believe that they haven’t anticipated this issue and don’t have a solution for you.

If you are a little weak with regex then you should find a good online resource that teaches it and get yourself up to speed. FCC has a whole section devoted to regex, so you can start there.

To answer your other question, the regex itself is for just finding matches. It doesn’t actually alter anything. In JS, if you wanted to use a regex to convert all uppercase to lowercase you could do something like:

const lcString = origString.replace(/[A-Z]/g, x => x.toLowerCase());

The regex is the first argument to the replace() function and a helper function which actually does the conversion is the second argument. The point being that the actual conversion is done using another function, it isn’t done by the regex.