Hi there! I am working on a function which tells you how many times a word appears in a text. I have written a regex which accounts for non-word characters which can be present in a text due to sentence grammar or user errors.
If word is “dogs” and text is “Dogs are lovely animals and dogs make great pets. Dogs are sensitive to human’s feelings and dogs are very loyal to their owners”, the regex matches 4 times as it should.
However, if the text is “Dogs, dogs, dogs, dogs”, it only matches twice and I am trying to get my head around why. Match misses appear when word is followed directly by another instance of word with non-word characters in between.
Here is my code:
let word = "dogs"
const regex = new RegExp(`(^${word}$|^${word}\\W+|\\W+${word}\\W+|\\W+${word}$)`, "gi");
let text = "dogs, dogs, dogs, dogs"
console.log(text.match(regex))
console.log(text.match(regex).length)
note: i used a regex constructor since the word will come from user input.
Sure. Basically, if the user types dogsdogs, i do not want this to be a match. I had to write a fairly long regex to account for this and allow other options even if they are gramatically incorrect or have punctuation issues.
So after doing some testing (you should try that ) with the different parts of the RegEx and using a neat trick which somehow doesn’t covered a lot, I figured two things.
First: RegEx is greedy so you need to add “?” after a multiplier like + or * to tell it to look for smallest matches → \\W+?
Second: The “match” is not looking for overlapping symbols. So because you specify for the match that it has to be followed/preceded by a non-word character, it will only find one match in “dogs.dogs”
You are getting two matches because after greedily taking all non-word characters, starting from the left, it matches:
“[dogs, ]dogs[, dogs, ]dogs” and excludes “dogs” twice because the greedy matching didn’t leave any non-word characters around them for the match. While making them ungreedy could solve this, it would still run into an issue with “dogs.dogs” because without overlap you need at least two non-word characters inbetween words.
I hope this helps - for testing I just opened the JS console of my browser and tested around different RegEx and texts… Though without knowing the non-greed-symbol, it’s ofcourse very hard… but there I started to notice the overlap-thing.
I’d think you need another approach or another regex-method for this. Though I am no expert on the field so I don’t know which would do it…
Technically the issue is you are matching the non-word characters. But you don’t want to match those, you only want to match “dogs”, so you gotta work out how to do that. Easiest but propably not super-efficient would be to split the text on non-word-chars and then check all words.
@Jagaya Thank you very much for this! You’ve highlighted some very insightful points. I’ll see what I can do with this code and post if I can sort it out
So, as you suggested I split the words in text. I then joined them with two whitespace characters and changed regex to non-greedy so it recognises all the matches. Now it’s working . Cheers for the advice. This is my revised code:
let word = "dogs"
const regex = new RegExp(`(^${word}$|^${word}\\W+?|\\W+?${word}\\W+?|\\W+?${word}$)`, "gi");
let text = "dogs, dogs, dogs, dogs"
const textArr = text.split(/\W/)
const newText = textArr.join(" ")
console.log(newText.match(regex))
console.log(newText.match(regex).length)
Perhaps this is not the cleanest way but it will do for now.
I just came across the lookahead assertion x(?=y) in RegEx which does match the x (and only the x), if it is followed by y → and opposite is x(?!y)
This might make this task easier.
Edit: Other direction of a lookahead is a lookbehind (“behind” being left of the word) and is “(?<=y)x” → matches x if preceeded by y.
I though this would make the expression much easier, though because I got curious and tested it, I ended up with this: /(?<!\w)dogs(?!\w)/gi
Now it’s neither preceeded nor followed by word-characters. Which ofcourse includes other dogs and also means it matches the words at the beginning and end of the string.
You might need to adjust it for your specific needs, as I didn’t exactly remember them when writing this - but it most certainly is short and simple.