Regex in JS code

Just some time ago I did one challenge on Codewars. I was successful with my attempt but I was struck when I saw how clever and simple one solution was. Unfortunately, I do not understand one part of a code that is Regex. In the solution below you can see the whole solution:

function duplicateCount(text){
  return (text.toLowerCase().split('').sort().join('').match(/([^])\1+/g) || []).length;
}

This was the description:

Count the number of Duplicates
Write a function that will return the count of distinct case-insensitive alphabetic characters and numeric digits that occur more than once in the input string. The input string can be assumed to contain only alphabets (both uppercase and lowercase) and numeric digits.

Example
“abcde” → 0 # no characters repeats more than once
“aabbcde” → 2 # 'a' and 'b'
“aabBcde” → 2 # 'a' occurs twice and 'b' twice (bandB)

If you have some time would you mind explaining to me what is this part of a solution (/([^])\1+/g) || []) about? I do know some regex and I also do have JS course on freeCodeCamp so I did something with it but this is beyond what I know so far and I did not grasp the concept of this part of a solution. So far I have encountered “^” only in combination with some characters in brackets. Also I do not get || [] this part. Why it is not inside foreward slashes?

function duplicateCount(text){
  return (text.toLowerCase().split('').sort().join('').match(/([^])\1+/g) || []).length;
}

So

function duplicateCount(text){
  const lowercaseText = text.toLowerCase();
  const individualChars  = lowercaseText.split("");
  const orderedChars = individualChars.sort("");
  const orderedText = orderedChars.join("");
  
  let repeatedCharMatches = orderedText.match(/([^])\1+/g);
  if (repeatedCharMatches === null) {
    repeatedCharMatches = [];
  }

  return repeatedCharMatches.length;
}

The regex says:

  • in a capturing group (())…
  • match the start of input (the [^], this is a bit tricky because normally the ^ in square brackets is negation – eg [^ab] would mean “not the characters a or b”, but on its own, it means “start of input”)…
  • then one or more (+)…
  • of whatever was in that capturing group (the \1, which means “capturing group 1”).
  • do it globally (for whole string, not just first match) (g)

The match function returns either null (no matches) or an array containing each match.

So for “abc”

  • start of input is a. Only one a, no match
  • next attempt (doing it globally, so moves on)
  • start of input is b. Only one b, no match
  • next attempt
  • start of input is c. Only one c, no match

So "abc".match returns null. Null has no length, so set the variable to an empty array.

For “aabbcc”

  • start of input is a. Two a, put “aa” into array
  • next attempt
  • start of input is b. Two b, put “bb” into array
  • next attempt
  • start of input is c. Two c, put “cc” into array

So "aabbcc".match returns ["aa", "bb", "cc"]. Array has length, length is 3, three repeated groups.

It’s not part of the regex, it’s a boolean OR.

IF the value to the left of the OR is false, return the value to the right.

So IF the return value of running match on the string is null (ie falsey), return the value to the right of the OR.

This allows the author of the code to run length on the result without it erroring out. They could have split it across two lines (do the sorting and matching, then either return 0 if null or length if not null), but to keep it on a single line they’ve used that trick.

4 Likes

Thank you for your answer. Things are a bit clearer to me now but I still have some questions.

I found this in MDN when I looked into capturing groups of “\n” type: “Where “n” is a positive integer. A back reference to the last substring matching the n parenthetical in the regular expression (counting left parentheses).”

They also showed one example but in the end, I still do not know why would I want to use this type of capture group and where would I want to use it. Why in this case does it need to be used?

Also you mentioned that [^] represents the start of the input. So I am curious how it loops through the string so it can find the groups of identical strings and what is an actual start? Is it one character, two or …? Why it does not stop on a on the aabbcc example you gave? How does it know what characters to group together?

Could you please try to explain in more detail how does the matching process works in this case? I get everything else but this at this point.

That’s the magic of the global (“g”) flag, it doesn’t stop at the first match, it checks the entire string.

this one actually is “any character including new line”
image

4 Likes

So if you didn’t need to worry about new lines you could do?

/(.)\1+/

I don’t think I realized that [^] matches any character including new lines. Thanks for pointing that out.

I think so. And I didn’t know either, it just didn’t sound right that would mean start of input if it was inside the square brackets so I went to check.

image

2 Likes

The capturing group matches one character. You want to then match any subsequent character that is the same, and you can’t do that unless you use something that knows what that character is

aabbcc: first character is a, second character is a, match for the pattern. Pattern is “aa”. Remaining character to look at are “bbcc”. First character is b, etc

So just to be clear,

(case of “aabbcc”)

  1. [^] matches one character that is a
  2. then it matches aa due to \1 because the first capture group is “a”
  3. then it moves onto the next unique character from string because [^] matches any character so it goes onto b and this is all due to g that specifies that it should match all unique characters

Did I get it right? Also why it does not work in case “abc”? + sign at the end matches one or more characters so why I wont get match for just one character? Also I am supposing that [^] matches only unique characters, is that true? I came to this preposition based on this hypothesis: If it would match every character and I would be working with “aaabbb” would not it also match aaa but also aa when it would progress onto next character?

The last thing I would like to ask where did you get this explanation? I would love to get my hands on tool that would explain Regex to me.

I didn’t know either, and I think I was looking at a different regex implementation or description of functionality (which I now can’t find), where it was described as start of entry, but practically that seems to have meant the same thing (it would be whatever the first character was in the match).

But yeah JS seems to (uniquely?) treat it as a negated character class of nothing, so it’s basically “not nothing”

Edit: haha, I see where it was from, it was MDN, where there is a little info box on the section for character classes saying “the ^ character may also indicate beginning of input” and I’ve inferred from there

2 Likes

No, so ([^]) is matching one character, then one or more of the same thing as that match. So “abc” cannot result in any matches: “a” is not followed by any more "a"s, “b” is not followed by any more "b"s, “c” is not followed by any more "c"s.

1 Like

Oh, so in the process of matching it eliminates the option that it already went through? In the case of aabbccaa after the first matching it would work only with bbcc also?

That’s how regex works – it goes left to right, discarding everything to the left (with caveats – eg here it temporarily stores the character/s to the left for that match)

1 Like

“aabbccaa” would give aa, bb, cc, aa

It doesn’t care what the character is, it doesn’t “remember” that first match of “aa”, once it gets that first pattern match, it just asks all over again whether there is any character followed by one or more or the same character

1 Like

Oh, I think I got it. So it seems like there is not some sort of complex memory management here as I thought.

" No, so ([^]) is matching one character, then one or more of the same thing as that match . So “abc” cannot result in any matches: “a” is not followed by any more "a"s, “b” is not followed by any more "b"s, “c” is not followed by any more “c"s.” → just to be sure that I am getting it right, if I am matching something there must be a minimum of the two occurrences of the same thing because the character that I want to match must be matched with other character of same type. Matching character to itself would not seem like a logical thing in this case. But in the case I want to match character to itself could I use /[character]+/ig ?

Yes, you can do that: /a+/gi would match all occurrences of “a” followed by one or more "a"s in a string – so "aabbccaa".match(/a+/gi) would produce ["aa", "aa"]

However, you don’t know in advance what the character is, which is why, instead, you do /([^])\1+/ instead. Same effect, but instead of “a”, it matches any character followed by whatever that character was

Note that I should have said you can’t actually get a string like "aabbccaa`, because in the code the string is sorted, you can only possibly get “aaaabbcc” (above few comments are just hypotheticals).

Note also that the i is redundant: the string has to already have been lowercased for the code to work, because if you don’t do that, then sort will cause this issue:

"cAbaDdA".split("").sort().join("")
// produces "AADabcd"

Whereas

"cAbaDdA".toLowerCase().split("").sort().join("")
// produces "aaabcdd"

Edit: also need to say that the reason trickery is required (the capture group + the reference to it) is that it’s quite difficult to match things you don’t know in advance. Regex works best matching literal strings – that’s what it’s designed to do very well, and that’s what’s easiest to write and to make understandable. Once you go outside that, you can often get what you want, but at the cost of the regex becoming progressively more like absolute gibberish

1 Like

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.