regExp - help with understanding <.*?> in string

Hej!

I would like to understand this challange a bit better;

what is not sitting right with me is that in my head , the regExp <.*?> should be broken down accordingly:
( <) : literal match
( . ) : match the first character after ‘<’ ( ==> matches ‘h’ )
(*) : match preceeding character ( h ) 0 or more times ( ==> matches ‘h’)
(?) : match lazily (not really clear about this one)
(>) : literal match

would be very grateful If someone could give a more detailed explanation on this challange and topic. Also, some questions related to this;

how come <.*> matches the entire string?
why isnt it working to swtich places between * and ? ==> <.*?> != <.?*>

Thanks!

Hello @MSPa1nT ,

Here is a good link that explains it:

I usually use the alternative approach myself in regular expressions

3 Likes

* doesn’t care what was matched by ..
.* denotes repeating . match zero or more times, not matching what the . matched zero or more times.

Once you understand the above, consider the initial challenge code:

let text = "<h1>Winter is coming</h1>";
let myRegex = /<.*>/; // Change this line
let result = text.match(myRegex);

If we disregard the greedy/lazy matching rules for a moment, you’ll notice that myRegex is able to match both "<h1>" and "<h1>Winter is coming</h1>", because both start with "<", end with ">", and have zero or more of . match in between.

Matching "<h1>" is called lazy match, as lazy logic says: “Found an answer and that’s enough! Couldn’t be bothered to check any further!”

Matching "<h1>Winter is coming</h1>" is called greedy match, as greedy logic says: “Found an answer, but I want to keep going and see if I can match a longer string!”

Simply put, each Regex character’s meaning is given by both their form and their *surrounding characters*. In <.*?>, ? means switching * from greedy to lazy; whereas in <.?*>, ? means repeating the preceding . match zero or one time (and the * will cause an error / impossible match, as it doesn’t know what character you want to match zero or more times).

Also, use this to see detailed analysis of what each expression means: regex101: build, test, and debug regex

5 Likes

Thanks alot for the detailed explanation , I think i understand :slight_smile:

If you have time and are able to see this then maybe you could help me understand the following challange as well when it comes to lookaheads:

I don’t quite understand the second lookahead grouping: (?=\w*\d{2})

Also, is this challange meant to convey how grouping works as well, or have i forgot about it/missed it in a previous lesson ?

Think of lookahead/lookbehind assertions as similar to the JS if statements, and remember that the condition characters aren’t matched.

Also use the regex101 link I gave in my previous post.

I have understood: .*= everything

The regExp /<.*?>/ says:

Return ANYTHING ( .*) but the SHORTEST version (?) what stands between " <" and “>”:

( <) : match starts with “<
( .*): matches everything, with any length ( Wildcard (.), zero or more times (*))
(?>) : but match it lazy (?, means find the SHORTEST match) and with “>” at the end, what is “h1>” and not “h1>Winter is coming</h1>”).

Note for editors
In this excersise it wasn’t really clear what’s the point, cause I could write it more simple like /<h1>/ or /<h.>/.

Wouldn’t it be more comprehensible if we want to return both tags, <h1> and </h1>?
In this case, the regExp /<.*?>/g (with the global flag g) will return [ '<h1>', '</h1>' ].