Need help creating a RegEx that captures a String between tags

Hey, I need a RegEx that captures a String between tags the following way:

<a>hello<b>world<c>whatup</c>ineed</b>help</a>

In this case I would only want whatup to be captured, because it is the only String that has its opening and closing tags adjacent to it.

<a><a>hello<b><b>world</b></b><c>whatup</a></c>yo</a>

In this case only the word world should be captured.

It should work for any possible tag and any amount of tags. The String should also be allowed to contain the characters <, > and /. Only if they are arranged like a closing tag </.*> should the capture group end.

I just started with RegEx yesterday and can’t really seem to figure this one out. I have tried several things as of now, and currently I am back at <(.*)>(.*)</\1> .

This does only kinda seperate the content the way I want it to:

<h1><h2>Sanjay has no watch</h2></h1><par>So wait for a while</par>

becomes (when only grabbing group2)

<h2>Sanjay has no watch</h2>

and
So wait for a while

I have at least two problems with this:

  1. It does not account for multiple tags.
    <h1><h2>Sanjay has no watch</h2></h1> becomes <h2>Sanjay has no watch</h2>

  2. It does not account for closing tags if there is no appropriate opening tag.

<h1>had<h1>public</h1515></h1> becomes had<h1>public</h1515> (it should not return anything in this case)

I tried solving this with negative lookaheads and lookbehinds, but struggled with that, especially because I dont know how to account for tags of any size when there is no multiplier * allowed in lookbehinds. Also they didn’t quiete behave the way I expected them to. I also don’t know if I am even going in the right direction.

I would really love to solve this on my own, so ideally I don’t want a solution, just someone to point me in the right direction

I want to implement this into an HTML Crawler I wrote in Java, the Strings that are going to be captured will be multiline paragraphs and headings. Getting matched with this RegEx is only one part of a process that the strings are gonna go through in order to get stripped down and sorted correctly.

I am interested in this because of its educational value, I could think of easier ways of doing it. But I happened to come across RegEx yesterday, and now it bothers the hell out of me that I can’t create a RegEx that fulfills this specific purpose. No matter if I am gonna use it or not, I just want to solve this puzzle

Aaagh, what @camperextraordinaire says, don’t parse HTML with regex, it’s a Very Very Bad Idea. Almost anything you’re using as a crawler you should be able to plug in/have directly available some method to easily grab text, either via a specific parser or by xpath or whatever.

Edit: the educational value of parsing some HTML with regex is close to nil because it has a million and one caveats which need to be covered and browsers are extremely liberal in what they accept as valid HTML, meaning you need to cover all those variations as well.

The easiest solution would be to run it in browser’s window, construct DOM tree out of elements and then filter only those whose innerHTML same as textContent

I am really more interested in having a RegEx that does the job than to actually get the job done. If there are disadvantages in using a regex to parse html, I am not gonna use it for that, I would still love to be able to write a regex which fits the aforementioned criteria

Point in right direction:

  1. You want to capture 1st group that will look like a opening tag /<(\w+)>/
  2. Closing tag for this group will look like /<\/\1>/
  3. Now you need to include your string that will have anything except what looks like tags

As many pointed out this is by no means production-level solution, more like my-controlled-scrapper-that-no-one-will-use-except-me solution :slight_smile: