Extract html tags

sabiut · July 21, 2019, 1:13am

I want to use regular expression to print out the HTML tags excluding the attributes.

for example:

import re
html = '<h1>Hi</h1><p>test <span class="time">test</span></p>'
tags = re.findall(r'<[^>]+>', html)
for a in tags:
    print(a)

the output is :

<h1>
</h1>
<p>
<span class="time">
</span>
</p>

But I just want the tag, not the attributes

<h1>
</h1>
<p>
<span >
</span>
</p>

kylec · July 21, 2019, 4:33am

You could probably use a regular expression on html to do this, but alternatively, you could just process a in the for loop. All tag attributes should have a pattern similar to attribute="some_value" (Unless this is non-standard HTML), so find and replace them all with re.sub():

for a in tags:
    b = re.sub(r'\s?\w+=\"[\w\d]+\"', '', b)
    print(b)

camperextraordinaire · July 21, 2019, 4:51am

I don’t recommend using regex to parse HTML, because if there are incomplete tags inside of the valid tags, it can become a nightmare. That being said, you could do something like the following:

import re
html = '<h1>Hi</h1><p>test <span class="time">test</span></p>'
tags = re.findall(r'<[^>]+>', html)
for a in tags:
    print(re.sub(r'(<\w+)[^>]+(>)', r'\1\2' , a))

Displays the following:

<h>
</h1>
<p>
<span>
</span>
</p>

sabiut · July 21, 2019, 4:19pm

Thanks, how to i print out only the tags that has no attributes.

<h1></h1>
<p></p>

Topic		Replies	Views
Extract content of style tags Python	2	1283	January 18, 2021
Print html tag regex Python	2	660	January 18, 2021
Print content of html tag Python	1	588	January 18, 2021
Parsing HTML with regular expressions JavaScript	2	922	June 1, 2021
Problem with regular expressions JavaScript	2	456	June 1, 2021

Extract html tags

Related topics