remove all inline styles and (most) classes from an HTML string

Question

I'll start from the end:
In my C# program, I have a string containing HTML, and I'd like to remove from the elements in this string, all inline style attributes (style=".."), and all classes beginning with 'abc'.
I'm willing to use regular expressions for this, even though some people bitch about it :).

(an explanation, for those wishing to berate me for parsing HTML strings:
I'm forced to use some less-than-optimal web control for my project. the control is designed to be used server-side (i.e with postbacks and all that stuff), but I'm required to use it in ajax calls.
which means that I have to configure it in code, call its Render() method which gives me the HTML string, and pass that string to the client-side, where it's inserted into the DOM at the appropriate place. Unfortunately, I wasn't able to find the correct configuration of the control to stop it from rendering itself with these useless styles and classes, so I'm forced to remove them by hand. Please don't hate me.)

Bohemian · Accepted Answer · 2013-12-26 23:46:19Z

10

Try this:

string html; string cleaned = new Regex("style=\"[^\"]*\"").Replace(html, ""); string cleaned = new Regex("(?<=class=\")([^\"]*)\\babc\\w*\\b([^\"]*)(?=\")").Replace(cleaned, "$1$2");

answered Dec 26, 2013 at 23:46

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Erçin Dedeoğlu Over a year ago

not working for me, source and result same. not effect

J. Ed · Accepted Answer · 2014-01-01 09:53:27Z

To anyone interested- I've solved this without using RegEx;
Rather, I used XDocument to parse the html-

private string MakeHtmlGood(string html) { var xmlDoc = XDocument.Parse(html); // Remove all inline styles xmlDoc.Descendants().Attributes("style").Remove(); // Remove all classes inserted by 3rd party, without removing our own lovely classes foreach (var node in xmlDoc.Descendants()) { var classAttribute = node.Attributes("class").SingleOrDefault(); if (classAttribute == null) { continue; } var classesThatShouldStay = classAttribute.Value.Split(' ').Where(className => !className.StartsWith("abc")); classAttribute.SetValue(string.Join(" ", classesThatShouldStay)); } return xmlDoc.ToString(); }

Make HTML Good I got a great laugh out of that one. Thanks for the humor
error: There are multiple root elements. Line 1, position 126.
You have to put a dummy root in to work, but the HTML has to be absolutely perfect or it won't work at all. HTMLAgilityPack can parse bad HTML (99.99% of html on the web!).

Collectives™ on Stack Overflow

remove all inline styles and (most) classes from an HTML string

2 Answers 2

1 Comment

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

3 Comments

Linked

Related