9

I've seen the humorous threads and read the warnings, and I know that you don't parse HTML with regex. Don't worry... I'm not planning on trying it.

BUT... that leads me to ask: how are HTML parsers coded (including the built-in functions of programming languages, like DOM parsers and PHP's strip_tags)? What mechanism do they employ to parse the (sometimes malformed) markup?

I found the source of one coded in JavaScript, and it actually uses regex to do the job:

// Regular Expressions for parsing tags and attributes var startTag = /^<(\w+)((?:\s+\w+(?:\s*=\s*(?:(?:"[^"]*")|(?:'[^']*')|[^>\s]+))?)*)\s*(\/?)>/, endTag = /^<\/(\w+)[^>]*>/, attr = /(\w+)(?:\s*=\s*(?:(?:"((?:\\.|[^"])*)")|(?:'((?:\\.|[^'])*)')|([^>\s]+)))?/g; 

Do they all do this? Is there a conventional, standard way to code an HTML parser?

12
  • right after I read your question, that tread you linked had popped up in my head :) Commented Feb 18, 2011 at 7:49
  • You can't parse a non-regular language with regexes. You sure can use them to extract information from known regular subsets of the language (which is what they are doing), but that's about it. Commented Feb 18, 2011 at 8:11
  • @NullUserException: Don’t be ridiculous. I thought everybody knew better than this by now. Commented Feb 18, 2011 at 10:57
  • @tchrist: It sure doesn't look like it Commented Feb 18, 2011 at 14:56
  • 1
    @tchrist: Please get practical and provide an example of such a parser working with pcre and is implemented in PHP. Or did you only create hot air? Commented Dec 13, 2011 at 8:16

1 Answer 1

3

I do not know that that style is a “normal” way to do things. It is better than most I’ve seen, but it’s still too close to what I refer to as a “naïve” approach in this answer. For one thing, it isn’t accounting for HTML comments getting in the way of things. There are also legal but somewhat matters of entities it isn’t dealing with. But it’s HTML comments where most such approaches fall down.

A more natural way is to use a lexer to peel off tokens, more like like shown in this answer’s script, then assemble those meaningfully. The lexer would be able to know about the HTML comments easily enough.

You could approach this with a full grammar, such as the one shown here for parsing an RFC 5322 mail address. That is the sort of approach I take in the second, “wizardly” solution in this answer. But even that is only a complete grammar for well-formed HTML, and I’m only interested in a few different sort of tags. Those I define fully, but I don’t define valid fields for tags I’m unconcerned with.

Sign up to request clarification or add additional context in comments.

2 Comments

Very nice answer. But how does (for example) the DOM library work internally? What does it do?
@Peter: the DOM library is free software, you can just read it's source-code if you really want to find out.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.