4

I need to quickly extract text from HTML files. I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are good ways of speeding up my process? I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented.

# Remove scripts, styles, tags, entities, and extraneous spaces: scriptRx = re.compile("<script.*?/script>", re.I) styleRx = re.compile("<style.*?/style>", re.I) tagsRx = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>") entitiesRx = re.compile("&[0-9a-zA-Z]+;") spacesRx = re.compile("\s{2,}") .... text = scriptRx.sub(" ", text) text = styleRx.sub(" ", text) .... 

Thanks!

5
  • 10
    I am pretty sure a decent (x)html parser (or a little hand-made parser) out-performs regex. Commented Jul 18, 2010 at 21:09
  • it looks like you are calling .sub() quite a few times, if "text" is large, it's going to be a lot more efficient to try and do what you need in one regex. in your question, you didn't clarify which regex is slow, did you mean that all of them combined are slow, or is there an individual one that is particularly slow? Commented Jul 18, 2010 at 21:37
  • @Bart: Any reason to think doing a complete parse will be faster than regex? Any reason to think a hand-made parser will outperform a fine-tuned and optimized regex library? Commented Jul 18, 2010 at 22:12
  • ...and for the same reason, I am not looking for accuracy, but speed. Commented Jul 18, 2010 at 22:17
  • 2
    @Ahbi: A regular expression search may recurse a lot more than a parser, especially if you use (as you have above) a lot of variable-width wildcard expressions like .*? A simple parser might just make one pass over the string, for example. If it relied mostly on built-in string functions, it could be very fast. Commented Jul 19, 2010 at 2:37

6 Answers 6

8

First, use an HTML parser built for this, like BeautifulSoup:

http://www.crummy.com/software/BeautifulSoup/

Then, you can identify remaining particular slow spots with the profiler:

http://docs.python.org/library/profile.html

And for learning about regular expressions, I've found Mastering Regular Expressions very valuable, no matter what the programming language:

http://oreilly.com/catalog/9781565922570

Also:

How can I debug a regular expression in python?

Due to the reclarification of the use-case, then for this request, I would say the above is not what you want. My alternate recommendation would be: Speeding up regular expressions in Python

Sign up to request clarification or add additional context in comments.

12 Comments

lxml might be faster than BeautifulSoup, you should try both.
Obviously, doing a complete parse can never be faster than substituting a few regular expressions, unless there is a some super-duper HTML parsing library. (I tried BS, it's an order of magnitude slower than the regex I am using).
The one downside to BeautifulSoup is its speed: it's very slow.
@Abhi: well, it depends on what you mean by "parsing" I suppose. Many parsers (and BS may be among them) use regular expressions to identify the tags and attributes, and those would be slow for the same reason all regular expressions are slow. But one could conceivably write a parser as a finite state machine, which goes through one character at a time and alters its state based on the character and current state, and that would be a lot faster. (Quite complicated to code for HTML though)
I'd say David Zaslavsky is spot on: if you just do simple removing of portions of html, write a character-by-character parse. A lot of the complexity a parser would deal with (tags / nesting) are non existent as you don't care to remember them, just to remove them.
|
5

You're processing each file five times, so the first thing you should do (as Paul Sanwald said) is try to reduce that number by combining your regexes together. I would also avoid using reluctant quantifiers, which are designed for convenience at the expense of efficiency. Consider this regex:

<script.*?</script> 

Each time the . goes to consume another character, it first has to make sure </script> won't match at that spot. It's almost like doing a negative lookahead at every position:

<script(?:(?!</script>).)*</script> 

But we know there's no point doing the lookahead if the next character is anything but <, and we can tailor the regex accordingly:

<script[^<]*(?:<(?!/script>)[^<]*)*</script> 

When I test them in RegexBuddy with this target string:

<script type="text/javascript">var imagePath='http://sstatic.net/stackoverflow/img/';</script> 

...the reluctant regex takes 173 steps to make the match, while the tailored regex takes only 28.

Combining your first three regexes into one yields this beast:

<(?:(script|style)[^<]*(?:<(?!/\1)[^<]*)*</\1>|[!/]?[a-zA-Z-]+[^<>]*>) 

You might want to zap the <HEAD> element while you're at it (i.e., (script|style|head)).

I don't know what you're doing with the fourth regex, for character entities--are you just deleting those, too? I'm guessing the fifth regex has to be run separately, since some of the whitespace it's cleaning up is generated by the earlier steps. But try it with the first three regexes combined and see how much difference it makes. That should tell you if it's worth going forward with this approach.

1 Comment

Thanks for the insight about reluctant quantifiers. Profiling did show that these are expensive! I will modify my regex.
1

one thing you can do is combine the script/style regexes using backreferences. here's some sample data:

$ cat sample <script>some stuff</script> <html>whatever </html> <style>some other stuff</style> 

using perl:

perl -ne "if (/<(script|style)>.*?<\/\1>/) { print $1; } " sample 

it will match either script or style. I second the recommendation for "mastering regular expressions", it's an excellent book.

Comments

1

If your use-case is indeed to parse a few things for each of millions of documents, then my above answer won't help. I recommend some heuristics, like doing a couple "straight text" regexes on them to begin with - like just plain /script/ and /style/ to throw things out quickly if you can. In fact, do you really need to do the end-tag check at all? Isn't <style good enough? Leave validation for someone else. If the quick ones succeed, then put the rest into a single regex, like /<script|<style|\s{2,}|etc.../ so that it doesn't have to go through so much text once for each regex.

Comments

0

The suggestion to use an HTML parser is a good one, since it'll quite possibly be faster than regular expressions. But I'm not sure BeautifulSoup is the right tool for the job, since it constructs a parse tree from the entire file and stores the whole thing in memory. For a terabyte of HTML, you'd need an obscene amount of RAM to do that ;-) I'd suggest you look at HTMLParser, which is written at a lower level than BeautifulSoup, but I believe it's a stream parser, so it will only load a bit of the text at a time.

3 Comments

Sorry, I meant 50 million HTML files. It's not one big piece of text. Anyway, HTMLParser didn't work: This is HTML-in-the-wild. Is there a clean way to extract all the text from BeautifulSoup's parse tree?
Ah, gotcha. I don't know all that much about BeautifulSoup, but I'd imagine it might be possible to extract the text. Did you try just converting it to a string? i.e. soup = BeautifulSoup(...) then str(soup) (I don't know if that's it, but that'd be my first guess if I were working on your project)
Thanks for getting back :) But I typed that comment in haste. BS is an order of magnitude slower than regex substitutions, so extracting text from its parse is moot :P
0

I would use simple program with regular Python partition something like, this, but it is tested only with one style example file:

## simple filtering when not hierarchical tags inside other discarded tags start_tags=('<style','<script') end_tags=('</style>','</script>') ##print("input:\n %s" % open('giant.html').read()) out=open('cleaned.html','w') end_tag='' for line in open('giant.html'): line=' '.join(line.split()) if end_tag: if end_tag in line: _,tag,end = line.partition(end_tags[index]) if end.strip(): out.write(end) end_tag='' continue ## discard rest of line if no end tag found in line found=( index for index in (start_tags.index(start_tag) if start_tag in line else '' for start_tag in start_tags) if index is not '') for index in found: start,tag,end = line.partition(start_tags[index]) # drop until closing angle bracket of start tag tag,_ ,end = end.partition('>') # check if closing tag already in same line if end_tags[index] in end: _,tag,end = end.partition(end_tags[index]) if end.strip(): out.write(end) end_tag = '' # end tag reset after found else: end_tag=end_tags[index] out.write(end) # no end tag at same line if not end_tag: out.write(line+'\n') out.close() ## print 'result:\n%s' % open('cleaned.html').read() 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.