I need to quickly extract text from HTML files. I am using the following regular expressions instead of a full-fledged parser since I need to be fast rather than accurate (I have more than a terabyte of text). The profiler shows that most of the time in my script is spent in the re.sub procedure. What are good ways of speeding up my process? I can implement some portions in C, but I wonder whether that will help given that the time is spent inside re.sub, which I think would be efficiently implemented.
# Remove scripts, styles, tags, entities, and extraneous spaces: scriptRx = re.compile("<script.*?/script>", re.I) styleRx = re.compile("<style.*?/style>", re.I) tagsRx = re.compile("<[!/]?[a-zA-Z-]+[^<>]*>") entitiesRx = re.compile("&[0-9a-zA-Z]+;") spacesRx = re.compile("\s{2,}") .... text = scriptRx.sub(" ", text) text = styleRx.sub(" ", text) .... Thanks!