How can I use Python to extract information from a HTML document?

Question

I need python to extract some data from a HTML file.

The code I am using at the moment is bellow:

import urllib recent = urllib.urlopen(http://gamebattles.majorleaguegaming.com/ps4/call-of-duty-ghosts/team/TeamCrYpToNGamingEU/match?id=46057240) recentsource = recent.read()

I now need this to then print a list of the gamer tags that are in the table of that webpage for the other team.

How can I do this?

Thanks

use beautifulsoup: crummy.com/software/BeautifulSoup

Casimir et Hippolyte
– Casimir et Hippolyte

2014-09-28 02:01:43 +00:00
Commented Sep 28, 2014 at 2:01 — Casimir et Hippolyte
– Casimir et Hippolyte, Commented Sep 28, 2014 at 2:01

Electron · Accepted Answer · 2014-09-28 02:13:18Z

Look at the Beautiful Soup module, which is a wonderful text parser.

If you do not want to or can't install it, you can download the source code, and just put the .py file in the same directory as your program.

To do so, download and extract the code from the website, and copy the "bs4" directory into the same folder as your python script.

Then, put this in the beginning of your code:

from bs4 import BeautifulSoup # or from bs4 import BeautifulSoup as bs # To type bs instead of BeautifulSoup every single time you use it

You can learn how to use it from other stackoverflow questions or look at the documentation

Avinash Babu · Accepted Answer · 2014-09-28 02:14:20Z

You can use html2text for this job or you can use ntlk.

A sample code

import nltk from urllib import urlopen url = "http://any-url" html = urlopen(url).read() raw = nltk.clean_html(html) print(raw)

PaulMcG · Accepted Answer · 2014-09-28 03:45:03Z

pyparsing has some helpful constructs for pulling data from HTML pages, and the results tend to be self-structuring and self-naming (if you set up the parser/scanner correctly). Here is a pyparsing solution for this particular web page:

from pyparsing import * # for stripping HTML tags anyTag,anyClose = makeHTMLTags(Word(alphas,alphanums+":_")) commonHTMLEntity.setParseAction(replaceHTMLEntity) stripHTML = lambda tokens: (commonHTMLEntity | Suppress(anyTag | anyClose) ).transformString(''.join(tokens)) # make pyparsing expressions for HTML opening and closing tags # (suppress all from results, as there is no interesting content in the tags or their attributes) h3,h3End = map(Suppress,makeHTMLTags("h3")) table,tableEnd = map(Suppress,makeHTMLTags("table")) tr,trEnd = map(Suppress,makeHTMLTags("tr")) th,thEnd = map(Suppress,makeHTMLTags("th")) td,tdEnd = map(Suppress,makeHTMLTags("td")) # nothing interesting in column headings - parse them, but suppress the results colHeading = Suppress(th + SkipTo(thEnd) + thEnd) # simple routine for defining data cells, with optional results name colData = lambda name='' : td + SkipTo(tdEnd)(name) + tdEnd playerListing = Group(tr + colData() + colData() + colData("username") + colData().setParseAction(stripHTML)("role") + colData("networkID") + trEnd) teamListing = (h3 + ungroup(SkipTo("Match Players" + h3End, failOn=h3))("name") + "Match Players" + h3End + table + tr + colHeading*5 + trEnd + Group(OneOrMore(playerListing))("players")) for team in teamListing.searchString(recentsource): # use this to print out names and structures of results #print team.dump() print "Team:", team.name for player in team.players: print "- %s: %s (%s)" % (player.role, player.username, player.networkID) # or like this # print "- %(role)s: %(username)s (%(networkID)s)" % player print

Prints:

Team: Team CrYpToN Gaming EU - Leader: CrYpToN_Crossy (CrYpToN_Crossy) - Captain: Juddanorty (CrYpToN_Judd) - Member: BLaZe_Elfy (CrYpToN_Elfy) Team: eXCeL™ - Leader: Caaahil (Caaahil) - Member: eSportsmanship (eSportsmanship) - Member: KillBoy-NL (iClown-x)

Collectives™ on Stack Overflow

How can I use Python to extract information from a HTML document?

3 Answers 3

Comments

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Related