1

I need python to extract some data from a HTML file.

The code I am using at the moment is bellow:

import urllib recent = urllib.urlopen(http://gamebattles.majorleaguegaming.com/ps4/call-of-duty-ghosts/team/TeamCrYpToNGamingEU/match?id=46057240) recentsource = recent.read() 

I now need this to then print a list of the gamer tags that are in the table of that webpage for the other team.

How can I do this?

Thanks

1

3 Answers 3

2

Look at the Beautiful Soup module, which is a wonderful text parser.

If you do not want to or can't install it, you can download the source code, and just put the .py file in the same directory as your program.

To do so, download and extract the code from the website, and copy the "bs4" directory into the same folder as your python script.

Then, put this in the beginning of your code:

from bs4 import BeautifulSoup # or from bs4 import BeautifulSoup as bs # To type bs instead of BeautifulSoup every single time you use it 

You can learn how to use it from other stackoverflow questions or look at the documentation

Sign up to request clarification or add additional context in comments.

Comments

0

You can use html2text for this job or you can use ntlk.

A sample code

import nltk from urllib import urlopen url = "http://any-url" html = urlopen(url).read() raw = nltk.clean_html(html) print(raw) 

Comments

0

pyparsing has some helpful constructs for pulling data from HTML pages, and the results tend to be self-structuring and self-naming (if you set up the parser/scanner correctly). Here is a pyparsing solution for this particular web page:

from pyparsing import * # for stripping HTML tags anyTag,anyClose = makeHTMLTags(Word(alphas,alphanums+":_")) commonHTMLEntity.setParseAction(replaceHTMLEntity) stripHTML = lambda tokens: (commonHTMLEntity | Suppress(anyTag | anyClose) ).transformString(''.join(tokens)) # make pyparsing expressions for HTML opening and closing tags # (suppress all from results, as there is no interesting content in the tags or their attributes) h3,h3End = map(Suppress,makeHTMLTags("h3")) table,tableEnd = map(Suppress,makeHTMLTags("table")) tr,trEnd = map(Suppress,makeHTMLTags("tr")) th,thEnd = map(Suppress,makeHTMLTags("th")) td,tdEnd = map(Suppress,makeHTMLTags("td")) # nothing interesting in column headings - parse them, but suppress the results colHeading = Suppress(th + SkipTo(thEnd) + thEnd) # simple routine for defining data cells, with optional results name colData = lambda name='' : td + SkipTo(tdEnd)(name) + tdEnd playerListing = Group(tr + colData() + colData() + colData("username") + colData().setParseAction(stripHTML)("role") + colData("networkID") + trEnd) teamListing = (h3 + ungroup(SkipTo("Match Players" + h3End, failOn=h3))("name") + "Match Players" + h3End + table + tr + colHeading*5 + trEnd + Group(OneOrMore(playerListing))("players")) for team in teamListing.searchString(recentsource): # use this to print out names and structures of results #print team.dump() print "Team:", team.name for player in team.players: print "- %s: %s (%s)" % (player.role, player.username, player.networkID) # or like this # print "- %(role)s: %(username)s (%(networkID)s)" % player print 

Prints:

Team: Team CrYpToN Gaming EU - Leader: CrYpToN_Crossy (CrYpToN_Crossy) - Captain: Juddanorty (CrYpToN_Judd) - Member: BLaZe_Elfy (CrYpToN_Elfy) Team: eXCeL™ - Leader: Caaahil (Caaahil) - Member: eSportsmanship (eSportsmanship) - Member: KillBoy-NL (iClown-x) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.