1

I use the code below to read tables from websites. With the first example everything works as expected. with the second example (commented variables) I only get the first column. I don't find the reason for it. Can somebody help here?

Also nice would be a simple ways to create a nicer output of the tables.

import urllib2 import pprint from bs4 import BeautifulSoup URL = 'http://www.proplanta.de/Markt-und-Preis/MATIF-Raps/' TABLENR = 36 #URL = 'http://www1.chineseshipping.com.cn/en/indices/ccfinew.jsp' #TABLENR = 4 req = urllib2.Request(URL, headers={'User-Agent' : "My Browser"}) con = urllib2.urlopen( req ) html = con.read() soup = BeautifulSoup(html) tables = soup.find_all('table') data = [] rows = tables[TABLENR].find_all('tr') for row in rows: cols = row.find_all('td') cols = [ele.text.strip() for ele in cols] data.append([ele for ele in cols if ele]) # Get rid of empty values pprint.pprint (data) 
4
  • 2
    In your second example (I didn't check the first), the data in the other columns is generated by javascript Commented Mar 17, 2016 at 14:27
  • ok - this explains the issue. Any suggestion on how I can read the table? Commented Mar 17, 2016 at 14:36
  • I think the standard solution is to use Selenium, phantomJS, Ghostery or some other javascript-engine or "robot browser". I don't know much about any of them but just keep hearing those three described as straight-forward solutions to scraping JS content. But even better, maybe you can access the site's API directly. If you're lucky, it'll return nicely formatted json or xml Commented Mar 17, 2016 at 14:39
  • @robvoi Yep, you're lucky. The API returns jsonp data :) Commented Mar 17, 2016 at 14:47

2 Answers 2

3

You could use the API instead. Much cleaner (even if my code might not be).

import requests import json url = "http://index.chineseshipping.com.cn/servlet/ccfiGetContrast?SpecifiedDate=&jc=" jsonp = requests.get(url) table_data = json.loads(jsonp.text.encode("utf-8").split("(")[1].split(")")[0]) # SCRAPE RESPONSIBLY. WE DON'T WANT TO DDOS SOME POOR WEBSITE 
Sign up to request clarification or add additional context in comments.

1 Comment

This works. Thanks a lot!
2

The webpage which is not working uses javaScript. JavaScript is used to create dynamic content which it does by altering the DOM (Document object model). Browser receives the data and then runs java script to alter it. (In your case table data is getting changed). When you try to get the webpage using urllib, it receives the content but it does not do the latter (running javaScript on it). By using selenium we are getting our job done through the browser and reading the complete data.

import selenium from bs4 import BeautifulSoup webpage = selenium.webdriver.Firefox() webpage.get('http://www1.chineseshipping.com.cn/en/indices/ccfinew.jsp') html = webpage.page_source soup = BeautifulSoup(html) tables = soup.find_all('table') 

1 Comment

This works. Thanks a lot!

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.