Python parsing a table from webpage

Question

I'm trying to parse the table in this link into a structured datatype e.g. a DataFrame or json or something like these. However it seems that none of the approaches I tried would work out, including requests, pandas.read_html.

Finally I found it was because the HTML obtained from the webpage did not contain the information in the table. For example, the string "贵广转债" is obviously present in the table body, but is absent from the page source (ctrl+F it gives no match)! However, this string is present when you right click and go to Inspect the cell.

It seems that if I can get the information in the Inspect -> Elements panel then I may be able to parse the table out. How can I do this?

In this day and age a lot of content is updated via javascript. So the html that loads the page (view source) and the html that gets updated and rendered (inspect) will not always be the same. there are a number of java script events that will modify the page content and that wouldnt be reflected in the page source. — Chris Doyle
– Chris Doyle, Commented Jul 1, 2019 at 10:27
@DecaK first the OP needs to identify if this is indeed the case. When i have coded such things which have java script triggers or javascript on click events which modify the page i have used selenium for such activities. — Chris Doyle
– Chris Doyle, Commented Jul 1, 2019 at 10:33
@ChrisDoyle yes this table is the price data of a bunch of actively traded convertible bonds. They shoule be updated frequently. — Vim
– Vim, Commented Jul 1, 2019 at 10:59

abdusco · Accepted Answer · 2019-07-01 10:43:08Z

For dynamic pages that load data with ajax requests, try monitoring Network tab in Developer Tools (F12), and find the request you need.

Here the ticker data is requested from https://www.jisilu.cn/data/cbnew/cb_list/?___jsl=LST___t=1561977181934

POST https://www.jisilu.cn/data/cbnew/cb_list/?___jsl=LST___t=1561977181934 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0 Accept: application/json, text/javascript, */*; q=0.01 Accept-Language: en-US,de-DE;q=0.7,en;q=0.3 Referer: https://www.jisilu.cn/data/cbnew/ X-Requested-With: XMLHttpRequest Connection: keep-alive Cookie: kbzw__Session=7n47d42nc28n259v722k8onhq5; kbz_newcookie=1 Cache-Control: max-age=0 Content-Type: application/x-www-form-urlencoded; charset=UTF-8 fprice=&tprice=&volume=&svolume=&premium_rt=&ytm_rt=&rating_cd=&is_search=N&btype=&listed=Y&industry=&bond_ids=&rp=50&page=1 <> 2019-07-01T013533.200.json

You can then use requests library or any other http client to fetch json (remember to supply headers/cookies if necessary) then use JSON however you like.

Python

Using the info you can utilize requests library as follows:

import requests if __name__ == '__main__': data = { 'fprice': '', 'tprice': '', 'volume': '', 'svolume': '', 'premium_rt': '', 'ytm_rt': '', 'rating_cd': '', 'is_search': 'N', 'btype': '', 'listed': 'Y', 'industry': '', 'bond_ids': '', 'rp': '50', 'page': '', } res = requests.post('https://www.jisilu.cn/data/cbnew/cb_list/?___jsl=LST___t=1561977181934', data=data) res.raise_for_status() data = res.json() print(data)

which gives you a very large list:

{'page': 1, 'rows': [{'id': '110052', 'cell': {'bond_id': '110052', 'bond_nm': '贵广转债', 'stock_id': 'sh600996', 'stock_nm': '贵广 ... and goes on much longer

Software · Accepted Answer · 2019-07-01 10:35:41Z

For instances of scraping when the webpage is being updated/loaded dynamically, I would recommend using 'Selenium' with Python. It loads the page in your browser and allows you to interact with it programmatically from there.

Collectives™ on Stack Overflow

Python parsing a table from webpage

2 Answers 2

Python

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Python

Comments

Comments

Related