2

I'm trying to parse the table in this link into a structured datatype e.g. a DataFrame or json or something like these. However it seems that none of the approaches I tried would work out, including requests, pandas.read_html.

Finally I found it was because the HTML obtained from the webpage did not contain the information in the table. For example, the string "贵广转债" is obviously present in the table body, but is absent from the page source (ctrl+F it gives no match)! However, this string is present when you right click and go to Inspect the cell.

enter image description here

enter image description here

It seems that if I can get the information in the Inspect -> Elements panel then I may be able to parse the table out. How can I do this?

5
  • In this day and age a lot of content is updated via javascript. So the html that loads the page (view source) and the html that gets updated and rendered (inspect) will not always be the same. there are a number of java script events that will modify the page content and that wouldnt be reflected in the page source. Commented Jul 1, 2019 at 10:27
  • @ChrisDoyle So what is your solution? Commented Jul 1, 2019 at 10:31
  • 1
    @DecaK first the OP needs to identify if this is indeed the case. When i have coded such things which have java script triggers or javascript on click events which modify the page i have used selenium for such activities. Commented Jul 1, 2019 at 10:33
  • @ChrisDoyle yes this table is the price data of a bunch of actively traded convertible bonds. They shoule be updated frequently. Commented Jul 1, 2019 at 10:59
  • Well, you can use Selenium or PyQt. Commented Jul 1, 2019 at 19:01

2 Answers 2

3

For dynamic pages that load data with ajax requests, try monitoring Network tab in Developer Tools (F12)network tab, and find the request you need.

Here the ticker data is requested from https://www.jisilu.cn/data/cbnew/cb_list/?___jsl=LST___t=1561977181934

POST https://www.jisilu.cn/data/cbnew/cb_list/?___jsl=LST___t=1561977181934 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Firefox/68.0 Accept: application/json, text/javascript, */*; q=0.01 Accept-Language: en-US,de-DE;q=0.7,en;q=0.3 Referer: https://www.jisilu.cn/data/cbnew/ X-Requested-With: XMLHttpRequest Connection: keep-alive Cookie: kbzw__Session=7n47d42nc28n259v722k8onhq5; kbz_newcookie=1 Cache-Control: max-age=0 Content-Type: application/x-www-form-urlencoded; charset=UTF-8 fprice=&tprice=&volume=&svolume=&premium_rt=&ytm_rt=&rating_cd=&is_search=N&btype=&listed=Y&industry=&bond_ids=&rp=50&page=1 <> 2019-07-01T013533.200.json 

You can then use requests library or any other http client to fetch json (remember to supply headers/cookies if necessary) then use JSON however you like.

Python

Using the info you can utilize requests library as follows:

import requests if __name__ == '__main__': data = { 'fprice': '', 'tprice': '', 'volume': '', 'svolume': '', 'premium_rt': '', 'ytm_rt': '', 'rating_cd': '', 'is_search': 'N', 'btype': '', 'listed': 'Y', 'industry': '', 'bond_ids': '', 'rp': '50', 'page': '', } res = requests.post('https://www.jisilu.cn/data/cbnew/cb_list/?___jsl=LST___t=1561977181934', data=data) res.raise_for_status() data = res.json() print(data) 

which gives you a very large list:

{'page': 1, 'rows': [{'id': '110052', 'cell': {'bond_id': '110052', 'bond_nm': '贵广转债', 'stock_id': 'sh600996', 'stock_nm': '贵广 ... and goes on much longer 
Sign up to request clarification or add additional context in comments.

Comments

2

For instances of scraping when the webpage is being updated/loaded dynamically, I would recommend using 'Selenium' with Python. It loads the page in your browser and allows you to interact with it programmatically from there.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.