Web Scraping Table to JSON

Question

I have a task that I think my python skills far slightly short at. I have a oil production data scraped from an open data website and looking to turn this into a json format.

At present with some help I have run through html table tags to build a list of headers and the row data. What I am struggling with is to nest some of this data within one json record.

Right now I either have all the json by row or per header. I have a header and all the column data under it, then the next header with its data going onto the next header (as below).

If possible I would like to have either the headers and the rows data assigned to it as one record, the next would have the headers again but with row 2 data and then onto the next json record with headers and row 3. The other option will make sense if looking at the table source…. would be to have a record per oil field, this field might have multiple rows for different years/months production values.

If possible I would like to capture all that in on json field record. Within the json record it would have the multiple year/,month rows should they exist, captured in the one json record.

Essentially it think it needs to make use of a few loops with the rest of the python on distinct html table cells, which I think is achievable just is slightly outside my python capabilities. Hope this makes sense tried to be quite descriptive.

JSON by Row

[ "\u00c5SGARD", "2017", "8", "0.19441", "0.81545", "0.26954", "0.00000", "1.27940", "0.07432", "43765" ]

Per Headers (every data column under these fields)

[ "Field (Discovery)":[cell1,cell2,cell3,cell4 etc] "Year":[cell1,cell2,cell3,cell4 etc], "Month":[cell1,cell2,cell3,cell4 etc], "Oil - saleable div[mill Sm3]":[cell1,cell2,cell3,cell4 etc], "Gas - saleable div[bill Sm3]":[cell1,cell2,cell3,cell4 etc], "NGL - saleable div[mill Sm3]", "Condensate - saleable div[mill Sm3]", "Oil equivalents - saleable div[mill Sm3]", "Water - wellbores div[mill Sm3]", "NPDID information carrier" ]

Desired per record (year:month fields concatenated)

{ " Field (Discovery)":"asset1" , " Oil – saleable": [ { "yearmonth":"201701","unit":"mmboe","value":"1234.456"}, { "yearmonth ":"201702","unit":"mmboe","value":"124.46"}], "Gas - saleable": [ {"yearmonth":"201701","unit":" bill Sm3","value":"1234.456"}, {" yearmonth ":"201702","unit":"mill Sm3","value":"14.456"}], "NGL - saleable ": [ {"yearmonth":"201704","unit":"mill Sm3","value":"1.456"}, {" yearmonth ":"201706","unit":" bill Sm3","value":"14.456"}], "Condensate – saleable": [ {"yearmonth":"201701","unit":" mill Sm3","value":"23.60"}, {"yearmonth ":"201608","unit":"mill Sm3","value":"4.4"}], "NPDID information carrier":"43765" }

current python script

import requests from bs4 import BeautifulSoup import json import boto3 import botocore from datetime import datetime from collections import OrderedDict starttime = datetime.now() #Agent detail to prevent scraping bot detection user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' header = {'User-Agent' : user_agent } # Webpage connection html = "http://factpages.npd.no/ReportServer?/FactPages/TableView/ field_production_monthly&rs:Command=Render&rc:Toolbar=false &rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&CultureCode=en" r=requests.get(html, headers=header) c=r.content soup=BeautifulSoup(c,"html.parser") rows = soup.findAll('td', { 'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr', 'a85cr','a89cr', 'a93c', 'a97cr']}) headers = soup.findAll('td', { 'class': ['a20c', 'a24c', 'a28c', 'a32c', 'a36c', 'a40c', 'a44c', 'a48c','a52c', 'a56c']}) headers_list = [item.getText('div') for item in headers] rows_list=[item.getText('div') for item in rows] final=[rows_list[item:item+10] for item in range(0,len(rows_list),10)] row_header={} for item in final: for indices in range(0,10): if headers_list[indices] not in row_header: row_header[headers_list[indices]]=[item[indices]] else: row_header[headers_list[indices]].append(item[indices]) #OrderedDict result= (json.dumps(row_header, indent=4, sort_keys=True, ensure_ascii=False)) with open('data.txt', 'wt') as outfile: #json dump print json.dump(result, outfile,indent=4, sort_keys=True, separators=(',', ': '), ensure_ascii=False) #Time runtime = datetime.now() - starttime print(runtime)

typingduck · Accepted Answer · 2017-11-14 20:01:55Z

Without seeing the HTML it is hard to tell but I am guessing parsing your table would be better done like this: python BeautifulSoup parsing table . That way you can be sure the headers and values match up.

I think it would be a disservice to your education to give the answer. So I hope the following helps:

If we have a list of items (rows!) [ ('a', 1 ), {'a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)] and we want to turn them into a list of [ ( 'a': [1, 2, 3] ), ('b': [1, 2, 3] ) ]

then we could just add a dictionary

d = {} for v1, v2 in items: d.setdefault(v1, []).append(v2)

setdefault here is just a shortcut for having a list by default for our list.

After this d.items() will have the values. The next step would then be to incorporate the headers. Perhaps start with one column Oil – saleable only at first and build from there.

Thanks for the help, I would like to understand how to get it working myself so I appreciate the assistance. I'll do a bit of studying later to see if I can get some bs dictionaries implemented.
If it helps, feel free to ask for clarifying questions here :)
got a little more understanding of nested dictionaries but not entirely sure if I'm getting slightly lost now. Have the following producing essentially what I want for one row just needs to loop for each row and loop internally to nest the multiple rows per 'fieldname' records based on the 'yearmonth'. code

Collectives™ on Stack Overflow

Web Scraping Table to JSON

1 Answer 1

3 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Linked

Related