I have a task that I think my python skills far slightly short at. I have a oil production data scraped from an open data website and looking to turn this into a json format.
At present with some help I have run through html table tags to build a list of headers and the row data. What I am struggling with is to nest some of this data within one json record.
Right now I either have all the json by row or per header. I have a header and all the column data under it, then the next header with its data going onto the next header (as below).
If possible I would like to have either the headers and the rows data assigned to it as one record, the next would have the headers again but with row 2 data and then onto the next json record with headers and row 3. The other option will make sense if looking at the table source…. would be to have a record per oil field, this field might have multiple rows for different years/months production values.
If possible I would like to capture all that in on json field record. Within the json record it would have the multiple year/,month rows should they exist, captured in the one json record.
Essentially it think it needs to make use of a few loops with the rest of the python on distinct html table cells, which I think is achievable just is slightly outside my python capabilities. Hope this makes sense tried to be quite descriptive.
JSON by Row
[ "\u00c5SGARD", "2017", "8", "0.19441", "0.81545", "0.26954", "0.00000", "1.27940", "0.07432", "43765" ] Per Headers (every data column under these fields)
[ "Field (Discovery)":[cell1,cell2,cell3,cell4 etc] "Year":[cell1,cell2,cell3,cell4 etc], "Month":[cell1,cell2,cell3,cell4 etc], "Oil - saleable div[mill Sm3]":[cell1,cell2,cell3,cell4 etc], "Gas - saleable div[bill Sm3]":[cell1,cell2,cell3,cell4 etc], "NGL - saleable div[mill Sm3]", "Condensate - saleable div[mill Sm3]", "Oil equivalents - saleable div[mill Sm3]", "Water - wellbores div[mill Sm3]", "NPDID information carrier" ] Desired per record (year:month fields concatenated)
{ " Field (Discovery)":"asset1" , " Oil – saleable": [ { "yearmonth":"201701","unit":"mmboe","value":"1234.456"}, { "yearmonth ":"201702","unit":"mmboe","value":"124.46"}], "Gas - saleable": [ {"yearmonth":"201701","unit":" bill Sm3","value":"1234.456"}, {" yearmonth ":"201702","unit":"mill Sm3","value":"14.456"}], "NGL - saleable ": [ {"yearmonth":"201704","unit":"mill Sm3","value":"1.456"}, {" yearmonth ":"201706","unit":" bill Sm3","value":"14.456"}], "Condensate – saleable": [ {"yearmonth":"201701","unit":" mill Sm3","value":"23.60"}, {"yearmonth ":"201608","unit":"mill Sm3","value":"4.4"}], "NPDID information carrier":"43765" } current python script
import requests from bs4 import BeautifulSoup import json import boto3 import botocore from datetime import datetime from collections import OrderedDict starttime = datetime.now() #Agent detail to prevent scraping bot detection user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' header = {'User-Agent' : user_agent } # Webpage connection html = "http://factpages.npd.no/ReportServer?/FactPages/TableView/ field_production_monthly&rs:Command=Render&rc:Toolbar=false &rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&CultureCode=en" r=requests.get(html, headers=header) c=r.content soup=BeautifulSoup(c,"html.parser") rows = soup.findAll('td', { 'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr', 'a85cr','a89cr', 'a93c', 'a97cr']}) headers = soup.findAll('td', { 'class': ['a20c', 'a24c', 'a28c', 'a32c', 'a36c', 'a40c', 'a44c', 'a48c','a52c', 'a56c']}) headers_list = [item.getText('div') for item in headers] rows_list=[item.getText('div') for item in rows] final=[rows_list[item:item+10] for item in range(0,len(rows_list),10)] row_header={} for item in final: for indices in range(0,10): if headers_list[indices] not in row_header: row_header[headers_list[indices]]=[item[indices]] else: row_header[headers_list[indices]].append(item[indices]) #OrderedDict result= (json.dumps(row_header, indent=4, sort_keys=True, ensure_ascii=False)) with open('data.txt', 'wt') as outfile: #json dump print json.dump(result, outfile,indent=4, sort_keys=True, separators=(',', ': '), ensure_ascii=False) #Time runtime = datetime.now() - starttime print(runtime)