1

I have a task that I think my python skills far slightly short at. I have a oil production data scraped from an open data website and looking to turn this into a json format.

At present with some help I have run through html table tags to build a list of headers and the row data. What I am struggling with is to nest some of this data within one json record.

Right now I either have all the json by row or per header. I have a header and all the column data under it, then the next header with its data going onto the next header (as below).

If possible I would like to have either the headers and the rows data assigned to it as one record, the next would have the headers again but with row 2 data and then onto the next json record with headers and row 3. The other option will make sense if looking at the table source…. would be to have a record per oil field, this field might have multiple rows for different years/months production values.

If possible I would like to capture all that in on json field record. Within the json record it would have the multiple year/,month rows should they exist, captured in the one json record.

Essentially it think it needs to make use of a few loops with the rest of the python on distinct html table cells, which I think is achievable just is slightly outside my python capabilities. Hope this makes sense tried to be quite descriptive.

JSON by Row

[ "\u00c5SGARD", "2017", "8", "0.19441", "0.81545", "0.26954", "0.00000", "1.27940", "0.07432", "43765" ] 

Per Headers (every data column under these fields)

[ "Field (Discovery)":[cell1,cell2,cell3,cell4 etc] "Year":[cell1,cell2,cell3,cell4 etc], "Month":[cell1,cell2,cell3,cell4 etc], "Oil - saleable div[mill Sm3]":[cell1,cell2,cell3,cell4 etc], "Gas - saleable div[bill Sm3]":[cell1,cell2,cell3,cell4 etc], "NGL - saleable div[mill Sm3]", "Condensate - saleable div[mill Sm3]", "Oil equivalents - saleable div[mill Sm3]", "Water - wellbores div[mill Sm3]", "NPDID information carrier" ] 

Desired per record (year:month fields concatenated)

{ " Field (Discovery)":"asset1" , " Oil – saleable": [ { "yearmonth":"201701","unit":"mmboe","value":"1234.456"}, { "yearmonth ":"201702","unit":"mmboe","value":"124.46"}], "Gas - saleable": [ {"yearmonth":"201701","unit":" bill Sm3","value":"1234.456"}, {" yearmonth ":"201702","unit":"mill Sm3","value":"14.456"}], "NGL - saleable ": [ {"yearmonth":"201704","unit":"mill Sm3","value":"1.456"}, {" yearmonth ":"201706","unit":" bill Sm3","value":"14.456"}], "Condensate – saleable": [ {"yearmonth":"201701","unit":" mill Sm3","value":"23.60"}, {"yearmonth ":"201608","unit":"mill Sm3","value":"4.4"}], "NPDID information carrier":"43765" } 

current python script

import requests from bs4 import BeautifulSoup import json import boto3 import botocore from datetime import datetime from collections import OrderedDict starttime = datetime.now() #Agent detail to prevent scraping bot detection user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36' header = {'User-Agent' : user_agent } # Webpage connection html = "http://factpages.npd.no/ReportServer?/FactPages/TableView/ field_production_monthly&rs:Command=Render&rc:Toolbar=false &rc:Parameters=f&Top100=True&IpAddress=108.171.128.174&CultureCode=en" r=requests.get(html, headers=header) c=r.content soup=BeautifulSoup(c,"html.parser") rows = soup.findAll('td', { 'class': ['a61cl', 'a65cr', 'a69cr', 'a73cr', 'a77cr', 'a81cr', 'a85cr','a89cr', 'a93c', 'a97cr']}) headers = soup.findAll('td', { 'class': ['a20c', 'a24c', 'a28c', 'a32c', 'a36c', 'a40c', 'a44c', 'a48c','a52c', 'a56c']}) headers_list = [item.getText('div') for item in headers] rows_list=[item.getText('div') for item in rows] final=[rows_list[item:item+10] for item in range(0,len(rows_list),10)] row_header={} for item in final: for indices in range(0,10): if headers_list[indices] not in row_header: row_header[headers_list[indices]]=[item[indices]] else: row_header[headers_list[indices]].append(item[indices]) #OrderedDict result= (json.dumps(row_header, indent=4, sort_keys=True, ensure_ascii=False)) with open('data.txt', 'wt') as outfile: #json dump print json.dump(result, outfile,indent=4, sort_keys=True, separators=(',', ': '), ensure_ascii=False) #Time runtime = datetime.now() - starttime print(runtime) 

1 Answer 1

0

Without seeing the HTML it is hard to tell but I am guessing parsing your table would be better done like this: python BeautifulSoup parsing table . That way you can be sure the headers and values match up.

I think it would be a disservice to your education to give the answer. So I hope the following helps:

If we have a list of items (rows!) [ ('a', 1 ), {'a', 2), ('a', 3), ('b', 1), ('b', 2), ('b', 3)] and we want to turn them into a list of [ ( 'a': [1, 2, 3] ), ('b': [1, 2, 3] ) ]

then we could just add a dictionary

d = {} for v1, v2 in items: d.setdefault(v1, []).append(v2) 

setdefault here is just a shortcut for having a list by default for our list.

After this d.items() will have the values. The next step would then be to incorporate the headers. Perhaps start with one column Oil – saleable only at first and build from there.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the help, I would like to understand how to get it working myself so I appreciate the assistance. I'll do a bit of studying later to see if I can get some bs dictionaries implemented.
If it helps, feel free to ask for clarifying questions here :)
got a little more understanding of nested dictionaries but not entirely sure if I'm getting slightly lost now. Have the following producing essentially what I want for one row just needs to loop for each row and loop internally to nest the multiple rows per 'fieldname' records based on the 'yearmonth'. code

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.