1

I am learning to process what i consider a complex json structure and am trying to load this into a dataframe. I want a single record for each outcome id. Here is the sample json structure

{ "_id": 12345, "reports": [ { "body": "\n***\nGeneral report text.", "outcome": { "comments": [], "id": "1", "status": { "failed": { "both": 0, "human": 0, "auto": 0, "total": 0 }, "open": { "both": 0, "human": 0, "auto": 0, "total": 0 }, "passed": { "both": 0, "human": 0, "auto": 1, "total": 1 }, "code": { "_input": 0, "_output": 0, "canceled": 0 }, "total": 1 } }, "type": "outcome" }, { "body": "\n***\nGeneral report text.", "outcome": { "comments": [], "id": "2", "status": { "failed": { "both": 0, "human": 0, "auto": 0, "total": 0 }, "open": { "both": 0, "human": 0, "auto": 0, "total": 0 }, "passed": { "both": 0, "human": 0, "auto": 1, "total": 1 }, "code": { "_input": 0, "_output": 0, "canceled": 0 }, "total": 1 } }, "type": "outcome" } ] } 

The desired format in the dataframe is

report_id | outcome.id | body | outcome.comments | status.failed.both | status.failed.human | status.failed.auto so on and so forth 

I would like all of the statuses in a single record (by outcome.id) and not sure how to normalize it all into a single record.

I've tried this but it does not give me the desired dataframe structure

df_reports = pd.json_normalize(data,record_path=['reports', 'outcome'], meta=[ ['reports','body'], ['outcome','comment'], ['outcome','id'], ['outcome','status'] ] 
4
  • 2
    You imply you have multiple report id values, yet the JSON you show only has a single report id value. Is your parsed JSON actually a list of dictionaries? Commented Mar 20 at 14:51
  • Sorry. I have corrected it. There is a single report id, but multiple outcomes (id). I would like a single record for each outcome. Also, i just gave a snippet of the full json. There are other parts. The reports list is the core part i am trying to parse into a dataframe. Commented Mar 20 at 14:57
  • maybe you should show what you have in df_reports after json_normalize. You could also show dataframe with expected result. Maybe it will need to reformat it after using json_normalize Commented Mar 20 at 15:17
  • 1
    You have occurrences of "\n..." in your string rendering it invalid JSON. I have to assume that what you are showing is not JSON but rather JSON that has already been parsed into a dictionary. Then it is no longer a "json structure". Commented Mar 20 at 16:17

2 Answers 2

3

Using record_path=['reports'] does roughly what you want. The only exception is that it doesn't capture the report_id column. You can add a line to provide that column.

import pandas as pd import json # This is a file containing the JSON from the question. # If you already have the JSON loaded into the variable 'data' # you don't need this step with open('test1165.json', 'rb') as f: data = json.load(f) report_df = pd.json_normalize(data, record_path=['reports']) report_df.insert(0, 'report_id', data['_id']) print(report_df.to_string()) 

Printing report_df gives the following:

 report_id body type outcome.comments outcome.id outcome.status.failed.both outcome.status.failed.human outcome.status.failed.auto outcome.status.failed.total outcome.status.open.both outcome.status.open.human outcome.status.open.auto outcome.status.open.total outcome.status.passed.both outcome.status.passed.human outcome.status.passed.auto outcome.status.passed.total outcome.status.code._input outcome.status.code._output outcome.status.code.canceled outcome.status.total 0 12345 \n***\nGeneral report text. outcome [] 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 12345 \n***\nGeneral report text. outcome [] 2 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 
Sign up to request clarification or add additional context in comments.

1 Comment

it seems you can get _id also with meta= like pd.json_normalize(data, record_path=["reports"], meta=["_id"]) but it will be in last column
2

Using record_path='reports' and meta=['_id'] then rearrange and rename columns to get the desired dataframe.

This approach also works if you have multiple (a list of) reports in the json data.

import json import pandas as pd with open("data.json") as f: data = json.load(f) df = pd.json_normalize(data, record_path='reports', meta=['_id']) #rearrange and rename columns df.insert(0, 'report_id', df.pop('_id')) df.insert(1, 'outcome.id', df.pop('outcome.id')) print(df.to_string()) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.