1

I am trying to import JSON data including nested data, into a Pandas DataFrame. It is illustrated here using a nice little public dataset that I found here. Let's focus our attention on the header 'owner' for the purpose of this illustration. Note that I'm explicitly casting the data to str. If you print the response data from response.content it has the proper JSON syntax with double quotes.

However, pd.read_json() seems to convert the JSON string under 'owner' into single quotes. Am I doing something wrong here or should this be raised as a dev issue in read_json()? I see a different issue relating to the single/double quotes has been fixed in the past on pandas-dev.

>>> import pandas as pd >>> import requests >>> response = requests.get(url='https://api.github.com/users/mralexgray/repos') >>> df = pd.read_json(response.content, orient='records', dtype=str) >>> df['owner'].iloc[1, ] "{'login': 'mralexgray', 'id': 262517, 'node_id': 'MDQ6VXNlcjI2MjUxNw==', 'avatar_url': 'https://avatars.githubusercontent.com/u/262517?v=4',..." >>> type(df['owner'].iloc[1, ]) str >>> response.content b'[{"id":6104546,"node_id":"MDEwOlJlcG9zaXRvcnk2MTA0NTQ2","name":"-REPONAME","full_name":"mralexgray/-REPONAME","private":false,"owner":{"login":"mralexgray","id":262517,"node_id":"MDQ6VXNlcjI2MjUxNw==","avatar_url":"https://avatars.githubusercontent.com/u/262517?v=4"... 

The only observation I make here is that pd.read_json() may be importing the nested JSON data as a Python dict before it is cast to str.

I'm running Python 3.8.10 with Pandas 1.2.4.

9
  • I don't understand the question. What output were you expecting instead, exactly? Commented Jul 12, 2021 at 8:47
  • 3
    It seems like you were perhaps expecting to see e.g. {"foo": "bar"} instead of {'foo': 'bar'}. Here's a hint: what happens when you try just {"foo": "bar"} by itself at the Python interpreter prompt? Another hint: what happens when you try {"foo": "bar"} == {'foo': 'bar'}? Or "foo" == 'foo'? Commented Jul 12, 2021 at 8:49
  • 2
    Yes, when the variable is a dict as in the case of {"foo": "bar"} and {'foo': 'bar'} they are equal. And the same goes for the individual strings 'foo' and "foo". However, my issue is with the content in the JSON string. "{'foo': 'bar'}" is not the same as "{"foo": "bar"}". This becomes an issue for me later when trying to later use json_normalize() where I get json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: Commented Jul 12, 2021 at 9:51
  • 1
    Yes, those are not equal strings. But right now you don't have a string in your Dataframe; you have a dict that represents the parsed JSON. If this somehow causes a problem "using json_normalize()", then you're going to have to show us a minimal, complete example. As is, what you show us doesn't illustrate a problem, even if you say there is one. Commented Jul 12, 2021 at 9:55
  • 1
    Single quotes are an issue because I expected pd.read_json(dtype=str) to import the JSON string as it is (i.e., with double quotes). Yes json_normalize() become problematic, but my focus here was to question whether the way pd.read_json() works is correct? Given, IMO it should cast it directly to a str without first importing it as dict. And I tried casting to str because expected the double quotes to remain as in the original JSON string. Commented Jul 18, 2021 at 22:59

1 Answer 1

3
+50

pd.read_json() may be importing the nested JSON data as a Python dict before it is cast to str.

You are totally right. The dtype str is applied after the json was loaded as a python data structure by the JsonReader class. What you get is the string representation of a dict and not the raw JSON.

import json d = {"key": "value"} 
>>> d {'key': 'value'} >>> str(s) # dtype=str "{'key': 'value'}" >>> json.dumps(d) '{"key": "value"}' >>> json.loads(json.dumps(d)) {'key': 'value'} 

Another example:

d = {"key1": "value1", "key2": {"key21": "value21"}} l = [d] * 3 
>>> df = pd.read_json(json.dumps(l)); print(df) key1 key2 0 value1 {'key21': 'value21'} 1 value1 {'key21': 'value21'} 2 value1 {'key21': 'value21'} >>> type(df.iloc[0, 1]) dict >>> df = pd.read_json(json.dumps(l), dtype={'key2': str}); print(df) key1 key2 0 value1 {'key21': 'value21'} # same display output 1 value1 {'key21': 'value21'} 2 value1 {'key21': 'value21'} >>> type(df.iloc[0, 1]) str # but not the same type 

To get double quotes, use pd.read_json without dtype and apply json.dumps to dict columns:

df = pd.read_json(json.dumps(l)) df['key2'] = df['key2'].apply(json.dumps) 
>>> df key1 key2 0 value1 {"key21": "value21"} 1 value1 {"key21": "value21"} 2 value1 {"key21": "value21"} 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the answer. Also, is this behaviour correct for pd.read_json()? I would have imagined it should not be casting multiple times thereby essentially destroying information (i.e., the double quotes). Would you say this should be changed in some sort of future release?
JSON is not a valid data type for Python, it have to translate the whole file in base type (str, dict, list, ...) first to understand the sense of data. For Python, simple or double quotes are not significant, it's just a visual representation and allowing the serialization (pickle). Besides, if your string contains a simple quote, Python prints it with double quotes to enclose the string. I think this behaviour will never be changed.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.