2

I have the giant dataset as the following. Most of the columns are in "Json format" but in pandas object datatype. I would like to convert them all into dicts and be able to fetch the values of each key.

 id body created_at user source symbols entities conversation likes 0 10000001 𝑛𝑓𝑙π‘₯π‘Žπ‘›π‘‘ sina are the stocks i hate to trade... 2012-10-15 18:54:55+00:00 {'id': 117974, 'username': 'choptrade', 'name'... {'id': 1, 'title': 'Stocktwits', 'url': 'https... [{'id': 2839, 'symbol': 'NFLX', 'title': 'NetF... {'sentiment': None} None None 1 10000002 An Introduction to Financial Spread Betting ht... 2012-10-15 18:55:00+00:00 {'id': 179410, 'username': 'KennysElliottWaves... {'id': 2, 'title': 'Twitter', 'url': 'http://t... [{'id': 667, 'symbol': 'EURUSD', 'title': 'Eur... {'sentiment': None} None None 2 10000003 Published $TXT estimate on Estimize: +0.05 EPS... 2012-10-15 18:55:00+00:00 {'id': 12134, 'username': '1nvestor', 'name': ... {'id': 2, 'title': 'Twitter', 'url': 'http://t... [{'id': 7506, 'symbol': 'TXT', 'title': 'Textr... {'sentiment': None} None None 3 10000005 $VRNG holding lot of shares under 3. Hold or s... 2012-10-15 18:55:06+00:00 {'id': 146417, 'username': 'CryptoMafia', 'nam... {'id': 1, 'title': 'Stocktwits', 'url': 'https... [{'id': 9214, 'symbol': 'XSPA', 'title': 'Xpre... {'sentiment': None} {'parent_message_id': 10000005, 'in_reply_to_m... None 4 10000006 We have a huge tail on oil. If we close near c... 2012-10-15 18:55:07+00:00 {'id': 86846, 'username': 'johnkicklighter', '... {'id': 1, 'title': 'Stocktwits', 'url': 'https... [{'id': 643, 'symbol': 'CL_F', 'title': 'Light... {'chart': {'thumb': 'https://charts.stocktwits... {'parent_message_id': 10000006, 'in_reply_to_m... None ... ... ... ... ... ... ... ... ... ... 173592731 459260899 $BBIG I really thought we would get a PR this ... 2022-05-13 12:39:08+00:00 {'id': 3231219, 'username': 'DMaCATO', 'name':... {'id': 2269, 'title': 'StockTwits Web', 'url':... [{'id': 16144, 'symbol': 'BBIG', 'title': 'Vin... {'sentiment': None} None None 173592732 459260904 $LUNA Absolutely iconic in the story of why 40... 2022-05-13 12:39:09+00:00 {'id': 5647436, 'username': 'Julieblue', 'name... {'id': 2269, 'title': 'StockTwits Web', 'url':... [{'id': 2563, 'symbol': 'LUNA', 'title': 'Luna... {'sentiment': {'basic': 'Bearish'}} None None 173592733 459260906 $TWTR get these people on the ropes and drop b... 2022-05-13 12:39:09+00:00 {'id': 4253849, 'username': 'Trump2036', 'name... {'id': 2269, 'title': 'StockTwits Web', 'url':... [{'id': 7872, 'symbol': 'TWTR', 'title': 'TWIT... {'sentiment': None} None None 173592734 459260907 $SPY I want the shorts to cry today 2022-05-13 12:39:09+00:00 {'id': 1129723, 'username': 'labibaouf', 'name... {'id': 2269, 'title': 'StockTwits Web', 'url':... [{'id': 7271, 'symbol': 'SPY', 'title': 'SPDR ... {'sentiment': {'basic': 'Bullish'}} None None 173592735 459260909 π‘‡π‘Šπ‘‡π‘…πΊπ‘œπ‘œπ‘‘π‘“π‘œπ‘ŸTSLA but bad for democracy and... 2022-05-13 12:39:09+00:00 {'id': 727683, 'username': 'Iwillplay', 'name'... {'id': 2095, 'title': 'StockTwits For Android ... [{'id': 7872, 'symbol': 'TWTR', 'title': 'TWIT... {'sentiment': None} None None 

my current apporach is to use the pandas apply:

import ast def dict_convert(x): try: return ast.literal_eval(x) except: return np.nan df["col"] = df.col.apply(lambda x: dict_convert(x)) 

However this would take so much time and it is not computational efficient. I have 64 cores so I would like to have some vertorrized method. I also tried the Pandarallel module it ended up being taking significantly more time than the base pandas apply.

pd.json_normalize does not seem to work either. I think it is due to the instances being in object data type.

pd.json_normalize(df.col.iloc[:100].values) out: 0 1 2 3 4 ... 95 96 97 98 99 100 rows Γ— 0 columns 
7
  • Try pyspark or cuDF Commented May 22, 2022 at 4:29
  • By using the same apply function? Commented May 22, 2022 at 4:30
  • since u have 64 cores so when you used pandarallel it was using 64 workers, right? Commented May 22, 2022 at 4:39
  • 1
    Thanks for the suggestions admed but I have two problems: 1. The instances are in object format instead of dict therefore I dont think json normalize will work at the first place. 2. My cluster node does not have access to internet. Do you have any other suggestion? Commented May 22, 2022 at 5:07
  • 2
    Check this answer, this might be helpful for you parse a string column that looks like JSON/dic Commented May 22, 2022 at 6:24

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.