Problem extracting words from dataframe

Question

I have the following dataset which is a .json file:

and I would like to get the first word for every string inside lista_asm, so I would like to get: jmp,push,uncomisd,...etc

what I am doing to do this is the following:

dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:[i.split()[0] for i in x])

but it gives me back the following error message:

 3589 else: 3590 values = self.astype(object).values -> 3591 mapped = lib.map_infer(values, f, convert=convert_dtype) 3592 3593 if len(mapped) and isinstance(mapped[0], Series): pandas/_libs/lib.pyx in pandas._libs.lib.map_infer() <ipython-input-18-5506b5721bf1> in <lambda>(x) ----> 1 dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:[i.split()[0].strip() for i in x]) IndexError: list index out of range

I don't understand what is wrong. Can somebody please help me?

[EDIT]Trying the code:

dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x:x[0].split(" ",2)[0])

and adding :

df = dataFrame[["opcodes", "semantic"]].copy() df

I get:

what I would like to get is a list of the type [push,mov,..] and this for every row.

It seems like when I do x[0] it does not return the first element of the list, but returns the pharentesis, which is weird. Am I doing something wrong which I don't see?

My objective is to pre-process this dataset in order to feed features to my model, but I haveing hard times in doing so.

Shahriyar Mammadli · Accepted Answer · 2020-11-12 11:56:30Z

Your question is a bit confusing. So, as much as I understood from your examples, for each sample of list_asm, you want to extract the very first word from the string.

The thing you are doing wrong is treating the string as a list. That is, ['uncomisd xmm2, xmm2', 'jp 0x40', ...] is considered as a string by python, not a list.

Thus, you need to extract the strings from your list first, then you can't take the first words from all these strings.

To achieve that, you can use a regular expression to find all the strings that are inside of quotes '...'.

import pandas as pd import re # Read the file into dataframe dataFrame = pd.read_json("dataset.json", lines=True) # First extract the strings the take the first word of each string dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x: [i.split()[0] for i in re.findall("'([^']*)'", x)]) print(dataFrame)

or modular form of the code would be:

import pandas as pd import re # Function to extract the first words from each string def extractFirstWord(str): listOfWords = re.findall("'([^']*)'", str) return [i.split()[0] for i in listOfWords] # Read the file into dataframe dataFrame = pd.read_json("dataset.json", lines=True) dataFrame['opcodes'] = dataFrame['lista_asm'].apply(lambda x: extractFirstWord(x)) print(dataFrame)

The result:

Thank you for your answer. I have tried your suggestion, but I still get only the pharentesis. — J.D.
– J.D., Commented Nov 11, 2020 at 11:43
Can you please provide the data by uploading it somewhere? at least some samples. — Shahriyar Mammadli
– Shahriyar Mammadli, Commented Nov 11, 2020 at 12:07
Are you sure, you are checking th right thing? Couse it works in my machine. Check the output and use the exact code I provided. — Shahriyar Mammadli
– Shahriyar Mammadli, Commented Nov 11, 2020 at 17:30
you are right. I have tried to rewrite the code from the biginning, and your code works. I don't know why this code is giving me a lot of problems, probably I have a lot to learn. — J.D.
– J.D., Commented Nov 11, 2020 at 18:56

Noah Weber · Accepted Answer · 2020-11-08 13:28:36Z

1

The problem is that you say "apply(lambda x:[i.split()[0] for i in x])"

As soon as you say apply, x is your list. So you can say following "apply(lambda x:x[0].split(" ", 2)[0])"

Meaning you say take first element in the list, than split on " " and in two parts. And than take the first word (part) with the last [0]

answered Nov 8, 2020 at 13:28

Noah Weber

5,8991 gold badge14 silver badges26 bronze badges

$\begingroup$ Thank you for the answer. But this returns just the first word for the first string of the list. I would like to get the first word for each string of the list for each row of the dataframe. (Sorry, I have posed the question not specifying really well this fact) $\endgroup$

J.D.
– J.D.

2020-11-08 13:37:40 +00:00
Commented Nov 8, 2020 at 13:37
$\begingroup$ I see. Can you jsut add list comprehension my code and that should be it? $\endgroup$

Noah Weber
– Noah Weber

2020-11-08 13:38:59 +00:00
Commented Nov 8, 2020 at 13:38
$\begingroup$ Sorry, my bad I misreaded the dataframe before. With your code, it returns only the pharentesis from the list, so '[', sorry I put ampther conlum close to that and misreaded. $\endgroup$

J.D.
– J.D.

2020-11-08 14:02:51 +00:00
Commented Nov 8, 2020 at 14:02
$\begingroup$ I will edit the question, so it is clearer. $\endgroup$

J.D.
– J.D.

2020-11-08 14:10:28 +00:00
Commented Nov 8, 2020 at 14:10

Add a comment |

Stack Exchange Network

Problem extracting words from dataframe

2 Answers 2

Hot Network Questions

Problem extracting words from dataframe

2 Answers 2

Related

Hot Network Questions