Get the position of a substring in a column of DataFrame using regex

Question

I'd like to break up a string into Pandas DataFrame columns using a regex.

Sample csv data [Updated]:

Data;Code;Temp;.... 12 364 OPR 4 67474;;33;... 893 73 GDP hdj 747;;34;... hr 777 hr9 GDP;;30;... 463 7g 448 OPR;;28;...

Desired situation: [Updated]

Data | Code | Temp | ... ------------------------------------------------ 12 364 | OPR 4 67474 | 33 | ... 893 73 | GDP hdj 747 | 34 | ... hr 777 hr9 GDP | NaN | 30 | ... 463 7g 448 OPR | NaN | 28 | ...

regex:

code = re.compile('\sOPR.?[^$]|\sGDP.?[^$]')

I only need to split if OPR or GDP is not at the end of the string. I was looking for a way to split based on the match position. Something like: match.start())
I tried something like: df['data'].str.contains(code, regex=True) and df['data'] = df['data'].str.extract(code, expand=True) and str.find only seems to work with a string and not with re.Pattern. I don't get it done.

I'm pretty new with Pandas, so please bear with me.

Do you wan to split 1 column of Pandas Dataframe into 2 columns? — Rusty
– Rusty, Commented Dec 15, 2018 at 11:30
NaN in 2nd column and hr 777 hr9 GDP in 1st is desired? I don't get the rule. — Zydnar
– Zydnar, Commented Dec 15, 2018 at 11:31
In regex you can use quantity of specific match eg: \d{3} Match groups can also be helpful. — Zydnar
– Zydnar, Commented Dec 15, 2018 at 11:38
@Rusty: Yes that's what i want. Please see desired situation — John Doe
– John Doe, Commented Dec 15, 2018 at 11:54
@Zydnar: I only need to split if OPR or GDP is not at the end of the string. Thats the case for 0 and 1, For 2 and 3 it's at the and and there's no need to split. — John Doe
– John Doe, Commented Dec 15, 2018 at 11:56

Chris Doyle · Accepted Answer · 2018-12-15 13:58:07Z

I am fairly new to python so someone might be able to comment if this is not a good approach. My line of thinking was to take the input and process it line by line. drop the trailing semi colon as you dont have it in your output. then using regex split the line by a space char only if its followed by either OPR or GDP and which is not at the end of the line. If this gives only one item in the list, then append the list with NaN to fill the second column. then i have printed with formatting.

import re data_string="""12 364 OPR 4 67474; 893 73 GDP hdj 747; hr 777 hr9 GDP; 463 7g 448 OPR; """ data_list=data_string.splitlines() for data in data_list: data_split=re.split("\s(?=(?:GDP|OPR)[^$])",data[:-1]) if len(data_split)==1: data_split.append("NaN") print("%-20s|%-20s" % tuple(data_split))

OUTPUT

12 364 |OPR 4 67474 893 73 |GDP hdj 747 hr 777 hr9 GDP |NaN 463 7g 448 OPR |NaN

Updated in light of question edit and comments

Based on your update to the question and comments you could try the below. I would suggest you to test this and check for any edge cases or add validation or conditional checks before performing updates.

import pandas as pd import re source_data = {'data': ['12 364 OPR 4 67474', '893 73 GDP hdj 747', 'hr 777 hr9 GDP','463 7g 448 OPR'], 'code': [None, None, None, None], 'Temp': [33,34,30,28] } df = pd.DataFrame.from_dict(source_data) print("Original df:") print(df, "\n") row_iter=df.iterrows() for index,row in row_iter: data=df.at[index,'data'] data_split=re.split("\s(?=(?:GDP|OPR)[^$])",data) if len(data_split)==2: df.at[index,'data']=data_split[0] df.at[index,'code']=data_split[1] print("Updated df:") print(df)

OUTPUT

Original df: data code Temp 0 12 364 OPR 4 67474 None 33 1 893 73 GDP hdj 747 None 34 2 hr 777 hr9 GDP None 30 3 463 7g 448 OPR None 28 Updated df: data code Temp 0 12 364 OPR 4 67474 33 1 893 73 GDP hdj 747 34 2 hr 777 hr9 GDP None 30 3 463 7g 448 OPR None 28

Sorry for the misunderstanding but there are more columns. I just created a simple example. I read the csv using pd.read_cvs(fileName, delimiter=';')
Doesnt the 2 answers you already have give you enough of an example or idea of how to tackle your problem such that you could have closed the question and gone away to work on a solution?
I understand the samples, but not how to implement this using Pandas DataFrames instead of a list. I probably do not see the connection, sorry
your question was how to break up a string using regex. thats what both answers do. if you now have a question of how to turn a list in to a dataframe then thats a different questions and should be asked as a new question. However have you tried having a look here: http://pbpython.com/pandas-list-dict.html
I think it's clear that it's related to Pandas and DataFrames as mentiontioned in my question (and the Tags). I also posted some Panda related code i've already tried. I'm not trying to read data from a list or dict to a DataFrame. The data is imported using Pandas.read_csv. So I already have a DataFrame and want to break up the string from one column to two columns.

Rusty · Accepted Answer · 2018-12-15 11:26:06Z

So first u gotta check if the data has GDP or OPR at the end. If not then u can used grouped regex to get the desired items. Here stuff in rounded brackets () represents a group. I have named them using syntax ?P that's optinal.

import re data = ["12 364 OPR 4 67474;", "893 73 GDP hdj 747;", "hr 777 hr9 GDP;", "463 7g 448 OPR;"] for item in data: # first check if it ends with GPR; or OPR; if re.search("GDP;|OPR;$", item): # as u specified it needs to be ignored print(item) else: # now u can split into two parts - i am splitting in three but u can do use them however u like splited_match_obj = re.search("(?P<Data>.+)(?P<Value>OPR|GDP)(?P<Code>.+)", item) print(splited_match_obj["Data"], splited_match_obj["Value"], splited_match_obj["Code"] )

How can i use this with Pandas DataFrames ? The sample is only 4 lines, but the actual data is in a large DataFrame imported from a csv file

Vaishali · Accepted Answer · 2018-12-15 14:48:12Z

Lets say this is your dataframe,

 Data Temp 0 12 364 OPR 4 67474 33 1 893 73 GDP hdj 747 34 2 hr 777 hr9 GDP 30 3 463 7g 448 OPR 28

You can use extract with multiple capture groups based on condition

df1[['Data', 'Code']] = df.loc[~df['Data'].str.endswith(('OPR','GDP')), 'Data'].str.extract('(.*)([A-Z]{3} .*)') df2[['Data', 'Code']] = df.loc[df['Data'].str.endswith(('OPR','GDP')), 'Data'].str.extract('(.*[OPR|GDP]$)(.*)') df[['Data', 'Code']] = pd.concat([df1,df2]) Data Temp Code 0 12 364 33 OPR 4 67474 1 893 73 34 GDP hdj 747 2 hr 777 hr9 GDP 30 3 463 7g 448 OPR 28

Collectives™ on Stack Overflow

Get the position of a substring in a column of DataFrame using regex

3 Answers 3

9 Comments

1 Comment

Comments

Linked

Hot Network Questions