I am fairly new to python so someone might be able to comment if this is not a good approach. My line of thinking was to take the input and process it line by line. drop the trailing semi colon as you dont have it in your output. then using regex split the line by a space char only if its followed by either OPR or GDP and which is not at the end of the line. If this gives only one item in the list, then append the list with NaN to fill the second column. then i have printed with formatting.
import re data_string="""12 364 OPR 4 67474; 893 73 GDP hdj 747; hr 777 hr9 GDP; 463 7g 448 OPR; """ data_list=data_string.splitlines() for data in data_list: data_split=re.split("\s(?=(?:GDP|OPR)[^$])",data[:-1]) if len(data_split)==1: data_split.append("NaN") print("%-20s|%-20s" % tuple(data_split))
OUTPUT
12 364 |OPR 4 67474 893 73 |GDP hdj 747 hr 777 hr9 GDP |NaN 463 7g 448 OPR |NaN
Updated in light of question edit and comments
Based on your update to the question and comments you could try the below. I would suggest you to test this and check for any edge cases or add validation or conditional checks before performing updates.
import pandas as pd import re source_data = {'data': ['12 364 OPR 4 67474', '893 73 GDP hdj 747', 'hr 777 hr9 GDP','463 7g 448 OPR'], 'code': [None, None, None, None], 'Temp': [33,34,30,28] } df = pd.DataFrame.from_dict(source_data) print("Original df:") print(df, "\n") row_iter=df.iterrows() for index,row in row_iter: data=df.at[index,'data'] data_split=re.split("\s(?=(?:GDP|OPR)[^$])",data) if len(data_split)==2: df.at[index,'data']=data_split[0] df.at[index,'code']=data_split[1] print("Updated df:") print(df)
OUTPUT
Original df: data code Temp 0 12 364 OPR 4 67474 None 33 1 893 73 GDP hdj 747 None 34 2 hr 777 hr9 GDP None 30 3 463 7g 448 OPR None 28 Updated df: data code Temp 0 12 364 OPR 4 67474 33 1 893 73 GDP hdj 747 34 2 hr 777 hr9 GDP None 30 3 463 7g 448 OPR None 28
NaNin 2nd column andhr 777 hr9 GDPin 1st is desired? I don't get the rule.\d{3}Match groups can also be helpful.OPRorGDPis not at the end of the string. Thats the case for 0 and 1, For 2 and 3 it's at the and and there's no need to split.