0

I have tried multiple methods that get me to a point close to but not exactly where I want to be with the final output. I am trying to first create a few columns that contain a specific within the raw dataframe based on it's position, afterwards I am trying to make a particular row the header row and skip all the rows that were above it.

Raw input:

 | NA | NA_1 | NA_2 | NA_3 | 0 | 12-Month Percent Change | NaN | NaN | NaN | 1 | Series Id: CUUR0000SAF1 | NaN | NaN | NaN | 2 | Item: Food | NaN | NaN | NaN | 3 | Year | Jan | Feb | Mar | 4 | 2010 | -0.4 | -0.2 | 0.2 | 5 | 2011 | 1.8 | 2.3 | 2.9 | 

Code used:

df1['View Description'] = df1.iat[0,0] df1['Series ID'] = df1.iat[1,1] df1['Series Name'] = df1.iat[2,1] df1 

Resulted to:

 NA NA.1 NA.2 NA.3 NA.4 NA.5 NA.6 NA.7 View Description Series ID Series Name 0 12-Month Percent Change NaN NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food 1 Series Id: CUUR0000SAF1 NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food 2 Item: Food NaN NaN NaN NaN NaN NaN 12-Month Percent Change CUUR0000SAF1 Food 3 Year Jan Feb Mar Apr May Jun Jul 12-Month Percent Change CUUR0000SAF1 Food 4 2010 -0.4 -0.2 0.2 0.5 0.7 0.7 0.9 12-Month Percent Change CUUR0000SAF1 Food 5 2011 1.8 2.3 2.9 3.2 3.5 3.7 4.2 12-Month Percent Change CUUR0000SAF1 Food 6 2012 4.4 3.9 3.3 3.1 2.8 2.7 2.3 12-Month Percent Change CUUR0000SAF1 Food 7 2013 1.6 1.6 1.5 1.5 1.4 1.4 1.4 12-Month Percent Change CUUR0000SAF1 Food 

Last thing is I want to make the header the row 3 and remove all the rows above it. BUT still keep the three columns at the end: 1) View Description, Series ID, Series Name.

Any suggestions with an efficient way that this can be done as next I want to scale it up with a for loop or something that would do this process for x10 files.

Thanks in advance!

1 Answer 1

1

Here's a way to do what I believe your question is asking:

# Parse and store the first 3 values in column 0 so that we can use them # as values for 3 new columns later. new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()] # Transpose so that we can use set_index() to replace the index # (the columns from the original df1) to ['Item: Food', NaN, NaN, NaN], # then transpose back so that the new index becomes the columns. df1 = df1.T.set_index(3).T # Use reset_index() to replace the index with a fresh range # index (0, 1, 2, ...) so we can use iloc() to discard the # first 3 unwanted rows, then call reset_index() again. df1 = df1.reset_index(drop=True).iloc[3:].reset_index(drop=True) # Get rid of vestigial name for columns. df1.columns.names = [None] # Add the three new columns set to the values saved earlier. df1[['View Description', 'Series ID', 'Series Name']] = new_columns 

Here is full test case (with the above annotated code compressed into fewer lines):

import pandas as pd s = [ ' | NA | NA_1 | NA_2 | NA_3 |', '0 | 12-Month Percent Change | NaN | NaN | NaN |', '1 | Series Id: CUUR0000SAF1 | NaN | NaN | NaN |', '2 | Item: Food | NaN | NaN | NaN |', '3 | Year | Jan | Feb | Mar |', '4 | 2010 | -0.4 | -0.2 | 0.2 |', '5 | 2011 | 1.8 | 2.3 | 2.9 |'] df1 = pd.DataFrame( [[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]], columns = [x.strip() for x in s[0].split('|')[1:-1]], ) print(df1) new_columns = [x.split(':')[-1].strip() for x in df1.iloc[0:3,0].to_list()] df1 = df1.T.set_index(3).T.reset_index(drop=True).iloc[3:].reset_index(drop=True) df1.columns.names = [None] df1[['View Description', 'Series ID', 'Series Name']] = new_columns print(df1) 

Output:

 NA NA_1 NA_2 NA_3 0 12-Month Percent Change NaN NaN NaN 1 Series Id: CUUR0000SAF1 NaN NaN NaN 2 Item: Food NaN NaN NaN 3 Year Jan Feb Mar 4 2010 -0.4 -0.2 0.2 5 2011 1.8 2.3 2.9 Year Jan Feb Mar View Description Series ID Series Name 0 2010 -0.4 -0.2 0.2 12-Month Percent Change CUUR0000SAF1 Food 1 2011 1.8 2.3 2.9 12-Month Percent Change CUUR0000SAF1 Food 

UPDATE: This is code that allows us to configure (1) the cell coordinates of each of 3 cells to be used for new column values (new_col_coords) and (2) the header_row above which rows are discarded:

import pandas as pd s = [ ' | NA | NA_1 | NA_2 | NA_3 |', '0 | 12-Month Percent Change | NaN | NaN | NaN |', '91 | To be discarded | NaN | NaN | NaN |', '1 | Series Id: CUUR0000SAF1 | Abc | NaN | NaN |', '92 | To be discarded | NaN | NaN | NaN |', '93 | To be discarded | NaN | NaN | NaN |', '94 | To be discarded | NaN | NaN | NaN |', '2 | Item: Food | Xyz | NaN | NaN |', '95 | To be discarded | NaN | NaN | NaN |', '96 | To be discarded | NaN | NaN | NaN |', '97 | To be discarded | NaN | NaN | NaN |', '98 | To be discarded | NaN | NaN | NaN |', '3 | Year | Jan | Feb | Mar |', '4 | 2010 | -0.4 | -0.2 | 0.2 |', '5 | 2011 | 1.8 | 2.3 | 2.9 |'] df1 = pd.DataFrame( [[x.strip() for x in y.split('|')[1:-1]] for y in s[1:]], columns = [x.strip() for x in s[0].split('|')[1:-1]], ) print(df1) # parse and store the 3 values at specified coordinates so that we can use them as values for 3 new columns later new_col_coords = [[0,0], [2,1], [6,1]] new_columns = [x.split(':')[-1].strip() for x in [df1.iloc[i, j] for i, j in new_col_coords]] header_row = 11 # Here's how to do everything that follows in one line of code: #df1 = df1.T.set_index(header_row).T.reset_index(drop=True).iloc[header_row:].reset_index(drop=True) # Transpose so that we can use set_index() to change the index to ['Item: Food', NaN, NaN, NaN], then transpose back so that index becomes the columns df1 = df1.T.set_index(header_row).T # Use reset_index() to replace the index with a fresh range index (0, 1, 2, ...) so we can use iloc() to discard the unwanted rows above header_row, then call reset_index() again df1 = df1.reset_index(drop=True).iloc[header_row:].reset_index(drop=True) # Get rid of vestigial name for columns df1.columns.names = [None] # Add the three new columns set to the values saved earlier df1[['View Description', 'Series ID', 'Series Name']] = new_columns print(df1) 

Output:

 NA NA_1 NA_2 NA_3 0 12-Month Percent Change NaN NaN NaN 1 To be discarded NaN NaN NaN 2 Series Id: CUUR0000SAF1 Abc NaN NaN 3 To be discarded NaN NaN NaN 4 To be discarded NaN NaN NaN 5 To be discarded NaN NaN NaN 6 Item: Food Xyz NaN NaN 7 To be discarded NaN NaN NaN 8 To be discarded NaN NaN NaN 9 To be discarded NaN NaN NaN 10 To be discarded NaN NaN NaN 11 Year Jan Feb Mar 12 2010 -0.4 -0.2 0.2 13 2011 1.8 2.3 2.9 Year Jan Feb Mar View Description Series ID Series Name 0 2010 -0.4 -0.2 0.2 12-Month Percent Change Abc Xyz 1 2011 1.8 2.3 2.9 12-Month Percent Change Abc Xyz 
Sign up to request clarification or add additional context in comments.

8 Comments

@Alexchristof20 Did this answer your question? Do you need any additional help with your question?
After testing specifically your code from the "test case" it works just as fine. The issue though rose when the raw input that initially included in my question is not quite the actual raw input. I tried to simplify it so that I would solve that afterwards. And based on your solution I am having trouble on making those changes. Specifically, the position of the 3 values that I want to save and create new variables with are not from 0:3 but instead one is [0:0], [2,1] and [6,1]. Any additional suggestions would be much appreciated!
So these are like 12-Month Percent Change, Series Id: CUUR0000SAF1 and Item: Food in your original example, but instead of being in rows 0:3 of column 0, they are in [0:0], [2,1] and [6,1]? (By the way, are you sure it's not [0,0], [2,0] and [6,0]?) And do you now want to make row 7 the header (instead of row 3) and remove all the rows above it?
Yes the indexes that I mentioned are indeed accurate. The header will be row 11 which is same if the code reflects row 7, I will adjust it. (It's 11 and not 7 because there are blank rows in between, not sure if that matters).
Take a look at my answer, updated to do what the follow-on question in your comment asks. All set?
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.