1

I know that with Pandas, you can use the CSV writer in "append" mode to add new rows to the file, but I'm wondering, is there a way to add a new column to an existing file, without having to first load the file like:

df = pd.read_csv/excel/parquet("the_file.csv") 

Reason I ask is, sometimes I'm dealing with huge datasets, and loading them into memory is expensive when all I'd like to do is just add 1 column to the file.

As an example, I have a huge dataset stored already, I load one column from that dataset to perform a calculation from it which gives me another column of data. Now I'd like to add that new column, same length of rows and everything, to the file, without first importing it. Possible?

Here's is a reproducible code if needed. I'm using this on much larger datasets, but the premise would be the exact same regardless:

from sklearn.datasets import make_classification from pandas import DataFrame, read_csv # Make a fake binary classification dataset X, y = make_classification(n_samples=100, n_features=10, n_informative=5, n_classes=2) # Turn it into a dataframe df = DataFrame(X, columns=['col1','col2','col3','col4','col5','col6','col7','col8','col9','col10']) df['label'] = y # Save the file df.to_csv("./the_file.csv", index=False) # Now, load one column from that file main_col = read_csv("./the_file.csv", usecols=["col1"]) # Perform some random calculation to get a new column main_col['new_col'] = main_col / 2 

Now, how can you add main_col['new_col'] to ./the_file.csv, without first importing the entire file, adding the column, then resaving?

19
  • You need to give specifics on the format of your file and the calculations involved. Provide input/output examples Commented Oct 17, 2021 at 7:53
  • None of that matters. For ease, we can assume the stored file is in .CSV format. I read in one column of that file with the command main_col = pd.read_csv("./the_file.csv", usecols=['main_col']). Now for arguments sake, let's say I just make a direct copy of that column with added_col = main_col. How can I add added_col to the .CSV file without importing the entire file just to save it again? The calculations/format of the data are irrelevant to the question I'm asking. Commented Oct 17, 2021 at 7:59
  • All of that matters absolutely. Answering your question requires to parse the lines. It is impossible to read column by column. Only line by line. Commented Oct 17, 2021 at 8:04
  • 1
    I'm not super familiar with Parquet but based on quick scanning of documentation, it could reduce the amount of information you have to rewrite rather drastically, compared to simple row-based formats. Another option might be to use a database, where the engine transparently takes care of any internal reorganization when you change the schema. Commented Oct 17, 2021 at 8:31
  • 1
    @MattWilson chunks actually works, i had same problem. When we apply chunk size, it provides file reader object to loop.i I supposed to read a csv & load to db. Due to memory constraints my 1.2 GB file was crashing my Docker. So i applied this technique. Here you can see github.com/rakeshkaswan/… line number 65. Commented Oct 17, 2021 at 8:50

1 Answer 1

0

In working with some of the feedback I received in the comments, here is my hacky workaround to this problem. Not efficient, and it doesn't even work as is, but it could be made to work. Use it as a pseudocode representation of what I want to accomplish. I'll look into the chunksize thing too as per @RakeshKumar:

# Idea 1 # Start a new file. The columns are known, so this is fine, tho not very efficient import csv columns = ['col1','col2','col3','col4','col5','col6','col7','col8','col9','col10','label','added_col'] with open('new_file.csv','a') as f: writer = csv.writer(f) writer.writerow(columns) # Used in a while loop keepGoing = True # A skip rows counter skipRows = 0 # As long as we don't run into import issues.... while keepGoing: try: # Read in one line from the file df = read_csv("./the_file.csv", nrows=1, skiprows=skipRows, header=0) # Perform your calculation df['added_col'] = df['col10'] / 2 # Write the new row to the new file writer.writerow(df.iloc[0,:]) # Do the next line skipRows+=1 # Once we've run out of rows in first file, stop the loop except: break 

So in effect, we're only reading in one line at a time from the first file, appending to the new file, then when we're done, we could just delete the first file. Not efficient, but would keep the memory load down when using giant datasets!

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.