1

I am trying to find a way to shuffle the lines of a large csv files in Python and then split it into multiple csv files (assigning a number of rows for each files) but I can't manage to find a way to shuffle the large dataset, and keep the headers in each csv. It would help a lot if someone would know how to

Here's the code I found useful for splitting a csv file:

number_of_rows = 100 def write_splitted_csvs(part, lines): with open('mycsvhere.csv'+ str(part) +'.csv', 'w') as f_out: f_out.write(header) f_out.writelines(lines) with open("mycsvhere.csv", "r") as f: count = 0 header = f.readline() lines = [] for line in f: count += 1 lines.append(line) if count % number_of_rows == 0: write_splitted_csvs(count // number_of_rows, lines) lines = [] if len(lines) > 0: write_splitted_csvs((count // number_of_rows) + 1, lines) 

If anyone knows how to shuffle all these splitted csv this would help a lot! Thank you very much

2
  • 1
    you can use pandas to shuffle, and also, to write it to csv. Commented Feb 9, 2022 at 16:59
  • Does this answer your question? Shuffle DataFrame rows Commented Feb 9, 2022 at 16:59

2 Answers 2

2

I would suggest using Pandas if possible.

Shuffling rows, reset the index in place:

import pandas as pd df = pd.read_csv('mycsvhere.csv'+ str(part) +'.csv') df.sample(frac=1).reset_index(drop=True) 

Then you can split into multiple dataframes into a list:

number_of_rows = 100 sub_dfs = [df[i:i + number_of_rows] for i in range(0, df.shape[0], number_of_rows)] 

Then if you want to save the csvs locally:

for idx, sub_df in enumerate(sub_dfs): sub_df.to_csv(f'csv_{idx}.csv', index=False) 
Sign up to request clarification or add additional context in comments.

4 Comments

your answer is way better than mine ^^ But the OP custom dev need, does not match the mentionned "large" data, where custom devs where already made to be optimized :)
Can the OP define "large csv" please ^^ - but agree if it's large GB csv files, you may want to use another method.
Hello, I meant a datafile that could have thousands or rows but like 40MB not in GB :) so the Pandas solution works great!! Thank you very much
Great to hear - please could you mark as the correct answer if it solved your query :)
1

There are 3 needs here :

  • Shuffle your dataset
  • Split your dataset
  • Formatting

For the first 2 steps, there are some nice tools in Sklearn. You can try the stratified shuffle splitter. Sklearn SSS You did not mention Stratified part, but you may need it without knowing it yet ;)

Last part, formatting, it is all up to you. You can check pandas to_csv() function where you can specify your headers, you can(need) specify your headers in the data object aswell (DataFrame). Nothing hard here, just spend a bit of time to specify what you want, and implement it easily :)

Side comments : You can drop pandas, depending on what 'big' is for you, pandas is not 'good' on big data.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.