How to shuffle and split a large csv with headers?

Question

I am trying to find a way to shuffle the lines of a large csv files in Python and then split it into multiple csv files (assigning a number of rows for each files) but I can't manage to find a way to shuffle the large dataset, and keep the headers in each csv. It would help a lot if someone would know how to

Here's the code I found useful for splitting a csv file:

number_of_rows = 100 def write_splitted_csvs(part, lines): with open('mycsvhere.csv'+ str(part) +'.csv', 'w') as f_out: f_out.write(header) f_out.writelines(lines) with open("mycsvhere.csv", "r") as f: count = 0 header = f.readline() lines = [] for line in f: count += 1 lines.append(line) if count % number_of_rows == 0: write_splitted_csvs(count // number_of_rows, lines) lines = [] if len(lines) > 0: write_splitted_csvs((count // number_of_rows) + 1, lines)

If anyone knows how to shuffle all these splitted csv this would help a lot! Thank you very much

you can use pandas to shuffle, and also, to write it to csv. — DanielTuzes
– DanielTuzes, Commented Feb 9, 2022 at 16:59

Interested Developer · Accepted Answer · 2022-02-09 16:58:11Z

2

I would suggest using Pandas if possible.

Shuffling rows, reset the index in place:

import pandas as pd df = pd.read_csv('mycsvhere.csv'+ str(part) +'.csv') df.sample(frac=1).reset_index(drop=True)

Then you can split into multiple dataframes into a list:

number_of_rows = 100 sub_dfs = [df[i:i + number_of_rows] for i in range(0, df.shape[0], number_of_rows)]

Then if you want to save the csvs locally:

for idx, sub_df in enumerate(sub_dfs): sub_df.to_csv(f'csv_{idx}.csv', index=False)

answered Feb 9, 2022 at 16:58

Interested Developer

1488 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Gwendal Yviquel Over a year ago

your answer is way better than mine ^^ But the OP custom dev need, does not match the mentionned "large" data, where custom devs where already made to be optimized :)

Interested Developer Over a year ago

Can the OP define "large csv" please ^^ - but agree if it's large GB csv files, you may want to use another method.

yoopiyo Over a year ago

Hello, I meant a datafile that could have thousands or rows but like 40MB not in GB :) so the Pandas solution works great!! Thank you very much

Interested Developer Over a year ago

Great to hear - please could you mark as the correct answer if it solved your query :)

Gwendal Yviquel · Accepted Answer · 2022-02-09 16:58:23Z

There are 3 needs here :

Shuffle your dataset
Split your dataset
Formatting

For the first 2 steps, there are some nice tools in Sklearn. You can try the stratified shuffle splitter. Sklearn SSS You did not mention Stratified part, but you may need it without knowing it yet ;)

Last part, formatting, it is all up to you. You can check pandas to_csv() function where you can specify your headers, you can(need) specify your headers in the data object aswell (DataFrame). Nothing hard here, just spend a bit of time to specify what you want, and implement it easily :)

Side comments : You can drop pandas, depending on what 'big' is for you, pandas is not 'good' on big data.

Collectives™ on Stack Overflow

How to shuffle and split a large csv with headers?

2 Answers 2

4 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Linked

Related