1

I have a perfectly working code. But, when I run a large CSV file (around 2GB) it takes about 15-20 minutes for the complete execution of the code. Is there a way I could optimise my below code to take less time to finsh execution and thus improve performance?

from csv import reader, writer import pandas as pd path = (r"data.csv") data = pd.read_csv(path, header=None) last_column = data.iloc[: , -1] arr = [i+1 for i in range(len(last_column)-1) if (last_column[i] == 1 and last_column[i+1] == 0)] ch_0_6 = [] ch_7_14 = [] ch_16_22 = [] with open(path, 'r') as read_obj: csv_reader = reader(read_obj) rows = list(csv_reader) for j in arr: # Channel 1-7 ch_0_6_init = [int(rows[j][k]) for k in range(1,8)] bin_num = ''.join([str(x) for x in ch_0_6_init]) dec_num = int(f'{bin_num}', 2) ch_0_6.append(dec_num) ch_0_6_init = [] # Channel 8-15 ch_7_14_init = [int(rows[j][k]) for k in range(8,16)] bin_num = ''.join([str(x) for x in ch_7_14_init]) dec_num = int(f'{bin_num}', 2) ch_7_14.append(dec_num) ch_7_14_init = [] # Channel 16-22 ch_16_22_init = [int(rows[j][k]) for k in range(16,23)] bin_num = ''.join([str(x) for x in ch_16_22_init]) dec_num = int(f'{bin_num}', 2) ch_16_22.append(dec_num) ch_16_22_init = [] 

Sample Data:

0.0114,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,1 0.0112,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0 0.0115,0,1,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1 0.0117,0,1,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,0,1,0,0,0,0 0.0118,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,1,1,0,1,0,0,0,1 

Join the binary digits to form a decimal number depending upon the channels chosen.

4
  • Could you add a few rows of sample data to test with and add a short explanation of what you are trying to achieve Commented Jan 3, 2022 at 14:43
  • 1
    For one thing, you should figure out a way to avoid loading the entire ginourmous csv file twice. You're reading it with both pandas and the standard csv reader! Commented Jan 3, 2022 at 14:49
  • @JaredSmith Thanks for the suggestion. Which is better pandas or the standard csv reader to reduce execution time? Commented Jan 3, 2022 at 15:09
  • @Pressing_Keys_24_7 IDK but would bet serious money that the difference between the two will be utterly dwarfed by simply not doing both (especially for a file that large, you're talking about potentially reading 4GB!! into memory at once, presumably on a user device). Use whichever one makes processing the data clearer/easier, and then if it's still too slow worry about which one is faster. Commented Jan 3, 2022 at 15:11

1 Answer 1

2

Using just the csv module, you could try the following type approach:

from csv import reader, writer ch_0_6 = [] ch_7_14 = [] ch_16_22 = [] with open('data.csv', 'r') as f_input: csv_input = reader(f_input) last_row = ['0'] for row in csv_input: if last_row[-1] == '1' and row[-1] == '0': ch_0_6.append(int(''.join(row[1:8]), 2)) ch_7_14.append(int(''.join(row[8:16]), 2)) ch_16_22.append(int(''.join(row[16:23]), 2)) last_row = row print(ch_0_6) print(ch_7_14) print(ch_16_22) 

For your example data this would display:

[32, 46] [1, 145] [104, 104] 

As noted, your original approach was reading the whole file twice into memory. The first pass was just to determine which rows to parse. This can be done whilst reading by keeping track of the previous row in the loop. This alone should result in a significant speed up.

The conversion from binary list elements into decimal elements is also a bit more efficient.

This approach would also work on much larger file sizes.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.