Optimise Code to improve performance and reduce Execution time

Question

I have a perfectly working code. But, when I run a large CSV file (around 2GB) it takes about 15-20 minutes for the complete execution of the code. Is there a way I could optimise my below code to take less time to finsh execution and thus improve performance?

from csv import reader, writer import pandas as pd path = (r"data.csv") data = pd.read_csv(path, header=None) last_column = data.iloc[: , -1] arr = [i+1 for i in range(len(last_column)-1) if (last_column[i] == 1 and last_column[i+1] == 0)] ch_0_6 = [] ch_7_14 = [] ch_16_22 = [] with open(path, 'r') as read_obj: csv_reader = reader(read_obj) rows = list(csv_reader) for j in arr: # Channel 1-7 ch_0_6_init = [int(rows[j][k]) for k in range(1,8)] bin_num = ''.join([str(x) for x in ch_0_6_init]) dec_num = int(f'{bin_num}', 2) ch_0_6.append(dec_num) ch_0_6_init = [] # Channel 8-15 ch_7_14_init = [int(rows[j][k]) for k in range(8,16)] bin_num = ''.join([str(x) for x in ch_7_14_init]) dec_num = int(f'{bin_num}', 2) ch_7_14.append(dec_num) ch_7_14_init = [] # Channel 16-22 ch_16_22_init = [int(rows[j][k]) for k in range(16,23)] bin_num = ''.join([str(x) for x in ch_16_22_init]) dec_num = int(f'{bin_num}', 2) ch_16_22.append(dec_num) ch_16_22_init = []

Sample Data:

0.0114,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,1 0.0112,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,0,1,0,0,0,0 0.0115,0,1,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,0,1,0,0,0,1 0.0117,0,1,0,1,1,1,0,1,0,0,1,0,0,0,1,1,1,0,1,0,0,0,0 0.0118,0,1,0,0,1,1,0,0,0,1,0,1,0,0,1,1,1,0,1,0,0,0,1

Join the binary digits to form a decimal number depending upon the channels chosen.

Could you add a few rows of sample data to test with and add a short explanation of what you are trying to achieve — Martin Evans
– Martin Evans, Commented Jan 3, 2022 at 14:43
For one thing, you should figure out a way to avoid loading the entire ginourmous csv file twice. You're reading it with both pandas and the standard csv reader! — Jared Smith
– Jared Smith, Commented Jan 3, 2022 at 14:49
@JaredSmith Thanks for the suggestion. Which is better pandas or the standard csv reader to reduce execution time? — Pressing_Keys_24_7
– Pressing_Keys_24_7, Commented Jan 3, 2022 at 15:09
@Pressing_Keys_24_7 IDK but would bet serious money that the difference between the two will be utterly dwarfed by simply not doing both (especially for a file that large, you're talking about potentially reading 4GB!! into memory at once, presumably on a user device). Use whichever one makes processing the data clearer/easier, and then if it's still too slow worry about which one is faster. — Jared Smith
– Jared Smith, Commented Jan 3, 2022 at 15:11

Martin Evans · Accepted Answer · 2022-01-03 15:23:51Z

Using just the csv module, you could try the following type approach:

from csv import reader, writer ch_0_6 = [] ch_7_14 = [] ch_16_22 = [] with open('data.csv', 'r') as f_input: csv_input = reader(f_input) last_row = ['0'] for row in csv_input: if last_row[-1] == '1' and row[-1] == '0': ch_0_6.append(int(''.join(row[1:8]), 2)) ch_7_14.append(int(''.join(row[8:16]), 2)) ch_16_22.append(int(''.join(row[16:23]), 2)) last_row = row print(ch_0_6) print(ch_7_14) print(ch_16_22)

For your example data this would display:

[32, 46] [1, 145] [104, 104]

As noted, your original approach was reading the whole file twice into memory. The first pass was just to determine which rows to parse. This can be done whilst reading by keeping track of the previous row in the loop. This alone should result in a significant speed up.

The conversion from binary list elements into decimal elements is also a bit more efficient.

This approach would also work on much larger file sizes.

Collectives™ on Stack Overflow

Optimise Code to improve performance and reduce Execution time

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related