Condition1 = CurrentRow.Category != OtherRow.Category Condition2 = CurrentRow.List1 intersects OtherRow.List1 Condition3 = CurrentRow.List2 intersects OtherRow.List2 or CurrentRow.List1List2 = NULL or OtherRow.List2 = NULL Condition1 = CurrentRow.Category != OtherRow.Category Condition2 = CurrentRow.List1 intersects OtherRow.List1 Condition3 = CurrentRow.List2 intersects OtherRow.List2 or CurrentRow.List1 = NULL or OtherRow.List2 = NULL Condition1 = CurrentRow.Category != OtherRow.Category Condition2 = CurrentRow.List1 intersects OtherRow.List1 Condition3 = CurrentRow.List2 intersects OtherRow.List2 or CurrentRow.List2 = NULL or OtherRow.List2 = NULL For each row (CurrentRow) of the table I want to calculate CurrentRow:
CurrentRow.Value3 = SUM(Value1)/SUM(Value2) of all other rows (OtherRow) in the table where the following condition isconditions are met:
Example for first row:
Condition1: first row has category value 1, as. As a result, rows 3-4 meet this conditionconsition #1 because their category values are not equal to 1.
Condition2: column "List1" has values "A,B,C" which intersects with values "E,A" (row #3), "E,A" (row #4), "B" (row #5).
Condition3: Column "List2" has values "Cat1,Cat2" which intersects with values "Cat1,Cat4" (row #2), "Cat2" (row #5) and also we take row #4 as it has "NULL" value As.
As a result, we take rows #4 and #5 as they both meet all conditions.
Value3 = (110+100)/(3+6) = 210/9 = 23.33
Ids = "4,5"
I have tried to do it usinfusing pandas in python:
For each row (CurrentRow) of the table I want to calculate CurrentRow.Value3 = SUM(Value1)/SUM(Value2) of all other rows (OtherRow) in the table where the following condition is met:
Example for first row:
Condition1: first row has category value 1, as a result rows 3-4 meet this condition because their category values are not equal to 1
Condition2: column "List1" has values "A,B,C" which intersects with values "E,A" (row #3), "E,A" (row #4), "B" (row #5).
Condition3: Column "List2" has values "Cat1,Cat2" which intersects with values "Cat1,Cat4" (row #2), "Cat2" (row #5) and also we take row #4 as it has "NULL" value As a result we take rows #4 and #5 as they both meet all conditions.
Value3 = (110+100)/(3+6) = 210/9 = 23.33
Ids = "4,5"
I have tried to do it usinf pandas in python:
For each row (CurrentRow) of the table I want to calculate:
CurrentRow.Value3 = SUM(Value1)/SUM(Value2) of all other rows (OtherRow) in the table where the following conditions are met:
Example for first row:
Condition1: first row has category value 1. As a result, rows 3-4 meet consition #1 because their category values are not equal to 1.
Condition2: column "List1" has values "A,B,C" which intersects with values "E,A" (row #3), "E,A" (row #4), "B" (row #5).
Condition3: Column "List2" has values "Cat1,Cat2" which intersects with values "Cat1,Cat4" (row #2), "Cat2" (row #5) and also we take row #4 as it has "NULL" value.
As a result, we take rows #4 and #5 as they both meet all conditions.
Value3 = (110+100)/(3+6) = 210/9 = 23.33
Ids = "4,5"
I have tried to do it using pandas in python:
Processing csv file with more than 700K rows of data
I have a .csv file (around 400MB in size) which contains 700K rows of structured data. Table structure is:
+----+----------+-------+-----------+--------+--------+ | Id | Category | List1 | List2 | Value1 | Value2 | +----+----------+-------+-----------+--------+--------+ | 1 | 1 | A,B,C | Cat1,Cat2 | 100 | 5 | | 2 | 1 | D,F | Cat1,Cat4 | 120 | 4 | | 3 | 2 | E,A | Cat3 | 140 | 2 | | 4 | 2 | E,A | NULL | 110 | 3 | | 5 | 3 | B | Cat2 | 100 | 6 | +----+----------+-------+-----------+--------+--------+ For each row (CurrentRow) of the table I want to calculate CurrentRow.Value3 = SUM(Value1)/SUM(Value2) of all other rows (OtherRow) in the table where the following condition is met:
Condition1 = CurrentRow.Category != OtherRow.Category Condition2 = CurrentRow.List1 intersects OtherRow.List1 Condition3 = CurrentRow.List2 intersects OtherRow.List2 or CurrentRow.List1 = NULL or OtherRow.List2 = NULL Also, I want to list Ids of rows which where involved into calculation of Value3.
Example for first row:
Condition1: first row has category value 1, as a result rows 3-4 meet this condition because their category values are not equal to 1
Condition2: column "List1" has values "A,B,C" which intersects with values "E,A" (row #3), "E,A" (row #4), "B" (row #5).
Condition3: Column "List2" has values "Cat1,Cat2" which intersects with values "Cat1,Cat4" (row #2), "Cat2" (row #5) and also we take row #4 as it has "NULL" value As a result we take rows #4 and #5 as they both meet all conditions.
Value3 = (110+100)/(3+6) = 210/9 = 23.33
Ids = "4,5"
The result for table above would be:
+----+----------+-------+-----------+--------+--------+--------+------+ | Id | Category | List1 | List2 | Value1 | Value2 | Value3 | Ids | +----+----------+-------+-----------+--------+--------+--------+------+ | 1 | 1 | A,B,C | Cat1,Cat2 | 100 | 5 | 23.33 | 4,5 | | 2 | 1 | D,F | Cat1,Cat4 | 120 | 4 | NULL | NULL | | 3 | 2 | E,A | Cat3 | 140 | 2 | NULL | NULL | | 4 | 2 | E,A | NULL | 110 | 3 | 20 | 1 | | 5 | 3 | B | Cat2 | 100 | 6 | 20 | 1 | +----+----------+-------+-----------+--------+--------+--------+------+ I have tried to do it usinf pandas in python:
data = pd.read_pickle('data.df') for _index, _record in data.iterrows(): category = _record["Category"] list1 = _record["List1"] list2 = _record["list2"] candidates = data.loc[(data['Category'] != category) & (pd.Series(list1).isin(data["List1"]).any()) & ((pd.Series(list2).isin(data["List2"]).any()) | (data["List2"][0] == 'NULL') | (list2[0] == 'NULL'))] value1_sum = candidates["Value1"].sum() value2_sum = candidates["Value2"].sum() ids = candidates[['Id']].to_numpy() if value2_sum > 0: data.loc[_index, "Value3"] = value1_sum/ value2_sum data.loc[_index, "Ids"] = ids data.to_csv('result.csv') It works for small number of rows, but takes forever on 700K rows. Is there any method to optimize this algorithm?