edited body

edited May 10, 2019 at 5:04

81
3

Condition1 = CurrentRow.Category != OtherRow.Category Condition2 = CurrentRow.List1 intersects OtherRow.List1 Condition3 = CurrentRow.List2 intersects OtherRow.List2 or CurrentRow.List1List2 = NULL or OtherRow.List2 = NULL

Condition1 = CurrentRow.Category != OtherRow.Category Condition2 = CurrentRow.List1 intersects OtherRow.List1 Condition3 = CurrentRow.List2 intersects OtherRow.List2 or CurrentRow.List1 = NULL or OtherRow.List2 = NULL

Condition1 = CurrentRow.Category != OtherRow.Category Condition2 = CurrentRow.List1 intersects OtherRow.List1 Condition3 = CurrentRow.List2 intersects OtherRow.List2 or CurrentRow.List2 = NULL or OtherRow.List2 = NULL

added 12 characters in body

Source Link

edited May 9, 2019 at 13:28

mbax2ak3

81
3

For each row (CurrentRow) of the table I want to calculate CurrentRow:
CurrentRow.Value3 = SUM(Value1)/SUM(Value2) of all other rows (OtherRow) in the table where the following condition isconditions are met:

Example for first row:
Condition1: first row has category value 1, as. As a result, rows 3-4 meet this conditionconsition #1 because their category values are not equal to 1.
Condition2: column "List1" has values "A,B,C" which intersects with values "E,A" (row #3), "E,A" (row #4), "B" (row #5).
Condition3: Column "List2" has values "Cat1,Cat2" which intersects with values "Cat1,Cat4" (row #2), "Cat2" (row #5) and also we take row #4 as it has "NULL" value As.

As a result, we take rows #4 and #5 as they both meet all conditions.
Value3 = (110+100)/(3+6) = 210/9 = 23.33
Ids = "4,5"

I have tried to do it usinfusing pandas in python:

Source Link

asked May 9, 2019 at 13:22

mbax2ak3

81
3

Processing csv file with more than 700K rows of data

I have a .csv file (around 400MB in size) which contains 700K rows of structured data. Table structure is:

 +----+----------+-------+-----------+--------+--------+ | Id | Category | List1 | List2 | Value1 | Value2 | +----+----------+-------+-----------+--------+--------+ | 1 | 1 | A,B,C | Cat1,Cat2 | 100 | 5 | | 2 | 1 | D,F | Cat1,Cat4 | 120 | 4 | | 3 | 2 | E,A | Cat3 | 140 | 2 | | 4 | 2 | E,A | NULL | 110 | 3 | | 5 | 3 | B | Cat2 | 100 | 6 | +----+----------+-------+-----------+--------+--------+

For each row (CurrentRow) of the table I want to calculate CurrentRow.Value3 = SUM(Value1)/SUM(Value2) of all other rows (OtherRow) in the table where the following condition is met:

Condition1 = CurrentRow.Category != OtherRow.Category Condition2 = CurrentRow.List1 intersects OtherRow.List1 Condition3 = CurrentRow.List2 intersects OtherRow.List2 or CurrentRow.List1 = NULL or OtherRow.List2 = NULL

Also, I want to list Ids of rows which where involved into calculation of Value3.

Example for first row:
Condition1: first row has category value 1, as a result rows 3-4 meet this condition because their category values are not equal to 1
Condition2: column "List1" has values "A,B,C" which intersects with values "E,A" (row #3), "E,A" (row #4), "B" (row #5).
Condition3: Column "List2" has values "Cat1,Cat2" which intersects with values "Cat1,Cat4" (row #2), "Cat2" (row #5) and also we take row #4 as it has "NULL" value As a result we take rows #4 and #5 as they both meet all conditions.
Value3 = (110+100)/(3+6) = 210/9 = 23.33
Ids = "4,5"

The result for table above would be:

+----+----------+-------+-----------+--------+--------+--------+------+ | Id | Category | List1 | List2 | Value1 | Value2 | Value3 | Ids | +----+----------+-------+-----------+--------+--------+--------+------+ | 1 | 1 | A,B,C | Cat1,Cat2 | 100 | 5 | 23.33 | 4,5 | | 2 | 1 | D,F | Cat1,Cat4 | 120 | 4 | NULL | NULL | | 3 | 2 | E,A | Cat3 | 140 | 2 | NULL | NULL | | 4 | 2 | E,A | NULL | 110 | 3 | 20 | 1 | | 5 | 3 | B | Cat2 | 100 | 6 | 20 | 1 | +----+----------+-------+-----------+--------+--------+--------+------+

I have tried to do it usinf pandas in python:

data = pd.read_pickle('data.df') for _index, _record in data.iterrows(): category = _record["Category"] list1 = _record["List1"] list2 = _record["list2"] candidates = data.loc[(data['Category'] != category) & (pd.Series(list1).isin(data["List1"]).any()) & ((pd.Series(list2).isin(data["List2"]).any()) | (data["List2"][0] == 'NULL') | (list2[0] == 'NULL'))] value1_sum = candidates["Value1"].sum() value2_sum = candidates["Value2"].sum() ids = candidates[['Id']].to_numpy() if value2_sum > 0: data.loc[_index, "Value3"] = value1_sum/ value2_sum data.loc[_index, "Ids"] = ids data.to_csv('result.csv')

It works for small number of rows, but takes forever on 700K rows. Is there any method to optimize this algorithm?

Stack Exchange Network

Return to Question

Processing csv file with more than 700K rows of data