8
$\begingroup$

I have a csv file, size around 50Mb. The data is 986132 by 3 matrix of the following form.

data={{a1,b1,1},{a1,b2,1},{a1,b3,1},{a1,b4,1},{a2,b1,1},{a2,b5,1},{a2,b6,1},{a3,b1,1},{a3,b2,1},{a3,b6,1},{a3,b7,1},...{a3012,b1423,1}}

The third column is always 1 and there is no pattern between a's and b's.

In the matrix form of data, I focus on the second column (b's) and want to remove all the rows that has b$i$ that appears less than a certain number. Say it's 10.

So, for example, if b10, b31, b412 each appears less than 10 times in data, I want remove all the rows that contain b11, b31 and b412.

How can I do this? I've tried to convert it to a pivot table and sort it by the number of each b, but my computer can't process it maybe because of the memory limitation..

$\endgroup$

5 Answers 5

5
$\begingroup$

Make some fake data:

SeedRandom[10]; numOccurrences = 5; data = Table[{RandomInteger[{0, 100}], RandomInteger[{1, 100}], 1}, 100]; 

Group the data by the value of b, then select only those values of b for which at least numOccurrences instances are present, then take the values of the resulting association and flatten them back into the desired shape:

Select[Length[#] >= numOccurrences &]@ GroupBy[#[[2]] &]@ data; Values[%]~Flatten~1 (* Out: {{83, 1, 1}, {33, 1, 1}, {27, 1, 1}, {12, 1, 1}, {74, 1, 1}} *) 
$\endgroup$
3
$\begingroup$

To generate data (with a[i] instead of ai but the principle is unchanged):

SeedRandom[2020]; n = 20; indexA = RandomInteger[{1, 4}, n]; indexB = RandomInteger[{1, 4}, n]; data = Table[{a[indexA[[i]]], b[indexB[[i]]], 1}, {i, n}] (* {{a[1], b[4], 1}, {a[1], b[2], 1}, {a[1], b[2], 1}, {a[1], b[4], 1}, {a[4], b[4], 1}, {a[1], b[4], 1}, {a[3], b[2], 1}, {a[4], b[3], 1}, {a[4], b[1], 1}, {a[3], b[3], 1}, {a[1], b[3], 1}, {a[4], b[3], 1}, {a[4], b[1], 1}, {a[4], b[4], 1}, {a[4], b[4], 1}, {a[2], b[2], 1}, {a[1], b[4], 1}, {a[2], b[1], 1}, {a[4], b[3], 1}, {a[2], b[4], 1}} *) 

Then, store variablesn that appear less than min times in the second colum, and select rows whose second value in not in the list of bad indices badB:

min = 5; badB = Select[Tally[data[[All, 2]]], #[[2]] < min &][[All, 1]] Select[data, MemberQ[badB, #[[2]]] == False &] (* {{a[1], b[4], 1}, {a[1], b[4], 1}, {a[4], b[4], 1}, {a[1], b[4], 1}, {a[4], b[4], 1}, {a[4], b[4], 1}, {a[1], b[4], 1}, {a[2], b[4], 1}} *) 
$\endgroup$
2
$\begingroup$
ClearAll[pick] pick = Pick[#, Developer`ToPackedArray @ UnitStep[(Counts[#[[All, 2]]] /@ #[[All, 2]]) - #2], 1] &; 

Using data from MarcoB's answer:

SeedRandom[10]; data = Table[{RandomInteger[{0, 100}], RandomInteger[{1, 100}], 1}, 100]; pick[data, 5] 
 {{83, 1, 1}, {33, 1, 1}, {27, 1, 1}, {12, 1, 1}, {74, 1, 1}} 

Note: This approach preserves the ordering of kept rows.

$\endgroup$
1
$\begingroup$
list = {{10, 10, 1}, {7, 9, 1}, {6, 6, 1}, {10, 1, 1}, {10, 6, 1}, {7, 9, 1}, {6, 9, 1}, {7, 5, 1}, {0, 8, 1}, {6, 2, 1}}; 

Using GatherBy

Join @@ Select[Length[#] >= 2 &] @ GatherBy[list, #[[2]] &] 

{{7, 9, 1}, {7, 9, 1}, {6, 9, 1}, {6, 6, 1}, {10, 6, 1}}

$\endgroup$
1
$\begingroup$
list = {{10, 10, 1}, {7, 9, 1}, {6, 6, 1}, {10, 1, 1}, {10, 6, 1}, {7, 9, 1}, {6, 9, 1}, {7, 5, 1}, {0, 8, 1}, {6, 2, 1}}; 

Using SortBy and SplitBy:

Cases[s : {_, __} :> Splice@s]@SplitBy[SortBy[list, #[[2]] &], #[[2]] &] (*{{6, 6, 1}, {10, 6, 1}, {6, 9, 1}, {7, 9, 1}, {7, 9, 1}}*) 
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.