Removing rows whose elements appear less than a certain number

Question

I have a csv file, size around 50Mb. The data is 986132 by 3 matrix of the following form.

data={{a1,b1,1},{a1,b2,1},{a1,b3,1},{a1,b4,1},{a2,b1,1},{a2,b5,1},{a2,b6,1},{a3,b1,1},{a3,b2,1},{a3,b6,1},{a3,b7,1},...{a3012,b1423,1}}

The third column is always 1 and there is no pattern between a's and b's.

In the matrix form of data, I focus on the second column (b's) and want to remove all the rows that has b$i$ that appears less than a certain number. Say it's 10.

So, for example, if b10, b31, b412 each appears less than 10 times in data, I want remove all the rows that contain b11, b31 and b412.

How can I do this? I've tried to convert it to a pivot table and sort it by the number of each b, but my computer can't process it maybe because of the memory limitation..

MarcoB · Accepted Answer · 2020-05-01 17:28:58Z

Make some fake data:

SeedRandom[10]; numOccurrences = 5; data = Table[{RandomInteger[{0, 100}], RandomInteger[{1, 100}], 1}, 100];

Group the data by the value of b, then select only those values of b for which at least numOccurrences instances are present, then take the values of the resulting association and flatten them back into the desired shape:

Select[Length[#] >= numOccurrences &]@ GroupBy[#[[2]] &]@ data; Values[%]~Flatten~1 (* Out: {{83, 1, 1}, {33, 1, 1}, {27, 1, 1}, {12, 1, 1}, {74, 1, 1}} *)

anderstood · Accepted Answer · 2020-05-01 17:12:08Z

To generate data (with a[i] instead of ai but the principle is unchanged):

SeedRandom[2020]; n = 20; indexA = RandomInteger[{1, 4}, n]; indexB = RandomInteger[{1, 4}, n]; data = Table[{a[indexA[[i]]], b[indexB[[i]]], 1}, {i, n}] (* {{a[1], b[4], 1}, {a[1], b[2], 1}, {a[1], b[2], 1}, {a[1], b[4], 1}, {a[4], b[4], 1}, {a[1], b[4], 1}, {a[3], b[2], 1}, {a[4], b[3], 1}, {a[4], b[1], 1}, {a[3], b[3], 1}, {a[1], b[3], 1}, {a[4], b[3], 1}, {a[4], b[1], 1}, {a[4], b[4], 1}, {a[4], b[4], 1}, {a[2], b[2], 1}, {a[1], b[4], 1}, {a[2], b[1], 1}, {a[4], b[3], 1}, {a[2], b[4], 1}} *)

Then, store variablesn that appear less than min times in the second colum, and select rows whose second value in not in the list of bad indices badB:

min = 5; badB = Select[Tally[data[[All, 2]]], #[[2]] < min &][[All, 1]] Select[data, MemberQ[badB, #[[2]]] == False &] (* {{a[1], b[4], 1}, {a[1], b[4], 1}, {a[4], b[4], 1}, {a[1], b[4], 1}, {a[4], b[4], 1}, {a[4], b[4], 1}, {a[1], b[4], 1}, {a[2], b[4], 1}} *)

kglr · Accepted Answer · 2020-05-02 12:20:11Z

ClearAll[pick] pick = Pick[#, Developer`ToPackedArray @ UnitStep[(Counts[#[[All, 2]]] /@ #[[All, 2]]) - #2], 1] &;

Using data from MarcoB's answer:

SeedRandom[10]; data = Table[{RandomInteger[{0, 100}], RandomInteger[{1, 100}], 1}, 100]; pick[data, 5]

 {{83, 1, 1}, {33, 1, 1}, {27, 1, 1}, {12, 1, 1}, {74, 1, 1}}

Note: This approach preserves the ordering of kept rows.

eldo · Accepted Answer · 2024-02-15 00:45:08Z

list = {{10, 10, 1}, {7, 9, 1}, {6, 6, 1}, {10, 1, 1}, {10, 6, 1}, {7, 9, 1}, {6, 9, 1}, {7, 5, 1}, {0, 8, 1}, {6, 2, 1}};

Using GatherBy

Join @@ Select[Length[#] >= 2 &] @ GatherBy[list, #[[2]] &]

{{7, 9, 1}, {7, 9, 1}, {6, 9, 1}, {6, 6, 1}, {10, 6, 1}}

E. Chan-López · Accepted Answer · 2024-02-15 02:32:09Z

list = {{10, 10, 1}, {7, 9, 1}, {6, 6, 1}, {10, 1, 1}, {10, 6, 1}, {7, 9, 1}, {6, 9, 1}, {7, 5, 1}, {0, 8, 1}, {6, 2, 1}};

Using SortBy and SplitBy:

Cases[s : {_, __} :> Splice@s]@SplitBy[SortBy[list, #[[2]] &], #[[2]] &] (*{{6, 6, 1}, {10, 6, 1}, {6, 9, 1}, {7, 9, 1}, {7, 9, 1}}*)

Stack Exchange Network

Removing rows whose elements appear less than a certain number

5 Answers 5

Hot Network Questions

Removing rows whose elements appear less than a certain number

5 Answers 5

Related

Hot Network Questions