Grouping similar items from CSV file column as primary key

Question

I have a large CSV file with data similar to this

User ID Group ABC Group1 DEF Group2 ABC Group3 GHI Group4 XYZ Group2 UVW Group5 XYZ Group1 ABC Group1 DEF Group2

i need to group these items in such a way that number of times group attribute is repeated in a user id and get a value such that

ABC Group1 ->2 ABC Group3 ->1 DEF Group2 ->2 GHI Group4 ->1 UVW Group5 ->1 XYZ Group2 ->1 XYZ Group1 ->1

Are there any clustering algorithm to do this.

What do you mean by 'clustering' algorithm? A parallelizable algorithm? — GeertPt
– GeertPt, Commented Aug 18, 2014 at 12:12
30.000 lines should fit in memory nicely for a modern machine... — GeertPt
– GeertPt, Commented Aug 18, 2014 at 14:52

Alessandro Suglia · Accepted Answer · 2014-08-18 12:49:54Z

In your case I will do somethink like this if you don't want to store all the data in memory:

public class Tester { public static Multiset<String> getMultisetFromCSV(String csvFileName, String lineDelimiter) throws IOException { Multiset<String> mapper = TreeMultiset.create(); BufferedReader reader = null; try { reader = new BufferedReader(new FileReader(csvFileName)); String[] currLineSplitted; while(reader.ready()) { currLineSplitted = reader.readLine().split(lineDelimiter); mapper.add(currLineSplitted[0] + "-" + currLineSplitted[1]); } return mapper; } finally { if(reader != null) reader.close(); } } public static void main(String[] args) throws IOException { Multiset<String> set = getMultisetFromCSV("csv", ","); for(String key : set.elementSet()) { System.out.println(key + " : " + set.count(key)); } }

}

In this way you're able to construct your map very easily. After that, for each key you can count the number of items associated to it using the count method.

This still keeps most of the data in memory, but for 30K records that won't be a problem. Further, I wouldn't do set.count(key) in the for loop, that's a O(n*log(n)) operation. Using Multiset.Entry<String> entry: set.iterator(), and pring entry.getElement() + ": " + entry.getCount() is O(n).

GeertPt · Accepted Answer · 2014-08-18 12:27:42Z

A very simple solution would be to use Guava's TreeMultiset: http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/TreeMultiset.html.

Create a class UserGroup with fields userId and group and let it implement Comparable, by comparing first on userId and then on group.

Read in your csv file, create a UserGroup per line and add it to the MultiSet,

To get the result, use the MultiSet.iterator(), and print entry.getElement() and entry.getCount().

If you get an Out-Of-Memory, and you can't assign enough memory, you could use an external (merge) sort https://code.google.com/p/externalsortinginjava/

The class UserGroup is not really necessary, you can use a String with userId + " "+group concatenated, provided that userId does not contain any spaces.

assylias · Accepted Answer · 2014-08-18 12:58:53Z

With Java 8 you can write something like:

Map<String, Long> userGroup = Files.lines(csvFile, UTF_8) .skip(1) //skip headers .map(s -> s.split("\\s+")) //split on space .map(array -> array[0] + " " + array[1]) //user + " " + group //collect into a TreeMap, for sorting //the key is the user/group and the value the number of occurences .collect(groupingBy(ug -> ug, TreeMap::new, counting()));

_{note: requires the following static imports: import static java.util.stream.Collectors.counting; and import static java.util.stream.Collectors.groupingBy;}

Collectives™ on Stack Overflow

Grouping similar items from CSV file column as primary key

3 Answers 3

1 Comment

1 Comment

Comments

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Related