0

I have a large CSV file with data similar to this

User ID Group ABC Group1 DEF Group2 ABC Group3 GHI Group4 XYZ Group2 UVW Group5 XYZ Group1 ABC Group1 DEF Group2 

i need to group these items in such a way that number of times group attribute is repeated in a user id and get a value such that

ABC Group1 ->2 ABC Group3 ->1 DEF Group2 ->2 GHI Group4 ->1 UVW Group5 ->1 XYZ Group2 ->1 XYZ Group1 ->1 

Are there any clustering algorithm to do this.

4
  • What do you mean by 'clustering' algorithm? A parallelizable algorithm? Commented Aug 18, 2014 at 12:12
  • And how large is your CSV? Can you fit it in memory? Commented Aug 18, 2014 at 12:15
  • CSV file is of 30,000 lines. Might be more in some cases. Commented Aug 18, 2014 at 12:37
  • 30.000 lines should fit in memory nicely for a modern machine... Commented Aug 18, 2014 at 14:52

3 Answers 3

1

In your case I will do somethink like this if you don't want to store all the data in memory:

public class Tester { public static Multiset<String> getMultisetFromCSV(String csvFileName, String lineDelimiter) throws IOException { Multiset<String> mapper = TreeMultiset.create(); BufferedReader reader = null; try { reader = new BufferedReader(new FileReader(csvFileName)); String[] currLineSplitted; while(reader.ready()) { currLineSplitted = reader.readLine().split(lineDelimiter); mapper.add(currLineSplitted[0] + "-" + currLineSplitted[1]); } return mapper; } finally { if(reader != null) reader.close(); } } public static void main(String[] args) throws IOException { Multiset<String> set = getMultisetFromCSV("csv", ","); for(String key : set.elementSet()) { System.out.println(key + " : " + set.count(key)); } } 

}

In this way you're able to construct your map very easily. After that, for each key you can count the number of items associated to it using the count method.

Sign up to request clarification or add additional context in comments.

1 Comment

This still keeps most of the data in memory, but for 30K records that won't be a problem. Further, I wouldn't do set.count(key) in the for loop, that's a O(n*log(n)) operation. Using Multiset.Entry<String> entry: set.iterator(), and pring entry.getElement() + ": " + entry.getCount() is O(n).
1

A very simple solution would be to use Guava's TreeMultiset: http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/collect/TreeMultiset.html.

Create a class UserGroup with fields userId and group and let it implement Comparable, by comparing first on userId and then on group.

Read in your csv file, create a UserGroup per line and add it to the MultiSet,

To get the result, use the MultiSet.iterator(), and print entry.getElement() and entry.getCount().

If you get an Out-Of-Memory, and you can't assign enough memory, you could use an external (merge) sort https://code.google.com/p/externalsortinginjava/

1 Comment

The class UserGroup is not really necessary, you can use a String with userId + " "+group concatenated, provided that userId does not contain any spaces.
1

With Java 8 you can write something like:

Map<String, Long> userGroup = Files.lines(csvFile, UTF_8) .skip(1) //skip headers .map(s -> s.split("\\s+")) //split on space .map(array -> array[0] + " " + array[1]) //user + " " + group //collect into a TreeMap, for sorting //the key is the user/group and the value the number of occurences .collect(groupingBy(ug -> ug, TreeMap::new, counting())); 

note: requires the following static imports: import static java.util.stream.Collectors.counting; and import static java.util.stream.Collectors.groupingBy;

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.