grouping similar strings from a csv file and find the number of occurences

Question

i need to group CSV columns such that

User ID Group ABC Group1 DEF Group2 ABC Group3 GHI Group4 XYZ Group2 UVW Group5 XYZ Group1 ABC Group1 DEF Group2

Output should be such that

ABC Group1 ->2 ABC Group3 ->1 DEF Group2 ->2 GHI Group4 ->1 UVW Group5 ->1 XYZ Group2 ->1 XYZ Group1 ->1

and need to group the data such that for ex. in ABC-->((group1 occurs twice)/(total number of occurences of ABC))+((group3 occurs once)/(total number of occurences of ABC)). so ABC-->2/3+1/3

ABC--> 2/3(no. of occurences of ABC)+1/3 DEF-->2/2 GHI-->1/1 UVW-->1/1 XYZ-->1/2+1/2

the first set of results is got using GUAVA lib

Multiset<String> set = TreeMultiset.create(); BufferedReader reader = null; try { reader = new BufferedReader(new FileReader("test.csv")); String[] currLineSplitted; while (reader.ready()) { currLineSplitted = reader.readLine().split(","); set.add(currLineSplitted[0] + "," + currLineSplitted[1]); } for (String key : set.elementSet()) { System.out.println(key + " : " + set.count(key)); } } finally { if (reader != null) { reader.close(); } }

not sure how to get the second result by grouping.

Very unclear. What do all the numbers mean? What exactly do you want? — stealthjong
– stealthjong, Commented Aug 20, 2014 at 11:46
I don't get the second grouping, could you explain the syntax? What does XYZ-->1/2+1/2 mean? You wrote 2/2(no. of occurences of ABC) so I guess (but that's not clear) that the second number is the number of occurences, but what's the first? What does the number of occurences refer to? Global occurences or per group? — Thomas
– Thomas, Commented Aug 20, 2014 at 11:46
A better explanation of the 2nd output would help to give you a solution. — Deutro
– Deutro, Commented Aug 20, 2014 at 11:48
in ABC-->((group1 occurs twice)/(total number of occurences of ABC))+((group3 occurs once)/(total number of occurences of ABC)). so ABC-->2/3+1/3 — Raj
– Raj, Commented Aug 20, 2014 at 14:26

Community · Accepted Answer · 2017-05-23 12:21:22Z

You should use a map of collections instead of a plain set. Something like this:

Map<String, Map<String,Integer>> supermap = new Hashmap(); BufferedReader reader = null; try { reader = new BufferedReader(new FileReader("test.csv")); String[] currLineSplitted; while (reader.ready()) { currLineSplitted = reader.readLine().split(","); Map<String,Integer> innermap; if(supermap.contains(currLineSplitted[0]){ innermap = supermap.get(currLineSplitted[0]); if(innermap.contains(currLineSplitted[1]){ innermap.put(currLineSplitted[1], innermap.get(currLineSplitted[1])++); } else { innermap.put(currLineSplitted[1],new Integer(1));//EDITED } } else { innermap=new Hashmap(); innermap.put(currLineSplitted[1],new Integer(1));//EDITED supermap.put(currLineSplitted[0], innermap); } } Collections.sort(supermap.keySet() , new YourOwnComparator() );//EDITED for (String userID : supermap.keySet()) { Map m = supermap.get(userID); //===========first result============= for(String group : m.keySet()){ System.out.println(userID + group + " : " + m.get(group)); } //===================================== } for (String userID : supermap.keySet()) { Map m = supermap.get(userID); //===========second result============= int numberOfGroups = m.size(); StringBuilder sb = new StringBuilder(); sb.append(userID+"-->"); for(String group : m.keySet()){ sb.append(m.get(group).toString()+"/"+numberOfGroups); } System.out.println(sb.toString()); //===================================== } } finally { if (reader != null) { reader.close(); } }

EDIT: My bad: the Integers must be created with 1 as start value. The sorting of your entries can be implemented accordingly to this case.

XYZGroup1 : 0 ABCGroup1 : 0 DEFGroup2 : 0 GHIGroup4 : 0 UVWGroup5 : 0 XYZ-->0/1 ABC-->0/1 DEF-->0/1 GHI-->0/1 UVW-->0/1 this is what i get from the above approach

Collectives™ on Stack Overflow

grouping similar strings from a csv file and find the number of occurences

1 Answer 1

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Linked

Related