1

I read data from a bunch or emails and count frequency of each word. first construct two counters:

counters.stats = collections.defaultdict(dict) counters.chi = collections.counter() 

The key of stats is word. For each word, I construct a dict, whose key is the name of the email and value is the frequency of that word in this email.

The key of chi is the same words as those in stats. I want to sort the key in 'stats' by the keys in 'chi.' The problem is fixed by:

def print_stats(counters): sorted_key = sorted(counters.stats, key = counters.chi.get) result = collections.OrderedDict(k, counters.stats[k] for key in sorted_key) for form, cat_to_stats in result.items(): 
1
  • 1
    your specific question aside, if you need chisquare stats, you may consider scipy package, module scipy.stats where chisquare function is Commented May 5, 2012 at 15:54

1 Answer 1

3

If I understand you correctly, this should do what you want:

>>> stats = {'a': {'email1':4, 'email2':3}, ... 'the': {'email1':2, 'email3':4}, ... 'or': {'email1':2, 'email3':1}} >>> chi = {'a': 7, 'the':6, 'or':3} >>> sorted(stats, key=chi.get) ['or', 'the', 'a'] 

Let me know if this works for you. Also, as Boud mentioned above, you should consider numpy/scipy, which would probably provide better performance -- and would definitely provide lots of built-in functionality.

Since you say this doesn't work -- for reasons you haven't yet explained -- here's a more general example of how to use the key argument. This shows that get works with Counter objects as well as standard dicts, but also how to create a function that does something :

>>> stats = {'a': {'email1':4, 'email2':3}, ... 'the': {'email1':2, 'email3':4}, ... 'or': {'email1':2, 'email3':1}} >>> wordlists = ([k] * sum(d.itervalues()) for k, d in stats.iteritems()) >>> chi = collections.Counter(word for seq in wordlists for word in seq) >>> sorted(stats, key=chi.get) ['or', 'the', 'a'] >>> sorted(stats, key=lambda x: chi[x] + 3) ['or', 'the', 'a'] >>> sorted(stats, key=chi.get, reverse=True) ['a', 'the', 'or'] 

I still don't completely understand what you're looking for, but perhaps you mean to get a sorted list of key, value tuples?

>>> sorted(stats.iteritems(), key=lambda x: chi[x[0]]) [('or', {'email3': 1, 'email1': 2}), ('the', {'email3': 4, 'email1': 2}), ('a', {'email2': 3, 'email1': 4})] 

I would actually recommend splitting this up though:

>>>> sorted_keys = sorted(stats, key=chi.get) >>>> [(k, stats[k]) for k in sorted_keys] [('or', {'email3': 1, 'email1': 2}), ('the', {'email3': 4, 'email1': 2}), ('a', {'email2': 3, 'email1': 4})] 

You said you want something sorted by the values in chi, but "with the same structure as stats." That's not possible because dictionaries don't have an order; the closest you can come is a sorted list of tuples, or an OrderedDict (in 2.7+).

>>>> collections.OrderedDict((k, stats[k]) for k in sorted_keys) OrderedDict([('or', {'email3': 1, 'email1': 2}), ('the', {'email3': 4, 'email1': 2}), ('a', {'email2': 3, 'email1': 4})]) 

If you have to frequently reorder the dictionary, this method is kind of pointless.

Sign up to request clarification or add additional context in comments.

7 Comments

an elegant solution I must remark.Learnt something new for the day
Do you mean I should sort 'chi' first? p.s. I cannot install numpy/scipy. Any suggestions? @senderle
is get a built-in function? I cannot use it.
@user1325302, what do you mean? It's a built-in method of Counter objects that does almost exactly what counter_object[key] does, but doesn't throw a KeyError if the key doesn't exist. When you say "I cannot use it" what do you mean?
@user1325302, you've said that it doesn't work twice, but haven't explained how it doesn't work. As you can see, in the code above, it works perfectly. What happens when you try this? Edit your question so that I can actually answer it.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.