1

I have a method that counts the frequency of words in a string. I am manually including some words that should be deleted. What I've found is that for short strings, 'the' is removed...for longer strings such as the one below, the method still prints 'the'. Any ideas on why this is and how to fix it?

def count_words(string) words = string.downcase.split(' ') delete_list = ['the'] delete_list.each do |del| words.delete_at(words.index(del)) end frequency = Hash.new(0) words.each do |word| frequency[word.downcase] += 1 end return frequency.sort_by {|k,v| v}.reverse end puts count_words('Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking politics have gotten out of control since gates left the building.. sales metrics often do not reflect the contributions of the role, which demonstrates that line management is out of touch of what the individual contributors role really does middle management does not care about the career of his/her directs, 90% of the time management competes directly with their people, or takes credit for their work lots of back stabbing going on Microsoft changes the organization or commitment or comp model, faster than the average deal cycle, making it next to near impossible to develop momentum in role or a rhythm of success execs promote themselves in years when they freeze employees merit increases only way to advance is to step on your peers/colleagues and take credit for work you had no impact on, beat your chest loud enough and you get "visibility" you need to advance visibility is not based on performance by enlarge, it is based on being in your manager\'s swim lane for advancement I have observed people get promoted in years when they did not meet their quota, nor did the earn the highest performance on the team, they kissed their way to the promotion Advice to Senior Management 1, get back to risk taking and teaming, less politics please, you are killing the company 2, set realistic commitments and stick to them for multiple years, stop changing the game faster than your people can react 3, stop over engineering commitments and over segmenting the company, people are not willing to collaborate or be corporate citizens 4, too many empty suits in middle management, keep flattening out the company and getting rid of middle managers that run reports all day, get back to a culture where managers also sell and drives wins 5, keep your word microsoft, you said stability, but you keep tinkering with the org too much for any changes to take affect A great Culture Limitless opportunities Supportive Management team who are passionate about people A company that really does want you to have a good work life balance and backs it up with policies that enable you to manage how and where you work. Cons Support resources are constrained Can be overly competitve and hard to get noticed Sales rewards are definitely prioritised and marketing cuts are always prioritised. Consumer organisation is still far from ideal. Advice to Senior Management Focus on getting the internal organisation simplified to improve performance and increase empowerment. Get some REAL consumer focus and invest for the long term Start connecting with people, focussing on telling stories rather than selling products.') 
3
  • Couple quick points; you don't need frequency[word.downcase] as words are already downcased. You should also use each_with_object instead of your words.each loop. See my answer. Commented Apr 10, 2013 at 2:00
  • This happens because you have more than one "the" in the array, so when you say delete_at(words.index(del)), it just deletes the first occurrence. To fix it, do what @meagar said. Commented Apr 10, 2013 at 2:08
  • 2
    When you create a question, don't put more data than is absolutely necessary. Your question is difficult to read simply because of all the visual noise from too much unnecessary text. Commented Apr 10, 2013 at 2:44

2 Answers 2

1

Just use words.delete("the"). All you need to do is give it the key.

The simpler version of your program would be:

def count_words(string) words = string.downcase.split(' ').each_with_object(Hash.new(0)) { |w,o| o[w] += 1 } delete_list = ['the'] delete_list.each { |del| words.delete(del) } frequency.sort_by {|k,v| v}.reverse end 
Sign up to request clarification or add additional context in comments.

Comments

1

This is a very common problem when analyzing web pages for SEO. Here's a quick version as I'd write it:

require 'pp' STOP_WORDS = %w[a and of the] def count_words(string) word_count = string .downcase .gsub(/[^a-z ]+/, '') .split .group_by{ |w| w } STOP_WORDS.each do |stop_word| word_count.delete(stop_word) end word_count .map{ |k,v| [k, v.size]} .sort_by{ |k, c| [-c, k] } end pp count_words(<<EOT) Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking politics have gotten out of control since gates left the building.. Start connecting with people, focussing on telling stories rather than selling products. EOT 

I purposely truncated the sample data for readability.

On that topic, you can use here-to ("<<") to improve the formatting of the code when you have to pass in a lot of text. An alternate is to insert an __END__ marker and put it all after it, then use the special IO object DATA to read that trailing block:

pp count_words(DATA.read) __END__ Pros great benefits fair compensation reasonable time off Cons middle management are empty suits, void of vision and very little risk taking politics have gotten out of control since gates left the building.. Start connecting with people, focussing on telling stories rather than selling products. 

In either case, the code outputs:

 [["of", 2], ["and", 1], ["are", 1], ["benefits", 1], ["buildingstart", 1], ["compensation", 1], ["connecting", 1], ["cons", 1], ["control", 1], ["empty", 1], ["fair", 1], ["focussing", 1], ["gates", 1], ["gotten", 1], ["great", 1], ["have", 1], ["left", 1], ["little", 1], ["management", 1], ["middle", 1], ["off", 1], ["on", 1], ["out", 1], ["people", 1], ["products", 1], ["pros", 1], ["rather", 1], ["reasonable", 1], ["risk", 1], ["selling", 1], ["since", 1], ["stories", 1], ["suits", 1], ["takingpolitics", 1], ["telling", 1], ["than", 1], ["time", 1], ["very", 1], ["vision", 1], ["void", 1], ["with", 1]] 

gsub(/[^a-z ]+/, '') strips anything that isn't a letter or a space. Enumerable's group_by is doing the heavy-lifting. Also, Enumerable's sort_by makes it easy to do the reversed sort by the count and word.

I use a hash instead of an array when removing stop words because iterating over the STOP_WORD list is usually faster than trying to iterate over the words in the corpus. A big corpus is very likely to have many more words in it than stop words.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.