sed -e 's/[^[:alpha:]]/ /g' text_to_analize.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl This command makes the following:
- Substitute all non alphanumeric characters with a blank space.
- All line breaks are converted to spaces also.
- Reduces all multiple blank spaces to one blank space
- All spaces are now converted to line breaks. Each word in a line.
- Translates all words to lower case to avoid 'Hello' and 'hello' to be different words
- Sorts de text
- Counts and remove the equal lines
- Sorts reverse in order to count the most frequent words
- Add a line number to each word in order to know the word posotion in the whole
For example if I want to analize the first Linus Torvald message:
From: [email protected] (Linus Benedict Torvalds) Newsgroups: comp.os.minix Subject: What would you like to see most in minix? Summary: small poll for my new operating system Message-ID: [email protected] Date: 25 Aug 91 20:57:08 GMT Organization: University of Helsinki
Hello everybody out there using minix –
I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones. This has been brewing since april, and is starting to get ready. I’d like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things).
I’ve currently ported bash(1.08) and gcc(1.40), and things seem to work. This implies that I’ll get something practical within a few months, and I’d like to know what features most people would want. Any suggestions are welcome, but I won’t promise I’ll implement them 🙂
Linus ([email protected])
PS. Yes – it’s free of any minix code, and it has a multi-threaded fs. It is NOT protable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that’s all I have :-(.
I create a file named linus.txt, I paste the content and then I write in the console:
sed -e 's/[^[:alpha:]]/ /g' linus.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl The out put would be:
1 7 i 2 5 to 3 5 like 4 5 it 5 5 and 6 4 minix 7 4 a 8 3 torvalds 9 3 of 10 3 helsinki 11 3 fi 12 3 any 13 2 would 14 2 won 15 2 what 16 ... If you want to visualize only the first 20 words:
sed -e 's/[^[:alpha:]]/ /g' text_to_analize.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | head -n 20 Is important to note that the command tr 'A-Z' 'a-z' doesn't suport UTF-8 yet, so that in foreign languages the word APRÈS would be translated as aprÈs.
If you only want to search for the occurency of one word you can add a grep at the end:
sed -e 's/[^[:alpha:]]/ /g' text_to_analize.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | grep "\sword_to_search_for$" In a script called search_freq:
#!/bin/bash sed -e 's/[^[:alpha:]]/ /g' text_to_analize.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | grep "\s$1$" The script must be called:
search_freq word_to_search_for