Return to Revisions

7 of 8

edited body

edited Dec 31, 2016 at 16:30

sed -e 's/[^[:alpha:]]/ /g' text_to_analize.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl

This command makes the following:

Substitute all non alphanumeric characters with a blank space.
All line breaks are converted to spaces also.
Reduces all multiple blank spaces to one blank space
All spaces are now converted to line breaks. Each word in a line.
Translates all words to lower case to avoid 'Hello' and 'hello' to be different words
Sorts de text
Counts and remove the equal lines
Sorts reverse in order to count the most frequent words
Add a line number to each word in order to know the word posotion in the whole

For example if I want to analize the first Linus Torvald message:

From: [email protected] (Linus Benedict Torvalds) Newsgroups: comp.os.minix Subject: What would you like to see most in minix? Summary: small poll for my new operating system Message-ID: [email protected] Date: 25 Aug 91 20:57:08 GMT Organization: University of Helsinki

Hello everybody out there using minix –

I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones. This has been brewing since april, and is starting to get ready. I’d like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things).

I’ve currently ported bash(1.08) and gcc(1.40), and things seem to work. This implies that I’ll get something practical within a few months, and I’d like to know what features most people would want. Any suggestions are welcome, but I won’t promise I’ll implement them 🙂

Linus ([email protected])

PS. Yes – it’s free of any minix code, and it has a multi-threaded fs. It is NOT protable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that’s all I have :-(.

I create a file named linus.txt, I paste the content and then I write in the console:

sed -e 's/[^[:alpha:]]/ /g' linus.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl

The out put would be:

 1 7 i 2 5 to 3 5 like 4 5 it 5 5 and 6 4 minix 7 4 a 8 3 torvalds 9 3 of 10 3 helsinki 11 3 fi 12 3 any 13 2 would 14 2 won 15 2 what 16 ...

If you want to visualize only the first 20 words:

sed -e 's/[^[:alpha:]]/ /g' text_to_analize.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | head -n 20

Is important to note that the command tr 'A-Z' 'a-z' doesn't suport UTF-8 yet, so that in foreign languages the word APRÈS would be translated as aprÈs.

If you only want to search for the occurency of one word you can add a grep at the end:

sed -e 's/[^[:alpha:]]/ /g' text_to_analize.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | grep "\sword_to_search_for$"

In a script called search_freq:

#!/bin/bash sed -e 's/[^[:alpha:]]/ /g' text_to_analize.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | grep "\s$1$"

The script must be called:

 search_freq word_to_search_for

answered Dec 26, 2016 at 21:12

Roger Borrell