50

I have a large JSON file that is on one line, and I want to use the command line to be able to count the number of occurrences of a word in the file. How can I do that?

2
  • 1
    It is unclear whether the word should be matched in both keys and values of the JSON data, i.e. whether { "key": "the key" } should count the string key once or twice. Commented Apr 4, 2019 at 5:32
  • If your JSON is "on one line", try piping it through jq -r and redirecting output to a new file. Commented May 5, 2023 at 16:30

8 Answers 8

52
$ tr ' ' '\n' < FILE | grep WORD | wc -l 

Where tr replaces spaces with newlines, grep filters all resulting lines matching WORD and wc counts the remaining ones.

One can even save the wc part using the -c option of grep:

$ tr ' ' '\n' < FILE | grep -c WORD 

The -c option is defined by POSIX.

If it is not guaranteed that there are spaces between the words, you have to use some other character (as delimiter) to replace. For example alternative tr parts are

tr '"' '\n' 

or

tr "'" '\n' 

if you want to replace double or single quotes. Of course, you can also use tr to replace multiple characters at once (think different kinds of whitespace and punctuation).

In case you need to count WORD but not prefixWORD, WORDsuffix or prefixWORDsuffix, you can enclose the WORD pattern in begin/end-of-line markers:

grep -c '^WORD$' 

Which is equivalent to word-begin/end markers, in our context:

grep -c '\<WORD\>' 
7
  • 1
    what if there are no spaces, i.e. the field name is surrounded by quotes? e.g. "field" Commented Sep 19, 2010 at 16:42
  • @mythz: Then you replace the quotes with newlines with tr. I'll update the answer. Commented Sep 19, 2010 at 16:45
  • 2
    This answer is incorrect in many ways. It is vague: you should explain how to come up with a tr command that does the job instead of suggesting examples that will never work in all situations. It will also match words that contain the word you are looking for. The grep -o '\<WORD\>' | wc -l solution is far superior. Commented Apr 9, 2011 at 2:28
  • 1
    @Sam, the question leaves it kind of open, if a searched word should be searched like 'WORD' or '\<WORD\>' - you can read it both ways. Even if you read it the 2nd way and only in the 2nd way, then my answer would be only incorrect in 1 one way. ;) And the 'grep -o' solution is only superior, if it supports the -o option - which is not specified by POSIX ... Well, I don't think so that the use of tr is that exotic to call it vague ... Commented May 6, 2011 at 21:01
  • 1
    @Kusalananda, well, it's still an occurrence. But if you don't want to count such substring matches then please read the last paragraph of my answer and my previous comment here. Commented Apr 4, 2019 at 9:10
29

With GNU grep, this works: grep -o '\<WORD\>' | wc -l

-o prints each matched parts of each line on a separate line.

\< asserts the start of a word and \> asserts the end of a word (similar to Perl's \b), so this ensures that you're not matching a string in the middle of a word.

For example,

 $ python -c 'import this' | grep '\<one\>' There should be one-- and preferably only one --obvious way to do it. Namespaces are one honking great idea -- let's do more of those! $ python -c 'import this' | grep -o '\<one\>' one one one $ python -c 'import this' | grep -o '\<one\>' | wc -l 3 
1
  • 3
    Or just grep -wo WORD | wc -l Commented May 10, 2018 at 9:15
14

This unfortunately does not work with GNU coreutils.

grep -o -c WORD file 

If it works on your platform, it's an elegant and fairly intuitive solution; but the GNU folks are still thinking.

5
  • 2
    My bad, the bug is still open: savannah.gnu.org/bugs/?33080 Commented Jan 13, 2016 at 10:24
  • 1
    Too bad this would have been the most elegant Commented Aug 31, 2016 at 12:04
  • This worked for me ! Commented Mar 20, 2017 at 15:50
  • This is wrong. This counts the number of lines with the pattern WORD. The OP wants the total number of occurrences. Commented May 9, 2018 at 20:40
  • @PierreB That's why I'm saying GNU grep has a bug here. It's not clear from POSIX what the semantics of combining -c and -o should be so this is currently not portable. Thanks for the comment; I have updated this answer. Commented May 10, 2018 at 6:38
11
sed -e 's/[^[:alpha:]]/ /g' text_to_analyze.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl 

This command makes the following:

  1. Substitute all non alphanumeric characters with a blank space.
  2. All line breaks are converted to spaces also.
  3. Reduces all multiple blank spaces to one blank space
  4. All spaces are now converted to line breaks. Each word in a line.
  5. Translates all words to lower case to avoid 'Hello' and 'hello' to be different words
  6. Sorts the text
  7. Counts and remove the equal lines
  8. Sorts reverse in order to count the most frequent words
  9. Add a line number to each word in order to know the word position in the whole

For example if I want to analyze the first Linus Torvald message:

From: [email protected] (Linus Benedict Torvalds) Newsgroups: comp.os.minix Subject: What would you like to see most in minix? Summary: small poll for my new operating system Message-ID: [email protected] Date: 25 Aug 91 20:57:08 GMT Organization: University of Helsinki

Hello everybody out there using minix –

I’m doing a (free) operating system (just a hobby, won’t be big and professional like gnu) for 386(486) AT clones. This has been brewing since april, and is starting to get ready. I’d like any feedback on things people like/dislike in minix, as my OS resembles it somewhat (same physical layout of the file-system (due to practical reasons) among other things).

I’ve currently ported bash(1.08) and gcc(1.40), and things seem to work. This implies that I’ll get something practical within a few months, and I’d like to know what features most people would want. Any suggestions are welcome, but I won’t promise I’ll implement them 🙂

Linus ([email protected])

PS. Yes – it’s free of any minix code, and it has a multi-threaded fs. It is NOT protable (uses 386 task switching etc), and it probably never will support anything other than AT-harddisks, as that’s all I have :-(.

I create a file named linus.txt, I paste the content and then I write in the console:

sed -e 's/[^[:alpha:]]/ /g' linus.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl 

The out put would be:

 1 7 i 2 5 to 3 5 like 4 5 it 5 5 and 6 4 minix 7 4 a 8 3 torvalds 9 3 of 10 3 helsinki 11 3 fi 12 3 any 13 2 would 14 2 won 15 2 what 16 ... 

If you want to visualize only the first 20 words:

sed -e 's/[^[:alpha:]]/ /g' text_to_analyze.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | head -n 20 

Is important to note that the command tr 'A-Z' 'a-z' doesn't suport UTF-8 yet, so that in foreign languages the word APRÈS would be translated as aprÈs.

If you only want to search for the occurency of one word you can add a grep at the end:

sed -e 's/[^[:alpha:]]/ /g' text_to_analyze.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | grep "\sword_to_search_for$" 

In a script called search_freq:

#!/bin/bash sed -e 's/[^[:alpha:]]/ /g' text_to_analyze.txt | tr '\n' " " | tr -s " " | tr " " '\n'| tr 'A-Z' 'a-z' | sort | uniq -c | sort -nr | nl | grep "\s$1$" 

The script must be called:

 search_freq word_to_search_for 
2
  • 1
    sed: -e expression #2, char 7: unterminated s' command`, also this counts all words, right? But OP asked only a particular one. Also a bit of explanation would be nice. Commented Dec 26, 2016 at 21:19
  • Sorry I had a mistake. I have remade the command plus commented the answer. In my opinion, from the question, it's impossible to know wether he would like to get the ocurrency of only one word or a frequency of occurencies. But in case you would like to get only one word you can add a grep at the end. Commented Dec 27, 2016 at 9:29
3

Depending on whether you'd like to match the word in the keys or in the values of the JSON data, you are likely to want to extract only keys or only values from the data. Otherwise you may count some words too many times if they occur as both keys and values.

To extract all keys:

jq -r '..|objects|keys[]' <file.json 

This recursively tests whether the current thing is an object, and if it is, it extracts the keys. The output will be a list of keys, one per line.

To extract all values:

jq -r '..|scalars' <file.json 

This works in a similar way, but has fewer steps.

You may then pipe the output of the above through grep -c 'PATTERN' (to match some pattern against the keys or values), or grep -c -w -F 'WORD' (to match a word in the keys or values), or grep -c -x -F 'WORD' (to match a complete key or value), or similar, to do your counting.

0

I have json with something like this: "number":"OK","number":OK" repeated multiple times in one line.

My simple "OK" counter:

sed "s|,|\n|g" response | grep -c OK

0

Using Raku (formerly known as Perl_6)

~$ curl https://www.gutenberg.org/cache/epub/5/pg5.txt > US_Constitution.txt 

THEN:

Below grep followed by elems gives the count per "examined unit" of text, wherein for slurp the unit is the entire file, lines is obviously lines, and words is obviously words:

~$ raku -e 'slurp.grep(/ :i the /).elems.put;' US_Constitution.txt 1 ~$ raku -e 'lines.grep(/ :i the /).elems.put;' US_Constitution.txt 443 ~$ raku -e 'words.grep(/ :i the /).elems.put;' US_Constitution.txt 681 

Below match followed by elems gives the count of matches. The "examined unit" doesn't matter so slurp, lines, and words all return the same count:

~$ raku -e 'slurp.match(:global, / :i the /).elems.put;' US_Constitution.txt 681 ~$ raku -e 'lines.match(:global, / :i the /).elems.put;' US_Constitution.txt 681 ~$ raku -e 'words.match(:global, / :i the /).elems.put;' US_Constitution.txt 681 

The regex can be improved to only match the free-standing word "the", as opposed to that three-character sequence being found within other words, such as "these" and "bathe". General word-boundaries are denoted with either <|w> or <?wb>. Alternatively, you can be even more specific and denote a << left-word-boundary and/or >> a right-word-boundary:

~$ raku -e 'slurp.match(:global, / :i <|w> the <|w> /).elems.put;' US_Constitution.txt 519 ~$ raku -e 'slurp.match(:global, / :i <?wb> the <?wb> /).elems.put;' US_Constitution.txt 519 ~$ raku -e 'slurp.match(:global, / :i << the >> /).elems.put;' US_Constitution.txt 519 #below, remove `:i` (:ignorecase flag, i.e. adverb): ~$ raku -e 'slurp.match(:global, / << the >> /).elems.put;' US_Constitution.txt 458 

Edit: the foregoing is just a general overview on word-counting with Raku. If you need to analyze JSON files specifically you can use Raku's JSON::Tiny or JSON::Fast modules.

https://docs.raku.org/routine/grep
https://docs.raku.org/type/Str#method_match
https://raku.org

-1

i Have used below awk command to find the number of occurrences

example file

cat file1

praveen ajay praveen ajay monkey praveen praveen boy praveen 

command:

awk '{print gsub("praveen",$0)}' file1 | awk 'BEGIN{sum=0}{sum=sum+$1}END{print sum}' 

output

awk '{print gsub("praveen",$0)}' file1 | awk 'BEGIN{sum=0}{sum=sum+$1}END{print sum}' 5 
2
  • Or just awk '{sum+=gsub("praveen","")} END {print sum+0}'. Commented Mar 18, 2019 at 19:54
  • 1
    Let me know why down vote for my answer Commented Mar 19, 2019 at 7:42

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.