6

I want to find where a word appears in a text file — as in the number of words into the text that a word occurs — for all instances of that word, but I'm not sure even where to start. I imagine I'll need a loop, and some combination of grep and wc.

As an example, here is a an article about iPhone 11:

On Tuesday, in a sign that Apple is paying attention to consumers who aren’t racing to buy more expensive phones, the company said the iPhone 11, its entry-level phone, would start at $700, compared with $750 for the comparable model last year.

Apple kept the starting prices of its more advanced models, the iPhone 11 Pro and iPhone 11 Pro Max, at $1,000 and $1,100. The company unveiled the new phones at a 90-minute press event at its Silicon Valley campus.

There are 81 words in the text.

jaireaux@macbook:~$ wc -w temp.txt 81 temp.txt 

The word 'iPhone' appears three times.

jaireaux@macbook:~$ grep -o -i iphone temp.txt | wc -w 3 

The output I want would be like this:

jaireaux@macbook:~$ whereword iPhone temp.txt 24 54 57 

What would I do to get that output?

2
  • To get a robust answer you need to define a "word". In your example is aren't a word? Is $1,000 a word? If so then they don't fit the usual criteria that a word is a series of word-constituent characters and the word-constituent characters are letters, digits, and underscore (e.g. see -w in the GNU grep man page, linuxcommand.org/lc3_man_pages/grep1.html, and the meaning of \w in regexps for tools that accept such). If aren't isn't a word then does that mean aren and t are both words? Commented Feb 21, 2020 at 14:55
  • Here's an example of one difficulty in coming up with a foolproof solution - is ' part of a word or not? If you wanted to search for aren't then you'd want it to be part of a word but if you also wanted to find iPhone when my iPhone's broken appears in your text then you wouldn't want it to be part of a word. Lots of different conflicting possibilities to consider when trying to parse natural language! Commented Feb 21, 2020 at 18:26

9 Answers 9

7

Here's one way, using GNU tools:

$ tr ' ' '\n' < file | tr -d '[:punct:]' | grep . | grep -nFx iPhone 25:iPhone 54:iPhone 58:iPhone 

The first tr replaces all spaces with newlines, and then the second deletes all punctuation (so that iPhone, can be found as a word). The grep . ensures that we skip any blank lines (we don't want to count those) and the grep -n appends the line number to the output. Then, the -F tells grep not to treat its input as a regular expression, and the -x that it should only find matches that span the entire line (so that job will not count as a match for jobs). Note that the numbers you gave in your question were off by one.

If you only want the numbers, you could add another step:

$ tr ' ' '\n' < file | tr -d '[:punct:]' | grep . | grep -nFx iPhone | cut -d: -f1 25 54 58 

As has been pointed out in the comments, this will still have problems with "words" such as aren't or double-barreled. You can improve on that using:

tr '[[:space:][:punct:]]' '\n' < file | grep . | grep -nFx iPhone 
23
  • 1
    A tr -d [:punct:] can be added, for the opposite reason as -x. "phone" (instead of iPhone) gives no match, but "phone," gives 29:phone, Commented Feb 20, 2020 at 19:01
  • 1
    Is aren’t a word? If yes then this answer won't work because it'll convert aren't to arent and so won't be able to find aren't in the input, if no then this answer still won't work because that'd mean aren and t are separate words but since the script converts aren't to arent it won't be able to find aren or t in the input. It has other issues too due to deleting punctuation rather than converting it to blanks and swapping the order of the trs (which would work for one interpretation of "word"). Commented Feb 21, 2020 at 14:28
  • 2
    To be clear - do not do tr -d '[:punct:]' as that concatenates strings that were separated by punctuation and so it'll create words that weren't actually present in your input while removing words that were present. Do 'tr '[[:space:][:punct:]]' '\n' < file instead - it's still not a perfect approach but it's an improvement assuming you do want to treat punctuation chars like the in aren’t as not word-constituent. Commented Feb 21, 2020 at 15:44
  • 1
    @EdMorton dammit, you keep raising very valid points! :) Yes, you're right, but if you're going to do proper natural language processing, you can't use this sort of naive tool anyway. It would also choke on O'Reiley or double-barreled... Still, your tr is indeed an improvement worth making even if it's still flawed, it's less flawed than what I came up with. Thanks! Commented Feb 21, 2020 at 16:37
  • 1
    sigh. In my defense, I'm still working and it's 11pm... Nevertheless, I look forward to the day you get editing privileges here @EdMorton :) Commented Feb 21, 2020 at 23:03
3

Use the tr command to replace all whitespace by a single newline (using the squeeze option).

Pipe that to nl -ba, which numbers each line (and thus word) sequentially.

Pipe that to grep -F for the word you want. This will show the number and text for just those words.

awk would also do this in one process, but probably look more complex.

6
  • 1
    <FILE tr -s '[:space:]' '\n' | grep -nFi iphone Commented Feb 20, 2020 at 18:10
  • With grep -n, I get 88:Seq. With nl -ba, I get ` 88 Seq` in columns. Actually, the OP asked just for the word numbers. I would change the grep to awk '/^myWord$/ { print NR } or to some variant of col. The significant part is using tr to serialise the words, though. Commented Feb 20, 2020 at 18:12
  • This would be much better if you actually showed the commands instead of just describing them. Commented Feb 20, 2020 at 18:25
  • I hadn't even thought of 'tr' but it makes this fairly simple. Thanks! Commented Feb 20, 2020 at 18:56
  • 2
    That's understandable, but remember that we don't answer for the person who asked. We answer for all the people who will read the question in the future, so even if the OP knows how to convert this into specific commands, the next person might not, so it's always better to at least show the commands. Commented Feb 20, 2020 at 22:07
2

An alternative with sed:

sed -e '/^$/d' -e 's/^[[:blank:]]*//g' < file | sed 's/[[:blank:]]/\n/g' | grep -ion "iphone" 

Output:

25:iPhone 54:iPhone 58:iPhone 
1

I was experimenting (right now!) with something similar: a word count. Like that, you see what the "words" look like:

]# cat iphone | tr -s [:space:] '\n' |sort|uniq -c|sort -n |grep phone 1 phone, 1 phones 1 phones, ]# cat iphone | tr -d [:punct:] | tr -s [:space:] '\n' |sort|uniq -c|sort -n |grep phone 1 phone 2 phones 

This trick(?) |sort|uniq -c|sort -n gives a good overview.

 2 Apple 2 Pro 2 a 2 and 2 company 2 more 2 phones 2 to 3 11 3 iPhone 3 its 4 at 6 the 

This looks nice, but at the top:

 1 1000 1 1100 1 700 1 750 1 90minute 

Dollars, comma and minus are gone...looks clean, at least.


A quick fix is defining some common interpuncts that will not appear in a (natural language) "word". And then use ^anchoring$, on one or both sides.

]# cat iphone | tr -d '.,;"!?' | tr -s [:space:] '\n' | grep -n phone 21:phones 30:phone 72:phones ]# cat iphone | tr -d '.,;"!?' | tr -s [:space:] '\n' | grep -n ^phone$ 30:phone 

And you can locate things like small-digit numbers:

]# cat iphone | tr -d '.,;"!?' | tr -s [:space:] '\n' | grep -n '1[012]' 27:11 56:11 60:11 64:$1000 66:$1100 

tr|sed|grep (best simple solution)

This handles some cases (well all the ones in this @*#! text;) and gives 81 words, like wc. There must be no leading spaces to be correct with numbering. The stupid (but not too) splitting is done by tr, then sed removes the trailing punctuations: here only comma and period. And then a grep numbers and filters ad lib.

]# <iphone tr -s ' \t' '\n' | sed -E 's/(.+)[.,]/\1/' | grep -En '[\$-]|campus|i*[pP]hone$|entry' 25:iPhone 28:entry-level 29:phone 33:$700 36:$750 54:iPhone 58:iPhone 63:$1000 65:$1,100 74:90-minute 81:campus 

This i*[pP]hone$ does not find the plural form. This would not work well with the trailing comma, see above. The commas are gone, except for the prices.

To separate "entry-level" you can just add the minus sign to tr's SET1.

I think this is a good example of each tool doing one natural step.

0

Create a function.

$ whereword(){ grep -ion "$1" -<<<$(egrep -o "[^[:blank:]]+" "$2"); } $ whereword iPhone tmp.txt 25:iPhone 54:iPhone 58:iPhone $ whereword "aren't" tmp.txt 14:aren't 
0

A GNU awk alternative, splitting on single spaces or the combination of a full stop and a newline

awk 'BEGIN{RS=" |\\.\n"} $0~/iPhone/{print NR}' file1 
0

To keep up with the idea that a word is what wc counts as a word:

A word is a non-zero-length sequence of characters delimited by white space.

We can divide the file into sequences of non-spaces in each line with grep -Eo '[^[:space:]]+' file, then remove tr -d '[:punct:]' the (still present) punctuation characters, and finally, grep (case insensitive) by the word of interest grep -in 'phone'

$ grep -Eo '[^[:space:]]+' file | tr -d '[:punct:]' | grep -in 'phone' 20:phones 25:iPhone 29:phone 54:iPhone 58:iPhone 71:phones 

Note that the removal of punctuation characters in this case doesn't change the line position of the words. The -i option selects both Phone and phone as shown.

For the case of the word iPhone:

$ grep -Eo '[^[:space:]]+' file | tr -d '[:punct:]' | grep -in 'iphone' 25:iPhone 54:iPhone 58:iPhone 

That should be the correct word numbering (Not 24, 54 and 58 as you wrote).

0

Using the most common interpretation of "word" for parsing English text (i.e. what grep -w considers a word and what \w means in tools that accept that as meaning "word constituent character" in regexps) which is "a string of letters, digits, and/or underscore characters" aren’t is not a word so:

$ cat tst.awk BEGIN { FS="[^[:alnum:]_]+" } { for (i=1; i<=NF; i++) { numWords++ if ($i == tgt) { print numWords } } } $ awk -v tgt="iPhone" -f tst.awk file 26 57 61 $ awk -v tgt="aren’t" -f tst.awk file $ $ awk -v tgt="aren" -f tst.awk file 14 

or if aren’t is a word then:

$ cat tst.awk BEGIN { FS="[^[:alnum:]_’]+" } { for (i=1; i<=NF; i++) { numWords++ if ($i == tgt) { print numWords } } } $ awk -v tgt="iPhone" -f tst.awk file 25 56 60 $ awk -v tgt="aren’t" -f tst.awk file 14 $ awk -v tgt="aren" -f tst.awk file $ 

The right solution all depends on your definition of a "word". For example neither of the above consider $1,000 to be a word - idk if that's a problem or not for your application. If it is, here's a script that might be closer to your interpretation of a "word" (using GNU awk for FPAT):

$ cat tst.awk BEGIN { FPAT = "([[:alpha:]]+[’'][[:alpha:]]+)|([$]?[0-9]+(,[0-9]+)*([.][0-9]+)?%?)|([[:alnum:]_]+)" } { for (i=1; i<=NF; i++) { numWords++ print numWords, "<" $i ">" if ($i == tgt) { print numWords } } } 

and here's the "words" it recognizes in your sample input:

$ awk -f tst.awk file 1 <On> 2 <Tuesday> 3 <in> 4 <a> 5 <sign> 6 <that> 7 <Apple> 8 <is> 9 <paying> 10 <attention> 11 <to> 12 <consumers> 13 <who> 14 <aren’t> 15 <racing> 16 <to> 17 <buy> 18 <more> 19 <expensive> 20 <phones> 21 <the> 22 <company> 23 <said> 24 <the> 25 <iPhone> 26 <11> 27 <its> 28 <entry> 29 <level> 30 <phone> 31 <would> 32 <start> 33 <at> 34 <$700> 35 <compared> 36 <with> 37 <$750> 38 <for> 39 <the> 40 <comparable> 41 <model> 42 <last> 43 <year> 44 <Apple> 45 <kept> 46 <the> 47 <starting> 48 <prices> 49 <of> 50 <its> 51 <more> 52 <advanced> 53 <models> 54 <the> 55 <iPhone> 56 <11> 57 <Pro> 58 <and> 59 <iPhone> 60 <11> 61 <Pro> 62 <Max> 63 <at> 64 <$1,000> 65 <and> 66 <$1,100> 67 <The> 68 <company> 69 <unveiled> 70 <the> 71 <new> 72 <phones> 73 <at> 74 <a> 75 <90> 76 <minute> 77 <press> 78 <event> 79 <at> 80 <its> 81 <Silicon> 82 <Valley> 83 <campus> 
2
  • No, wc (and that is what the OP used to get the word count) define word (from man wc) as A word is a non-zero-length sequence of characters delimited by white space., so no: that's clearly not the same as \w. Commented Feb 21, 2020 at 21:25
  • @Isaac I didn't mention wc. I know the OP used it but the OP doesn't know how to solve his problem. Did you maybe mean to comment on the question or on a different answer? Commented Feb 21, 2020 at 23:15
0

[I wonder how you got those numbers -- if I select the text up to the 1st iPhone and pipe it to wc -w, I get 24. Up the 2nd iPhone, I get 53, not 54. So they don't match, no matter in which direction I shift them]

Assuming that a) the count should be 1-based, b) words are separated by spaces (using the same definition of "word" as wc -w), and c) GNU grep is used, this will do much simpler:

grep -Po '\S+' file | grep -n iPhone 25:iPhone 54:iPhone 58:iPhone 

[that will also match iPhoney or XiPhone, but not iphone; if you want to make it match the whole word case insensitively, use ... | grep -nwi iPhone]

That's also more easily adaptable to a different definition of "word"; for instance, for word = a sequence of any characters except controls, spaces (separators) and punctuation:

grep -Po '[^\pC\pZ\pP]+' file | grep -n iPhone 26:iPhone 56:iPhone 60:iPhone 

Or word = letters, marks, numbers and some symbols and punctuation like $, _, ' + the unproperly used "left quotation mark" (U+2019) instead of apostrophe in aren’t:

grep -Po "[\pL\pM\pN'\x{2019}\$]+" file | grep -n iPhone 25:iPhone 55:iPhone 59:iPhone 
1
  • This works (also): grep -Eo '\S+' file| grep -n '[Pp]hones*'. Btw. tr inserts a newline before the text starts, like your second example (from +1 to +2) Commented Feb 21, 2020 at 12:39

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.