39

I need to find files where a specific string appears twice or more.

For example, for three files:

File 1:

Hello World! 

File 2:

Hello World! Hello ! 

File 3:

Hello World! Hello Hello Again. 

--

I want to grep Hello and only get files 2 & 3.

4
  • 2
    @Melanie Shebel - not really sure what you are looking for. It may be good to know if multiple matches in the same line should be considered or not, for example. Commented Oct 21, 2015 at 16:26
  • I have some files that contain "calculation completed" once and some that contain "calculation completed" twice. I need to pull a list of the files that contain the string twice. The strings appear on separate lines. Commented Oct 23, 2015 at 4:43
  • Then all of the answers below will work. What more do you need? Commented Oct 23, 2015 at 6:50
  • @MelanieShebel ok. Adding a bounty is nice, even though I guess you could have asked a new question to have more control over the possible solutions and desired output. Commented Oct 23, 2015 at 12:54

8 Answers 8

40
+50

What about this:

grep -o -c Hello * | awk -F: '{if ($2 > 1){print $1}}' 
Sign up to request clarification or add additional context in comments.

4 Comments

This will tell us which files contain at least 2 lines containing the word 'Hello'. What if a file has the line Hello Hello World? It won't get listed.
This should be ($2 > 1) or it will only print file with 3 or more hits.
@bstar55 Sorry if that was ambiguous. The way the files are designed this issue isn't going to be a problem.
This could use some explanation. Does the -o flag actually do anything here?
10

Since the question is tagged grep, here is a solution using only that utility and bash (no awk required):

#!/bin/bash for file in * do if [ "$(grep -c "Hello" "${file}")" -gt 1 ] then echo "${file}" fi done 

Can be a one-liner:

for file in *; do if [ "$(grep -c "Hello" "${file}")" -gt 1 ]; then echo "${file}"; fi; done 

###Explanation###

  • You can modify the for file in * statement with whatever shell expansion you want to get all the data files.
  • grep -c returns the number of lines that match the pattern, with multiple matches on a line still counting for just one matched line.
  • if [ ... -gt 1 ] test that more than one line is matched in the file. If so:
  • echo ${file} print the file name.

Comments

3

This awk will print the file name of all files with 2 or more Hello

awk 'FNR==1 {if (a>1) print f;a=0} /Hello/ {a++} {f=FILENAME} END {if (a>1) print f}' * file2 file3 

Comments

2

What you need is a grep that can recognise patterns across line endings ("hello" followed by anything (possibly even line endings), followed by "hello")

As grep processes your files line by line, it is (by itself) not the right tool for the job - unless you manage to cram the whole file into one single line.

Now, that is easy, for example using the tr command, replacing line endings by spaces:

if cat $file | tr '\n' ' ' | grep -q 'hello.*hello' then echo "$file matches" fi 

This is quite efficient, even on large files with many (say 100000) lines, and can be made even more efficient by calling grep with --max-count=1 , making it stop the search after a match has been found. It doesn't matter whether the two hellos are on the same line or not.

Comments

2

After reading your question, I think you also want to find the case hello hello in one line. ( find files where a specific string appears twice or more.) so I come up with this one-liner:

awk -v p="hello" 'FNR==1{x=0}{x+=gsub(p,p);if(x>1){print FILENAME;nextfile}}' * 
  • in the above line, p is the pattern you want to search
  • it will print the filename if the file contains the pattern two or more times. no matter they are in same or different lines
  • during the processing, after checking some line, if we had already found two or more pattern, print the filename and stop processing current file, take the next input file, if there still are. This is helpful if you have big files.

A little test:

kent$ head f* ==> f <== hello hello world ==> f2 <== hello ==> f3 <== hello hello SK-Arch 22:27:00 /tmp/test kent$ awk -v p="hello" 'FNR==1{x=0}{x+=gsub(p,p);if(x>1){print FILENAME;nextfile}}' f* f f3 

1 Comment

Thanks @Kent ! In my specific example I'll never have the string twice in a row, but it's good to know.
1

Another way:

grep Hello * | cut -d: -f1 | uniq -d 

Grep for lines containing 'Hello'; keep only the file names; print only the duplicates.

1 Comment

First time I used -d switch of uniq command! Interresting!
0

Piping to a scripting language might be overkill, but it's oftentimes much easier than just using awk

grep -rnc "Hello" . | ruby -ne 'file, count = $_.split(":"); puts "#{file}: #{count}" if count&.to_i >= 2' 

So for your input, we get

$ grep -rnc "Hello" . | ruby -ne 'file, count = $_.split(":"); puts "#{file}: #{count}" if count&.to_i >= 2' ./2: 2 ./3: 3 

Or to omit the count

grep -rnc "Hello" . | ruby -ne 'file, _ = $_.split(":"); puts file if count&.to_i >= 2' 

Comments

0
grep -c Hello * | egrep -v ':[01]$' | sed 's/:[0-9]*$//' 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.