2

I am trying to find duplicates in a file and once a match is found mark the 1st match with a character or word on the end of the line.

eg my file (test.html) contains the following entries

host= alpha-sfserver1 host= alphacrest3 host= alphacrest4 host= alphactn1 host= alphactn2 host= alphactn3 host= alphactn4 down alphacrest4 

I can find the duplicate using the following:- (I use $2 as the duplicate will always be in column 2)

awk '{if (++dup[$2] == 1) print $0;}' test.html 

It removed the last entry (down alphacrest4) but what I want is to also mark the duplicate entry with a word or character such as:-

host= alphacrest4 acked 

Any help is most welcome.

2
  • Strange html file... Commented Jun 3, 2013 at 15:43
  • just named it test.html. I should have called it test.txt for all the people who really care about its name. :-) Commented Jun 3, 2013 at 15:50

3 Answers 3

1

You need to process the file twice. In the first run you write the dupes into a file:

awk '{if (++dup[$2] == 1) print $2;}' test.html > dupes.txt 

The second run compares all lines against the file contents:

awk 'BEGIN { while (getline var <"dupes.txt") { dup2[var]=1; }}; { num=++dup[$2] if (num == 1) { if (1 == dup2[$2]) print $0 " acked"; else print $0;} }' \ test.html 
2
  • Hi Hauke, almost worked, just had to change the awk '{if (++dup[$2] == 1) print $2;}' test.html > dupes.txt to awk '{if (++dup[$2] == 2) print $2;}' test.html > dupes.txt Commented Jun 3, 2013 at 16:21
  • Sorry I should have said thanks for you help and very quick reply. Commented Jun 3, 2013 at 16:30
1

This would be much easier if we had the entire file. Are you only interested in lines beginning with host= or any of the 2nd fields? For a general solution, try this:

perl -e '@file=<>; foreach(map{/.+?\s+(.+)/;}@file){$dup{$_}++}; foreach(@file){ chomp; /.+?\s+(.+)/; if($dup{$1}>1 && not defined($p{$1})){ print "$_ acked\n"; $p{$1}++;} else{print "$_\n"} }' test.html 

The script above will first read the entire file, check for duplicates and then print each duplicate line followed by "acked".

The whole thing is much simpler if we can assume you are only interested in lines starting with down X:

grep down test.html | awk '{printf $2}' | perl -e 'while(<>){$dup{$_}++}open(A,"test.html"); while(<A>){ if(/host=\s+(.+)/ && defined($dup{$1})){ chomp; print "$_ acked\n"} else{print}}' 
1

This could help:

One-Liner:

awk 'NR==FNR{b[$2]++; next} $2 in b { if (b[$2]>1) { print $0" acked" ; delete b[$2]} else print $0}' inputFile inputFile 

Explaination:

awk ' NR==FNR { ## Loop through the file and check which line is repeated based on column 2 b[$2]++ ## Skip the rest of the actions until complete file is scanned next } ## Once the scan is complete, look for second column in the array $2 in b { ## If the count of the column is greater than 1 it means there is duplicate. if (b[$2]>1) { ## So print that line with "acked" marker print $0" acked" ## and delete the array so that it is not printed again delete b[$2] } ## If count is 1 it means there was no duplicate so print the line else print $0 }' inputFile inputFile 

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.