I have a file: a.txt with a number at each line. I also have another file b.txt with also a number at each line.
How could I check if all the lines in file a.txt is included in b.txt?
- Better add sample input/output ;)Gilles Quénot– Gilles Quénot2014-12-10 08:54:51 +00:00Commented Dec 10, 2014 at 8:54
- check to line number is required also?Малъ Скрылевъ– Малъ Скрылевъ2014-12-10 09:21:53 +00:00Commented Dec 10, 2014 at 9:21
- Practically the same question stackoverflow.com/questions/27376807/…user3442743– user34427432014-12-10 09:29:35 +00:00Commented Dec 10, 2014 at 9:29
- @user3442743: that question specified using sed or awk; this is more general.user1071847– user10718472018-11-30 19:28:56 +00:00Commented Nov 30, 2018 at 19:28
- Related: unix.stackexchange.com/questions/397747/…Ciro Santilli OurBigBook.com– Ciro Santilli OurBigBook.com2021-01-11 20:53:17 +00:00Commented Jan 11, 2021 at 20:53
7 Answers
You can use comm for that.
If a.txt and b.txt are already sorted (lexically and ascending), you just need
comm -23 a.txt b.txt or maybe
comm -23 a.txt b.txt | wc -l If there is no output (or if wc -l returns "0"), then every line in a.txt was in b.txt (-2 suppresses output of lines that are only in b.txt, -3 suppresses output of lines that are in both files).
If the files are not sorted, you can use process substitution to pass a sorted output of each file to comm:
comm -23 <(sort a.txt) <(sort b.txt) The process substitution <(COMMAND) puts the output of COMMAND into a FIFO or a file in /dev/fd (depending on what is supported on the system). On the commandline <(COMMAND) is then substituted with the name of this file as part of the command line expansion.
This does really check lines, so if a number exists twice in a.txt but only once in b.txt this will output the duplicate line from a.txt. If you do not care about duplicates, use sort -u FILE instead of sort FILE (or sort FILE | uniq in case your sort has no switch for unique sorting)
Comments
You can use the diff command to compare two files
Example usage
$ seq 1 5 > a.txt $ seq 1 5 > b.txt $ diff a.txt b.txt $ $ seq 1 6 > b.txt $ diff a.txt b.txt 5a6 > 6 EDIT
You can also try something like
$ seq 1 5 > a.txt $ seq 1 5 > b.txt $ diff a.txt b.txt > /dev/null && echo files are same || echo files are not same files are same $ seq 1 6 > b.txt $ diff a.txt b.txt > /dev/null && echo files are same || echo files are not same files are not same 6 Comments
echo if there is some difference or not. Hope it helps youif the numbers are unique (without repetitions in each file), you can concatenate them, and pipe to sort and then uniq and check how many lines you have.
for example :
>> cat a.txt 1 2 8 5 >> cat b.txt 1 2 5 3 8 >> cat a.txt b.txt | sort | uniq | wc -l 5 since the answer is the same as the number of lines in b.txt, the answer is yes!
Comments
Try this :
awk ' NR==FNR{arr[$0]++;next} {print ($0 in arr) ? $0 " in both files" : $0 " *not* in both files"} ' b.txt a.txt With diff :
$ diff -a b.txt a.txt 2c2 < 3 --- > 2 6d5 < 7 Comments
A Perl solution:
#!/usr/bin/perl use strict; use warnings; use List::Compare; #read file a.txt open (my $fh, "<", "a.txt") or die $!; while (<$fh>){ push @atxt = $_; } close($fh); #read file b.txt open (my $fh2, "<", "b.txt") or die $!; while (<$fh2>){ push @btxt = $_; } close($fh2); my $lc = List::Compare->new(\@atxt, \@btxt); print $lc->get_intersection; print $lc->get_union; print $lc->get_unique; print $lc->get_complement; There are many more options, check out the documentation: http://search.cpan.org/~jkeenan/List-Compare-0.39/lib/List/Compare.pm
Comments
A file containing another file would imply the whole content of a.txt being present within b.txt in the same order, including possible repetitions, whereas your final question:
How could I check if all the lines in file a.txt is included in b.txt?
implies that the order and repetitions are irrelevant. As a simple example:
a.txt: 5 7 3 b.txt: 9 5 3 7 satisfies your quoted question, but not the one in the title.
Solving the quoted problem is significantly easier given that the container file is not huge (otherwise you'll run into memory problems with a straight-forward approach like the one I'll demonstrate below). A simple solution would be to create a set of all the numbers contained in b.txt, then iterate through a.txt and return false in case an item is not found in the constructed set. If this didn't happen by the time you finish iterating the content of a.txt, then return true.
This would look as follows in pseudo-code:
ContentSet = {} for each element b of b.txt add b into ContentSet for each element a of a.txt if a is not in ContentSet then return false return true This approach has the advantage that the first iteration gets rid of the possible repetitions in the container file, hence keeping the file size and thus the search time to a minimum, as well as running the second iteration faster than a naive approach given that the set has a good hash implementation, because checking whether the hashset contains a given object is an O(1) operation.