Check if all lines of a file are contained in another file

Question

I have a file: a.txt with a number at each line. I also have another file b.txt with also a number at each line.
How could I check if all the lines in file a.txt is included in b.txt?

Practically the same question stackoverflow.com/questions/27376807/… — user3442743
– user3442743, Commented Dec 10, 2014 at 9:29
@user3442743: that question specified using sed or awk; this is more general. — user1071847
– user1071847, Commented Nov 30, 2018 at 19:28

Adaephon · Accepted Answer · 2014-12-10 10:00:43Z

You can use comm for that.

If a.txt and b.txt are already sorted (lexically and ascending), you just need

comm -23 a.txt b.txt

or maybe

comm -23 a.txt b.txt | wc -l

If there is no output (or if wc -l returns "0"), then every line in a.txt was in b.txt (-2 suppresses output of lines that are only in b.txt, -3 suppresses output of lines that are in both files).

If the files are not sorted, you can use process substitution to pass a sorted output of each file to comm:

comm -23 <(sort a.txt) <(sort b.txt)

The process substitution <(COMMAND) puts the output of COMMAND into a FIFO or a file in /dev/fd (depending on what is supported on the system). On the commandline <(COMMAND) is then substituted with the name of this file as part of the command line expansion.

This does really check lines, so if a number exists twice in a.txt but only once in b.txt this will output the duplicate line from a.txt. If you do not care about duplicates, use sort -u FILE instead of sort FILE (or sort FILE | uniq in case your sort has no switch for unique sorting)

nu11p01n73R · Accepted Answer · 2014-12-10 10:51:28Z

3

You can use the diff command to compare two files

Example usage

$ seq 1 5 > a.txt $ seq 1 5 > b.txt $ diff a.txt b.txt $ $ seq 1 6 > b.txt $ diff a.txt b.txt 5a6 > 6

EDIT

You can also try something like

$ seq 1 5 > a.txt $ seq 1 5 > b.txt $ diff a.txt b.txt > /dev/null && echo files are same || echo files are not same files are same $ seq 1 6 > b.txt $ diff a.txt b.txt > /dev/null && echo files are same || echo files are not same files are not same

edited Dec 10, 2014 at 10:51

answered Dec 10, 2014 at 8:58

nu11p01n73R

26.8k3 gold badges42 silver badges52 bronze badges

6 Comments

mrtubis Over a year ago

But for that, the order of the numbers in both files has to be the same, right?

nu11p01n73R Over a year ago

@mrtubis Yeah it needs to be. as it comapares lines with corresponding lines of two files. You can do a sort on the two files to be safe on that

Jim Over a year ago

@nu11p01n73R:Yes that works for small files. But it the file has 2K lines how can I understand from the diff that it is a subset?

nu11p01n73R Over a year ago

@Jim if the two files are same, then the diff does not give any output. where as if there is some diffirence, its shown

nu11p01n73R Over a year ago

@Jim I have edited my answer to echo if there is some difference or not. Hope it helps you

|

mrtubis · Accepted Answer · 2014-12-10 09:05:05Z

if the numbers are unique (without repetitions in each file), you can concatenate them, and pipe to sort and then uniq and check how many lines you have.

for example :

>> cat a.txt 1 2 8 5 >> cat b.txt 1 2 5 3 8 >> cat a.txt b.txt | sort | uniq | wc -l 5

since the answer is the same as the number of lines in b.txt, the answer is yes!

Gilles Quénot · Accepted Answer · 2014-12-10 16:14:51Z

Try this :

awk ' NR==FNR{arr[$0]++;next} {print ($0 in arr) ? $0 " in both files" : $0 " *not* in both files"} ' b.txt a.txt

With diff :

 $ diff -a b.txt a.txt 2c2 < 3 --- > 2 6d5 < 7

Vijay · Accepted Answer · 2014-12-10 08:58:52Z

0

awk 'FNR==NR{b[$0];next} {if($0 in b){print $0" is present in b.txt"} else{print $0" is not present in b.txt"} }' b.txt a.txt

answered Dec 10, 2014 at 8:58

Vijay

67.7k94 gold badges238 silver badges327 bronze badges

1 Comment

user3442743 Over a year ago

If there are multiple of the same number this may give misleading results.

Chankey Pathak · Accepted Answer · 2014-12-10 09:18:45Z

A Perl solution:

#!/usr/bin/perl use strict; use warnings; use List::Compare; #read file a.txt open (my $fh, "<", "a.txt") or die $!; while (<$fh>){ push @atxt = $_; } close($fh); #read file b.txt open (my $fh2, "<", "b.txt") or die $!; while (<$fh2>){ push @btxt = $_; } close($fh2); my $lc = List::Compare->new(\@atxt, \@btxt); print $lc->get_intersection; print $lc->get_union; print $lc->get_unique; print $lc->get_complement;

There are many more options, check out the documentation: http://search.cpan.org/~jkeenan/List-Compare-0.39/lib/List/Compare.pm

downhand · Accepted Answer · 2014-12-10 09:43:27Z

A file containing another file would imply the whole content of a.txt being present within b.txt in the same order, including possible repetitions, whereas your final question:

How could I check if all the lines in file a.txt is included in b.txt?

implies that the order and repetitions are irrelevant. As a simple example:

a.txt: 5 7 3 b.txt: 9 5 3 7

satisfies your quoted question, but not the one in the title.

Solving the quoted problem is significantly easier given that the container file is not huge (otherwise you'll run into memory problems with a straight-forward approach like the one I'll demonstrate below). A simple solution would be to create a set of all the numbers contained in b.txt, then iterate through a.txt and return false in case an item is not found in the constructed set. If this didn't happen by the time you finish iterating the content of a.txt, then return true.

This would look as follows in pseudo-code:

ContentSet = {} for each element b of b.txt add b into ContentSet for each element a of a.txt if a is not in ContentSet then return false return true

This approach has the advantage that the first iteration gets rid of the possible repetitions in the container file, hence keeping the file size and thus the search time to a minimum, as well as running the second iteration faster than a naive approach given that the set has a good hash implementation, because checking whether the hashset contains a given object is an O(1) operation.

Collectives™ on Stack Overflow

Check if all lines of a file are contained in another file

7 Answers 7

Comments

6 Comments

Comments

Comments

1 Comment

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

Comments

6 Comments

Comments

Comments

1 Comment

Comments

Comments

Linked

Related