7

I have a file: a.txt with a number at each line. I also have another file b.txt with also a number at each line.
How could I check if all the lines in file a.txt is included in b.txt?

5

7 Answers 7

8

You can use comm for that.

If a.txt and b.txt are already sorted (lexically and ascending), you just need

comm -23 a.txt b.txt 

or maybe

comm -23 a.txt b.txt | wc -l 

If there is no output (or if wc -l returns "0"), then every line in a.txt was in b.txt (-2 suppresses output of lines that are only in b.txt, -3 suppresses output of lines that are in both files).

If the files are not sorted, you can use process substitution to pass a sorted output of each file to comm:

comm -23 <(sort a.txt) <(sort b.txt) 

The process substitution <(COMMAND) puts the output of COMMAND into a FIFO or a file in /dev/fd (depending on what is supported on the system). On the commandline <(COMMAND) is then substituted with the name of this file as part of the command line expansion.

This does really check lines, so if a number exists twice in a.txt but only once in b.txt this will output the duplicate line from a.txt. If you do not care about duplicates, use sort -u FILE instead of sort FILE (or sort FILE | uniq in case your sort has no switch for unique sorting)

Sign up to request clarification or add additional context in comments.

Comments

3

You can use the diff command to compare two files

Example usage

$ seq 1 5 > a.txt $ seq 1 5 > b.txt $ diff a.txt b.txt $ $ seq 1 6 > b.txt $ diff a.txt b.txt 5a6 > 6 

EDIT

You can also try something like

$ seq 1 5 > a.txt $ seq 1 5 > b.txt $ diff a.txt b.txt > /dev/null && echo files are same || echo files are not same files are same $ seq 1 6 > b.txt $ diff a.txt b.txt > /dev/null && echo files are same || echo files are not same files are not same 

6 Comments

But for that, the order of the numbers in both files has to be the same, right?
@mrtubis Yeah it needs to be. as it comapares lines with corresponding lines of two files. You can do a sort on the two files to be safe on that
@nu11p01n73R:Yes that works for small files. But it the file has 2K lines how can I understand from the diff that it is a subset?
@Jim if the two files are same, then the diff does not give any output. where as if there is some diffirence, its shown
@Jim I have edited my answer to echo if there is some difference or not. Hope it helps you
|
1

if the numbers are unique (without repetitions in each file), you can concatenate them, and pipe to sort and then uniq and check how many lines you have.

for example :

>> cat a.txt 1 2 8 5 >> cat b.txt 1 2 5 3 8 >> cat a.txt b.txt | sort | uniq | wc -l 5 

since the answer is the same as the number of lines in b.txt, the answer is yes!

Comments

1

Try this :

awk ' NR==FNR{arr[$0]++;next} {print ($0 in arr) ? $0 " in both files" : $0 " *not* in both files"} ' b.txt a.txt 

With :

 $ diff -a b.txt a.txt 2c2 < 3 --- > 2 6d5 < 7 

Comments

0
awk 'FNR==NR{b[$0];next} {if($0 in b){print $0" is present in b.txt"} else{print $0" is not present in b.txt"} }' b.txt a.txt 

1 Comment

If there are multiple of the same number this may give misleading results.
0

A Perl solution:

#!/usr/bin/perl use strict; use warnings; use List::Compare; #read file a.txt open (my $fh, "<", "a.txt") or die $!; while (<$fh>){ push @atxt = $_; } close($fh); #read file b.txt open (my $fh2, "<", "b.txt") or die $!; while (<$fh2>){ push @btxt = $_; } close($fh2); my $lc = List::Compare->new(\@atxt, \@btxt); print $lc->get_intersection; print $lc->get_union; print $lc->get_unique; print $lc->get_complement; 

There are many more options, check out the documentation: http://search.cpan.org/~jkeenan/List-Compare-0.39/lib/List/Compare.pm

Comments

0

A file containing another file would imply the whole content of a.txt being present within b.txt in the same order, including possible repetitions, whereas your final question:

How could I check if all the lines in file a.txt is included in b.txt?

implies that the order and repetitions are irrelevant. As a simple example:

a.txt: 5 7 3 b.txt: 9 5 3 7 

satisfies your quoted question, but not the one in the title.

Solving the quoted problem is significantly easier given that the container file is not huge (otherwise you'll run into memory problems with a straight-forward approach like the one I'll demonstrate below). A simple solution would be to create a set of all the numbers contained in b.txt, then iterate through a.txt and return false in case an item is not found in the constructed set. If this didn't happen by the time you finish iterating the content of a.txt, then return true.

This would look as follows in pseudo-code:

ContentSet = {} for each element b of b.txt add b into ContentSet for each element a of a.txt if a is not in ContentSet then return false return true 

This approach has the advantage that the first iteration gets rid of the possible repetitions in the container file, hence keeping the file size and thus the search time to a minimum, as well as running the second iteration faster than a naive approach given that the set has a good hash implementation, because checking whether the hashset contains a given object is an O(1) operation.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.