1

I have 2 files , one that a biiiiig file (have 249430 rows) and other is smaller than the first (500 rows)

the first file has this five columns and other columns (the first five columns same in the second file ) such as

#CHROM POS ID REF ALT QUAL INFO chr2 32424454 rs4576493 T G pass ...... chr8 35578788 rs3686678 C A pass ......... chr8 35578788 rs3686678 C CCG pass ......... chrx 35578788 rs3686678 C CCG pass ......... 

in the second file there are 5 columns such as:

#CHROM POS ID REF ALT chr2 32424454 rs4576493 T G chr8 35578788 rs3686678 C CCG 

I want to compare the second file with the first file in each five columns then save only intersection rows between files (but have all columns in the file 1)

So the final file that I want like this

#CHROM POS ID REF ALT QUAL INFO chr2 32424454 rs4576493 T G pass ...... chr8 35578788 rs3686678 C CCG pass ......... 

how can I do it please in unix? thanks

2
  • grep -F -f file2 file1? Commented Mar 2, 2023 at 23:26
  • @Cyrus that would match chr2 32424454 rs4576493 T C with chr2 32424454 rs4576493 T CCG , etc. Using grep instead of awk for these kinds of things almost always works for some sample input and then fails later on the real input. Commented Mar 3, 2023 at 14:51

2 Answers 2

1

Using any awk:

awk ' { key = $1 FS $2 FS $3 FS $4 FS $5 } NR==FNR { a[key]; next } key in a ' file2 file1 
0

Assuming that both files are TSV files, i.e., tab-delimited, you may use Miller (mlr; a tool specifically developed for processing structured data) to perform a relational INNER JOIN operation between the two data sets using the five fields that you mentioned:

$ mlr --tsv join -f firstfile -j '#CHROM,POS,ID,REF,ALT' secondfile #CHROM POS ID REF ALT QUAL INFO chr2 32424454 rs4576493 T G pass ...... chr8 35578788 rs3686678 C CCG pass ......... 

Is your data using multiple spaces in place of single tabs, use --pprint in place of --tsv to signal that you want both input and output to be "pretty-printed". Use --p2t (or --ipprint and --otsv) to read the input as "pretty-printed" format but to write the output as TSV.

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.