1

I have 2 lists with files with their md5sum checks and the lists have different paths for the same files.

Example of content in first file with check sums (server.list):

2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz/ 6e6bcd84f264233cf7c428c0cfdc0c03 tmp/fastq1_L002_R1_001.fastq.gz 

Example of content in two file with check sums (downloaded.list):

2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz 6e6bcd84f264233cf7c428c0cfdc0c03 /home/projects/fastq1_L002_R1_001.fastq.gz 

When I run the following line, I got the following lines:

awk -F"/" 'FNR==NR{filearray[$1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' downloaded.list server.list fastq1_L001_R1_001.fastq.gz has a different md5sum fastq1_L001_R2_001.fastq.gz has a different md5sum fastq1_L002_R2_001.fastq.gz has a different md5sum 

Why I am getting this message since the first column is the same in both files? Can someone enlighten me on this issue?

Edit:

If I remove the path and leave only the file name, it works just fine.

Edit 2:

As pointed out, there is another possibility of file path form, which does not start with /. In this case, I cannot use / as the field separator.

16
  • 1
    I created the two files using your sample data (including the paths), and ran your awk script. I (correctly) get no putput. Commented Mar 2, 2022 at 22:07
  • @pmf thanks for the answer! I really wish tô understand why it's showing the output... Do I have a bug? If so, where? Anyway, i think that i can fairly assume that i don't have corrupted Files, right? Commented Mar 2, 2022 at 22:33
  • Is the whitespace between the md5 and the file path the same between the two files? Using -F"/" makes it treat that as part of $1. Commented Mar 2, 2022 at 22:34
  • @RamonTCarmo Try the same as I did. Create new files with the data copied from here (not from your actual files), and see what happens. If there's no output, it must be with your actual files. Commented Mar 2, 2022 at 22:38
  • 1
    Thanks for your answers! As you guys noted, it was the whitespace! Just tried the code provided by @LMC and worked just fine! I'm still learning how to fuse awk. Many thanks, you all! Have a nice week :) Commented Mar 2, 2022 at 22:59

2 Answers 2

3

Assumptions:

  • filename (sans path) and md5sum have to match
  • filenames may not be listed in the same order
  • filenames may not exist in both files

Sample data:

$ head downloaded.list server.list ==> downloaded.list <== 2c03ff18a643a1437ec0cf051b8b7b9d /home/projects/fastq1_L001_R1_001.fastq.gz # match YYYYf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R5_911.fastq.gz # different md5sum c430f587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R2_001.fastq.gz # match MNOPf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R8_abc.fastq.gz # filename does not exist in other file ABCDf587aba1aa9f4fdf69aeb4526621 /home/projects/fastq1_L001_R9_004.fastq.gz # different filename but matching md5sum (vs last line of other file) ==> server.list <== 2c03ff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L001_R1_001.fastq.gz # match c430f587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R2_001.fastq.gz # match XXXXf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R5_911.fastq.gz # different md5sum TUVWff18a643a1437ec0cf051b8b7b9d /tmp/fastq1_L999_R6_922.fastq.gz # filename does not exist in other file ABCDf587aba1aa9f4fdf69aeb4526621 /tmp/fastq1_L001_R7_933.fastq.gz # different filename but matching md5sum (vs last line of other file) 

One awk idea to address white space issues as well as verifying filename matches:

awk ' # stick with default field delimiter of white space but ... { md5sum=$1 n=split($2,arr,"/") # split 2nd field on "/" delimiter fname=arr[n] if (FNR==NR) filearray[fname]=md5sum else { if (fname in filearray && filearray[fname] == $1) next printf "%s has a different md5sum\n",fname } } ' downloaded.list server.list 

This generates:

fastq1_L001_R5_911.fastq.gz has a different md5sum fastq1_L999_R6_922.fastq.gz has a different md5sum fastq1_L001_R7_933.fastq.gz has a different md5sum 
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the explanation, @markp-fuso! Never occurred to me that the whitespace would be an issue! I'm still learning how to properly use awk.
2

The whitespace on $1 used as an array key is causing problems. Removing it:

awk -F"/" '{gsub(/ /, "", $1)}; FNR==NR{filearray[ $1]=$NF; next }!($1 in filearray){printf "%s has a different md5sum\n",$NF}' list1.txt list2.txt 

1 Comment

@EdMorton thanks for pointing that out. Added a fix for that.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.