How to find and count matching data for an ID column in two files?

Question

I've got two genetic datasets which have matching chromosome position IDs. I want to count how many times file 1's chromosome position IDs appear in file 2.

For example my data looks like:

File 1 (chromosome position is actually my 125th column, implied by the ...):

Gene pval ... Chromosome position ID ACE 0.002 ... 01:3290834_CT_C_1 NOS 0.01 ... 03:3304593_GA_G_1 BRCA 0.004 . ... 06:6265733_GA_G_1 CYP3 0.34 ... 09:9433933_GA_G_1

File 2 (chromosome position is my first column):

Chromosome position ID Gene pval 01:1243933_GA_G_1 ACE 0.002 03:3304593_GA_G_1 NOS 0.01 06:6265733_GA_G_1 BRCA 0.004 09:9433933_GA_G_1 CYP3 0.34

I've found a lot of questions giving extraction of matching lines, and applied code based off those questions, but I just want to get the count of matching chromosome positions between 2 files.

Currently I'm using:

awk -F'|' 'NR==FNR{c[$125]++;next};c[$125]' file2.csv file1.txt > file3.txt wc -l file1.txt wc -l file3.txt

The line count for files 1 and 3 doesn't exactly match as I'm expecting (I expect all of file 1 to be in file 2) and to be sure what's going on I need to find a way to perform a count of matching rows in the chromosome position column. If I can find a way to code 'does all of file 1 chromosome positions match/appear in file 3?' that would be ideal, but even just a count works for now.

So the output would be a number counting how many times chromosome position column $125 in file 1 has matches also with file 2 chromosome position column $1.

I am using Linux.

Why are you using -F'|'? Your files don't even contain any |, at least none that you show. What defined a field in these files? — terdon
– terdon ♦, Commented Jan 31, 2020 at 11:15
Hi thank you for this, I will remove it. I'm new to linux, and just trying to piece this command together based on what I'm finding online. — DN1
– DN1, Commented Jan 31, 2020 at 11:17
Ah, I see. The -F is setting the field delimiter, so when you use -F'|', it is expecting data separated into fields by a |, like foo|bar|baz. What separates the fields in your files? Is it tabs? Spaces? Something else? — terdon
– terdon ♦, Commented Jan 31, 2020 at 11:18
Some good example input there, but some example output would be useful too, in order to aid those trying to answer. — steve
– steve, Commented Jan 31, 2020 at 11:21
Also, could you please edit your question and make the title match the content? Your title is asking for something completely different (replacing a character), but your question doesn't seem to be involving any replacement at all. — terdon
– terdon ♦, Commented Jan 31, 2020 at 11:23

Paulo Tomé · Accepted Answer · 2020-01-31 11:37:09Z

2

A solution with awk, tail, sort, join and wc.

join <(awk -F '\t' '{print $125}' file1 | tail -n +2 | sort) <(awk -F '\t' '{print $1}' file2 | tail -n +2 | sort ) | wc -l 3

Explanation.

This solution assumes that columns are tab separated. With awk are collected the 125th column of file1 and the first column of file2. tail -n +2 removes the first line of the collected results. sort is mandatory since join expects ordered files. The resulting intersecting set is supplied to wc which returns its number of lines.

edited Jan 31, 2020 at 11:37

answered Jan 31, 2020 at 11:27

Paulo Tomé

3,8626 gold badges28 silver badges40 bronze badges

Thank you, is the number 3 at the end meant to be a new file?

DN1
– DN1

2020-01-31 11:59:01 +00:00
Commented Jan 31, 2020 at 11:59
In problem statement you said: the output would be a number counting how many times chromosome position column $125 in file 1 has matches also with file 2 chromosome position column $1.

Paulo Tomé
– Paulo Tomé

2020-01-31 12:07:26 +00:00
Commented Jan 31, 2020 at 12:07
To see the resulting intersecting set remove | wc -l from the command.

Paulo Tomé
– Paulo Tomé

2020-01-31 12:20:10 +00:00
Commented Jan 31, 2020 at 12:20
Thank you, this seems to work for me. Oddly it gives back 0 matches, which shouldn't be possible, but I'm wondering if this isn't a code problem anymore. One file is a text file and one is a csv, would that be an issue?

DN1
– DN1

2020-01-31 13:00:16 +00:00
Commented Jan 31, 2020 at 13:00
If files have different separators that will be an issue. In the CVS file replace the awk separator by -F ','.

Paulo Tomé
– Paulo Tomé

2020-01-31 13:03:45 +00:00
Commented Jan 31, 2020 at 13:03

Add a comment |

RudiC · Accepted Answer · 2020-01-31 11:21:05Z

1

You're close. Try

awk 'FNR == 1 {next}; FNR==NR {P[$125]; next} $1 in P {P[$1]++} END {for (p in P) print p, P[p]+0}' file[12] 03:3304593_GA_G_1 1 01:3290834_CT_C_1 0 09:9433933_GA_G_1 1 06:6265733_GA_G_1 1

Obviously, not all positions in file1 are found in file2.

answered Jan 31, 2020 at 11:21

RudiC

9,0592 gold badges12 silver badges22 bronze badges

Thank you, is the file[12] representative of both files 1 and 2? I currently have different file names in actuality but I can rename them if it works like this

DN1
– DN1

2020-01-31 11:59:51 +00:00
Commented Jan 31, 2020 at 11:59
file[12] is expanded by the shell to file1 file2 and will be operated upon in that sequence by awk.

RudiC
– RudiC

2020-01-31 13:52:30 +00:00
Commented Jan 31, 2020 at 13:52

Add a comment |

Stack Exchange Network

How to find and count matching data for an ID column in two files?

2 Answers 2

You must log in to answer this question.

Hot Network Questions

How to find and count matching data for an ID column in two files?

2 Answers 2

You must log in to answer this question.

Related

Hot Network Questions