Skip to content

[on hold] dedup ncbi segments#93

Draft
jameshadfield wants to merge 3 commits intomasterfrom
james/dedup-ncbi-segments
Draft

[on hold] dedup ncbi segments#93
jameshadfield wants to merge 3 commits intomasterfrom
james/dedup-ncbi-segments

Conversation

@jameshadfield
Copy link
Member

WIP - here for discussion with @joverlee521

The only "disagreements" (which I haven't yet resolved) are a handful of strains which have multiple sequences for (all) segments. So that's reassuring!

The phylo workflow hasn't been updated to use the new metadata format

DAG is a bit simpler (before: above, after: below):

image
@jameshadfield
Copy link
Member Author

jameshadfield commented Oct 7, 2024

Here's the 3 (yes, only 3) strains which were dropped:

Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pb2. Accessions: PP761255, PP761574. Skipping this segment. Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pb1. Accessions: PP761260, PP761572. Skipping this segment. Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment pa. Accessions: PP761262, PP761577. Skipping this segment. Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment ha. Accessions: PP761257, PP761548, PP761557, PP761576. Skipping this segment. Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment np. Accessions: PP761261, PP761550, PP761553, PP761571. Skipping this segment. Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment na. Accessions: PP761256, PP761552, PP761555, PP761578. Skipping this segment. Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment mp. Accessions: PP761259, PP761551, PP761554, PP761573. Skipping this segment. Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had multiple accessions for segment ns. Accessions: PP761258, PP761549, PP761556, PP761575. Skipping this segment. Strain 'A/redheadduck/NorthCarolina/W24-83A/2024' had zero or multiple accessions for all segments. Dropping this entire strain. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pb2. Accessions: PP761569, PP766982. Skipping this segment. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pb1. Accessions: PP761570, PP766984. Skipping this segment. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment pa. Accessions: PP761563, PP766987. Skipping this segment. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment ha. Accessions: PP761566, PP766985. Skipping this segment. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment np. Accessions: PP761567, PP766983. Skipping this segment. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment na. Accessions: PP761568, PP766981. Skipping this segment. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment mp. Accessions: PP761564, PP766980. Skipping this segment. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had multiple accessions for segment ns. Accessions: PP761565, PP766986. Skipping this segment. Strain 'A/Canadagoose/NorthCarolina/W24-90A/2024' had zero or multiple accessions for all segments. Dropping this entire strain. Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pb2. Accessions: PP862906, PQ367318. Skipping this segment. Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pb1. Accessions: PP862905, PQ367313. Skipping this segment. Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment pa. Accessions: PP862901, PQ367316. Skipping this segment. Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment ha. Accessions: PP862902, PQ367314. Skipping this segment. Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment np. Accessions: PP862907, PQ367312. Skipping this segment. Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment na. Accessions: PP862903, PQ367315. Skipping this segment. Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment mp. Accessions: PP862904, PQ367317. Skipping this segment. Strain 'A/sanderling/Virginia/W24-190K/2024' had multiple accessions for segment ns. Accessions: PP862908, PQ367311. Skipping this segment. Strain 'A/sanderling/Virginia/W24-190K/2024' had zero or multiple accessions for all segments. Dropping this entire strain.

@joverlee521 and I discussed this today and we're going to leave this PR open for the moment and explore NCBI's new API in #82 which promises to group segments together and compare those results to ours from this PR.

@jameshadfield jameshadfield changed the title James/dedup ncbi segments [on hold] dedup ncbi segments Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant