Skip to content

[ingest] general curation fixes#267

Merged
jameshadfield merged 15 commits into
masterfrom
james/ingest-use-epi-isl-as-key
Dec 7, 2025
Merged

[ingest] general curation fixes#267
jameshadfield merged 15 commits into
masterfrom
james/ingest-use-epi-isl-as-key

Conversation

@jameshadfield
Copy link
Copy Markdown
Member

@jameshadfield jameshadfield commented Nov 26, 2025

This started out as an effort to move away from using strain names as the canonical ID for curation steps but turned into more of a grab-bag of fixes and changes. Please see the commit messages for full details. There's a few big-picture directions this PR (and subsequent PRs) aim for.

Each layer in the curation chain is intended to be self-contained where possible. For instance, any and all changes to lineages should happen in the ./scripts/standardize-lineage layer, including hardcoded annotations. This means we have more annotations TSVs, but errors/warnings printed by layers can be fixed at the source rather than later on. This allows us to address all the warnings in the curate logs and the majority of code-changes in this commit do just this, the end result being an empty curate log; this will eventually allow us to diff curated files to understand any curation changes.

While a motivating direction was to replace strain names with EPI ISLs, I ended up only doing this for the curate layers above ./scripts/standardize-strain-names. Specifically: lineage annotations & location annotations. Because strain names are so widely used in curation (prioritised strains, reference strains), titers, and phylo workflows (include / exclude files) we should invest in tooling which identifies when strain names change so that we can keep all these files in sync.

I made a start on curation docs, but they're very much a work in progress.

The next PR will build on this to address avian-flu focused curation changes.

@jameshadfield jameshadfield force-pushed the james/ingest-use-epi-isl-as-key branch 4 times, most recently from 096055e to 3cea5e7 Compare November 30, 2025 19:24
@jameshadfield

This comment was marked as outdated.

Comment thread ingest/rules/prepare_ndjson.smk Outdated
@jameshadfield jameshadfield changed the title [ingest] move towards using EPI ISL as the canonical ID [ingest] general curation fixes Dec 2, 2025
Comment thread ingest/defaults/lineages.tsv Outdated
Comment thread ingest/scripts/standardize-lineage
Comment thread ingest/rules/curate.smk
| ./scripts/parse-gisaid-location \
--location-field {params.gisaid_location_field:q} \
--strain-field {params.gisaid_strain_field:q} \
--annotations {input.location_annotations:q} \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, this has the downside of potentially getting overridden by geolocation rules downstream. I see you what you mean by needing to document the curation steps for geolocations...

Comment thread ingest/scripts/parse-gisaid-location
Comment thread ingest/rules/curate.smk
jameshadfield added a commit that referenced this pull request Dec 3, 2025
The '?' null value meant that our geographic rules weren't applied as
intended. For instance, we (Augur) have rules such as:

  1. `Asia/Bangladesh/Jashore/` →	`Asia/Bangladesh/Khulna/Jashore`
  2. `Asia/Bangladesh/Jashore/*` →	`Asia/Bangladesh/Khulna/*`

and our geographic fixes layer would produce values like
`Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to
this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore`
however the second rule would be run instead resulting in
`Asia/Bangladesh/Khulna/?`.

Originally I implemented a fix in Augur for this to allow geo-location
values with '?' to be appropriately parsed by `augur curate
apply-geolocation-rules` <nextstrain/augur#1929>
however on review we decided to go the other way and change the
seasonal-flu geographic null value to the empty string
<#267 (comment)>

**Summary of changes to (seasonal-flu) ingested data**

393,688 / 514,895 records changed

Value "?" → '' changes (very much as expected!):
* n = 4 (country)
* n = 102,408 (division)
* n = 389,522 (location)

Value "?" → valid geographic deme (via the rules doing their thing):
* n = 0 (country)
* n = 1,925 (division)
* n = 3,888 (location)

Value change from one deme to another (via the rules doing their thing):
* n = 280 (country)
* n = 8,904 (division)
* n = 0 (location)

**Some examples**

The example given above (`Asia/Bangladesh/Jashore/?`) results in 38
records having a location "?" → "Jashore"

97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"
@jameshadfield jameshadfield force-pushed the james/ingest-use-epi-isl-as-key branch from 3cea5e7 to f042899 Compare December 3, 2025 01:23
Copy link
Copy Markdown
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest changes look good to me. I think fine to merge and continue improving the curation docs as necessary

Comment thread ingest/curation-docs.md Outdated
Comment thread ingest/curation-docs.md Outdated
This is useful when making changes to the curation steps. For instance,
assuming we have stored a good copy of the seasonal-flu.ndjson created
before we made changes we can diff the new one against this.

I plan to eventually write up a recommended workflow so that anyone can
confidently make code/curation changes, and this workflow will leverage
scripts such as this.
This forms part of a bigger direction to move away from strain names as
the canonical ID as they may change over time as our curation rules
change. The downside is that many of the strains are duplicated and thus
require multiple ISLs.
This starts the move towards applying annotations early and at the
relevant curate step rather than at the end. The benefits are that we
can provide warnings when unexpected situations arise safe in the
knowledge that there won't be a later step which corrects the record.

I've changed the format such that we now also update the vtype & subtype
which is expected to result in large number of changes to records. In
total 8,097 records are changed by this commit.

* n=8058 had their type & subtype changed from "tbd" to (e.g.) "a" &
  "h3n2". These changes were the only ones I expected when writing the
  code here

* n=14 had their subtype changed from "n/a" to either "h1n1pdm" or
  "h3n2". All of these records had raw info of "Subtype": "A",
  "Lineage": "" so these changes also make sense.

* n=25 had their subtype changed, e.g. "h3n2" → "h1n1". As an example, a
  record after curation with the previous code was:
  `{"vtype": "a", "subtype": "h3n2", "lineage": "h1n1pdm"}`
  and using this updated code is now:
  `{"vtype": "a", "subtype": "h1n1", "lineage": "h1n1pdm"}`
  These will be revisited in a subsequent commit as they flag up
  inconsistencies in the underlying annotations rather than the code
  changes introduced here.

In addition, annotations for 5 EPI ISLs are flagged as unnecessary
These were flagged as unnecessary by the changes in the previous commit
Recent changes to how we process lineage overrides (via hardcoded TSV)
also update the type & subtype accordingly. This flagged up 25 cases
where our previous curated NDJSON would have conflicting calls.

Working through an entry in the annotations tsv of "EPI_ISL_123
a/h1n1/h1n1pdm" as an example:

* The GISAID metadata indicated the strain was a/h3n2/h3n2
* Our previous approach overrode the lineage field only, so this example
  became a/h3n2/h1n1pdm
* Our new approach also updates the subtype, so the strain becomes
  a/h1n1/h1n1pdm

I looked at a handful of the HA segments using Nextclade's automatic
dataset-suggest functionality and in all cases the underlying GISAID
metadata was correct and our annotations were wrong. This commit thus
drops all 25 annotations.

Specifically, the new curated seasonal-flu NDJSON will have:
* n=16 lineages changed from "h1n1pdm" to "h3n2"
* n=5 lineages changed from "h3n2" to "h1n1pdm"
* n=4 records dropped as they are no longer classified as seasonal flu
  (n=3 h5n1 and n=1 h1n2, all of which were previously annotated as
  a/h1n1/h1n1pdm)
Makes searching through the log files much easier (e.g. with `grep`) and
also makes it easier to read through the log files as there's now a
clear mapping back to the part of the curate chain where the warning
came from.

This starts a pattern I'd like to roll out to all our Augur curate
scripts, but that's for another day.
The primary way data is analysed is by first subsetting to the subtype
or lineage of interest, and so setting these values for all records will
enable a suite of analyses which were previously not possible.

Previously (for A viruses only) we sometimes set the lineage to the
subtype (e.g. h3n2, h1n2) and not for others (e.g. h3n3, h5n1). We now
set the lineage to the subtype in all cases except for h1n1/pdm09
(subtype=h1n1, lineage=h1n1pdm09). This means we expand our previous
ignoring of the upstream Lineage="seasonal" to apply to any subtype
(previously we would ignore it for some subtypes and for others they
would be assigned "TBD" type/subtype/lineage).

Because the data is messy our subtypes & lineages are inherently messy.
This means we now have subtypes of "h3", "n1" etc. In practice, if you
wanted to (e.g.) fully analyse H3 strains then you would have to subset
by many different subtypes (e.g. h3, h3n1, h3n2...), and potentially
consider the type='A' records without any subtype assigned.

There are no changes to the output seasonal-flu ndjson, but the larger
curated ndjson has many changes. Beyond the updates to 87k records with
TBD type/subtype/lineage, we also change the lineage for many records
from 'n/a' to the subtype (see above).
The shifting of the `len(split_locations) < expected_num_locations`
earlier should speed things up a little as we're no longer unnecessarily
regex-matching on the strain name

By keeping the `Lineage` field unchanged (i.e. capital L) we keep this
field in the output NDJSON which allows better troubleshooting. This
will be dropped when we do the TSV fields subsetting.
In line with previous commits, this shifts the hardcoded annotation-
overrides to the relevant curate step which allows better user-facing
warnings.

While this allows us to drop the `apply-record-annotations` step
entirely, we keep it around (without any actual annotations) to allow
for hardcoded fixes in the future as needed.
The '?' null value meant that our geographic rules weren't applied as
intended. For instance, we (Augur) have rules such as:

  1. `Asia/Bangladesh/Jashore/` →	`Asia/Bangladesh/Khulna/Jashore`
  2. `Asia/Bangladesh/Jashore/*` →	`Asia/Bangladesh/Khulna/*`

and our geographic fixes layer would produce values like
`Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to
this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore`
however the second rule would be run instead resulting in
`Asia/Bangladesh/Khulna/?`.

Originally I implemented a fix in Augur for this to allow geo-location
values with '?' to be appropriately parsed by `augur curate
apply-geolocation-rules` <nextstrain/augur#1929>
however on review we decided to go the other way and change the
seasonal-flu geographic null value to the empty string
<#267 (comment)>

**Summary of changes to (seasonal-flu) ingested data**

393,688 / 514,895 records changed

Value "?" → '' changes (very much as expected!):
* n = 4 (country)
* n = 102,408 (division)
* n = 389,522 (location)

Value "?" → valid geographic deme (via the rules doing their thing):
* n = 0 (country)
* n = 1,925 (division)
* n = 3,888 (location)

Value change from one deme to another (via the rules doing their thing):
* n = 280 (country)
* n = 8,904 (division)
* n = 0 (location)

**Some examples**

The example given above (`Asia/Bangladesh/Jashore/?`) results in 38
records having a location "?" → "Jashore"

97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"
Our parsing of gisaid Location fields expects 4 slash-separated fields,
however there are thousands of entries which have more than 4. Each of
these raises a warning however there's no way to remedy these records.
Instead of adding thousands of one-off annotations, this introduces
the concept of rules to apply to these Location fields.
The full influenza ingest now completes without any warnings.

Most changes are to non-seasonal-flu records, perhaps as expected. There
are only 168 record changes to seasonal-flu, all of which are
improvements.

As will be described in the upcomming docs commit, the process for
geographic curation is multi-layered and complex. For instance, the
`parse-gisaid-location` sets "Maricopa County AZ", which `augur curate
titlecase` chances to "Maricopa County Az", which we then change back to
"Maricopa County AZ" via a custom `geolocation_rules.tsv` entry. I'm
sure there are improvements to the process that we can make over time...
This flags up 79 duplicate (but conflicting) strain name fixes and 3
location fixes.
While the files are being fetched from fauna (and those files are
changing ~every week) it's not practical to fix the underlying issues.
We should re-enable these warnings once the files are in this repo and
we can fix them. (P.S. I don't want to keep the warnings around in the
logs as I want any warning shown to be fixable/fixed.)
@jameshadfield jameshadfield force-pushed the james/ingest-use-epi-isl-as-key branch from f042899 to c1958ac Compare December 7, 2025 22:04
@jameshadfield jameshadfield merged commit 70877f2 into master Dec 7, 2025
3 checks passed
@jameshadfield jameshadfield deleted the james/ingest-use-epi-isl-as-key branch December 7, 2025 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants