[ingest] general curation fixes by jameshadfield · Pull Request #267 · nextstrain/seasonal-flu

jameshadfield · 2025-11-26T03:58:33Z

This started out as an effort to move away from using strain names as the canonical ID for curation steps but turned into more of a grab-bag of fixes and changes. Please see the commit messages for full details. There's a few big-picture directions this PR (and subsequent PRs) aim for.

Each layer in the curation chain is intended to be self-contained where possible. For instance, any and all changes to lineages should happen in the ./scripts/standardize-lineage layer, including hardcoded annotations. This means we have more annotations TSVs, but errors/warnings printed by layers can be fixed at the source rather than later on. This allows us to address all the warnings in the curate logs and the majority of code-changes in this commit do just this, the end result being an empty curate log; this will eventually allow us to diff curated files to understand any curation changes.

While a motivating direction was to replace strain names with EPI ISLs, I ended up only doing this for the curate layers above ./scripts/standardize-strain-names. Specifically: lineage annotations & location annotations. Because strain names are so widely used in curation (prioritised strains, reference strains), titers, and phylo workflows (include / exclude files) we should invest in tooling which identifies when strain names change so that we can keep all these files in sync.

I made a start on curation docs, but they're very much a work in progress.

The next PR will build on this to address avian-flu focused curation changes.

joverlee521 · 2025-12-02T21:59:27Z

            | ./scripts/parse-gisaid-location \
                --location-field {params.gisaid_location_field:q} \
                --strain-field {params.gisaid_strain_field:q} \
+                --annotations {input.location_annotations:q} \


Ah, this has the downside of potentially getting overridden by geolocation rules downstream. I see you what you mean by needing to document the curation steps for geolocations...

The '?' null value meant that our geographic rules weren't applied as intended. For instance, we (Augur) have rules such as: 1. `Asia/Bangladesh/Jashore/` → `Asia/Bangladesh/Khulna/Jashore` 2. `Asia/Bangladesh/Jashore/*` → `Asia/Bangladesh/Khulna/*` and our geographic fixes layer would produce values like `Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore` however the second rule would be run instead resulting in `Asia/Bangladesh/Khulna/?`. Originally I implemented a fix in Augur for this to allow geo-location values with '?' to be appropriately parsed by `augur curate apply-geolocation-rules` <nextstrain/augur#1929> however on review we decided to go the other way and change the seasonal-flu geographic null value to the empty string <#267 (comment)> **Summary of changes to (seasonal-flu) ingested data** 393,688 / 514,895 records changed Value "?" → '' changes (very much as expected!): * n = 4 (country) * n = 102,408 (division) * n = 389,522 (location) Value "?" → valid geographic deme (via the rules doing their thing): * n = 0 (country) * n = 1,925 (division) * n = 3,888 (location) Value change from one deme to another (via the rules doing their thing): * n = 280 (country) * n = 8,904 (division) * n = 0 (location) **Some examples** The example given above (`Asia/Bangladesh/Jashore/?`) results in 38 records having a location "?" → "Jashore" 97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"

joverlee521

Latest changes look good to me. I think fine to merge and continue improving the curation docs as necessary

This is useful when making changes to the curation steps. For instance, assuming we have stored a good copy of the seasonal-flu.ndjson created before we made changes we can diff the new one against this. I plan to eventually write up a recommended workflow so that anyone can confidently make code/curation changes, and this workflow will leverage scripts such as this.

This forms part of a bigger direction to move away from strain names as the canonical ID as they may change over time as our curation rules change. The downside is that many of the strains are duplicated and thus require multiple ISLs.

This starts the move towards applying annotations early and at the relevant curate step rather than at the end. The benefits are that we can provide warnings when unexpected situations arise safe in the knowledge that there won't be a later step which corrects the record. I've changed the format such that we now also update the vtype & subtype which is expected to result in large number of changes to records. In total 8,097 records are changed by this commit. * n=8058 had their type & subtype changed from "tbd" to (e.g.) "a" & "h3n2". These changes were the only ones I expected when writing the code here * n=14 had their subtype changed from "n/a" to either "h1n1pdm" or "h3n2". All of these records had raw info of "Subtype": "A", "Lineage": "" so these changes also make sense. * n=25 had their subtype changed, e.g. "h3n2" → "h1n1". As an example, a record after curation with the previous code was: `{"vtype": "a", "subtype": "h3n2", "lineage": "h1n1pdm"}` and using this updated code is now: `{"vtype": "a", "subtype": "h1n1", "lineage": "h1n1pdm"}` These will be revisited in a subsequent commit as they flag up inconsistencies in the underlying annotations rather than the code changes introduced here. In addition, annotations for 5 EPI ISLs are flagged as unnecessary

These were flagged as unnecessary by the changes in the previous commit

Recent changes to how we process lineage overrides (via hardcoded TSV) also update the type & subtype accordingly. This flagged up 25 cases where our previous curated NDJSON would have conflicting calls. Working through an entry in the annotations tsv of "EPI_ISL_123 a/h1n1/h1n1pdm" as an example: * The GISAID metadata indicated the strain was a/h3n2/h3n2 * Our previous approach overrode the lineage field only, so this example became a/h3n2/h1n1pdm * Our new approach also updates the subtype, so the strain becomes a/h1n1/h1n1pdm I looked at a handful of the HA segments using Nextclade's automatic dataset-suggest functionality and in all cases the underlying GISAID metadata was correct and our annotations were wrong. This commit thus drops all 25 annotations. Specifically, the new curated seasonal-flu NDJSON will have: * n=16 lineages changed from "h1n1pdm" to "h3n2" * n=5 lineages changed from "h3n2" to "h1n1pdm" * n=4 records dropped as they are no longer classified as seasonal flu (n=3 h5n1 and n=1 h1n2, all of which were previously annotated as a/h1n1/h1n1pdm)

Makes searching through the log files much easier (e.g. with `grep`) and also makes it easier to read through the log files as there's now a clear mapping back to the part of the curate chain where the warning came from. This starts a pattern I'd like to roll out to all our Augur curate scripts, but that's for another day.

The primary way data is analysed is by first subsetting to the subtype or lineage of interest, and so setting these values for all records will enable a suite of analyses which were previously not possible. Previously (for A viruses only) we sometimes set the lineage to the subtype (e.g. h3n2, h1n2) and not for others (e.g. h3n3, h5n1). We now set the lineage to the subtype in all cases except for h1n1/pdm09 (subtype=h1n1, lineage=h1n1pdm09). This means we expand our previous ignoring of the upstream Lineage="seasonal" to apply to any subtype (previously we would ignore it for some subtypes and for others they would be assigned "TBD" type/subtype/lineage). Because the data is messy our subtypes & lineages are inherently messy. This means we now have subtypes of "h3", "n1" etc. In practice, if you wanted to (e.g.) fully analyse H3 strains then you would have to subset by many different subtypes (e.g. h3, h3n1, h3n2...), and potentially consider the type='A' records without any subtype assigned. There are no changes to the output seasonal-flu ndjson, but the larger curated ndjson has many changes. Beyond the updates to 87k records with TBD type/subtype/lineage, we also change the lineage for many records from 'n/a' to the subtype (see above).

The shifting of the `len(split_locations) < expected_num_locations` earlier should speed things up a little as we're no longer unnecessarily regex-matching on the strain name By keeping the `Lineage` field unchanged (i.e. capital L) we keep this field in the output NDJSON which allows better troubleshooting. This will be dropped when we do the TSV fields subsetting.

In line with previous commits, this shifts the hardcoded annotation- overrides to the relevant curate step which allows better user-facing warnings. While this allows us to drop the `apply-record-annotations` step entirely, we keep it around (without any actual annotations) to allow for hardcoded fixes in the future as needed.

The '?' null value meant that our geographic rules weren't applied as intended. For instance, we (Augur) have rules such as: 1. `Asia/Bangladesh/Jashore/` → `Asia/Bangladesh/Khulna/Jashore` 2. `Asia/Bangladesh/Jashore/*` → `Asia/Bangladesh/Khulna/*` and our geographic fixes layer would produce values like `Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore` however the second rule would be run instead resulting in `Asia/Bangladesh/Khulna/?`. Originally I implemented a fix in Augur for this to allow geo-location values with '?' to be appropriately parsed by `augur curate apply-geolocation-rules` <nextstrain/augur#1929> however on review we decided to go the other way and change the seasonal-flu geographic null value to the empty string <#267 (comment)> **Summary of changes to (seasonal-flu) ingested data** 393,688 / 514,895 records changed Value "?" → '' changes (very much as expected!): * n = 4 (country) * n = 102,408 (division) * n = 389,522 (location) Value "?" → valid geographic deme (via the rules doing their thing): * n = 0 (country) * n = 1,925 (division) * n = 3,888 (location) Value change from one deme to another (via the rules doing their thing): * n = 280 (country) * n = 8,904 (division) * n = 0 (location) **Some examples** The example given above (`Asia/Bangladesh/Jashore/?`) results in 38 records having a location "?" → "Jashore" 97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"

Our parsing of gisaid Location fields expects 4 slash-separated fields, however there are thousands of entries which have more than 4. Each of these raises a warning however there's no way to remedy these records. Instead of adding thousands of one-off annotations, this introduces the concept of rules to apply to these Location fields.

The full influenza ingest now completes without any warnings. Most changes are to non-seasonal-flu records, perhaps as expected. There are only 168 record changes to seasonal-flu, all of which are improvements. As will be described in the upcomming docs commit, the process for geographic curation is multi-layered and complex. For instance, the `parse-gisaid-location` sets "Maricopa County AZ", which `augur curate titlecase` chances to "Maricopa County Az", which we then change back to "Maricopa County AZ" via a custom `geolocation_rules.tsv` entry. I'm sure there are improvements to the process that we can make over time...

This flags up 79 duplicate (but conflicting) strain name fixes and 3 location fixes.

While the files are being fetched from fauna (and those files are changing ~every week) it's not practical to fix the underlying issues. We should re-enable these warnings once the files are in this repo and we can fix them. (P.S. I don't want to keep the warnings around in the logs as I want any warning shown to be fixable/fixed.)

jameshadfield force-pushed the james/ingest-use-epi-isl-as-key branch 4 times, most recently from 096055e to 3cea5e7 Compare November 30, 2025 19:24

This comment was marked as outdated.

Sign in to view

jameshadfield commented Dec 2, 2025

View reviewed changes

Comment thread ingest/rules/prepare_ndjson.smk Outdated

jameshadfield changed the title ~~[ingest] move towards using EPI ISL as the canonical ID~~ [ingest] general curation fixes Dec 2, 2025

joverlee521 reviewed Dec 2, 2025

View reviewed changes

jameshadfield force-pushed the james/ingest-use-epi-isl-as-key branch from 3cea5e7 to f042899 Compare December 3, 2025 01:23

joverlee521 mentioned this pull request Dec 3, 2025

[curate apply-geolocation-rules] treat '?' values akin to empty values nextstrain/augur#1929

Open

4 tasks

joverlee521 approved these changes Dec 5, 2025

View reviewed changes

Comment thread ingest/curation-docs.md Outdated

Comment thread ingest/curation-docs.md Outdated

jameshadfield added 15 commits December 8, 2025 11:03

[ingest] use ISL for hardcoded annotations

0809c4f

This forms part of a bigger direction to move away from strain names as the canonical ID as they may change over time as our curation rules change. The downside is that many of the strains are duplicated and thus require multiple ISLs.

[ingest] remove unnecessary lineage annotations

ee69df9

These were flagged as unnecessary by the changes in the previous commit

[ingest] flag duplicate strain/location fixes

91273c3

This flags up 79 duplicate (but conflicting) strain name fixes and 3 location fixes.

[ingest] start curation docs

c1958ac

jameshadfield force-pushed the james/ingest-use-epi-isl-as-key branch from f042899 to c1958ac Compare December 7, 2025 22:04

jameshadfield merged commit 70877f2 into master Dec 7, 2025
3 checks passed

jameshadfield deleted the james/ingest-use-epi-isl-as-key branch December 7, 2025 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ingest] general curation fixes#267

[ingest] general curation fixes#267
jameshadfield merged 15 commits into
masterfrom
james/ingest-use-epi-isl-as-key

jameshadfield commented Nov 26, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

joverlee521 Dec 2, 2025

Uh oh!

Uh oh!

Uh oh!

joverlee521 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jameshadfield commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

joverlee521 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

joverlee521 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jameshadfield commented Nov 26, 2025 •

edited

Loading