[ingest] general curation fixes#267
Merged
Merged
Conversation
096055e to
3cea5e7
Compare
This comment was marked as outdated.
This comment was marked as outdated.
jameshadfield
commented
Dec 2, 2025
joverlee521
reviewed
Dec 2, 2025
| | ./scripts/parse-gisaid-location \ | ||
| --location-field {params.gisaid_location_field:q} \ | ||
| --strain-field {params.gisaid_strain_field:q} \ | ||
| --annotations {input.location_annotations:q} \ |
Contributor
There was a problem hiding this comment.
Ah, this has the downside of potentially getting overridden by geolocation rules downstream. I see you what you mean by needing to document the curation steps for geolocations...
jameshadfield
added a commit
that referenced
this pull request
Dec 3, 2025
The '?' null value meant that our geographic rules weren't applied as intended. For instance, we (Augur) have rules such as: 1. `Asia/Bangladesh/Jashore/` → `Asia/Bangladesh/Khulna/Jashore` 2. `Asia/Bangladesh/Jashore/*` → `Asia/Bangladesh/Khulna/*` and our geographic fixes layer would produce values like `Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore` however the second rule would be run instead resulting in `Asia/Bangladesh/Khulna/?`. Originally I implemented a fix in Augur for this to allow geo-location values with '?' to be appropriately parsed by `augur curate apply-geolocation-rules` <nextstrain/augur#1929> however on review we decided to go the other way and change the seasonal-flu geographic null value to the empty string <#267 (comment)> **Summary of changes to (seasonal-flu) ingested data** 393,688 / 514,895 records changed Value "?" → '' changes (very much as expected!): * n = 4 (country) * n = 102,408 (division) * n = 389,522 (location) Value "?" → valid geographic deme (via the rules doing their thing): * n = 0 (country) * n = 1,925 (division) * n = 3,888 (location) Value change from one deme to another (via the rules doing their thing): * n = 280 (country) * n = 8,904 (division) * n = 0 (location) **Some examples** The example given above (`Asia/Bangladesh/Jashore/?`) results in 38 records having a location "?" → "Jashore" 97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"
3cea5e7 to
f042899
Compare
4 tasks
joverlee521
approved these changes
Dec 5, 2025
Contributor
joverlee521
left a comment
There was a problem hiding this comment.
Latest changes look good to me. I think fine to merge and continue improving the curation docs as necessary
This is useful when making changes to the curation steps. For instance, assuming we have stored a good copy of the seasonal-flu.ndjson created before we made changes we can diff the new one against this. I plan to eventually write up a recommended workflow so that anyone can confidently make code/curation changes, and this workflow will leverage scripts such as this.
This forms part of a bigger direction to move away from strain names as the canonical ID as they may change over time as our curation rules change. The downside is that many of the strains are duplicated and thus require multiple ISLs.
This starts the move towards applying annotations early and at the
relevant curate step rather than at the end. The benefits are that we
can provide warnings when unexpected situations arise safe in the
knowledge that there won't be a later step which corrects the record.
I've changed the format such that we now also update the vtype & subtype
which is expected to result in large number of changes to records. In
total 8,097 records are changed by this commit.
* n=8058 had their type & subtype changed from "tbd" to (e.g.) "a" &
"h3n2". These changes were the only ones I expected when writing the
code here
* n=14 had their subtype changed from "n/a" to either "h1n1pdm" or
"h3n2". All of these records had raw info of "Subtype": "A",
"Lineage": "" so these changes also make sense.
* n=25 had their subtype changed, e.g. "h3n2" → "h1n1". As an example, a
record after curation with the previous code was:
`{"vtype": "a", "subtype": "h3n2", "lineage": "h1n1pdm"}`
and using this updated code is now:
`{"vtype": "a", "subtype": "h1n1", "lineage": "h1n1pdm"}`
These will be revisited in a subsequent commit as they flag up
inconsistencies in the underlying annotations rather than the code
changes introduced here.
In addition, annotations for 5 EPI ISLs are flagged as unnecessary
These were flagged as unnecessary by the changes in the previous commit
Recent changes to how we process lineage overrides (via hardcoded TSV) also update the type & subtype accordingly. This flagged up 25 cases where our previous curated NDJSON would have conflicting calls. Working through an entry in the annotations tsv of "EPI_ISL_123 a/h1n1/h1n1pdm" as an example: * The GISAID metadata indicated the strain was a/h3n2/h3n2 * Our previous approach overrode the lineage field only, so this example became a/h3n2/h1n1pdm * Our new approach also updates the subtype, so the strain becomes a/h1n1/h1n1pdm I looked at a handful of the HA segments using Nextclade's automatic dataset-suggest functionality and in all cases the underlying GISAID metadata was correct and our annotations were wrong. This commit thus drops all 25 annotations. Specifically, the new curated seasonal-flu NDJSON will have: * n=16 lineages changed from "h1n1pdm" to "h3n2" * n=5 lineages changed from "h3n2" to "h1n1pdm" * n=4 records dropped as they are no longer classified as seasonal flu (n=3 h5n1 and n=1 h1n2, all of which were previously annotated as a/h1n1/h1n1pdm)
Makes searching through the log files much easier (e.g. with `grep`) and also makes it easier to read through the log files as there's now a clear mapping back to the part of the curate chain where the warning came from. This starts a pattern I'd like to roll out to all our Augur curate scripts, but that's for another day.
The primary way data is analysed is by first subsetting to the subtype or lineage of interest, and so setting these values for all records will enable a suite of analyses which were previously not possible. Previously (for A viruses only) we sometimes set the lineage to the subtype (e.g. h3n2, h1n2) and not for others (e.g. h3n3, h5n1). We now set the lineage to the subtype in all cases except for h1n1/pdm09 (subtype=h1n1, lineage=h1n1pdm09). This means we expand our previous ignoring of the upstream Lineage="seasonal" to apply to any subtype (previously we would ignore it for some subtypes and for others they would be assigned "TBD" type/subtype/lineage). Because the data is messy our subtypes & lineages are inherently messy. This means we now have subtypes of "h3", "n1" etc. In practice, if you wanted to (e.g.) fully analyse H3 strains then you would have to subset by many different subtypes (e.g. h3, h3n1, h3n2...), and potentially consider the type='A' records without any subtype assigned. There are no changes to the output seasonal-flu ndjson, but the larger curated ndjson has many changes. Beyond the updates to 87k records with TBD type/subtype/lineage, we also change the lineage for many records from 'n/a' to the subtype (see above).
The shifting of the `len(split_locations) < expected_num_locations` earlier should speed things up a little as we're no longer unnecessarily regex-matching on the strain name By keeping the `Lineage` field unchanged (i.e. capital L) we keep this field in the output NDJSON which allows better troubleshooting. This will be dropped when we do the TSV fields subsetting.
In line with previous commits, this shifts the hardcoded annotation- overrides to the relevant curate step which allows better user-facing warnings. While this allows us to drop the `apply-record-annotations` step entirely, we keep it around (without any actual annotations) to allow for hardcoded fixes in the future as needed.
The '?' null value meant that our geographic rules weren't applied as intended. For instance, we (Augur) have rules such as: 1. `Asia/Bangladesh/Jashore/` → `Asia/Bangladesh/Khulna/Jashore` 2. `Asia/Bangladesh/Jashore/*` → `Asia/Bangladesh/Khulna/*` and our geographic fixes layer would produce values like `Asia/Bangladesh/Jashore/?`. We want the first rule to be applied to this, resulting in a corrected value of `Asia/Bangladesh/Khulna/Jashore` however the second rule would be run instead resulting in `Asia/Bangladesh/Khulna/?`. Originally I implemented a fix in Augur for this to allow geo-location values with '?' to be appropriately parsed by `augur curate apply-geolocation-rules` <nextstrain/augur#1929> however on review we decided to go the other way and change the seasonal-flu geographic null value to the empty string <#267 (comment)> **Summary of changes to (seasonal-flu) ingested data** 393,688 / 514,895 records changed Value "?" → '' changes (very much as expected!): * n = 4 (country) * n = 102,408 (division) * n = 389,522 (location) Value "?" → valid geographic deme (via the rules doing their thing): * n = 0 (country) * n = 1,925 (division) * n = 3,888 (location) Value change from one deme to another (via the rules doing their thing): * n = 280 (country) * n = 8,904 (division) * n = 0 (location) **Some examples** The example given above (`Asia/Bangladesh/Jashore/?`) results in 38 records having a location "?" → "Jashore" 97 records go from "Norway/Buskerud/?" → "Norway/Viken/Buskerud"
Our parsing of gisaid Location fields expects 4 slash-separated fields, however there are thousands of entries which have more than 4. Each of these raises a warning however there's no way to remedy these records. Instead of adding thousands of one-off annotations, this introduces the concept of rules to apply to these Location fields.
The full influenza ingest now completes without any warnings. Most changes are to non-seasonal-flu records, perhaps as expected. There are only 168 record changes to seasonal-flu, all of which are improvements. As will be described in the upcomming docs commit, the process for geographic curation is multi-layered and complex. For instance, the `parse-gisaid-location` sets "Maricopa County AZ", which `augur curate titlecase` chances to "Maricopa County Az", which we then change back to "Maricopa County AZ" via a custom `geolocation_rules.tsv` entry. I'm sure there are improvements to the process that we can make over time...
This flags up 79 duplicate (but conflicting) strain name fixes and 3 location fixes.
While the files are being fetched from fauna (and those files are changing ~every week) it's not practical to fix the underlying issues. We should re-enable these warnings once the files are in this repo and we can fix them. (P.S. I don't want to keep the warnings around in the logs as I want any warning shown to be fixable/fixed.)
f042899 to
c1958ac
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This started out as an effort to move away from using strain names as the canonical ID for curation steps but turned into more of a grab-bag of fixes and changes. Please see the commit messages for full details. There's a few big-picture directions this PR (and subsequent PRs) aim for.
Each layer in the curation chain is intended to be self-contained where possible. For instance, any and all changes to lineages should happen in the
./scripts/standardize-lineagelayer, including hardcoded annotations. This means we have more annotations TSVs, but errors/warnings printed by layers can be fixed at the source rather than later on. This allows us to address all the warnings in the curate logs and the majority of code-changes in this commit do just this, the end result being an empty curate log; this will eventually allow us to diff curated files to understand any curation changes.While a motivating direction was to replace strain names with EPI ISLs, I ended up only doing this for the curate layers above ./scripts/standardize-strain-names. Specifically: lineage annotations & location annotations. Because strain names are so widely used in curation (prioritised strains, reference strains), titers, and phylo workflows (include / exclude files) we should invest in tooling which identifies when strain names change so that we can keep all these files in sync.
I made a start on curation docs, but they're very much a work in progress.
The next PR will build on this to address avian-flu focused curation changes.