Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,13 +2,18 @@

## __NEXT__

### Features

* A helper function – `augur.subsample.get_parallelism` – has been added to optimize usage of `augur subsample` in Snakemake workflows. This is experimental and not yet part of the public API. [#1963][] (@victorlin)

### Bug fixes

* filter, merge: Fixed formatting of the error message shown when there are duplicate sequence ids. [#1954][] @victorlin
* filter: Adjusted the error message shown when there are missing weights to mention the option of updating values in metadata. [#1956][] @victorlin

[#1954]: https://github.com/nextstrain/augur/pull/1954
[#1956]: https://github.com/nextstrain/augur/pull/1956
[#1963]: https://github.com/nextstrain/augur/pull/1963

## 33.0.0 (26 January 2026)

Expand Down
36 changes: 36 additions & 0 deletions augur/subsample.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,42 @@ def run(args: argparse.Namespace) -> None:
sample.remove_output_strains()


def get_parallelism(
config_file: str,
config_section: list[str] | None = None,
limit: int | None = None
) -> int:
"""Compute the degree of parallelism (i.e., optimal value for ``--nthreads``).
Comment on lines +236 to +241
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jameshadfield do you think this would still be feasible with the augur proximity integration in #1962?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we chat about this tomorrow?

I haven't yet implemented the --nthreads stuff yet, so proximity calculations just run with 1 thread. That's not good - the function is designed to parallalize really well, and without parallalization is unnecessarily slow.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!


Inspects the subsample config file to return the degree of parallelism that
should be used for ``--nthreads``. Higher values will underutilize
resources, while lower values will underallocate resources and not fully use
available parallelism.

Parameters
----------
config_file
Path to the subsample config file.

config_section
Optional list of keys to navigate to a specific section of the config file.

limit
Optional upper bound for return value.

Returns
-------
int
Degree of parallelism.
"""
schema_validator = load_json_schema("schema-subsample-config.json")
config = _parse_config(config_file, config_section, schema_validator)
if limit is None:
return max(1, len(config["samples"]))
else:
return max(1, min(limit, len(config["samples"])))


def get_referenced_files(
config_file: str,
config_section: Optional[List[str]] = None,
Expand Down