Skip to content

Add new multithreaded TwoQubitPeepholeOptimization pass#13419

Merged
Cryoris merged 98 commits into
Qiskit:mainfrom
mtreinish:two-qubit-peephole-parallel-pass
May 21, 2026
Merged

Add new multithreaded TwoQubitPeepholeOptimization pass#13419
Cryoris merged 98 commits into
Qiskit:mainfrom
mtreinish:two-qubit-peephole-parallel-pass

Conversation

@mtreinish
Copy link
Copy Markdown
Member

@mtreinish mtreinish commented Nov 10, 2024

Summary

This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substitution. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see #11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

  • Only supports the target
  • It doesn't support the XX decomposer because it's not in rust (the TwoQubitBasisDecomposer and TwoQubitControlledUDecomposer are used)

This pass doesn't support using the unitary synthesis plugin interface, since
it's optimized to use Qiskit's built-in two qubit synthesis routines written in
Rust. The existing combination of ConsolidateBlocks and UnitarySynthesis
should be used instead if the plugin interface is necessary.

Details and comments

Fixes #12007
Fixes #11659

TODO:

@mtreinish mtreinish added performance Changelog: Added Add an "Added" entry in the GitHub Release changelog. Rust This PR or issue is related to Rust code in the repository mod: transpiler Issues and PRs related to Transpiler labels Nov 10, 2024
@mtreinish mtreinish added this to the 2.0.0 milestone Nov 10, 2024
@coveralls
Copy link
Copy Markdown

coveralls commented Nov 10, 2024

Coverage Report for CI Build 26225241457

Warning

Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes.
Quick fix: rebase this PR. Learn more →

Coverage decreased (-0.1%) to 87.488%

Details

  • Coverage decreased (-0.1%) from the base build.
  • Patch coverage: 34 uncovered changes across 4 files (307 of 341 lines covered, 90.03%).
  • 304 coverage regressions across 17 files.

Uncovered Changes

File Changed Covered %
crates/transpiler/src/passes/two_qubit_peephole.rs 237 208 87.76%
crates/synthesis/src/two_qubit_decompose/basis_decomposer.rs 8 6 75.0%
crates/transpiler/src/passes/unitary_synthesis/mod.rs 53 51 96.23%
crates/pyext/src/lib.rs 4 3 75.0%

Coverage Regressions

304 previously-covered lines in 17 files lost coverage.

Top 10 Files by Coverage Loss Lines Losing Coverage Coverage
crates/synthesis/src/two_qubit_decompose/basis_decomposer.rs 66 84.49%
crates/circuit/src/circuit_drawer.rs 56 95.88%
crates/bindgen/src/lib.rs 45 68.48%
crates/circuit/src/classical/expr/expr.rs 44 91.95%
crates/synthesis/src/two_qubit_decompose/controlled_u_decomposer.rs 25 93.47%
qiskit/circuit/library/generalized_gates/linear_function.py 15 85.25%
crates/transpiler/src/passes/commutative_optimization.rs 13 96.75%
crates/bindgen-cli/src/main.rs 12 0.0%
crates/bindgen/src/simple_ir.rs 7 74.55%
crates/circuit/src/variable_mapper.rs 6 63.51%

Coverage Stats

Coverage Status
Relevant Lines: 123676
Covered Lines: 108202
Line Coverage: 87.49%
Coverage Strength: 961479.69 hits per line

💛 - Coveralls

This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.

This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substituion. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see Qiskit#11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers

Right now this commit has a number of limitations, the largest are:

- Only supports the target
- It doesn't support any synthesizers besides the TwoQubitBasisDecomposer,
  because it's the only one in rust currently.

For plugin handling I left the logic as running the three pass series,
but I'm not sure this is the behavior we want. We could say keep the
synthesis plugins for `UnitarySynthesis` only and then rely on our
built-in methods for physical optimiztion only. But this also seems less
than ideal because the plugin mechanism is how we support synthesizing
to custom basis gates, and also more advanced approximate synthesis
methods. Both of those are things we need to do as part of the synthesis
here.

Additionally, this is currently missing tests and documentation and while
running it manually "works" as in it returns a circuit that looks valid,
I've not done any validation yet. This also likely will need several
rounds of performance optimization and tuning. t this point this is
just a rough proof of concept and will need a lof refinement along with
larger changes to Qiskit's rust code before this is ready to merge.

Fixes Qiskit#12007
Fixes Qiskit#11659
Since Qiskit#13139 merged we have another two qubit decomposer available to
run in rust, the TwoQubitControlledUDecomposer. This commit updates the
new TwoQubitPeepholeOptimization to call this decomposer if the target
supports appropriate 2q gates.
Clippy is correctly warning that the size difference between the two
decomposer types in the TwoQubitDecomposer enumese two types is large.
TwoQubitBasisDecomposer is 1640 bytes and TwoQubitControlledUDecomposer
is only 24 bytes. This means each element of ControlledU is wasting
> 1600 bytes. However, in this case that is acceptable in order to
avoid a layer of pointer indirection as these are stored temporarily
in a vec inside a thread to decompose a unitary. A trait would be more
natural for this to define a common interface between all the two qubit
decomposers but since we keep them instantiated for each edge in a Vec
they need to be sized and doing something like
`Box<dyn TwoQubitDecomposer>` (assuming a trait `TwoQubitDecomposer`
instead of a enum) to get around this would have additional runtime
overhead. This is also considering that TwoQubitControlledUDecomposer
has far less likelihood in practice as it only works with some targets
that have RZZ, RXX, RYY, or RZX gates on an edge which is less common.
Also don't run scoring more than needed.
@ShellyGarion
Copy link
Copy Markdown
Member

Copy here the comment of @t-imamichi #13568 (comment)
and my reply: #13568 (comment)

I think this closes #13428. How about adding a test case of consecutive RZZ (RXX, and RYY) gates?

We should make sure that after PR #13568 and this PR will be merged, we can efficiently transpile circuits into basis fractional RZZ gates .

@mtreinish
Copy link
Copy Markdown
Member Author

I added support for using the ControlledUDecomposer to the new pass back in early December with this commit: 746758f although looking at that now with fresh eyes I need to check that the gate is continuous in the target, right now it only looks at the supported gate types.

@ShellyGarion ShellyGarion self-assigned this Feb 3, 2025
Copy link
Copy Markdown
Collaborator

@Cryoris Cryoris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not done yet, but I'm already submitting these comments below 🙂

Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
Comment on lines +199 to +200
original_2q_count,
1. - original_fidelity,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this order gives the precedence in sorting -- shouldn't the fidelity be the main priority over the 2q count? Since this is done after routing (it is, right?) it seems like we should be optimizing for fidelity over anything else

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: Reading the rest, I assume this is because the output operations might not be in the target basis which makes the fidelity calculation an estimation only. If that's what's going on could we leave a comment to explaining this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this in: #13419 (comment) I think we could change it to be fidelity first, but this was a change I made during the development to try and debug other issues. I think right now it's a good choice because if there is a controlled U equivalent entangling gate we will emit too many 1q gates (until #16036 is fixed) which will get simplified by Optimize1qGatesDecomposition but if error was the first heuristic we would miss an optimization opportunity. I think we can investigate switching it to be fidelity first after #16036 is fixed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#16036 is ready now :)

Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
mtreinish added 2 commits May 11, 2026 15:45
This avoids the second synchronization point during the parallel
portion of the pass function. While an AtomicBool shouldn't have much
synchronization overhead it wasn't strictly necessary. We do have the
O(n) overhead of iterating over the runs to determine if we changed
anything but this is the less common case that we don't make any
substitutions so taking the overhead isn't a huge deal.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request May 13, 2026
In Qiskit#13419 the pass has been updated to handle threads that need access
to Python objects to get matrices or definitions of custom Python gates.
This changes the story for the C API when using the C API in a python
binding context. To run the C API reliably within a Python context the
pass needs to be in a context with the GIL so it can be reliably
released and reacquired where necessary in the rust code.
@mtreinish mtreinish requested a review from ShellyGarion May 15, 2026 20:57
Copy link
Copy Markdown
Collaborator

@Cryoris Cryoris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still need to go through the tests and docs, but here's already a next chunk 🙂

Comment thread crates/synthesis/src/two_qubit_decompose/basis_decomposer.rs
Comment thread crates/transpiler/src/passes/unitary_synthesis/decomposers.rs
Comment thread crates/transpiler/src/passes/unitary_synthesis/mod.rs
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs
Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
let mut original_2q_count: usize = 0;
let original_total_count: usize = node_indices.len();
let mut outside_target = false;
for node_index in node_indices {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't this use the fidelity_2q_sequence function we've defined further above?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't comes down to the typing, the fidelity_2q_sequence expects the inputs in all the forms that are used internally by the unitary synthesis module (a vec of tuples with the gates, params, and qubits, target wrapped in a QpuConstraint, etc). While at this point we just have a vec of node indices in the dag of the original. I'd have to rewrite fidelity_2q_sequence to be more generic to be able to reuse it in this context without any overhead. I can do that if you think it's necessary

Comment thread crates/transpiler/src/passes/two_qubit_peephole.rs Outdated
Comment thread test/python/transpiler/test_two_qubit_peephole.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we (can we?) ensure in the tests somewhere that the pass will be called with multiprocessing enabled?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't test it because we don't use multiprocessing on the pass as a standalone. In general we disable multiprocessing in the test suite because we are in a multiprocessing context already via stestr. We do have a special testing harness in place that runs the transpiler (which is the only place we use multiprocessing directly) with multiprocessing enabled. So #16136 should exercise it via that.

Comment thread test/python/transpiler/test_two_qubit_peephole.py
@Cryoris Cryoris added this pull request to the merge queue May 21, 2026
Merged via the queue into Qiskit:main with commit ef7004f May 21, 2026
27 checks passed
@github-project-automation github-project-automation Bot moved this from In development to Done in Qiskit 2.5 May 21, 2026
@mtreinish mtreinish deleted the two-qubit-peephole-parallel-pass branch May 21, 2026 13:40
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request May 21, 2026
Following on from Qiskit#13419 which added a new optimization pass
TwoQubitPeepholeOptimization which was designed to replace the
pair of ConsolidateBlocks and UnitarySynthesis for the optimization
stage after we have a physical circuit. That PR however did not update
the preset pass managers to concentrate the review on just adding the
new pass. This continues off from there by updating the preset pass
managers to use the new pass in optimization levels 2 and 3 replacing
those levels' optimization stage's previous usage of ConsolidateBlocks
and UnitarySynthesis to achieve the same goal. This should result in
both a runtime performance and transpilation quality improvement as the
new pass is both faster and should produce better fidelity circuits
than the previous peephole optimization.

The tests updates that are made in this PR are because the peephole
optimization is changing the transpilation output of various test
circuits. These were all verified to be valid outputs and in all cases
a "better" output than before. Specifically, for the tests updated
these were the changes in output and why they occurred:

* The two tests in
  test.python.circuit.test_scheduled_circuit.TestScheduledCircuit were
  the single CX gate in the output circuit was flipped from (0, 1) to
  (1, 0) because in the target the error rate for the (0, 1) direction
  was higher than the extra error cost of 3 sx gates (the rz gates have
  0 error).
* In test_unroll_only_if_not_gates_in_basis from
  test.python.transpiler.test_preset_passmanagers.TestPresetPassManager
  we no longer run ConsolidateBlocks in the optimization loop so we no
  longer need to add the 2 executions from the init and translation
  stages. The test is updated to count the new peephole pass which is
  the intent of the count check, to check the pass in the optimization
  loop.
* In test_2q_circuit_5q_backend_v2 from
  test.python.transpiler.test_vf2_post_layout.TestVF2PostLayoutUndirected
  had the same cx gate flipping because the error rate in the original
  layout for the reverse direction was 0.000779905 vs 0.00163587 in the
  original direction. So the new pass was correctly flipping the cx gate
  resulting in a different circuit that vf2 couldn't place anywhere
  better. To fix this the test sets a fixed layout on worse qubits so
  that vf2 will have to place it somewhere better.
* For test_layout_tokyo_fully_connected_cx_4_3 from
  test.python.transpiler.test_preset_passmanagers.TestFinalLayouts the
  output circuit has a better estimated fidelity (although more gates in
  general). The transpiler output goes from an estimated fidelity of
  0.9526614226294913 before the new pass was used to an estimated fidelity
  of 0.961996188569715 after the new pass is used. This new circuit with a
  better fidelity has a different initial layout set now, so the test
  is updated to use the new layout.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request May 27, 2026
This commit adds an option to the TwoQubitPeepholeOptimization transpiler
pass added in Qiskit#13419 to allow configuring the heuristic priority when
instantiating the pass. When the pass is making a decision on which
potential synthesis outcome out of multiple is the best or when to
replace a block with the best synthesis outcome there are three metrics
we look at, the estimated fidelity, the number of 2q gates, and the total
number of gates. The order of this comparison is inherently flexible and
prioritizes different aspects of the synthesis. This exposes a new
option to the pass to enable users to specify which aspect they want the
pass to prioritize.

Previously the pass was hard coded to prioritize 2q gate count, but as
was discussed in the PR review and the commit history of the PR branch
this isn't necessarily the ideal choice. Intuiatively assuming relatively
accurate error rates in the target the estimated fidelity should be the
first priority. This was changed to prioritize two qubit gate count
during development as a debugging step and left in place while we worked
on issues with the TwoQubitControlledUDecomposer's 1q component handling
(see Qiskit#16123). Now that the TwoQubitControlledUDecomposer issue has been
resolved we can explore switching the default, this new argument will
make it much easier to do side by side comparisons of the different
options. The intent is to enable updating the usage in the preset pass
managers (added in Qiskit#16136) without having to modify the pass's rust
code.

This does not include a release not as the the new pass has not been
included in a release yet so there is no addition to a public API in
this PR.
mtreinish added a commit to mtreinish/qiskit-core that referenced this pull request May 28, 2026
In Qiskit#13419 the pass has been updated to handle threads that need access
to Python objects to get matrices or definitions of custom Python gates.
This changes the story for the C API when using the C API in a python
binding context. To run the C API reliably within a Python context the
pass needs to be in a context with the GIL so it can be reliably
released and reacquired where necessary in the rust code.
chookity-pokk pushed a commit to qctrl/qiskit that referenced this pull request May 29, 2026
* Use TwoQubitPeepholeOptimization in preset pass managers

Following on from Qiskit#13419 which added a new optimization pass
TwoQubitPeepholeOptimization which was designed to replace the
pair of ConsolidateBlocks and UnitarySynthesis for the optimization
stage after we have a physical circuit. That PR however did not update
the preset pass managers to concentrate the review on just adding the
new pass. This continues off from there by updating the preset pass
managers to use the new pass in optimization levels 2 and 3 replacing
those levels' optimization stage's previous usage of ConsolidateBlocks
and UnitarySynthesis to achieve the same goal. This should result in
both a runtime performance and transpilation quality improvement as the
new pass is both faster and should produce better fidelity circuits
than the previous peephole optimization.

The tests updates that are made in this PR are because the peephole
optimization is changing the transpilation output of various test
circuits. These were all verified to be valid outputs and in all cases
a "better" output than before. Specifically, for the tests updated
these were the changes in output and why they occurred:

* The two tests in
  test.python.circuit.test_scheduled_circuit.TestScheduledCircuit were
  the single CX gate in the output circuit was flipped from (0, 1) to
  (1, 0) because in the target the error rate for the (0, 1) direction
  was higher than the extra error cost of 3 sx gates (the rz gates have
  0 error).
* In test_unroll_only_if_not_gates_in_basis from
  test.python.transpiler.test_preset_passmanagers.TestPresetPassManager
  we no longer run ConsolidateBlocks in the optimization loop so we no
  longer need to add the 2 executions from the init and translation
  stages. The test is updated to count the new peephole pass which is
  the intent of the count check, to check the pass in the optimization
  loop.
* In test_2q_circuit_5q_backend_v2 from
  test.python.transpiler.test_vf2_post_layout.TestVF2PostLayoutUndirected
  had the same cx gate flipping because the error rate in the original
  layout for the reverse direction was 0.000779905 vs 0.00163587 in the
  original direction. So the new pass was correctly flipping the cx gate
  resulting in a different circuit that vf2 couldn't place anywhere
  better. To fix this the test sets a fixed layout on worse qubits so
  that vf2 will have to place it somewhere better.
* For test_layout_tokyo_fully_connected_cx_4_3 from
  test.python.transpiler.test_preset_passmanagers.TestFinalLayouts the
  output circuit has a better estimated fidelity (although more gates in
  general). The transpiler output goes from an estimated fidelity of
  0.9526614226294913 before the new pass was used to an estimated fidelity
  of 0.961996188569715 after the new pass is used. This new circuit with a
  better fidelity has a different initial layout set now, so the test
  is updated to use the new layout.

* Remove unused imports
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Changelog: Added Add an "Added" entry in the GitHub Release changelog. mod: transpiler Issues and PRs related to Transpiler performance Rust This PR or issue is related to Rust code in the repository

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Add parallel synthesis interface to default unitary synthesis plugin ConsolidateBlocks does not have a good logic for heterogeneous gates

9 participants