Add new multithreaded TwoQubitPeepholeOptimization pass#13419
Conversation
Coverage Report for CI Build 26225241457Warning Build has drifted: This PR's base is out of sync with its target branch, so coverage data may include unrelated changes. Coverage decreased (-0.1%) to 87.488%Details
Uncovered Changes
Coverage Regressions304 previously-covered lines in 17 files lost coverage.
Coverage Stats
💛 - Coveralls |
ad06d1a to
4d160bc
Compare
This commit adds a new transpiler pass for physical optimization, TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks, ConsolidateBlocks, and UnitarySynthesis in the optimization stage for a default pass manager setup. The pass logically works the same way where it analyzes the dag to get a list of 2q runs, calculates the matrix of each run, and then synthesizes the matrix and substitutes it inplace. The distinction this pass makes though is it does this all in a single pass and also parallelizes the matrix calculation and synthesis steps because there is no data dependency there. This new pass is not meant to fully replace the Collect2qBlocks, ConsolidateBlocks, or UnitarySynthesis passes as those also run in contexts where we don't have a physical circuit. This is meant instead to replace their usage in the optimization stage only. Accordingly this new pass also changes the logic on how we select the synthesis to use and when to make a substituion. Previously this logic was primarily done via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if the number of basis gates needed based on the weyl chamber coordinates was less than the number of 2q gates in the block (see Qiskit#11659 for discussion on this). Since this new pass skips the explicit consolidation stage we go ahead and try all the available synthesizers Right now this commit has a number of limitations, the largest are: - Only supports the target - It doesn't support any synthesizers besides the TwoQubitBasisDecomposer, because it's the only one in rust currently. For plugin handling I left the logic as running the three pass series, but I'm not sure this is the behavior we want. We could say keep the synthesis plugins for `UnitarySynthesis` only and then rely on our built-in methods for physical optimiztion only. But this also seems less than ideal because the plugin mechanism is how we support synthesizing to custom basis gates, and also more advanced approximate synthesis methods. Both of those are things we need to do as part of the synthesis here. Additionally, this is currently missing tests and documentation and while running it manually "works" as in it returns a circuit that looks valid, I've not done any validation yet. This also likely will need several rounds of performance optimization and tuning. t this point this is just a rough proof of concept and will need a lof refinement along with larger changes to Qiskit's rust code before this is ready to merge. Fixes Qiskit#12007 Fixes Qiskit#11659
Since Qiskit#13139 merged we have another two qubit decomposer available to run in rust, the TwoQubitControlledUDecomposer. This commit updates the new TwoQubitPeepholeOptimization to call this decomposer if the target supports appropriate 2q gates.
Clippy is correctly warning that the size difference between the two decomposer types in the TwoQubitDecomposer enumese two types is large. TwoQubitBasisDecomposer is 1640 bytes and TwoQubitControlledUDecomposer is only 24 bytes. This means each element of ControlledU is wasting > 1600 bytes. However, in this case that is acceptable in order to avoid a layer of pointer indirection as these are stored temporarily in a vec inside a thread to decompose a unitary. A trait would be more natural for this to define a common interface between all the two qubit decomposers but since we keep them instantiated for each edge in a Vec they need to be sized and doing something like `Box<dyn TwoQubitDecomposer>` (assuming a trait `TwoQubitDecomposer` instead of a enum) to get around this would have additional runtime overhead. This is also considering that TwoQubitControlledUDecomposer has far less likelihood in practice as it only works with some targets that have RZZ, RXX, RYY, or RZX gates on an edge which is less common.
Also don't run scoring more than needed.
|
Copy here the comment of @t-imamichi #13568 (comment)
We should make sure that after PR #13568 and this PR will be merged, we can efficiently transpile circuits into basis fractional RZZ gates . |
|
I added support for using the |
Cryoris
left a comment
There was a problem hiding this comment.
I'm not done yet, but I'm already submitting these comments below 🙂
| original_2q_count, | ||
| 1. - original_fidelity, |
There was a problem hiding this comment.
I assume this order gives the precedence in sorting -- shouldn't the fidelity be the main priority over the 2q count? Since this is done after routing (it is, right?) it seems like we should be optimizing for fidelity over anything else
There was a problem hiding this comment.
Edit: Reading the rest, I assume this is because the output operations might not be in the target basis which makes the fidelity calculation an estimation only. If that's what's going on could we leave a comment to explaining this?
There was a problem hiding this comment.
I discussed this in: #13419 (comment) I think we could change it to be fidelity first, but this was a change I made during the development to try and debug other issues. I think right now it's a good choice because if there is a controlled U equivalent entangling gate we will emit too many 1q gates (until #16036 is fixed) which will get simplified by Optimize1qGatesDecomposition but if error was the first heuristic we would miss an optimization opportunity. I think we can investigate switching it to be fidelity first after #16036 is fixed.
This avoids the second synchronization point during the parallel portion of the pass function. While an AtomicBool shouldn't have much synchronization overhead it wasn't strictly necessary. We do have the O(n) overhead of iterating over the runs to determine if we changed anything but this is the less common case that we don't make any substitutions so taking the overhead isn't a huge deal.
In Qiskit#13419 the pass has been updated to handle threads that need access to Python objects to get matrices or definitions of custom Python gates. This changes the story for the C API when using the C API in a python binding context. To run the C API reliably within a Python context the pass needs to be in a context with the GIL so it can be reliably released and reacquired where necessary in the rust code.
Cryoris
left a comment
There was a problem hiding this comment.
Still need to go through the tests and docs, but here's already a next chunk 🙂
| let mut original_2q_count: usize = 0; | ||
| let original_total_count: usize = node_indices.len(); | ||
| let mut outside_target = false; | ||
| for node_index in node_indices { |
There was a problem hiding this comment.
Couldn't this use the fidelity_2q_sequence function we've defined further above?
There was a problem hiding this comment.
The reason I didn't comes down to the typing, the fidelity_2q_sequence expects the inputs in all the forms that are used internally by the unitary synthesis module (a vec of tuples with the gates, params, and qubits, target wrapped in a QpuConstraint, etc). While at this point we just have a vec of node indices in the dag of the original. I'd have to rewrite fidelity_2q_sequence to be more generic to be able to reuse it in this context without any overhead. I can do that if you think it's necessary
Co-authored-by: Julien Gacon <gaconju@gmail.com>
There was a problem hiding this comment.
Are we (can we?) ensure in the tests somewhere that the pass will be called with multiprocessing enabled?
There was a problem hiding this comment.
This PR doesn't test it because we don't use multiprocessing on the pass as a standalone. In general we disable multiprocessing in the test suite because we are in a multiprocessing context already via stestr. We do have a special testing harness in place that runs the transpiler (which is the only place we use multiprocessing directly) with multiprocessing enabled. So #16136 should exercise it via that.
Following on from Qiskit#13419 which added a new optimization pass TwoQubitPeepholeOptimization which was designed to replace the pair of ConsolidateBlocks and UnitarySynthesis for the optimization stage after we have a physical circuit. That PR however did not update the preset pass managers to concentrate the review on just adding the new pass. This continues off from there by updating the preset pass managers to use the new pass in optimization levels 2 and 3 replacing those levels' optimization stage's previous usage of ConsolidateBlocks and UnitarySynthesis to achieve the same goal. This should result in both a runtime performance and transpilation quality improvement as the new pass is both faster and should produce better fidelity circuits than the previous peephole optimization. The tests updates that are made in this PR are because the peephole optimization is changing the transpilation output of various test circuits. These were all verified to be valid outputs and in all cases a "better" output than before. Specifically, for the tests updated these were the changes in output and why they occurred: * The two tests in test.python.circuit.test_scheduled_circuit.TestScheduledCircuit were the single CX gate in the output circuit was flipped from (0, 1) to (1, 0) because in the target the error rate for the (0, 1) direction was higher than the extra error cost of 3 sx gates (the rz gates have 0 error). * In test_unroll_only_if_not_gates_in_basis from test.python.transpiler.test_preset_passmanagers.TestPresetPassManager we no longer run ConsolidateBlocks in the optimization loop so we no longer need to add the 2 executions from the init and translation stages. The test is updated to count the new peephole pass which is the intent of the count check, to check the pass in the optimization loop. * In test_2q_circuit_5q_backend_v2 from test.python.transpiler.test_vf2_post_layout.TestVF2PostLayoutUndirected had the same cx gate flipping because the error rate in the original layout for the reverse direction was 0.000779905 vs 0.00163587 in the original direction. So the new pass was correctly flipping the cx gate resulting in a different circuit that vf2 couldn't place anywhere better. To fix this the test sets a fixed layout on worse qubits so that vf2 will have to place it somewhere better. * For test_layout_tokyo_fully_connected_cx_4_3 from test.python.transpiler.test_preset_passmanagers.TestFinalLayouts the output circuit has a better estimated fidelity (although more gates in general). The transpiler output goes from an estimated fidelity of 0.9526614226294913 before the new pass was used to an estimated fidelity of 0.961996188569715 after the new pass is used. This new circuit with a better fidelity has a different initial layout set now, so the test is updated to use the new layout.
This commit adds an option to the TwoQubitPeepholeOptimization transpiler pass added in Qiskit#13419 to allow configuring the heuristic priority when instantiating the pass. When the pass is making a decision on which potential synthesis outcome out of multiple is the best or when to replace a block with the best synthesis outcome there are three metrics we look at, the estimated fidelity, the number of 2q gates, and the total number of gates. The order of this comparison is inherently flexible and prioritizes different aspects of the synthesis. This exposes a new option to the pass to enable users to specify which aspect they want the pass to prioritize. Previously the pass was hard coded to prioritize 2q gate count, but as was discussed in the PR review and the commit history of the PR branch this isn't necessarily the ideal choice. Intuiatively assuming relatively accurate error rates in the target the estimated fidelity should be the first priority. This was changed to prioritize two qubit gate count during development as a debugging step and left in place while we worked on issues with the TwoQubitControlledUDecomposer's 1q component handling (see Qiskit#16123). Now that the TwoQubitControlledUDecomposer issue has been resolved we can explore switching the default, this new argument will make it much easier to do side by side comparisons of the different options. The intent is to enable updating the usage in the preset pass managers (added in Qiskit#16136) without having to modify the pass's rust code. This does not include a release not as the the new pass has not been included in a release yet so there is no addition to a public API in this PR.
In Qiskit#13419 the pass has been updated to handle threads that need access to Python objects to get matrices or definitions of custom Python gates. This changes the story for the C API when using the C API in a python binding context. To run the C API reliably within a Python context the pass needs to be in a context with the GIL so it can be reliably released and reacquired where necessary in the rust code.
* Use TwoQubitPeepholeOptimization in preset pass managers Following on from Qiskit#13419 which added a new optimization pass TwoQubitPeepholeOptimization which was designed to replace the pair of ConsolidateBlocks and UnitarySynthesis for the optimization stage after we have a physical circuit. That PR however did not update the preset pass managers to concentrate the review on just adding the new pass. This continues off from there by updating the preset pass managers to use the new pass in optimization levels 2 and 3 replacing those levels' optimization stage's previous usage of ConsolidateBlocks and UnitarySynthesis to achieve the same goal. This should result in both a runtime performance and transpilation quality improvement as the new pass is both faster and should produce better fidelity circuits than the previous peephole optimization. The tests updates that are made in this PR are because the peephole optimization is changing the transpilation output of various test circuits. These were all verified to be valid outputs and in all cases a "better" output than before. Specifically, for the tests updated these were the changes in output and why they occurred: * The two tests in test.python.circuit.test_scheduled_circuit.TestScheduledCircuit were the single CX gate in the output circuit was flipped from (0, 1) to (1, 0) because in the target the error rate for the (0, 1) direction was higher than the extra error cost of 3 sx gates (the rz gates have 0 error). * In test_unroll_only_if_not_gates_in_basis from test.python.transpiler.test_preset_passmanagers.TestPresetPassManager we no longer run ConsolidateBlocks in the optimization loop so we no longer need to add the 2 executions from the init and translation stages. The test is updated to count the new peephole pass which is the intent of the count check, to check the pass in the optimization loop. * In test_2q_circuit_5q_backend_v2 from test.python.transpiler.test_vf2_post_layout.TestVF2PostLayoutUndirected had the same cx gate flipping because the error rate in the original layout for the reverse direction was 0.000779905 vs 0.00163587 in the original direction. So the new pass was correctly flipping the cx gate resulting in a different circuit that vf2 couldn't place anywhere better. To fix this the test sets a fixed layout on worse qubits so that vf2 will have to place it somewhere better. * For test_layout_tokyo_fully_connected_cx_4_3 from test.python.transpiler.test_preset_passmanagers.TestFinalLayouts the output circuit has a better estimated fidelity (although more gates in general). The transpiler output goes from an estimated fidelity of 0.9526614226294913 before the new pass was used to an estimated fidelity of 0.961996188569715 after the new pass is used. This new circuit with a better fidelity has a different initial layout set now, so the test is updated to use the new layout. * Remove unused imports
Summary
This commit adds a new transpiler pass for physical optimization,
TwoQubitPeepholeOptimization. This replaces the use of Collect2qBlocks,
ConsolidateBlocks, and UnitarySynthesis in the optimization stage for
a default pass manager setup. The pass logically works the same way
where it analyzes the dag to get a list of 2q runs, calculates the matrix
of each run, and then synthesizes the matrix and substitutes it inplace.
The distinction this pass makes though is it does this all in a single
pass and also parallelizes the matrix calculation and synthesis steps
because there is no data dependency there.
This new pass is not meant to fully replace the Collect2qBlocks,
ConsolidateBlocks, or UnitarySynthesis passes as those also run in
contexts where we don't have a physical circuit. This is meant instead
to replace their usage in the optimization stage only. Accordingly this
new pass also changes the logic on how we select the synthesis to use
and when to make a substitution. Previously this logic was primarily done
via the ConsolidateBlocks pass by only consolidating to a UnitaryGate if
the number of basis gates needed based on the weyl chamber coordinates
was less than the number of 2q gates in the block (see #11659 for
discussion on this). Since this new pass skips the explicit
consolidation stage we go ahead and try all the available synthesizers
Right now this commit has a number of limitations, the largest are:
TwoQubitBasisDecomposerandTwoQubitControlledUDecomposerare used)This pass doesn't support using the unitary synthesis plugin interface, since
it's optimized to use Qiskit's built-in two qubit synthesis routines written in
Rust. The existing combination of
ConsolidateBlocksandUnitarySynthesisshould be used instead if the plugin interface is necessary.
Details and comments
Fixes #12007
Fixes #11659
TODO: