Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transpiler yields less optimal circuits on larger topologies #10160

Open
nonhermitian opened this issue May 25, 2023 · 11 comments
Open

Transpiler yields less optimal circuits on larger topologies #10160

nonhermitian opened this issue May 25, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@nonhermitian
Copy link
Contributor

Environment

  • Qiskit Terra version: latest (and likely previous)
  • Python version:
  • Operating system:

I have empirically noticed for some time that compiling circuits against smaller device topologies, e.g. a 16Q fake IBM backend vs. larger ones, yields improved output circuits, as defined by lower swap counts / added 2Q gates. Intuitively this does not make much sense as the smaller systems can be viewed as sub-graphs of the larger ones, and therefore the same optimal solutions found on the smaller graphs should be found within the larger topologies.

This is quite problematic because we are in a situation where we have large numbers of qubits, but the number of effectively usable qubits is less than the full device; exactly where this issue pops up.

What is happening?

To see this, I have plotted the number of added CX gates when compiling an all-ones BV circuits, an Efficient SU2 ansatz with circular entanglement, and QFT circuits as a function of size against various backends. Here 250 transpilations were performed using optimization_level=3. Bars show one standard deviation of the resulting distribution of added 2Q gates. It is clear to see that smaller device topology yields better circuits, or at least does no worse.

funny_business 001

How can we reproduce the issue?

Repeat the above, or use other circuits.

What should happen?

An optimal circuit found on a smaller machine should be discoverable on a larger device since they are sub-graphs.

Any suggestions?

That better solutions are directly linked to topology size, my guess is this is an issue with search space size, where somehow the transpiler is not finding better results in the larger space. @ashsaki has looked into this a bit with respect to the initial layout.

@nonhermitian nonhermitian added the bug Something isn't working label May 25, 2023
@jakelishman
Copy link
Member

With these plots, did you have an initial_layout set? If not, I'm curious what the results are if the initial layout on a 127q device is set to match the topology of the initial_layout found by the transpiler for a 16q device (or whatever). Knowing that might help determine whether we should focus efforts on smarter ways to search the initial_layout space as part of the layout passes, or smarter ways to solve the routing problem afterwards.

@nonhermitian
Copy link
Contributor Author

No initial layout was set.

mtreinish added a commit to mtreinish/qiskit-core that referenced this issue May 26, 2023
This commit builds on the VF2PartialLayout pass which was an experiment
available as an external plugin here:

https://github.com/mtreinish/vf2_partial_layout

That pass used the vf2 algorithm in rustworkx to find the deepest
partial interaction graph of a circuit which is isomorphic with the
coupling graph and uses that mapping to apply an initial layout. The
issue with the performance of that pass was the selection of the qubits
outside the partial interaction graph. Selecting the mapping for those
qubits is similar to the same heuristic layout that SabreLayout is
trying to solve, just for a subset of qubits. In VF2PartialLayout a
simple nearest neighbor based approach was used for selecting qubits
from the coupling graph for any virtual qubits outside the partial
layout. In practice this ended up performing worse than SabreLayout.

To address the shortcomings of that pass this commit combines the
partial layout selection from that external plugin with SabreLayout.
The sabre layout algorithm starts by randomly selecting a layout and
then progressively working forwards and backwards across the circuit
and swap mapping it to find the permutation caused by inserted swaps.
Those permutations are then used to modify the random layout and
eventual an initial layout that minimizes the number of swaps needed is
selected. With this commit instead of using a completely random layout
for all the initial guesses this starts a single trial with the partial
layout found in the same way as VF2PartialLayout. Then the remaining
qubits are selected at random and the Sabrelayout algorithm is run in
the same manner as before. This hopefully should improve the quality
of the results because we're starting from a partial layout that
doesn't require swaps for those qubits.

A similar (almost identical approach) was tried in Qiskit#9174 except instead
of seeding a single trial with the partial layout it used the partial
layout for all the the trials. In that case the results were not
generally better and the results were mixed. At the time my guess was
that using the partial layout constrained the search space too much and
was inducing more swaps to be needed. However, looking at the details in
issue Qiskit#10160 this adapts Qiskit#9174 to see if doing the partial layout in a
more limited manner has any impact there.
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue May 26, 2023
This commit builds on the VF2PartialLayout pass which was an experiment
available as an external plugin here:

https://github.com/mtreinish/vf2_partial_layout

That pass used the vf2 algorithm in rustworkx to find the deepest
partial interaction graph of a circuit which is isomorphic with the
coupling graph and uses that mapping to apply an initial layout. The
issue with the performance of that pass was the selection of the qubits
outside the partial interaction graph. Selecting the mapping for those
qubits is similar to the same heuristic layout that SabreLayout is
trying to solve, just for a subset of qubits. In VF2PartialLayout a
simple nearest neighbor based approach was used for selecting qubits
from the coupling graph for any virtual qubits outside the partial
layout. In practice this ended up performing worse than SabreLayout.

To address the shortcomings of that pass this commit combines the
partial layout selection from that external plugin with SabreLayout.
The sabre layout algorithm starts by randomly selecting a layout and
then progressively working forwards and backwards across the circuit
and swap mapping it to find the permutation caused by inserted swaps.
Those permutations are then used to modify the random layout and
eventual an initial layout that minimizes the number of swaps needed is
selected. With this commit instead of using a completely random layout
for all the initial guesses this starts a single trial with the partial
layout found in the same way as VF2PartialLayout. Then the remaining
qubits are selected at random and the Sabrelayout algorithm is run in
the same manner as before. This hopefully should improve the quality
of the results because we're starting from a partial layout that
doesn't require swaps for those qubits.

A similar (almost identical approach) was tried in Qiskit#9174 except instead
of seeding a single trial with the partial layout it used the partial
layout for all the the trials. In that case the results were not
generally better and the results were mixed. At the time my guess was
that using the partial layout constrained the search space too much and
was inducing more swaps to be needed. However, looking at the details in
issue Qiskit#10160 this adapts Qiskit#9174 to see if doing the partial layout in a
more limited manner has any impact there.
@nonhermitian
Copy link
Contributor Author

Following up on this a bit, I looked at comparing results vs PyTket for the case of circular entanglement. Because Tket is deterministic, but the Qiskit methods are not, I transpiled circuits from different width from 2->20 qubits 50 times, for 27, 127, and 433Q devices, and looked at the mean and spread of the results vs Tket.

On 27Q, Qiskit on average performs quite well verse Tket, and in the vast majority of cases the min value found after 50 trials was better than the deterministic Tket result. Every transpilation for Qiskit is much faster than Qiskit, so in this case the timings to do 50x the circuits is the same as 1 circuit in Tket

image

However, because of this issue, as the topology grows, the mean number of addtional gates due to swaps increases, and it becomes harder to out-perform Tket.

image

image

What we see is that, for identical circuits, the mean number of additional gates increases with topology size, and on average Tket begins to pull ahead. Basically, as the number of gates, and mean values increase, we are having a harder nad harder time of finding a better routing because we need to hit the tail end of a distribution that is moving further away. So on average, for this circuit at least, using the Qiskit transpiler is no better than Tket (but is faster). For 433Q, Tket is by far better on average in most cases. So in situations like the Runtime where this repeated transpilation is not carried out, the end user is getting subpar performance relative to other tools, and it becomes more and more imperative that the user transpiles locally. There is of course a cost to transpiling multiple times, and the cost grows with topology size for Qiskit, Tket seems to remain relatively constant

image

@jakelishman
Copy link
Member

Can you share the circuits and scripts that you're benchmarking here? Something seems odd about our transpile times being hundreds of seconds, because I've successfully routed circuits of 1,000+q in a fraction of that time.

@jakelishman
Copy link
Member

Fwiw, I have a pretty good guess at what's going on here: our random sampling of start locations on entry to SabreLayout is much less likely to be something decent when we've got more qubits in the map (more chance of qubits starting miles away from each other), which means we spend most of our forwards/backwards passes of routing just bringing the "initial" qubits closer together, before we even start on the actual meat of polishing that guess.

Our heuristic for choosing initial layouts to try at the moment is a uniform random sampling over the "n_total permute n_virtual" mappings, with no regard to initial distance, symmetries of the coupling map, or anything, whereas tket has deterministic routines to choose something decent, but are generally much slower (and in part is why I'm really wondering about your compile times).

@alexanderivrii
Copy link
Contributor

Jake, wouldn't this be mitigated by Sabre's forward-backward approach of updating layouts?

@jakelishman
Copy link
Member

My strong hunch is that it'll be somewhat mitigated, but not sufficiently - we only do a finite number of forward/backward passes, and there's nothing that stops Sabre routing towards the end from routing gates "away" from the "centre" of the populated qubits, which would need more passes to correct. Much of the back-and-forth is meant to help with settling on the best initial permutation from within a given set of initial physical qubits, but in the cases of a coupling map that's much larger than the number of virtual qubits, we have to use up some of our back-and-forth passes just trying to get a decent combination from the initial set, before we can start worrying about the best permutation within that.

There's also no guarantee that Sabre routing will bring all the qubits "close" to each at the start/end - only those that interact nearby, and that potentially stymies routing's choices later down the line, since one swap may only ever affect one gate, whereas if all the qubits are packed into a dense subspace of the coupling graph, there's the opportunity for one swap to help several gates in the lookahead layers.

This is all super heuristic and a bit of me just speaking from my intuition about how Sabre achieves its results rather than a formal study, but it sounds pretty plausible to me.

@mtreinish
Copy link
Member

I've been playing some more with #10169 and I'm not sure a "better" initial layout guess necessarily leads to better results. That PR sets a layout trial to use a partial isomorphic subgraph (it's build traversing the DAG and returns the mapping from the deepest node in the DAG) and then if there are extra qubits needed for a full layout (it's not always the case, for example for a circular entanglement EfficientSU2 a partial isomorphic subgraph is full width it just doesn't include the last cx to make it circular) it adds the nearest neighbors first. In testing the PR so far with a 16 qubit EfficientSU2 it rarely selects the starting partial layout as the one with the least amount of SWAPs. When it does select the partial layout it's not always the best possible result (based on other attempts which had lower 2q gate counts). This specific example might be a bad test for the approach in #10169 because the partial layout is the full circuit width and not accounting for the last interaction, I still need to do more exploration (also the code in the PR could have bugs, it's still a WIP). When I was testing the standalone PartialVF2Layout pass in https://github.com/mtreinish/vf2_partial_layout and also trying the hybrid with sabre approach last year in #9174 what I found is that the search space ends up being too constrained limiting the starting point to partial mappings, or at least that was my intuition at the time, and the early test results with #10169 seem to indicate the same thing.

I'm wondering if there is more of a middle ground we can influence the layout selected towards a better result but without constraining the search space too much. For #10169 the difference between that and my earlier attempt in #9174 is that the newer PR only uses one partial layout trial with nearest neighbors for the free qubits and the other trials are all full random. While the earlier attempt used a partial layout for all trials random permutations of the free qubits outside the partial layout. The other thing I want to try in this vain is instead of using a partial VF2 layout as the starting point is to try and use DenseLayout too and see if we just give sabre one trial as a starting point with the densest subgraph in the coupling map and see how that performs relative to full random and also the partial isomorphic case.

That all being said I'm wondering though if maybe it makes more sense in the meantime to just dial up the heuristic effort dials on SabreLayout for optimization level 3. So instead of running 20 layout trials, 4 layout iterations (back and forth routing calls), and 20 routing trials we do something like like multiply it by 5 and do 100 layout trials, 20 layout iterations, and 100 routing trials. We probably have sufficient time budget for O3 to spend a bit more time on layout with fully random trials, especially given the pass's performance has improved since we first set the number of trials we run.

@mtreinish
Copy link
Member

I ran a quick test to see what the impact of extended set size is on the output here. After doing more testing with different types of input layout seeding (using various forms of VF2PartialLayout, DenseLayout, and ordering by betweenness centrality) I was finding none made a difference. But also that the swap counts had wide swings between different seed trials. My thinking is that the extra degree of freedom in the larger coupling map is also impacting the routing search in addition to the layout and we're just not searching deeply enough in the lookahead portion of the heuristic to factor against going the "wrong way" with swap insertion. To test this I ran a transpile() of:

qc = EfficientSU2(16, entanglement='circular', reps=6, flatten=True)
qc.assign_parameters([math.pi/2] * len(qc.parameters), inplace=True)
qc.measure_all()

against IBM Seattle with opt level 3 and a fixed seed. Then modified the sabre code to take the extended set size from an environment variable. The results are:

extended_set

The best case result found on IBM Guadalupe in my previous testing of this circuit was 168 2q gates which matches the graph when the extended set size was 72. There was also no measurable impact to the runtime because of this change as even with the larger sizes the runtime of transpile() was always < 5 sec and realistically was bound by the optimization loop time, which become a function of the output size from routing. I'm thinking to solve this the answer is to use a larger extended set size (in the short term while working a more thorough lookahead heuristic) and also increase the trial counts for level 3.

mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Aug 8, 2023
This commit increases the heuristic effort for sabre layout and routing.
This is made through 2 changes, the first is the depth of the internal
lookahead heuristic used in sabre swap has been increased from 20 to 72.
This is just a stop-gap for a potential reworking of the lookahead
heuristic. In local testing for larger backends and deeper circuits in
this is showing better output results than the current default of 20
without any runtime impact. The other aspect is that the trial counts
for each optimization level > 0 is increased 5x. This has a runtime cost,
but it the performance of the rust sabre implementation is fast enough
that even running the algorithm with more trials it is not a bottleneck
for typical compilation.

Related to Qiskit#10160
mtreinish added a commit to mtreinish/qiskit-core that referenced this issue Aug 8, 2023
This commit increases the heuristic effort for sabre layout and routing.
This is made through 2 changes, the first is the depth of the internal
lookahead heuristic used in sabre swap has been increased from 20 to 72.
This is just a stop-gap for a potential reworking of the lookahead
heuristic. In local testing for larger backends and deeper circuits in
this is showing better output results than the current default of 20
without any runtime impact. The other aspect is that the trial counts
for each optimization level > 0 is increased 5x. This has a runtime cost,
but it the performance of the rust sabre implementation is fast enough
that even running the algorithm with more trials it is not a bottleneck
for typical compilation.

Related to Qiskit#10160
@alexanderivrii
Copy link
Contributor

If we are specifically interested in transpiling all-ones BV circuits, then we can do a much job by implementing the CX-logic of these circuits adhering to linear-nearest-neighbor connectivity.

More precisely, the two circuits below are equivalent (while the second circuit has a perfect layout when the connectivity map of the device contains a line of the right size):

     ┌───┐      ░                           ░ ┌───┐┌─┐            
q_0: ┤ H ├──────░───■───────────────────────░─┤ H ├┤M├────────────
     ├───┤      ░   │                       ░ ├───┤└╥┘┌─┐         
q_1: ┤ H ├──────░───┼────■──────────────────░─┤ H ├─╫─┤M├─────────
     ├───┤      ░   │    │                  ░ ├───┤ ║ └╥┘┌─┐      
q_2: ┤ H ├──────░───┼────┼────■─────────────░─┤ H ├─╫──╫─┤M├──────
     ├───┤      ░   │    │    │             ░ ├───┤ ║  ║ └╥┘┌─┐   
q_3: ┤ H ├──────░───┼────┼────┼────■────────░─┤ H ├─╫──╫──╫─┤M├───
     ├───┤      ░   │    │    │    │        ░ ├───┤ ║  ║  ║ └╥┘┌─┐
q_4: ┤ H ├──────░───┼────┼────┼────┼────■───░─┤ H ├─╫──╫──╫──╫─┤M├
     ├───┤┌───┐ ░ ┌─┴─┐┌─┴─┐┌─┴─┐┌─┴─┐┌─┴─┐ ░ └───┘ ║  ║  ║  ║ └╥┘
q_5: ┤ X ├┤ H ├─░─┤ X ├┤ X ├┤ X ├┤ X ├┤ X ├─░───────╫──╫──╫──╫──╫─
     └───┘└───┘ ░ └───┘└───┘└───┘└───┘└───┘ ░       ║  ║  ║  ║  ║ 
c: 5/═══════════════════════════════════════════════╩══╩══╩══╩══╩═
                                                    0  1  2  3  4 
     ┌───┐      ░                                               ░ ┌───┐┌─┐            
q_0: ┤ H ├──────░───■───────────────────────────────────────■───░─┤ H ├┤M├────────────
     ├───┤      ░ ┌─┴─┐                                   ┌─┴─┐ ░ ├───┤└╥┘┌─┐         
q_1: ┤ H ├──────░─┤ X ├──■─────────────────────────────■──┤ X ├─░─┤ H ├─╫─┤M├─────────
     ├───┤      ░ └───┘┌─┴─┐                         ┌─┴─┐└───┘ ░ ├───┤ ║ └╥┘┌─┐      
q_2: ┤ H ├──────░──────┤ X ├──■───────────────────■──┤ X ├──────░─┤ H ├─╫──╫─┤M├──────
     ├───┤      ░      └───┘┌─┴─┐               ┌─┴─┐└───┘      ░ ├───┤ ║  ║ └╥┘┌─┐   
q_3: ┤ H ├──────░───────────┤ X ├──■─────────■──┤ X ├───────────░─┤ H ├─╫──╫──╫─┤M├───
     ├───┤      ░           └───┘┌─┴─┐     ┌─┴─┐└───┘           ░ ├───┤ ║  ║  ║ └╥┘┌─┐
q_4: ┤ H ├──────░────────────────┤ X ├──■──┤ X ├────────────────░─┤ H ├─╫──╫──╫──╫─┤M├
     ├───┤┌───┐ ░                └───┘┌─┴─┐└───┘                ░ └───┘ ║  ║  ║  ║ └╥┘
q_5: ┤ X ├┤ H ├─░─────────────────────┤ X ├─────────────────────░───────╫──╫──╫──╫──╫─
     └───┘└───┘ ░                     └───┘                     ░       ║  ║  ║  ║  ║ 
c: 5/═══════════════════════════════════════════════════════════════════╩══╩══╩══╩══╩═
                                                                        0  1  2  3  4 

@nonhermitian
Copy link
Contributor Author

BV is just an example that requires lots of swaps, if done in the canonical manner. It can be rewritten using resets as a 2Q problem. The issue here is that the swap mappers seem to be having a harder time on circuit requiring swaps as the search space grows

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants