Transpiler yields less optimal circuits on larger topologies #10160

nonhermitian · 2023-05-25T10:21:37Z

Environment

Qiskit Terra version: latest (and likely previous)
Python version:
Operating system:

I have empirically noticed for some time that compiling circuits against smaller device topologies, e.g. a 16Q fake IBM backend vs. larger ones, yields improved output circuits, as defined by lower swap counts / added 2Q gates. Intuitively this does not make much sense as the smaller systems can be viewed as sub-graphs of the larger ones, and therefore the same optimal solutions found on the smaller graphs should be found within the larger topologies.

This is quite problematic because we are in a situation where we have large numbers of qubits, but the number of effectively usable qubits is less than the full device; exactly where this issue pops up.

What is happening?

To see this, I have plotted the number of added CX gates when compiling an all-ones BV circuits, an Efficient SU2 ansatz with circular entanglement, and QFT circuits as a function of size against various backends. Here 250 transpilations were performed using optimization_level=3. Bars show one standard deviation of the resulting distribution of added 2Q gates. It is clear to see that smaller device topology yields better circuits, or at least does no worse.

How can we reproduce the issue?

Repeat the above, or use other circuits.

What should happen?

An optimal circuit found on a smaller machine should be discoverable on a larger device since they are sub-graphs.

Any suggestions?

That better solutions are directly linked to topology size, my guess is this is an issue with search space size, where somehow the transpiler is not finding better results in the larger space. @ashsaki has looked into this a bit with respect to the initial layout.

The text was updated successfully, but these errors were encountered:

jakelishman · 2023-05-25T10:42:37Z

With these plots, did you have an initial_layout set? If not, I'm curious what the results are if the initial layout on a 127q device is set to match the topology of the initial_layout found by the transpiler for a 16q device (or whatever). Knowing that might help determine whether we should focus efforts on smarter ways to search the initial_layout space as part of the layout passes, or smarter ways to solve the routing problem afterwards.

nonhermitian · 2023-05-25T10:54:55Z

No initial layout was set.

This commit builds on the VF2PartialLayout pass which was an experiment available as an external plugin here: https://github.com/mtreinish/vf2_partial_layout That pass used the vf2 algorithm in rustworkx to find the deepest partial interaction graph of a circuit which is isomorphic with the coupling graph and uses that mapping to apply an initial layout. The issue with the performance of that pass was the selection of the qubits outside the partial interaction graph. Selecting the mapping for those qubits is similar to the same heuristic layout that SabreLayout is trying to solve, just for a subset of qubits. In VF2PartialLayout a simple nearest neighbor based approach was used for selecting qubits from the coupling graph for any virtual qubits outside the partial layout. In practice this ended up performing worse than SabreLayout. To address the shortcomings of that pass this commit combines the partial layout selection from that external plugin with SabreLayout. The sabre layout algorithm starts by randomly selecting a layout and then progressively working forwards and backwards across the circuit and swap mapping it to find the permutation caused by inserted swaps. Those permutations are then used to modify the random layout and eventual an initial layout that minimizes the number of swaps needed is selected. With this commit instead of using a completely random layout for all the initial guesses this starts a single trial with the partial layout found in the same way as VF2PartialLayout. Then the remaining qubits are selected at random and the Sabrelayout algorithm is run in the same manner as before. This hopefully should improve the quality of the results because we're starting from a partial layout that doesn't require swaps for those qubits. A similar (almost identical approach) was tried in Qiskit#9174 except instead of seeding a single trial with the partial layout it used the partial layout for all the the trials. In that case the results were not generally better and the results were mixed. At the time my guess was that using the partial layout constrained the search space too much and was inducing more swaps to be needed. However, looking at the details in issue Qiskit#10160 this adapts Qiskit#9174 to see if doing the partial layout in a more limited manner has any impact there.

nonhermitian · 2023-07-26T16:05:50Z

Following up on this a bit, I looked at comparing results vs PyTket for the case of circular entanglement. Because Tket is deterministic, but the Qiskit methods are not, I transpiled circuits from different width from 2->20 qubits 50 times, for 27, 127, and 433Q devices, and looked at the mean and spread of the results vs Tket.

On 27Q, Qiskit on average performs quite well verse Tket, and in the vast majority of cases the min value found after 50 trials was better than the deterministic Tket result. Every transpilation for Qiskit is much faster than Qiskit, so in this case the timings to do 50x the circuits is the same as 1 circuit in Tket

However, because of this issue, as the topology grows, the mean number of addtional gates due to swaps increases, and it becomes harder to out-perform Tket.

What we see is that, for identical circuits, the mean number of additional gates increases with topology size, and on average Tket begins to pull ahead. Basically, as the number of gates, and mean values increase, we are having a harder nad harder time of finding a better routing because we need to hit the tail end of a distribution that is moving further away. So on average, for this circuit at least, using the Qiskit transpiler is no better than Tket (but is faster). For 433Q, Tket is by far better on average in most cases. So in situations like the Runtime where this repeated transpilation is not carried out, the end user is getting subpar performance relative to other tools, and it becomes more and more imperative that the user transpiles locally. There is of course a cost to transpiling multiple times, and the cost grows with topology size for Qiskit, Tket seems to remain relatively constant

jakelishman · 2023-07-26T16:30:00Z

Can you share the circuits and scripts that you're benchmarking here? Something seems odd about our transpile times being hundreds of seconds, because I've successfully routed circuits of 1,000+q in a fraction of that time.

jakelishman · 2023-07-26T16:37:45Z

Fwiw, I have a pretty good guess at what's going on here: our random sampling of start locations on entry to SabreLayout is much less likely to be something decent when we've got more qubits in the map (more chance of qubits starting miles away from each other), which means we spend most of our forwards/backwards passes of routing just bringing the "initial" qubits closer together, before we even start on the actual meat of polishing that guess.

Our heuristic for choosing initial layouts to try at the moment is a uniform random sampling over the "n_total permute n_virtual" mappings, with no regard to initial distance, symmetries of the coupling map, or anything, whereas tket has deterministic routines to choose something decent, but are generally much slower (and in part is why I'm really wondering about your compile times).

alexanderivrii · 2023-07-27T09:00:21Z

Jake, wouldn't this be mitigated by Sabre's forward-backward approach of updating layouts?

jakelishman · 2023-07-27T09:55:00Z

My strong hunch is that it'll be somewhat mitigated, but not sufficiently - we only do a finite number of forward/backward passes, and there's nothing that stops Sabre routing towards the end from routing gates "away" from the "centre" of the populated qubits, which would need more passes to correct. Much of the back-and-forth is meant to help with settling on the best initial permutation from within a given set of initial physical qubits, but in the cases of a coupling map that's much larger than the number of virtual qubits, we have to use up some of our back-and-forth passes just trying to get a decent combination from the initial set, before we can start worrying about the best permutation within that.

There's also no guarantee that Sabre routing will bring all the qubits "close" to each at the start/end - only those that interact nearby, and that potentially stymies routing's choices later down the line, since one swap may only ever affect one gate, whereas if all the qubits are packed into a dense subspace of the coupling graph, there's the opportunity for one swap to help several gates in the lookahead layers.

This is all super heuristic and a bit of me just speaking from my intuition about how Sabre achieves its results rather than a formal study, but it sounds pretty plausible to me.

mtreinish · 2023-07-30T14:42:56Z

I've been playing some more with #10169 and I'm not sure a "better" initial layout guess necessarily leads to better results. That PR sets a layout trial to use a partial isomorphic subgraph (it's build traversing the DAG and returns the mapping from the deepest node in the DAG) and then if there are extra qubits needed for a full layout (it's not always the case, for example for a circular entanglement EfficientSU2 a partial isomorphic subgraph is full width it just doesn't include the last cx to make it circular) it adds the nearest neighbors first. In testing the PR so far with a 16 qubit EfficientSU2 it rarely selects the starting partial layout as the one with the least amount of SWAPs. When it does select the partial layout it's not always the best possible result (based on other attempts which had lower 2q gate counts). This specific example might be a bad test for the approach in #10169 because the partial layout is the full circuit width and not accounting for the last interaction, I still need to do more exploration (also the code in the PR could have bugs, it's still a WIP). When I was testing the standalone PartialVF2Layout pass in https://github.com/mtreinish/vf2_partial_layout and also trying the hybrid with sabre approach last year in #9174 what I found is that the search space ends up being too constrained limiting the starting point to partial mappings, or at least that was my intuition at the time, and the early test results with #10169 seem to indicate the same thing.

I'm wondering if there is more of a middle ground we can influence the layout selected towards a better result but without constraining the search space too much. For #10169 the difference between that and my earlier attempt in #9174 is that the newer PR only uses one partial layout trial with nearest neighbors for the free qubits and the other trials are all full random. While the earlier attempt used a partial layout for all trials random permutations of the free qubits outside the partial layout. The other thing I want to try in this vain is instead of using a partial VF2 layout as the starting point is to try and use DenseLayout too and see if we just give sabre one trial as a starting point with the densest subgraph in the coupling map and see how that performs relative to full random and also the partial isomorphic case.

That all being said I'm wondering though if maybe it makes more sense in the meantime to just dial up the heuristic effort dials on SabreLayout for optimization level 3. So instead of running 20 layout trials, 4 layout iterations (back and forth routing calls), and 20 routing trials we do something like like multiply it by 5 and do 100 layout trials, 20 layout iterations, and 100 routing trials. We probably have sufficient time budget for O3 to spend a bit more time on layout with fully random trials, especially given the pass's performance has improved since we first set the number of trials we run.

mtreinish · 2023-08-08T11:57:24Z

I ran a quick test to see what the impact of extended set size is on the output here. After doing more testing with different types of input layout seeding (using various forms of VF2PartialLayout, DenseLayout, and ordering by betweenness centrality) I was finding none made a difference. But also that the swap counts had wide swings between different seed trials. My thinking is that the extra degree of freedom in the larger coupling map is also impacting the routing search in addition to the layout and we're just not searching deeply enough in the lookahead portion of the heuristic to factor against going the "wrong way" with swap insertion. To test this I ran a transpile() of:

qc = EfficientSU2(16, entanglement='circular', reps=6, flatten=True)
qc.assign_parameters([math.pi/2] * len(qc.parameters), inplace=True)
qc.measure_all()

against IBM Seattle with opt level 3 and a fixed seed. Then modified the sabre code to take the extended set size from an environment variable. The results are:

The best case result found on IBM Guadalupe in my previous testing of this circuit was 168 2q gates which matches the graph when the extended set size was 72. There was also no measurable impact to the runtime because of this change as even with the larger sizes the runtime of transpile() was always < 5 sec and realistically was bound by the optimization loop time, which become a function of the output size from routing. I'm thinking to solve this the answer is to use a larger extended set size (in the short term while working a more thorough lookahead heuristic) and also increase the trial counts for level 3.

This commit increases the heuristic effort for sabre layout and routing. This is made through 2 changes, the first is the depth of the internal lookahead heuristic used in sabre swap has been increased from 20 to 72. This is just a stop-gap for a potential reworking of the lookahead heuristic. In local testing for larger backends and deeper circuits in this is showing better output results than the current default of 20 without any runtime impact. The other aspect is that the trial counts for each optimization level > 0 is increased 5x. This has a runtime cost, but it the performance of the rust sabre implementation is fast enough that even running the algorithm with more trials it is not a bottleneck for typical compilation. Related to Qiskit#10160

alexanderivrii · 2023-08-20T08:03:35Z

If we are specifically interested in transpiling all-ones BV circuits, then we can do a much job by implementing the CX-logic of these circuits adhering to linear-nearest-neighbor connectivity.

More precisely, the two circuits below are equivalent (while the second circuit has a perfect layout when the connectivity map of the device contains a line of the right size):

     ┌───┐      ░                           ░ ┌───┐┌─┐            
q_0: ┤ H ├──────░───■───────────────────────░─┤ H ├┤M├────────────
     ├───┤      ░   │                       ░ ├───┤└╥┘┌─┐         
q_1: ┤ H ├──────░───┼────■──────────────────░─┤ H ├─╫─┤M├─────────
     ├───┤      ░   │    │                  ░ ├───┤ ║ └╥┘┌─┐      
q_2: ┤ H ├──────░───┼────┼────■─────────────░─┤ H ├─╫──╫─┤M├──────
     ├───┤      ░   │    │    │             ░ ├───┤ ║  ║ └╥┘┌─┐   
q_3: ┤ H ├──────░───┼────┼────┼────■────────░─┤ H ├─╫──╫──╫─┤M├───
     ├───┤      ░   │    │    │    │        ░ ├───┤ ║  ║  ║ └╥┘┌─┐
q_4: ┤ H ├──────░───┼────┼────┼────┼────■───░─┤ H ├─╫──╫──╫──╫─┤M├
     ├───┤┌───┐ ░ ┌─┴─┐┌─┴─┐┌─┴─┐┌─┴─┐┌─┴─┐ ░ └───┘ ║  ║  ║  ║ └╥┘
q_5: ┤ X ├┤ H ├─░─┤ X ├┤ X ├┤ X ├┤ X ├┤ X ├─░───────╫──╫──╫──╫──╫─
     └───┘└───┘ ░ └───┘└───┘└───┘└───┘└───┘ ░       ║  ║  ║  ║  ║ 
c: 5/═══════════════════════════════════════════════╩══╩══╩══╩══╩═
                                                    0  1  2  3  4 
     ┌───┐      ░                                               ░ ┌───┐┌─┐            
q_0: ┤ H ├──────░───■───────────────────────────────────────■───░─┤ H ├┤M├────────────
     ├───┤      ░ ┌─┴─┐                                   ┌─┴─┐ ░ ├───┤└╥┘┌─┐         
q_1: ┤ H ├──────░─┤ X ├──■─────────────────────────────■──┤ X ├─░─┤ H ├─╫─┤M├─────────
     ├───┤      ░ └───┘┌─┴─┐                         ┌─┴─┐└───┘ ░ ├───┤ ║ └╥┘┌─┐      
q_2: ┤ H ├──────░──────┤ X ├──■───────────────────■──┤ X ├──────░─┤ H ├─╫──╫─┤M├──────
     ├───┤      ░      └───┘┌─┴─┐               ┌─┴─┐└───┘      ░ ├───┤ ║  ║ └╥┘┌─┐   
q_3: ┤ H ├──────░───────────┤ X ├──■─────────■──┤ X ├───────────░─┤ H ├─╫──╫──╫─┤M├───
     ├───┤      ░           └───┘┌─┴─┐     ┌─┴─┐└───┘           ░ ├───┤ ║  ║  ║ └╥┘┌─┐
q_4: ┤ H ├──────░────────────────┤ X ├──■──┤ X ├────────────────░─┤ H ├─╫──╫──╫──╫─┤M├
     ├───┤┌───┐ ░                └───┘┌─┴─┐└───┘                ░ └───┘ ║  ║  ║  ║ └╥┘
q_5: ┤ X ├┤ H ├─░─────────────────────┤ X ├─────────────────────░───────╫──╫──╫──╫──╫─
     └───┘└───┘ ░                     └───┘                     ░       ║  ║  ║  ║  ║ 
c: 5/═══════════════════════════════════════════════════════════════════╩══╩══╩══╩══╩═
                                                                        0  1  2  3  4

nonhermitian · 2023-08-20T10:39:09Z

BV is just an example that requires lots of swaps, if done in the canonical manner. It can be rewritten using resets as a 2Q problem. The issue here is that the swap mappers seem to be having a harder time on circuit requiring swaps as the search space grows

nonhermitian added the bug Something isn't working label May 25, 2023

mtreinish mentioned this issue May 26, 2023

[WIP] Use VF2 to find a partial layout for seeding a SabreLayout trial #10169

Closed

4 tasks

mtreinish mentioned this issue Aug 8, 2023

Increase heuristic effort for sabre #10588

Closed

alexanderivrii mentioned this issue Sep 15, 2023

Starting layout analysis pass for sabre #10829

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transpiler yields less optimal circuits on larger topologies #10160

Transpiler yields less optimal circuits on larger topologies #10160

nonhermitian commented May 25, 2023

jakelishman commented May 25, 2023

nonhermitian commented May 25, 2023

nonhermitian commented Jul 26, 2023

jakelishman commented Jul 26, 2023

jakelishman commented Jul 26, 2023

alexanderivrii commented Jul 27, 2023

jakelishman commented Jul 27, 2023

mtreinish commented Jul 30, 2023

mtreinish commented Aug 8, 2023

alexanderivrii commented Aug 20, 2023

nonhermitian commented Aug 20, 2023

Transpiler yields less optimal circuits on larger topologies #10160

Transpiler yields less optimal circuits on larger topologies #10160

Comments

nonhermitian commented May 25, 2023

Environment

What is happening?

How can we reproduce the issue?

What should happen?

Any suggestions?

jakelishman commented May 25, 2023

nonhermitian commented May 25, 2023

nonhermitian commented Jul 26, 2023

jakelishman commented Jul 26, 2023

jakelishman commented Jul 26, 2023

alexanderivrii commented Jul 27, 2023

jakelishman commented Jul 27, 2023

mtreinish commented Jul 30, 2023

mtreinish commented Aug 8, 2023

alexanderivrii commented Aug 20, 2023

nonhermitian commented Aug 20, 2023