Improve I2C Flush (Update) Speed #94

careyk007 · 2019-10-03T21:35:31Z

This increases the performance of the I2C interface and possibly the SPI interface as well, but I don't have a SPI-based SSD1306 for verification. The main idea comes from the fact that the SSD1306 has support for writing to a subset of the display memory buffer as if it were contiguous. This means that if you only have a small window within the display to update, you don't have to update the entire display.

This is achieved by keeping track of the locations of the pixels that have changed since the last call to flush or reset. The only values we need to track are the max and min values for the x and y coordinates of each pixel. Once a new call to flush happens (fast_flush in this pull request), the elements of the display buffer on the micro-controller that are within the minimum bounding rectangle of changed pixels are copied to a secondary buffer, the SSD1306 is commanded to work with the window subset of the minimum bounding rectangle, and the secondary buffer is sent to the SSD1306.

I have been developing on the STM32F411-DISCOVERY board since my Blue Pills are still on their way across the ocean so I can't speak to the results with the Blue Pill, but I have seen some truly impressive results with the Discovery board. If you are only updating half of the display size, the fast_flush results in a 28.5% faster update. If you are only updating a quarter of the display, that jumps up to a 61.9% faster update.

The downside of this approach is that the micro-controller does have to do some additional computation while copying the changed pixels into a secondary buffer. This means that as the size of the minimum bounding rectangle approaches the display size, the fast_flush approach actually becomes slower than just updating the entire display. When the update area is greater than 75% of the display area, fast_flush calls flush to do a full display update.

Perhaps the biggest performance jump may come from changing the draw function to take an iterator as a parameter instead of a slice, but I'm not sure how that would work and if that would introduce any breaking changes.

Improvements

Faster screen updates when using an I2C DisplayInterface
- The smaller the area of the screen you are writing to, the faster the update

Problems

Need to double the buffer size
- This would only be for 128x68 displays since smaller displays could reuse space in the buffer
- This may also be negated by using iterators instead of slices
Faster screen update times might not be a good thing
- Since the current flush call updates the entire screen, the function always takes the same amount of time to return
- Maybe some users are using this consistent return time as a key part of their program?

therealprof · 2019-10-04T00:03:41Z

Very nice!

A few observations from my side:

I don't think it is actually necessary do keep a second buffer. It should be sufficient if the windowing is done while sending out the buffer.
You're doing a lot of divisions in your code, those should be avoided if possible; the compiler is not smart enough to avoid them in some cases and will emit emulated divisions on some MCUs which are abysmally slow and bloaty; ideally the thresholding and separate function should be completely irrelevant and provide close to the same speed.
It would be really cool if this could be simplified so it only comes with negligible overhead so it could always be used as the standard flush function

Re: Faster screen update times might not be a good thing

I don't think any sane person would ever rely on the timing of an I2C bus for their application; there are so many things which can go sideways here that it's completely unusable.

careyk007 · 2019-10-04T00:56:34Z

Thanks for the feedback!

You're doing a lot of divisions in your code, those should be avoided if possible

Great catch. It looks like most of the divisions are by powers of 2, so I am working on replacing those with bit shift operations.

It would be really cool if this could be simplified so it only comes with negligible overhead so it could always be used as the standard flush function

I definitely agree. My first thought was to use Rust's powerful iterators, but I haven't been having much success with this approach. My other thought was to modify DisplayProperties::draw() and DisplayInterface::send_data() to take the bounds of the update window and apply the filter while sending the data. Both approaches would probably eliminate the need for a secondary buffer, which would be awesome.

therealprof · 2019-10-04T01:09:38Z

Great catch. It looks like most of the divisions are by powers of 2, so I am working on replacing those with bit shift operations.

Divisions of unsigned integers by powers of 2 are unproblematic. The same for signed integers can sometimes be problematic. It's more the odd divisors I'm worried about...

I also tried iterators a while ago without much success. It's probably best to add the windowing directly to lowlevel send functions.

careyk007 · 2019-10-04T03:20:49Z

I have moved the windowing into I2cInterface and SpiInterface. I decided to add a new send_bounded_data() method instead of re-writing send_data(), which also required adding a bounded_draw() method to DisplayProperties.

This approach has removed most of the divisions, except for 2 initial divide-by-8 calculations. As far as performance boost goes, updating a quarter of the display is now 68.5% faster than the original flush (increased from 61.9% previously), updating half of the display is now 37.5% faster (increased from 28.5% previously), and updating the full display is actually 0.5% faster (as opposed to a 10% hit previously).

therealprof · 2019-10-04T06:33:57Z

Great job, sounds fantastic. @jamwaffles do you have time to give it a try? I'm a bit tied up at the moment.

jamwaffles · 2019-10-04T08:23:17Z

Thanks for this PR! The perf numbers are looking good. I'll test and review it more thoroughly when I have some time to get my Blue Pill out.

This approach has removed most of the divisions, except for 2 initial divide-by-8 calculations.

To satisfy my own curiosity, Rust is smart enough to convert all these to lsrs (shift right) instructions for thumbv7. See this example on Godbolt. I think it's also converted in debug mode, but has some additional bounds checks thrown in.

I don't think any sane person would ever rely on the timing of an I2C bus for their application

And if they do, it's not our problem 😂. Regular screen updates should be handled by interrupts or something IMO, so I2C timing isn't much to worry about from the driver's side.

therealprof · 2019-10-04T08:30:59Z

@jamwaffles

If you divide by a non-power of two on thumbv6m you'll get this:

example::divide:
        push    {r7, lr}
        subs    r0, r0, r1
        movs    r1, #7
        bl      __aeabi_uidiv
        pop     {r7, pc}

on thumbv7m it's a good deal better:

example::divide:
        subs    r0, r0, r1
        movw    r1, #18725
        movt    r1, #9362
        umull   r1, r2, r0, r1
        subs    r0, r0, r2
        add.w   r0, r2, r0, lsr #1
        lsrs    r0, r0, #2
        bx      lr

As I said: Divides by power of 2 should be okay for unsigned variables on any platform, anything else depends...

jamwaffles

I haven't run it on a physical display yet so this first review is just looking at the code. A few style nits but otherwise ok.

Can I ask you to run cargo fmt as well? I've opened #95 to check formatting in CI which will break this PR's build.

src/mode/graphics.rs

jamwaffles · 2019-10-04T12:20:01Z

Tested on a physical I2C device and it looks to be behaving as expected. The photo below is two demos run back to back with the clear/in it calls removed in the second. You can see the partial update working so yay!

jamwaffles · 2019-10-04T12:21:46Z

I'll test on an SPI device this evening.

It would be good to:

Use fast_flush() on both SPI and I2C devices and
Integrate the fast_flush() logic into flush() as @therealprof mentioned - it keeps the API cleaner with only one method then.

careyk007 · 2019-10-04T15:31:37Z

Thanks a bunch for the feedback. I have made the style changes and run cargo fmt. I have kept fast_flush() and flush() separate for now for easier testing, but will integrate the fast_flush() logic into flush() once SPI performance has been benchmarked.

I have been testing in release mode, but ran my tests again in debug mode and got similar performance boosts. For those interested, the performance boost of simply changing from debug to release is about 70%.

jamwaffles · 2019-10-06T10:50:28Z

I'm seeing some odd behaviour with SPI displays:

I'll do a bit of digging with this SPI display and see where I get.

The image_i2c example runs much faster to the eye when using fast_flush(), however it'd be great to get some numbers - how are you benchmarking this PR?

jamwaffles · 2019-10-06T10:52:15Z

It also just occurred to me that we should make sure fast_flush() works in rotations other than Rotate0. Has this been tested yet?

jamwaffles · 2019-10-06T11:33:39Z

The examples use the following pattern to initialise and clear the display:

disp.reset(&mut rst, &mut delay).unwrap();
disp.init().unwrap();
disp.fast_flush().unwrap();

Because no pixels have been set in the buffer at this point, fast_flush() is a noop. To maintain existing behaviour, I think fast_flush() should clear the display if no pixels have been set yet. This could be changed in the future to introduce a clear() method, but for now I'd like to keep the existing behaviour.

Also, here's another example of the odd SPI behaviour. I ported the image_i2c example to use the SPI interface:

careyk007 · 2019-10-08T01:56:31Z

I found an error in my SPI implementation where I had a ..= instead of a .., which would explain the offset from your last image. I'm not entirely convinced that it would explain the artifacts from your first image though.

I've put together a gist with the logic I've been using for benchmarking. For some reason, I haven't been able to get my machine to compile for the Blue Pill, so there's no guarantee that the gist will compile, but the program logic should be sound. If you'd like, I can share the code I'm using to test on my STM32F4 Discovery board.

I hadn't thought about support for rotated displays, but I've added support in my latest commit. Please let me know if there is a more Rusty way of writing that logic.

As far as fast_flush defaulting to a clear when no pixels have changed, I feel like this might result in slower performance if the user calls a double fast_flush by accident. Would an alternative be to have the init method call clear so that the first call to fast_flush does indeed clear the entire screen, but subsequent calls will operate as fast as possible?

jamwaffles · 2019-10-08T12:56:24Z

I found an error in my SPI implementation where I had a ..= instead of a .., which would explain the offset from your last image. I'm not entirely convinced that it would explain the artifacts from your first image though.

Yep, that's fixed it! The artifacts are from the power-on state of this display - it seems to just leave some pixels randomly turned on. Using a whole-screen flush as in master clears the whole display which is why this issue hasn't cropped up until this PR.

Would an alternative be to have the init method call clear

Yeah, I think this is a good idea - the artifacts mentioned above would be cleared as before then. I also noticed that the display isn't cleared on restarting the chip, so clear on init is a definite requirement IMO. For example, this is me testing rotations 3 times:

I've put together a gist with the logic I've been using for benchmarking.

Thanks for putting that up in a gist. It looks like you're checking timings with an oscilloscope or logic analyser hooked up to a pin? Unfortunately I don't have either of those bits of kit so I'll trust your I2C numbers 😁

I hadn't thought about support for rotated displays, but I've added support in my latest commit. Please let me know if there is a more Rusty way of writing that logic.

I'll do another review pass, but rotation results aren't quite right for SPI. I2C works fine strangely enough. This is the image_i2c example running over SPI with a rotation of Rotate270 (ignore the artifacts - I had just powered the display on):

careyk007 · 2019-10-09T19:42:08Z

I also noticed that the display isn't cleared on restarting the chip, so clear on init is a definite requirement IMO.

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

Thanks for putting that up in a gist. It looks like you're checking timings with an oscilloscope or logic analyser hooked up to a pin? Unfortunately I don't have either of those bits of kit so I'll trust your I2C numbers grin

No problem! Unfortunately I don't have an oscilloscope or logic analyzer either, so I'm relying on the "super precise" method of hooking up an ESP32 to the GPIO pin and measuring the time between edges. I probably should have mentioned this in my initial comment.

I'll do another review pass, but rotation results aren't quite right for SPI. I2C works fine strangely enough.

That is quite strange. I'll try to get my hands on a SPI display so that I can do local testing and figure out what's going on.

careyk007 · 2019-10-10T23:48:01Z

Out of curiosity, what opt-level are you testing at? I ran into an issue with another crate that had bad SPI writes if the opt-level wasn't set to 1, 2, or 'z'. I don't think this is the issue since you are actually getting data transferred, but I figured it wouldn't hurt to ask.

jamwaffles · 2019-10-11T08:52:27Z

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

Mint. I'll do another test run later :)

I've just been running it with cargo run --example <example> on a Blue Pill over OpenOCD. I'll try running with --release and make sure it still works.

jamwaffles · 2019-10-08T12:58:39Z

src/mode/graphics.rs

+            DisplayRotation::Rotate90 | DisplayRotation::Rotate270 => {
+                ((self.max_x | 7).min(width), (self.max_y + 1).min(height))
+            }
+        };

        self.min_x = width - 1;
        self.max_x = 0;
        self.min_y = width - 1;


Should this be height?

I'm pretty sure it is correct as it is since the width and height take into account the display's orientation. Just for fun, I tried swapping width and height in line 169 and got a really cool (but unfortunately incorrect) vertically mirrored display.

jamwaffles · 2019-10-11T19:45:16Z

Pahaha please ignore my dumb ass, the rotation is working fine - the examples set an offset to center the Rust logo in the center of the screen horizontally. This of course changes when rotated by 90/270 degrees so instead of translating by Point::new(0, 32), the fix is to simply translate by Point::new(32, 0) in the demo code - sorry about that!

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

I think the examples and doc examples can have the initial call to display.flush().unwrap(), etc removed in this case. Aside from this last cleanup step I think we're good to merge unless @therealprof has some extra comments.

careyk007 · 2019-10-11T20:08:14Z

Awesome, glad you got it sorted!

Just to be clear, should I create a new commit with the initial calls to display.flush().unwrap(), etc. in the examples and docs removed? And should I also replace flush with fast_flush at this point?

therealprof

I haven't verified on hardware myself but intention and code look good to me.

jamwaffles · 2019-10-12T08:46:11Z

should I create a new commit with the initial calls to display.flush().unwrap(), etc. in the examples and docs removed?

Yes please

And should I also replace flush with fast_flush at this point?

Ah yes please. I forgot about that

careyk007 · 2019-10-13T03:34:44Z

Done and done!

jamwaffles · 2019-10-14T15:25:19Z

Sweet! I'll do a final review/physical device test as soon as I can.

jamwaffles

Did a final test pass on both I2C and SPI devices which both seem to work fine. Code looks good too so I'll get this merged. Thanks for your time and hard work!

jamwaffles · 2019-10-14T21:33:31Z

Released in 0.3.0-alpha.2. Thanks!

careyk007 added 2 commits October 3, 2019 07:06

Added fast_flush

8d4b1f4

Adding bounded data transfers

95f69a5

Formating and style updates

97299b9

jamwaffles requested changes Oct 4, 2019

View reviewed changes

Merge branch 'master' into master

821f848

Fix SPI error and add support for screen rotation

6e9f924

Added clear on initialization

05c38af

jamwaffles requested changes Oct 11, 2019

View reviewed changes

therealprof previously approved these changes Oct 11, 2019

View reviewed changes

Rename fast_flush and update examples

d1bd7c7

careyk007 dismissed therealprof’s stale review via d1bd7c7 October 13, 2019 03:33

jamwaffles approved these changes Oct 14, 2019

View reviewed changes

Merge branch 'master' into master

b9d2a9c

jamwaffles merged commit 2d0a445 into rust-embedded-community:master Oct 14, 2019

jamwaffles mentioned this pull request Oct 29, 2019

simulator speedup embedded-graphics/embedded-graphics#190

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve I2C Flush (Update) Speed #94

Improve I2C Flush (Update) Speed #94

careyk007 commented Oct 3, 2019

therealprof commented Oct 4, 2019

careyk007 commented Oct 4, 2019

therealprof commented Oct 4, 2019

careyk007 commented Oct 4, 2019

therealprof commented Oct 4, 2019

jamwaffles commented Oct 4, 2019

therealprof commented Oct 4, 2019

jamwaffles left a comment

jamwaffles commented Oct 4, 2019

jamwaffles commented Oct 4, 2019

careyk007 commented Oct 4, 2019

jamwaffles commented Oct 6, 2019

jamwaffles commented Oct 6, 2019

jamwaffles commented Oct 6, 2019

careyk007 commented Oct 8, 2019

jamwaffles commented Oct 8, 2019

careyk007 commented Oct 9, 2019

careyk007 commented Oct 10, 2019

jamwaffles commented Oct 11, 2019

jamwaffles Oct 8, 2019

careyk007 Oct 11, 2019

jamwaffles commented Oct 11, 2019

careyk007 commented Oct 11, 2019

therealprof left a comment

jamwaffles commented Oct 12, 2019

careyk007 commented Oct 13, 2019

jamwaffles commented Oct 14, 2019

jamwaffles left a comment

jamwaffles commented Oct 14, 2019

Improve I2C Flush (Update) Speed #94

Improve I2C Flush (Update) Speed #94

Conversation

careyk007 commented Oct 3, 2019

Improvements

Problems

therealprof commented Oct 4, 2019

careyk007 commented Oct 4, 2019

therealprof commented Oct 4, 2019

careyk007 commented Oct 4, 2019

therealprof commented Oct 4, 2019

jamwaffles commented Oct 4, 2019

therealprof commented Oct 4, 2019

jamwaffles left a comment

Choose a reason for hiding this comment

jamwaffles commented Oct 4, 2019

jamwaffles commented Oct 4, 2019

careyk007 commented Oct 4, 2019

jamwaffles commented Oct 6, 2019

jamwaffles commented Oct 6, 2019

jamwaffles commented Oct 6, 2019

careyk007 commented Oct 8, 2019

jamwaffles commented Oct 8, 2019

careyk007 commented Oct 9, 2019

careyk007 commented Oct 10, 2019

jamwaffles commented Oct 11, 2019

jamwaffles Oct 8, 2019

Choose a reason for hiding this comment

careyk007 Oct 11, 2019

Choose a reason for hiding this comment

jamwaffles commented Oct 11, 2019

careyk007 commented Oct 11, 2019

therealprof left a comment

Choose a reason for hiding this comment

jamwaffles commented Oct 12, 2019

careyk007 commented Oct 13, 2019

jamwaffles commented Oct 14, 2019

jamwaffles left a comment

Choose a reason for hiding this comment

jamwaffles commented Oct 14, 2019