Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve I2C Flush (Update) Speed #94

Merged
merged 8 commits into from
Oct 14, 2019
Merged

Improve I2C Flush (Update) Speed #94

merged 8 commits into from
Oct 14, 2019

Conversation

careyk007
Copy link
Contributor

This increases the performance of the I2C interface and possibly the SPI interface as well, but I don't have a SPI-based SSD1306 for verification. The main idea comes from the fact that the SSD1306 has support for writing to a subset of the display memory buffer as if it were contiguous. This means that if you only have a small window within the display to update, you don't have to update the entire display.

This is achieved by keeping track of the locations of the pixels that have changed since the last call to flush or reset. The only values we need to track are the max and min values for the x and y coordinates of each pixel. Once a new call to flush happens (fast_flush in this pull request), the elements of the display buffer on the micro-controller that are within the minimum bounding rectangle of changed pixels are copied to a secondary buffer, the SSD1306 is commanded to work with the window subset of the minimum bounding rectangle, and the secondary buffer is sent to the SSD1306.

I have been developing on the STM32F411-DISCOVERY board since my Blue Pills are still on their way across the ocean so I can't speak to the results with the Blue Pill, but I have seen some truly impressive results with the Discovery board. If you are only updating half of the display size, the fast_flush results in a 28.5% faster update. If you are only updating a quarter of the display, that jumps up to a 61.9% faster update.

The downside of this approach is that the micro-controller does have to do some additional computation while copying the changed pixels into a secondary buffer. This means that as the size of the minimum bounding rectangle approaches the display size, the fast_flush approach actually becomes slower than just updating the entire display. When the update area is greater than 75% of the display area, fast_flush calls flush to do a full display update.

Perhaps the biggest performance jump may come from changing the draw function to take an iterator as a parameter instead of a slice, but I'm not sure how that would work and if that would introduce any breaking changes.

Improvements

  • Faster screen updates when using an I2C DisplayInterface
    • The smaller the area of the screen you are writing to, the faster the update

Problems

  • Need to double the buffer size
    • This would only be for 128x68 displays since smaller displays could reuse space in the buffer
    • This may also be negated by using iterators instead of slices
  • Faster screen update times might not be a good thing
    • Since the current flush call updates the entire screen, the function always takes the same amount of time to return
    • Maybe some users are using this consistent return time as a key part of their program?

@therealprof
Copy link
Collaborator

Very nice!

A few observations from my side:

  • I don't think it is actually necessary do keep a second buffer. It should be sufficient if the windowing is done while sending out the buffer.
  • You're doing a lot of divisions in your code, those should be avoided if possible; the compiler is not smart enough to avoid them in some cases and will emit emulated divisions on some MCUs which are abysmally slow and bloaty; ideally the thresholding and separate function should be completely irrelevant and provide close to the same speed.
  • It would be really cool if this could be simplified so it only comes with negligible overhead so it could always be used as the standard flush function

Re: Faster screen update times might not be a good thing

I don't think any sane person would ever rely on the timing of an I2C bus for their application; there are so many things which can go sideways here that it's completely unusable.

@careyk007
Copy link
Contributor Author

Thanks for the feedback!

You're doing a lot of divisions in your code, those should be avoided if possible

Great catch. It looks like most of the divisions are by powers of 2, so I am working on replacing those with bit shift operations.

It would be really cool if this could be simplified so it only comes with negligible overhead so it could always be used as the standard flush function

I definitely agree. My first thought was to use Rust's powerful iterators, but I haven't been having much success with this approach. My other thought was to modify DisplayProperties::draw() and DisplayInterface::send_data() to take the bounds of the update window and apply the filter while sending the data. Both approaches would probably eliminate the need for a secondary buffer, which would be awesome.

@therealprof
Copy link
Collaborator

Great catch. It looks like most of the divisions are by powers of 2, so I am working on replacing those with bit shift operations.

Divisions of unsigned integers by powers of 2 are unproblematic. The same for signed integers can sometimes be problematic. It's more the odd divisors I'm worried about...

I also tried iterators a while ago without much success. It's probably best to add the windowing directly to lowlevel send functions.

@careyk007
Copy link
Contributor Author

I have moved the windowing into I2cInterface and SpiInterface. I decided to add a new send_bounded_data() method instead of re-writing send_data(), which also required adding a bounded_draw() method to DisplayProperties.

This approach has removed most of the divisions, except for 2 initial divide-by-8 calculations. As far as performance boost goes, updating a quarter of the display is now 68.5% faster than the original flush (increased from 61.9% previously), updating half of the display is now 37.5% faster (increased from 28.5% previously), and updating the full display is actually 0.5% faster (as opposed to a 10% hit previously).

@therealprof
Copy link
Collaborator

Great job, sounds fantastic. @jamwaffles do you have time to give it a try? I'm a bit tied up at the moment.

@jamwaffles
Copy link
Collaborator

Thanks for this PR! The perf numbers are looking good. I'll test and review it more thoroughly when I have some time to get my Blue Pill out.

This approach has removed most of the divisions, except for 2 initial divide-by-8 calculations.

To satisfy my own curiosity, Rust is smart enough to convert all these to lsrs (shift right) instructions for thumbv7. See this example on Godbolt. I think it's also converted in debug mode, but has some additional bounds checks thrown in.

I don't think any sane person would ever rely on the timing of an I2C bus for their application

And if they do, it's not our problem 😂. Regular screen updates should be handled by interrupts or something IMO, so I2C timing isn't much to worry about from the driver's side.

@therealprof
Copy link
Collaborator

@jamwaffles

If you divide by a non-power of two on thumbv6m you'll get this:

example::divide:
        push    {r7, lr}
        subs    r0, r0, r1
        movs    r1, #7
        bl      __aeabi_uidiv
        pop     {r7, pc}

on thumbv7m it's a good deal better:

example::divide:
        subs    r0, r0, r1
        movw    r1, #18725
        movt    r1, #9362
        umull   r1, r2, r0, r1
        subs    r0, r0, r2
        add.w   r0, r2, r0, lsr #1
        lsrs    r0, r0, #2
        bx      lr

As I said: Divides by power of 2 should be okay for unsigned variables on any platform, anything else depends...

Copy link
Collaborator

@jamwaffles jamwaffles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run it on a physical display yet so this first review is just looking at the code. A few style nits but otherwise ok.

Can I ask you to run cargo fmt as well? I've opened #95 to check formatting in CI which will break this PR's build.

src/mode/graphics.rs Outdated Show resolved Hide resolved
src/mode/graphics.rs Outdated Show resolved Hide resolved
src/mode/graphics.rs Outdated Show resolved Hide resolved
src/mode/graphics.rs Outdated Show resolved Hide resolved
src/mode/graphics.rs Outdated Show resolved Hide resolved
src/mode/graphics.rs Outdated Show resolved Hide resolved
src/mode/graphics.rs Outdated Show resolved Hide resolved
@jamwaffles
Copy link
Collaborator

Tested on a physical I2C device and it looks to be behaving as expected. The photo below is two demos run back to back with the clear/in it calls removed in the second. You can see the partial update working so yay!

IMG_20191004_131717

@jamwaffles
Copy link
Collaborator

I'll test on an SPI device this evening.

It would be good to:

  1. Use fast_flush() on both SPI and I2C devices and
  2. Integrate the fast_flush() logic into flush() as @therealprof mentioned - it keeps the API cleaner with only one method then.

@careyk007
Copy link
Contributor Author

Thanks a bunch for the feedback. I have made the style changes and run cargo fmt. I have kept fast_flush() and flush() separate for now for easier testing, but will integrate the fast_flush() logic into flush() once SPI performance has been benchmarked.

I have been testing in release mode, but ran my tests again in debug mode and got similar performance boosts. For those interested, the performance boost of simply changing from debug to release is about 70%.

@jamwaffles
Copy link
Collaborator

I'm seeing some odd behaviour with SPI displays:

image

I'll do a bit of digging with this SPI display and see where I get.

The image_i2c example runs much faster to the eye when using fast_flush(), however it'd be great to get some numbers - how are you benchmarking this PR?

@jamwaffles
Copy link
Collaborator

It also just occurred to me that we should make sure fast_flush() works in rotations other than Rotate0. Has this been tested yet?

@jamwaffles
Copy link
Collaborator

The examples use the following pattern to initialise and clear the display:

disp.reset(&mut rst, &mut delay).unwrap();
disp.init().unwrap();
disp.fast_flush().unwrap();

Because no pixels have been set in the buffer at this point, fast_flush() is a noop. To maintain existing behaviour, I think fast_flush() should clear the display if no pixels have been set yet. This could be changed in the future to introduce a clear() method, but for now I'd like to keep the existing behaviour.


Also, here's another example of the odd SPI behaviour. I ported the image_i2c example to use the SPI interface:

image

@careyk007
Copy link
Contributor Author

I found an error in my SPI implementation where I had a ..= instead of a .., which would explain the offset from your last image. I'm not entirely convinced that it would explain the artifacts from your first image though.

I've put together a gist with the logic I've been using for benchmarking. For some reason, I haven't been able to get my machine to compile for the Blue Pill, so there's no guarantee that the gist will compile, but the program logic should be sound. If you'd like, I can share the code I'm using to test on my STM32F4 Discovery board.

I hadn't thought about support for rotated displays, but I've added support in my latest commit. Please let me know if there is a more Rusty way of writing that logic.

As far as fast_flush defaulting to a clear when no pixels have changed, I feel like this might result in slower performance if the user calls a double fast_flush by accident. Would an alternative be to have the init method call clear so that the first call to fast_flush does indeed clear the entire screen, but subsequent calls will operate as fast as possible?

@jamwaffles
Copy link
Collaborator

I found an error in my SPI implementation where I had a ..= instead of a .., which would explain the offset from your last image. I'm not entirely convinced that it would explain the artifacts from your first image though.

Yep, that's fixed it! The artifacts are from the power-on state of this display - it seems to just leave some pixels randomly turned on. Using a whole-screen flush as in master clears the whole display which is why this issue hasn't cropped up until this PR.

Would an alternative be to have the init method call clear

Yeah, I think this is a good idea - the artifacts mentioned above would be cleared as before then. I also noticed that the display isn't cleared on restarting the chip, so clear on init is a definite requirement IMO. For example, this is me testing rotations 3 times:

image

I've put together a gist with the logic I've been using for benchmarking.

Thanks for putting that up in a gist. It looks like you're checking timings with an oscilloscope or logic analyser hooked up to a pin? Unfortunately I don't have either of those bits of kit so I'll trust your I2C numbers 😁

I hadn't thought about support for rotated displays, but I've added support in my latest commit. Please let me know if there is a more Rusty way of writing that logic.

I'll do another review pass, but rotation results aren't quite right for SPI. I2C works fine strangely enough. This is the image_i2c example running over SPI with a rotation of Rotate270 (ignore the artifacts - I had just powered the display on):

image

@careyk007
Copy link
Contributor Author

I also noticed that the display isn't cleared on restarting the chip, so clear on init is a definite requirement IMO.

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

Thanks for putting that up in a gist. It looks like you're checking timings with an oscilloscope or logic analyser hooked up to a pin? Unfortunately I don't have either of those bits of kit so I'll trust your I2C numbers grin

No problem! Unfortunately I don't have an oscilloscope or logic analyzer either, so I'm relying on the "super precise" method of hooking up an ESP32 to the GPIO pin and measuring the time between edges. I probably should have mentioned this in my initial comment.

I'll do another review pass, but rotation results aren't quite right for SPI. I2C works fine strangely enough.

That is quite strange. I'll try to get my hands on a SPI display so that I can do local testing and figure out what's going on.

@careyk007
Copy link
Contributor Author

Out of curiosity, what opt-level are you testing at? I ran into an issue with another crate that had bad SPI writes if the opt-level wasn't set to 1, 2, or 'z'. I don't think this is the issue since you are actually getting data transferred, but I figured it wouldn't hurt to ask.

@jamwaffles
Copy link
Collaborator

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

Mint. I'll do another test run later :)

I've just been running it with cargo run --example <example> on a Blue Pill over OpenOCD. I'll try running with --release and make sure it still works.

DisplayRotation::Rotate90 | DisplayRotation::Rotate270 => {
((self.max_x | 7).min(width), (self.max_y + 1).min(height))
}
};

self.min_x = width - 1;
self.max_x = 0;
self.min_y = width - 1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be height?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure it is correct as it is since the width and height take into account the display's orientation. Just for fun, I tried swapping width and height in line 169 and got a really cool (but unfortunately incorrect) vertically mirrored display.

@jamwaffles
Copy link
Collaborator

Pahaha please ignore my dumb ass, the rotation is working fine - the examples set an offset to center the Rust logo in the center of the screen horizontally. This of course changes when rotated by 90/270 degrees so instead of translating by Point::new(0, 32), the fix is to simply translate by Point::new(32, 0) in the demo code - sorry about that!

I've added a call to clear within the init function, this should get rid of any residual display artifacts at start.

I think the examples and doc examples can have the initial call to display.flush().unwrap(), etc removed in this case. Aside from this last cleanup step I think we're good to merge unless @therealprof has some extra comments.

@careyk007
Copy link
Contributor Author

Awesome, glad you got it sorted!

Just to be clear, should I create a new commit with the initial calls to display.flush().unwrap(), etc. in the examples and docs removed? And should I also replace flush with fast_flush at this point?

therealprof
therealprof previously approved these changes Oct 11, 2019
Copy link
Collaborator

@therealprof therealprof left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't verified on hardware myself but intention and code look good to me.

@jamwaffles
Copy link
Collaborator

should I create a new commit with the initial calls to display.flush().unwrap(), etc. in the examples and docs removed?

Yes please

And should I also replace flush with fast_flush at this point?

Ah yes please. I forgot about that

@careyk007
Copy link
Contributor Author

Done and done!

@jamwaffles
Copy link
Collaborator

Sweet! I'll do a final review/physical device test as soon as I can.

Copy link
Collaborator

@jamwaffles jamwaffles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a final test pass on both I2C and SPI devices which both seem to work fine. Code looks good too so I'll get this merged. Thanks for your time and hard work!

@jamwaffles jamwaffles merged commit 2d0a445 into rust-embedded-community:master Oct 14, 2019
@jamwaffles
Copy link
Collaborator

Released in 0.3.0-alpha.2. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants