Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use on-heap keys in AVX-512 XTS implementation rather than copy to stack #1550

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

pittma
Copy link
Contributor

@pittma pittma commented Apr 25, 2024

This patch uses the keys that come in as parameters rather than copying them to the stack.


There is also some bytecode gen in this file that I couldn't tell whether it was being used anymore, of course when I run tests locally all is well. If this bombs CI I'll bring it back in.

@pittma pittma requested a review from a team as a code owner April 25, 2024 18:16
@andrewhop andrewhop requested a review from nebeid April 30, 2024 17:22
@pittma
Copy link
Contributor Author

pittma commented Apr 30, 2024

Looks like some of those failures are on old versions of clang/gcc. I'll put the bytecode gen stuff back in :/

@pittma
Copy link
Contributor Author

pittma commented May 7, 2024

Looks like some of those failures are on old versions of clang/gcc. I'll put the bytecode gen stuff back in :/

While this may be the case, the failure at hand was a FIPS/delocate parsing error. I'll wait and see how far this FIPS fix gets us.

@nebeid
Copy link
Contributor

nebeid commented Oct 9, 2024

@pittma, would you mind please rebasing this work so we can see if the errors persist. Sorry for the delay in following up.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.50%. Comparing base (9ff8458) to head (e0c4d12).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1550      +/-   ##
==========================================
- Coverage   78.51%   78.50%   -0.02%     
==========================================
  Files         585      585              
  Lines       99681    99681              
  Branches    14271    14270       -1     
==========================================
- Hits        78265    78253      -12     
- Misses      20782    20790       +8     
- Partials      634      638       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -3126,125 +3038,6 @@
___
}

# Bits 7 & 4 contain the src1 register's MSB in inverted form
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this perl handling removed? I believe that the replacement of the .byte instructions with vaesenc|dec|enclast|declast| and vpclmulqdq in the Perl output is causing the failures in gcc 7 and below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was hoping this was vestigial! Before I removed the copy code I did a bunch of clean up, for this bit I was really just crossing my fingers that it wasn't needed. I'll add it back in shortly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pittma , you're still planning on adding it in, right?

@pittma
Copy link
Contributor Author

pittma commented Nov 4, 2024

Happy Monday @nebeid! I've got some Monday vibes for you this morning!

@pittma , you're still planning on adding it in, right?

Yes, BUT—

We may need to hold off on this patch for now, but I don't want to close it until I know definitively. I suspect we know now why this implementation (which was ported to perlasm from Intel's ISA-L crypto library before I joined this team) did this copy to the stack. I've been working intermittently to port this patch to OpenSSL as well, and OpenSSL tests 64-byte block sizes which were not included in the speed tests we did with AWS-LC. With the heap-based implementation, we see a throughput regression with those block sizes but none of the others.

What's different:

The vmov's, when we're working with the key rounds on the stack, are aligned while when we DO NOT copy, they're unaligned (vmovqa vs vmovqu), this change is causing a pretty significant difference in cache misses, at least that's what we think is happening at the moment. The original implementation may have been trading space for time.

Next Steps:

I'm still collecting data on this, and it may all be specious anyway because some XTS tests are failing on the OpenSSL patch currently, but as soon as I know more I'll update this PR with the specifics and we can decide together what we want to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants