Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10744. Standardize byte array conversion to String for LiveFileMetaData in RocksDB. #6580

Merged
merged 4 commits into from
Apr 29, 2024

Conversation

fapifta
Copy link
Contributor

@fapifta fapifta commented Apr 23, 2024

What changes were proposed in this pull request?

LiveFileMetaData class in RocksDB has three methods that are returning a byte[] which we convert to String after any call.
These methods are:

  • columnFamilyName()
  • smallestKey()
  • largestKey()

We use 3 different conversion to String for the returned byte arrays.
For largestKey and smallestKey we use FixedLengthStringCodec.bytes2String and new String(byte[], UTF_8)
For columnFamilyName we use org.apache.hadoop.hdds.StringUtils.bytes2String, new String(byte[], UTF_8), and org.bouncycastle.util.Strings.fromByteArray.

From these methods, FixedLengthStringCodec throws an exception if the conversion can not be done, and it uses ISO_8859_1 as the charset for the conversion, while the rest uses UTF_8 charset for the conversion, and replaces the characters that UTF-8 can not represent.

Based on how and where we use these it seems to be safe to settle on UTF-8 as the target charset, and use StringUtils.bytes2String from our own utilities which uses the String constructor as of now by the way.

Removing org.bouncycastle.util.Strings usage is also beneficial for crypto compliance related development.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10744

How was this patch tested?

CI

@fapifta fapifta added code-cleanup Changes that aim to make code better, without changing functionality. crypto-compliance labels Apr 23, 2024
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fapifta the same conversion appears in another class, should we remove that, too?

Could use Ozone's StringUtils, too:

public static String bytes2String(byte[] bytes) {
return bytes2String(bytes, 0, bytes.length);
}

public static String bytes2String(byte[] bytes, int offset, int length) {
return new String(bytes, offset, length, UTF8);
}

@fapifta
Copy link
Contributor Author

fapifta commented Apr 23, 2024

Hmm.. thank you for spotting this @adoroszlai, my IDE made fun of me... I searched for all occurance of bouncycastle and it seems it does not show just the first n results...

On the other hand, I spotted a few other things that I might fix together to standardize on one method for this conversion. I will post a second version shortly.

@fapifta fapifta changed the title HDDS-10744. Remove org.bouncycastle.util.Strings usage from RocksDBStoreMetrics. HDDS-10744. Standardize byte array conversion to String for LiveFileMetaData in RocksDB. Apr 23, 2024
… null then the first condition evaluates to true, if it is not null, we do not need the not null check in StringUtils.isNotEmpty, and we can just ensure that the length of the token is not 0 which is done with the condition.
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fapifta for updating the patch.

Copy link
Contributor

@Galsza Galsza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fapifta Thanks for the change it's looking good to me after Attila's recommendations.

@szetszwo
Copy link
Contributor

Removing org.bouncycastle.util.Strings usage is also beneficial for crypto compliance related development.

Agree.

... FixedLengthStringCodec throws an exception if the conversion can not be done, and it uses ISO_8859_1 as the charset for the conversion, while the rest uses UTF_8 charset for the conversion ...

If we change the charset, how the new code read the existing DB?

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fapifta for updating the patch, LGTM.

@fapifta
Copy link
Contributor Author

fapifta commented Apr 26, 2024

Thank you @adoroszlai for the review, let's see if @szetszwo is also ok with this patch, or at least give him a chance to react for some time.
So what I did, I just reverted the unrelated formatting change, and the change in compactionIfNeeded, so that it is using the old way with ISO-8859-1 conversion.

In the meantime I have added HDDS-10762 as I have some performance concerns, also I kind of have technical concerns around this, as in an other place the same smallestKey and largetKey result is converted using UTF-8 conversion, and this was the original state also, so we have two code parts that handles this same data differently... But it is out of my knowledge area, and I can not spend more time on to understand this as of now.

With that this patch should preserve the old behaviour, using the same formats and charsets for the conversion, but via the same method call instead of using different utilities at different places.

@adoroszlai adoroszlai merged commit fdd2037 into apache:master Apr 29, 2024
39 checks passed
@adoroszlai
Copy link
Contributor

Thanks @fapifta for the patch, @Galsza, @szetszwo for the review.

jojochuang pushed a commit to jojochuang/ozone that referenced this pull request May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-cleanup Changes that aim to make code better, without changing functionality. crypto-compliance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants