Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing performance opportunities #234

Open
marxin opened this issue Jul 17, 2024 · 2 comments
Open

Parsing performance opportunities #234

marxin opened this issue Jul 17, 2024 · 2 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@marxin
Copy link
Contributor

marxin commented Jul 17, 2024

The following great optimization hints were destined as a part of the discussion in mstange/samply#290:

I couldn't resist looking at your profile and had some ideas for optimizations.

These are all about RPM parsing inside the rpm crate:

  • rpm::rpm::headers::header::Header::parse_signature is calling .push() for every single signature byte; IndexData::Bin handling should be rewritten to avoid nom. It can just get the slice and put it into bin with one call to extend_from_slice. For example like this: let bin_bytes = remaining.get(..num_items as usize).ok_or_else(|| Error::Nom("Insufficient bytes for IndexData::Bin entry".into()))?; bin.extend_from_slice(bin_bytes);
  • rpm::rpm::package::PackageMetadata::get_file_entries collects a lot of metadata that you don't need; you only need the paths and the links. There's already a method called get_file_paths to get just the paths; maybe the rpm crate can add another one to get just the links, e.g. get_file_links.
  • IndexData::StringArray uses nom's complete::take_till to find the nul byte. It could instead use the memchr crate for SIMD-accelerated nul byte finding, and then it can do the slicing without nom.
  • This line inside get_file_paths does two allocations: acc.push(PathBuf::from(dir).join(basename));. It might be more efficient to do let mut path = PathBuf::from(dir); path.push(basename); acc.push(path);
  • String::from_utf8_lossy is called for all strings in all header entries, even for entry types that you don't look at. And in then all these strings get converted into unix paths anyway, which don't need to be utf-8. Furthermore, the string bytes are copied twice: First from the file into a temporary Vec, and then from that Vec into the the header entries. I think one could rework header parsing as follows, with some breaking changes for the API of the rpm crate: During Header::parse, only read the bytes from the input and store them in a Vec that becomes a permanent part of the header. Also parse the IndexHeader and the list of IndexEntrys, and sanity-check the size of the remaining bytes given the sizes expected by the entries. But don't do any entry-type specific parsing here. Only once somebody calls Header::get_entry_data_as_string_array or one of its friends, get the slice for the entry data from the Vec that's stored in the header. At this point the bytes can be converted into the expected format. For the string array cases, I would just compute byte slices into the header Vec and not convert those slices to strings (don't copy, don't utf-8 validate). Then, when get_file_paths makes the paths, it can just call Path::new(OsStr::from_bytes(byte_slice)) (at least if the target is unix), which would let it skip the utf-8 parsing.
@marxin
Copy link
Contributor Author

marxin commented Jul 17, 2024

  • rpm::rpm::headers::header::Header::parse_signature is calling .push() for every single signature byte; IndexData::Bin handling should be rewritten to avoid nom. It can just get the slice and put it into bin with one call to extend_from_slice. For example like this: let bin_bytes = remaining.get(..num_items as usize).ok_or_else(|| Error::Nom("Insufficient bytes for IndexData::Bin entry".into()))?; bin.extend_from_slice(bin_bytes);

Good catch! I've just addressed that in #235.

@dralley dralley added enhancement New feature or request good first issue Good for newcomers labels Jul 17, 2024
@marxin
Copy link
Contributor Author

marxin commented Jul 17, 2024

  • This line inside get_file_paths does two allocations: acc.push(PathBuf::from(dir).join(basename));. It might be more efficient to do let mut path = PathBuf::from(dir); path.push(basename); acc.push(path);

Yes, that really helps and it's addressed in #237.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants