Parsing performance opportunities #234

marxin · 2024-07-17T18:26:52Z

The following great optimization hints were destined as a part of the discussion in mstange/samply#290:

I couldn't resist looking at your profile and had some ideas for optimizations.

These are all about RPM parsing inside the rpm crate:

rpm::rpm::headers::header::Header::parse_signature is calling .push() for every single signature byte; IndexData::Bin handling should be rewritten to avoid nom. It can just get the slice and put it into bin with one call to extend_from_slice. For example like this: let bin_bytes = remaining.get(..num_items as usize).ok_or_else(|| Error::Nom("Insufficient bytes for IndexData::Bin entry".into()))?; bin.extend_from_slice(bin_bytes);

rpm::rpm::package::PackageMetadata::get_file_entries collects a lot of metadata that you don't need; you only need the paths and the links. There's already a method called get_file_paths to get just the paths; maybe the rpm crate can add another one to get just the links, e.g. get_file_links.

IndexData::StringArray uses nom's complete::take_till to find the nul byte. It could instead use the memchr crate for SIMD-accelerated nul byte finding, and then it can do the slicing without nom.

This line inside get_file_paths does two allocations: acc.push(PathBuf::from(dir).join(basename));. It might be more efficient to do let mut path = PathBuf::from(dir); path.push(basename); acc.push(path);

String::from_utf8_lossy is called for all strings in all header entries, even for entry types that you don't look at. And in then all these strings get converted into unix paths anyway, which don't need to be utf-8. Furthermore, the string bytes are copied twice: First from the file into a temporary Vec, and then from that Vec into the the header entries. I think one could rework header parsing as follows, with some breaking changes for the API of the rpm crate: During Header::parse, only read the bytes from the input and store them in a Vec that becomes a permanent part of the header. Also parse the IndexHeader and the list of IndexEntrys, and sanity-check the size of the remaining bytes given the sizes expected by the entries. But don't do any entry-type specific parsing here. Only once somebody calls Header::get_entry_data_as_string_array or one of its friends, get the slice for the entry data from the Vec that's stored in the header. At this point the bytes can be converted into the expected format. For the string array cases, I would just compute byte slices into the header Vec and not convert those slices to strings (don't copy, don't utf-8 validate). Then, when get_file_paths makes the paths, it can just call Path::new(OsStr::from_bytes(byte_slice)) (at least if the target is unix), which would let it skip the utf-8 parsing.

The text was updated successfully, but these errors were encountered:

marxin · 2024-07-17T18:32:06Z

rpm::rpm::headers::header::Header::parse_signature is calling .push() for every single signature byte; IndexData::Bin handling should be rewritten to avoid nom. It can just get the slice and put it into bin with one call to extend_from_slice. For example like this: let bin_bytes = remaining.get(..num_items as usize).ok_or_else(|| Error::Nom("Insufficient bytes for IndexData::Bin entry".into()))?; bin.extend_from_slice(bin_bytes);

Good catch! I've just addressed that in #235.

marxin · 2024-07-17T19:00:18Z

This line inside get_file_paths does two allocations: acc.push(PathBuf::from(dir).join(basename));. It might be more efficient to do let mut path = PathBuf::from(dir); path.push(basename); acc.push(path);

Yes, that really helps and it's addressed in #237.

dralley added enhancement New feature or request good first issue Good for newcomers labels Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parsing performance opportunities #234

Parsing performance opportunities #234

marxin commented Jul 17, 2024

marxin commented Jul 17, 2024

marxin commented Jul 17, 2024

Parsing performance opportunities #234

Parsing performance opportunities #234

Comments

marxin commented Jul 17, 2024

marxin commented Jul 17, 2024

marxin commented Jul 17, 2024