Training Dictionary Issue with Small Sample Sizes and from_files Function #260

DvirYo-starkware · 2024-01-31T15:48:19Z

I'm encountering an issue while attempting to train a dictionary with a small number of samples, impacting all training functions, from_continuous, from_files, from_samples. A code example demonstrating this issue is provided below.

I noticed in the documentation for the from_continuous function it mentions, "Train a dictionary from a big continuous chunk of data." While this behavior might be intentional for from_continuous, it's not explicitly stated for the from_files function, which is the function I am particularly interested in.

My use case involves training a dictionary on a dataset ranging from a million to a few hundred million items, each item varying in size from a few hundred bytes to a few dozen kilobytes. However, attempting to write all data to a single file for training fails due to the described problem. Even adding dummy empty files to facilitate training results in a very bad dictionary.

I am seeking guidance on the recommended approach for my use case. Would it be advisable to separate the data into several smaller files? If so, what would be an appropriate size for these files to ensure successful training without compromising the quality of the resulting dictionary? Your assistance in resolving this matter is greatly appreciated.

#[test]
fn train_dict_fail_for_small_size(){
    const SAMPLE_LENGTH: usize=10;

    // Failure
    let samples=[[0;SAMPLE_LENGTH]; 2];
    let ret_val=zstd::dict::from_samples(&samples, 1000);
    assert_eq!(format!("{ret_val:?}"), r#"Err(Custom { kind: Other, error: "Src size is incorrect" })"#);

    // Success
    let samples=[[0;SAMPLE_LENGTH]; 10];
    let ret_val=zstd::dict::from_samples(&samples, 1000);
    assert!(ret_val.is_ok());
}

The text was updated successfully, but these errors were encountered:

gyscos · 2024-03-21T13:59:19Z

Hi, and thanks for the report!

from_continuous needs the data to be continuous in memory, not necessarily in a single file. This is the API that the C library uses.

from_files is a convenience method that builds the continuous chunk of memory from the given files.

from_samples is similar to from_files but directly takes a list of samples - it still needs to internally copy everything into a large continuous chunk to actually train.

Note that there are limits to the input data that zstd can use: if the entire input data (the sum of all files) is too small, it will reject the request. If it's too big, it will most likely ignore the data beyond some amount. I'm not entirely sure where to find the exact cutoff values.

As for the splitting, you should have each sample, or each file, represent a typical "message" or "item" you would try to compress in real conditions.

DvirYo-starkware · 2024-03-24T11:31:58Z

Thank you for the answer.
I didn't even take into consideration that there could be kind of limitations. After becoming aware of that, a short search on Google gave me the following, which says the dictionary training size is limited to 2 GB:
facebook/zstd#3111

DvirYo-starkware mentioned this issue Feb 4, 2024

feat(storage): add binary to train dictionary compression starkware-libs/papyrus#1668

Closed

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Dictionary Issue with Small Sample Sizes and from_files Function #260

Training Dictionary Issue with Small Sample Sizes and from_files Function #260

DvirYo-starkware commented Jan 31, 2024

gyscos commented Mar 21, 2024 •

edited

Loading

DvirYo-starkware commented Mar 24, 2024

Training Dictionary Issue with Small Sample Sizes and from_files Function #260

Training Dictionary Issue with Small Sample Sizes and from_files Function #260

Comments

DvirYo-starkware commented Jan 31, 2024

gyscos commented Mar 21, 2024 • edited Loading

DvirYo-starkware commented Mar 24, 2024

gyscos commented Mar 21, 2024 •

edited

Loading