You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm encountering an issue while attempting to train a dictionary with a small number of samples, impacting all training functions, from_continuous, from_files, from_samples. A code example demonstrating this issue is provided below.
I noticed in the documentation for the from_continuous function it mentions, "Train a dictionary from a big continuous chunk of data." While this behavior might be intentional for from_continuous, it's not explicitly stated for the from_files function, which is the function I am particularly interested in.
My use case involves training a dictionary on a dataset ranging from a million to a few hundred million items, each item varying in size from a few hundred bytes to a few dozen kilobytes. However, attempting to write all data to a single file for training fails due to the described problem. Even adding dummy empty files to facilitate training results in a very bad dictionary.
I am seeking guidance on the recommended approach for my use case. Would it be advisable to separate the data into several smaller files? If so, what would be an appropriate size for these files to ensure successful training without compromising the quality of the resulting dictionary? Your assistance in resolving this matter is greatly appreciated.
#[test]
fn train_dict_fail_for_small_size(){
const SAMPLE_LENGTH: usize=10;
// Failure
let samples=[[0;SAMPLE_LENGTH]; 2];
let ret_val=zstd::dict::from_samples(&samples, 1000);
assert_eq!(format!("{ret_val:?}"), r#"Err(Custom { kind: Other, error: "Src size is incorrect" })"#);
// Success
let samples=[[0;SAMPLE_LENGTH]; 10];
let ret_val=zstd::dict::from_samples(&samples, 1000);
assert!(ret_val.is_ok());
}
The text was updated successfully, but these errors were encountered:
from_continuous needs the data to be continuous in memory, not necessarily in a single file. This is the API that the C library uses.
from_files is a convenience method that builds the continuous chunk of memory from the given files.
from_samples is similar to from_files but directly takes a list of samples - it still needs to internally copy everything into a large continuous chunk to actually train.
Note that there are limits to the input data that zstd can use: if the entire input data (the sum of all files) is too small, it will reject the request. If it's too big, it will most likely ignore the data beyond some amount. I'm not entirely sure where to find the exact cutoff values.
As for the splitting, you should have each sample, or each file, represent a typical "message" or "item" you would try to compress in real conditions.
Thank you for the answer.
I didn't even take into consideration that there could be kind of limitations. After becoming aware of that, a short search on Google gave me the following, which says the dictionary training size is limited to 2 GB: facebook/zstd#3111
I'm encountering an issue while attempting to train a dictionary with a small number of samples, impacting all training functions, from_continuous, from_files, from_samples. A code example demonstrating this issue is provided below.
I noticed in the documentation for the from_continuous function it mentions, "Train a dictionary from a big continuous chunk of data." While this behavior might be intentional for from_continuous, it's not explicitly stated for the from_files function, which is the function I am particularly interested in.
My use case involves training a dictionary on a dataset ranging from a million to a few hundred million items, each item varying in size from a few hundred bytes to a few dozen kilobytes. However, attempting to write all data to a single file for training fails due to the described problem. Even adding dummy empty files to facilitate training results in a very bad dictionary.
I am seeking guidance on the recommended approach for my use case. Would it be advisable to separate the data into several smaller files? If so, what would be an appropriate size for these files to ensure successful training without compromising the quality of the resulting dictionary? Your assistance in resolving this matter is greatly appreciated.
The text was updated successfully, but these errors were encountered: