Releases: google/sentencepiece
Releases · google/sentencepiece
v0.2.0
Major changes
N/A
New features
- [ALL] Added SentencePieceNormalizer class in C++/Python. It supports almost the equivalent feature of spm_normalize. Python Sample C++ Sample
- [ALL] Added SentencePieceProcessor::Normalize method in C++/Python Python Sample
C++ Sample - [ALL] Added functionality to override the normalization spec before the processing. Python Sample
Bug fixes & minor changes
- Introduce better support of using external abseil and protobuf #869
- Build universal binary in OSX release package #892
- Add the set_min_log_level function to python to change the loglevel from the python wrapper. #893
- Uses the logsumexp techniques in marginal probabilities of n-best tokenization to avoid underflow.
- Support Python 3.12 #932
- Improves the thread utilization in batch encoding/decoding.
- Fix nasty bug in BPE position encoding.
- Fix bugs in the handling of duplicated bigrams
v0.2.0pre1
Major changes
N/A
New features
- [ALL] Added SentencePieceNormalizer class in C++/Python. It supports almost the equivalent feature of spm_normalize. Python Sample C++ Sample
- [ALL] Added SentencePieceProcessor::Normalize method in C++/Python Python Sample
C++ Sample - [ALL] Added functionality to override the normalization spec before the processing. Python Sample
Bug fixes & minor changes
- Introduce better support of using external abseil and protobuf #869
- Build universal binary in OSX release package #892
- Add the set_min_log_level function to python to change the loglevel from the python wrapper. #893
- Uses the logsumexp techniques in marginal probabilities of n-best tokenization to avoid underflow.
- Support Python 3.12 #932
- Improves the thread utilization in batch encoding/decoding.
- Fix nasty bug in BPE position encoding.
- Fix bugs in the handling of duplicated bigrams
v0.1.99
Major changes
N/A
New features
N/A
Bug fixes & minor changes
v0.1.99pre1
v0.1.99 pre release for testing.
v0.1.98
Major changes
- Python 3.11 support (wheel packages for python 3.11 are available)
- Includes the entire full sources in the source python package to reduce the pip install troubles.
- Improves the algorithm to initialize unigram seed vocabulary. Coverage is improved.
New features
- [ALL] Added the feature to train the model with pre-tokenization boundary constraints. (
--pretokenization_delimiter
) flag
Bug fixes & minor changes
- [ALL] Makes the error message more descriptive.
- [ALL] Fixes the crash error when std::random_device failed
- [ALL] Fixes the build error on Raspberry pi around atomic operation
- [ALL] Fixes the minor bugs in nbest enumeration
- [ALL] Fixes the build error when using the external protobuf library.
- [ALL] Fixes the build error on a big-endian machine.
- [Windows] Use /MD build flag instead of /MT.
v0.1.97
Major changes
- Migrated the C++ version from C++11 to C++17.
- Migrated the CI environment from Travis-CI to Github actions
- Started using cibuildtool to build pypi wheel packages
New features
- [ALL] Support differential privacy while training. https://aclanthology.org/2022.findings-acl.171.pdf
- [ALL] Introduced APIs that return the struct of ImmutableSentencePieceText, which encodes string-token, id, and utf-8 byte offsets at once. New API is available both from C++ and Python.
- [ALL] Allow tab ‘\t’ to be included in user defined symbols.
- [ALL] Added NFKD normalization rule. NFKD rule is provided as a TSV file.
- [ALL] Added option to emit unknown symbol instead of raw symbol.
- [Python]: Batch encode/decode requests are performed in native multi-threads.
- [Python]: Supports to pass a custom log stream during training.
- [Python]: Adds module-level version variable:
spm.__version__
- [Python]: Creates wheel package of Mac universal binary.
Bug fixes & minor changes
- Uses the efficient encoding algorithm by default. Removed the functionality to switch the Viterbi tokenization algorithm.
- Make the output of Encode and 1-best from NBestEncode same.
- Use std::string_view as much as possible.
- [Python] Removed pip package for ppc64le and s390x architecture as cibuiltool doesn’t support them.
v0.1.96
Updates
- Improves the performance of unigram training
- Updated the nfkc normalization with the latest ICU module.
- Stop handling zero-width-joiner string as whitespace.
New features
- added new sampling algorithm without replacement.
- added API for new sampling and perplexity calculation.
- added
allow_whitespace_only_pieces
mode.
v0.1.95
v0.1.94
Updates
- added SetRandomGeneratorSeed function to set the seed value for random generator. This can allow to make reproducible sampling.
- Validate the range of the vocab id in Python module.
- Change the directory arrangement of python module.
- Added protobuf python module.
Bug fixes
- Support to build python wheel from source package.
v0.1.93
Bug fix
- Fixed the regression bug around the flag --minloglevel
- Fixed minor bugs.
Updates
- Used manylinux2014 to build pypi packages
- Support arm64, ppc64le, s390x architectures in pypi packages
- Support Python 3.9
Removed
- Stopped tf-sentencepiece.
- Stopped the support of Python 2.x and Python 3.4