Releases · google/sentencepiece

19 Feb 16:08

taku910

v0.2.0

17d7580

v0.2.0 Latest

Latest

Major changes

N/A

New features

[ALL] Added SentencePieceNormalizer class in C++/Python. It supports almost the equivalent feature of spm_normalize. Python Sample C++ Sample
[ALL] Added SentencePieceProcessor::Normalize method in C++/Python Python Sample
C++ Sample
[ALL] Added functionality to override the normalization spec before the processing. Python Sample

Bug fixes & minor changes

Introduce better support of using external abseil and protobuf #869
Build universal binary in OSX release package #892
Add the set_min_log_level function to python to change the loglevel from the python wrapper. #893
Uses the logsumexp techniques in marginal probabilities of n-best tokenization to avoid underflow.
Support Python 3.12 #932
Improves the thread utilization in batch encoding/decoding.
Fix nasty bug in BPE position encoding.
Fix bugs in the handling of duplicated bigrams

Assets 60

multiple.intoto.jsonl

24.3 KB 2024-02-19T16:34:37Z
sentencepiece-0.2.0-cp310-cp310-macosx_10_9_universal2.whl

2.3 MB 2024-02-19T16:21:55Z
sentencepiece-0.2.0-cp310-cp310-macosx_10_9_x86_64.whl

1.18 MB 2024-02-19T16:21:54Z
sentencepiece-0.2.0-cp310-cp310-macosx_11_0_arm64.whl

1.13 MB 2024-02-19T16:21:53Z
sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl

1.2 MB 2024-02-19T16:33:35Z
sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_i686.manylinux2014_i686.whl

1.29 MB 2024-02-19T16:33:33Z
sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

1.24 MB 2024-02-19T16:33:32Z
sentencepiece-0.2.0-cp310-cp310-win32.whl

915 KB 2024-02-19T16:21:56Z
sentencepiece-0.2.0-cp310-cp310-win_amd64.whl

968 KB 2024-02-19T16:21:55Z
sentencepiece-0.2.0-cp311-cp311-macosx_10_9_universal2.whl

2.3 MB 2024-02-19T16:21:52Z
Source code (zip)

2024-02-19T08:06:52Z
Source code (tar.gz)

2024-02-19T08:06:52Z

16 Jan 06:37

taku910

v0.2.0pre1

de1747b

v0.2.0pre1 Pre-release

Pre-release

Major changes

N/A

New features

[ALL] Added SentencePieceNormalizer class in C++/Python. It supports almost the equivalent feature of spm_normalize. Python Sample C++ Sample
[ALL] Added SentencePieceProcessor::Normalize method in C++/Python Python Sample
C++ Sample
[ALL] Added functionality to override the normalization spec before the processing. Python Sample

Bug fixes & minor changes

Introduce better support of using external abseil and protobuf #869
Build universal binary in OSX release package #892
Add the set_min_log_level function to python to change the loglevel from the python wrapper. #893
Uses the logsumexp techniques in marginal probabilities of n-best tokenization to avoid underflow.
Support Python 3.12 #932
Improves the thread utilization in batch encoding/decoding.
Fix nasty bug in BPE position encoding.
Fix bugs in the handling of duplicated bigrams

Assets 60

02 May 03:20

taku910

v0.1.99

3863f76

v0.1.99

Major changes

N/A

New features

N/A

Bug fixes & minor changes

[ALL] Fixes the NaN issues in unigram model training: #851
[ALL] Fixes the bug in unigram loss computation: #628
[ALL] Fixes the minor bug in BPE token extraction algorithm: #318
[ALL] Increase the number of maximum threads from 128 to 1024. #857

Assets 51

28 Apr 22:54

taku910

v0.1.99pre1

25b64fc

v0.1.99pre1 Pre-release

Pre-release

v0.1.99 pre release for testing.

Assets 51

12 Apr 08:47

taku910

v0.1.98

518c57c

v0.1.98

Major changes

Python 3.11 support (wheel packages for python 3.11 are available)
Includes the entire full sources in the source python package to reduce the pip install troubles.
Improves the algorithm to initialize unigram seed vocabulary. Coverage is improved.

New features

[ALL] Added the feature to train the model with pre-tokenization boundary constraints. (--pretokenization_delimiter) flag

Bug fixes & minor changes

[ALL] Makes the error message more descriptive.
[ALL] Fixes the crash error when std::random_device failed
[ALL] Fixes the build error on Raspberry pi around atomic operation
[ALL] Fixes the minor bugs in nbest enumeration
[ALL] Fixes the build error when using the external protobuf library.
[ALL] Fixes the build error on a big-endian machine.
[Windows] Use /MD build flag instead of /MT.

Assets 51

06 Aug 16:03

taku910

v0.1.97

58f256c

v0.1.97

Major changes

Migrated the C++ version from C++11 to C++17.
Migrated the CI environment from Travis-CI to Github actions
Started using cibuildtool to build pypi wheel packages

New features

[ALL] Support differential privacy while training. https://aclanthology.org/2022.findings-acl.171.pdf
[ALL] Introduced APIs that return the struct of ImmutableSentencePieceText, which encodes string-token, id, and utf-8 byte offsets at once. New API is available both from C++ and Python.
[ALL] Allow tab ‘\t’ to be included in user defined symbols.
[ALL] Added NFKD normalization rule. NFKD rule is provided as a TSV file.
[ALL] Added option to emit unknown symbol instead of raw symbol.
[Python]: Batch encode/decode requests are performed in native multi-threads.
[Python]: Supports to pass a custom log stream during training.
[Python]: Adds module-level version variable: spm.__version__
[Python]: Creates wheel package of Mac universal binary.

Bug fixes & minor changes

Uses the efficient encoding algorithm by default. Removed the functionality to switch the Viterbi tokenization algorithm.
Make the output of Encode and 1-best from NBestEncode same.
Use std::string_view as much as possible.
[Python] Removed pip package for ppc64le and s390x architecture as cibuiltool doesn’t support them.

Assets 43

17 Jun 16:55

taku910

v0.1.96

d8711f5

v0.1.96

Updates

Improves the performance of unigram training
Updated the nfkc normalization with the latest ICU module.
Stop handling zero-width-joiner string as whitespace.

New features

added new sampling algorithm without replacement.
added API for new sampling and perplexity calculation.
added allow_whitespace_only_pieces mode.

Assets 46

10 Jan 06:02

taku910

v0.1.95

0e6dfbf

v0.1.95

Updates

support to build sentencepiece with the external (official) abseil library.
upgraded protobuf 3.14.0
changed the type of input_sentence_size from int32 to uint64.

Assets 43

24 Oct 02:01

taku910

v0.1.94

8336bbd

v0.1.94

Updates

added SetRandomGeneratorSeed function to set the seed value for random generator. This can allow to make reproducible sampling.
Validate the range of the vocab id in Python module.
Change the directory arrangement of python module.
Added protobuf python module.

Bug fixes

Support to build python wheel from source package.

Assets 43

14 Oct 04:38

taku910

v0.1.93

0b5bd12

v0.1.93

Bug fix

Fixed the regression bug around the flag --minloglevel
Fixed minor bugs.

Updates

Used manylinux2014 to build pypi packages
Support arm64, ppc64le, s390x architectures in pypi packages
Support Python 3.9

Removed

Stopped tf-sentencepiece.
Stopped the support of Python 2.x and Python 3.4

Assets 43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major changes

New features

Bug fixes & minor changes

Major changes

New features

Bug fixes & minor changes

Major changes

New features

Bug fixes & minor changes

Major changes

New features

Bug fixes & minor changes

Major changes

New features

Bug fixes & minor changes

Updates

New features

Updates

Updates

Bug fixes

Bug fix

Updates

Removed

Releases: google/sentencepiece

v0.2.0

Major changes

New features

Bug fixes & minor changes

v0.2.0pre1

Major changes

New features

Bug fixes & minor changes

v0.1.99

Major changes

New features

Bug fixes & minor changes

v0.1.99pre1

v0.1.98

Major changes

New features

Bug fixes & minor changes

v0.1.97

Major changes

New features

Bug fixes & minor changes

v0.1.96

Updates

New features

v0.1.95

Updates

v0.1.94

Updates

Bug fixes

v0.1.93

Bug fix

Updates

Removed