Rework text encoder #161

chengchingwen · 2023-12-06T14:31:00Z

This PR reworked the text encoder interface. The previous TransformerTextEncoder, BertTextEncoder, GPT2TextEncoder, and T5TextEncoder are unified into TrfTextEncoder. TrfTextEncoder has multiple fields that can modify the encode/decode process:

annotate (default to TextEncoders.annotate_strings): Annotate the input string for the tokenizer, e.g. String would be treated as a single sentence, not a single word.
process: The preprocess function applied to the tokenization results, e.g. adding special end-of-sentence token, computing attention mask...
onehot (default to TextEncoders.lookup_fist): Apply onehot encoding on the preprocess result, the default behavior takes the first element from the proprocess result and applies onehot encoding.
decode (default to identity): The function that converts each token id back to string. This can be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.
textprocess (default to TextEncodeBase.join_text): the function that joins the decode-d result in complete sentence(s).

A new api decode_text is also provided to simplify text generation. These designs allows us to unify the behavior difference between the old <X>TextEncoders, and extract the text decoder direclty from huggingface tokenizer file.

codecov · 2023-12-06T15:07:26Z

Codecov Report

Attention: 41 lines in your changes are missing coverage. Please review.

Comparison is base (91a3fe0) 46.52% compared to head (a9bcc53) 60.89%.

Files	Patch %	Lines
src/textencoders/TextEncoders.jl	66.66%	14 Missing ⚠️
src/huggingface/tokenizer/fast_tkr.jl	88.18%	13 Missing ⚠️
src/textencoders/gpt_textencoder.jl	37.50%	5 Missing ⚠️
src/textencoders/utils.jl	84.21%	3 Missing ⚠️
src/huggingface/tokenizer/tokenizer.jl	60.00%	2 Missing ⚠️
src/tokenizer/unigram/unigram.jl	81.81%	2 Missing ⚠️
src/textencoders/t5_textencoder.jl	75.00%	1 Missing ⚠️
src/tokenizer/unigram/tokenization.jl	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##              0.3     #161       +/-   ##
===========================================
+ Coverage   46.52%   60.89%   +14.37%     
===========================================
  Files          85       85               
  Lines        4400     4547      +147     
===========================================
+ Hits         2047     2769      +722     
+ Misses       2353     1778      -575

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ked list

chengchingwen added 8 commits December 6, 2023 20:21

update env

5a3563c

using linked list for tokenization interal container

0087371

adapt huggingface bert tokenizer behavior

7a7251d

update util func

0832910

unifying text encoders

3825ef8

huggingface extract tokenizer decoder

d6ca63c

huggingface validation with TimerOutputs

49f8e74

huggingface validate tokenizer decode

53f1cf6

chengchingwen added 3 commits December 13, 2023 13:16

update t5 example

7206b89

update llama2 example

8b0cdc6

add lru cache for unigram tokenization

ba841e4

chengchingwen force-pushed the tkr branch from 97372ab to ba841e4 Compare December 17, 2023 13:35

chengchingwen added 10 commits December 18, 2023 13:46

unigram use cache with abstract string keytype

710be24

use func normlizer for adding prefix

23e50bc

update bpe compat

5461ab0

improve unigram tokenization perf: inbounds & replace vector with lin…

b56aead

…ked list

fix cached unigram tokenization

3d95348

use pythoncall to test huggingface tokenizer

4884d38

test more tokenizer

212f39c

remove unused import

7af6990

add ci env var for testing hgf tokenizer

023432b

fix skip test

c2076d7

chengchingwen force-pushed the tkr branch 4 times, most recently from e2fce6d to cb1e8b1 Compare December 20, 2023 08:37

use secret hfhub token

a9bcc53

chengchingwen force-pushed the tkr branch from cb1e8b1 to a9bcc53 Compare December 20, 2023 08:48

chengchingwen merged commit 297e9c3 into 0.3 Dec 20, 2023
8 checks passed

chengchingwen deleted the tkr branch December 20, 2023 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework text encoder #161

Rework text encoder #161

chengchingwen commented Dec 6, 2023

codecov bot commented Dec 6, 2023 •

edited

Loading

Rework text encoder #161

Rework text encoder #161

Conversation

chengchingwen commented Dec 6, 2023

codecov bot commented Dec 6, 2023 • edited Loading

Codecov Report

codecov bot commented Dec 6, 2023 •

edited

Loading