Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework text encoder #161

Merged
merged 22 commits into from
Dec 20, 2023
Merged

Rework text encoder #161

merged 22 commits into from
Dec 20, 2023

Conversation

chengchingwen
Copy link
Owner

This PR reworked the text encoder interface. The previous TransformerTextEncoder, BertTextEncoder, GPT2TextEncoder, and T5TextEncoder are unified into TrfTextEncoder. TrfTextEncoder has multiple fields that can modify the encode/decode process:

  1. annotate (default to TextEncoders.annotate_strings): Annotate the input string for the tokenizer, e.g. String would be treated as a single sentence, not a single word.
  2. process: The preprocess function applied to the tokenization results, e.g. adding special end-of-sentence token, computing attention mask...
  3. onehot (default to TextEncoders.lookup_fist): Apply onehot encoding on the preprocess result, the default behavior takes the first element from the proprocess result and applies onehot encoding.
  4. decode (default to identity): The function that converts each token id back to string. This can be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.
  5. textprocess (default to TextEncodeBase.join_text): the function that joins the decode-d result in complete sentence(s).

A new api decode_text is also provided to simplify text generation. These designs allows us to unify the behavior difference between the old <X>TextEncoders, and extract the text decoder direclty from huggingface tokenizer file.

Copy link

codecov bot commented Dec 6, 2023

Codecov Report

Attention: 41 lines in your changes are missing coverage. Please review.

Comparison is base (91a3fe0) 46.52% compared to head (a9bcc53) 60.89%.

Files Patch % Lines
src/textencoders/TextEncoders.jl 66.66% 14 Missing ⚠️
src/huggingface/tokenizer/fast_tkr.jl 88.18% 13 Missing ⚠️
src/textencoders/gpt_textencoder.jl 37.50% 5 Missing ⚠️
src/textencoders/utils.jl 84.21% 3 Missing ⚠️
src/huggingface/tokenizer/tokenizer.jl 60.00% 2 Missing ⚠️
src/tokenizer/unigram/unigram.jl 81.81% 2 Missing ⚠️
src/textencoders/t5_textencoder.jl 75.00% 1 Missing ⚠️
src/tokenizer/unigram/tokenization.jl 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##              0.3     #161       +/-   ##
===========================================
+ Coverage   46.52%   60.89%   +14.37%     
===========================================
  Files          85       85               
  Lines        4400     4547      +147     
===========================================
+ Hits         2047     2769      +722     
+ Misses       2353     1778      -575     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@chengchingwen chengchingwen merged commit 297e9c3 into 0.3 Dec 20, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant