-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework text encoder #161
Merged
Merged
Rework text encoder #161
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## 0.3 #161 +/- ##
===========================================
+ Coverage 46.52% 60.89% +14.37%
===========================================
Files 85 85
Lines 4400 4547 +147
===========================================
+ Hits 2047 2769 +722
+ Misses 2353 1778 -575 ☔ View full report in Codecov by Sentry. |
chengchingwen
force-pushed
the
tkr
branch
4 times, most recently
from
December 20, 2023 08:37
e2fce6d
to
cb1e8b1
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR reworked the text encoder interface. The previous
TransformerTextEncoder
,BertTextEncoder
,GPT2TextEncoder
, andT5TextEncoder
are unified intoTrfTextEncoder
.TrfTextEncoder
has multiple fields that can modify the encode/decode process:annotate
(default toTextEncoders.annotate_strings
): Annotate the input string for the tokenizer, e.g.String
would be treated as a single sentence, not a single word.process
: The preprocess function applied to the tokenization results, e.g. adding specialend-of-sentence
token, computing attention mask...onehot
(default toTextEncoders.lookup_fist
): Apply onehot encoding on the preprocess result, the default behavior takes the first element from the proprocess result and applies onehot encoding.decode
(default toidentity
): The function that converts each token id back to string. This can be used to handle some tokenizers that use a different set of vocabulary such as gpt2's byte-level vocabulary.textprocess
(default toTextEncodeBase.join_text
): the function that joins thedecode
-d result in complete sentence(s).A new api
decode_text
is also provided to simplify text generation. These designs allows us to unify the behavior difference between the old<X>TextEncoder
s, and extract the text decoder direclty from huggingface tokenizer file.