-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent region numbers #10
Comments
Hey @gtzheng
One region in the query ends up overlapping with two universe regions so you will tokenize a single region and get back two tokens. One thing I like to do as a sanity check is to use Let me know if you have any other thoughts! |
I see. Thanks for the rely! |
Yeah, one option is just to return the first region it overlaps. I like the idea of returning the single one. It overlaps with the most, however. In terms of signal, biological meaning however - I'm not sure which is better... |
Perhaps make it an option for users to choose. Would it be possible to get how much two regions overlap? If so, we can use that information for soft tokenization. |
Here is the code I used for tokenization:
The numbers of regions after tokenization seem inflated.
The text was updated successfully, but these errors were encountered: