Removed redundant and potentially error cause validation for single doc OpenAI embedding #3819

Hase-U · 2023-04-30T05:27:37Z

The line here should actually be a length comparison with the text as token
https://github.com/hwchase17/langchain/blob/18ec22fe56049aaea446406daab6d66d172dd48f/langchain/embeddings/openai.py#L210

But realistically, there is no need to use a function like len(encode(text)) here, and we can use self._get_len_safe_embeddings by default.
All langchain users will need to install tiktoken, but it's natural to think that using tiktoken is also necessary when using openai's embedding.

So the difference from the current situation is

tiktoken needs to be installed.
all docs will be encoded at the client side and sent to openai.

…imit

fix batch process of openai embedding to avoid errors in token

restore chunk_size to original value

Fetch fork master

Fix no attr

Upstrem merge

Hase-U · 2023-04-30T05:31:23Z

Also, this PR means that the use of _get_len_safe_embeddings is completely defaulted.
So please note that this PR over here needs to be merged first. (#3778)

Hase-U · 2023-04-30T05:37:27Z

#3811

Also, as pointed out here, embeddings/openai.py imports tiktoken in a different way than elsewhere in langchain, so I adjusted it accordingly.
However, since this is a problem of how to write the code, I think that nothing has changed functionally.
the attribute error seems to be caused by other reason

shawnesquivel

I reached the same conclusion on my fork. #3811 has more documentation on why this is the right fix.

shawnesquivel · 2023-05-04T21:49:59Z

Was your OpenAIEmbeddings model parameter OK with the default (text-embedding-ada-002)? Personally, I had to change model to "gpt2" as per #3811.

File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/tiktoken/registry.py", line 60, in get_encoding
    raise ValueError(f"Unknown encoding {encoding_name}")
ValueError: Unknown encoding text-embedding-ada-002

https://github.com/hwchase17/langchain/blob/master/langchain/embeddings/openai.py#L107
Change:
model: str = "gpt2"

leo-gan · 2023-09-13T01:34:38Z

@Hase-U Hi , could you, please, resolve the merging issues? After that ping me and I push this PR for the review. Thanks!

efriis · 2023-11-07T04:42:01Z

Closing because the PR wouldn't line up with the current directory structure of the library (would need to be in /libs/langchain/langchain instead of /langchain). Feel free to reopen against the current head if it's still relevant!

Hase-U and others added 15 commits February 11, 2023 17:32

update openai embeddings to calculate based on token size

a36b7c6

fix batch process of openai embedding to avoid errors in token size l…

54331d6

…imit

Merge pull request #1 from Hase-U/openai_safe_embedding

71e6dc9

fix batch process of openai embedding to avoid errors in token

restore chunk_size to original value

7e77853

Merge pull request #2 from Hase-U/fix_test

973afec

restore chunk_size to original value

Merge branch 'fork_master' into fetch__fork_master

eed7746

Merge pull request #3 from Hase-U/fetch__fork_master

c305398

Fetch fork master

returned batch process since it was removed by mistake

f73cfbc

add hasattr check

79f8ee1

Merge pull request #4 from Hase-U/fix_no_attr

58f3a33

Fix no attr

remove backwards compatibility code following the fork origin rep policy

f9dd9e6

resolve conflicts

c26d03c

Merge pull request #5 from Hase-U/upstrem-merge

d14d62c

Upstrem merge

remove redundant and possible error check

f56922e

revert unnecessary document updates

114a72b

Hase-U added 2 commits May 2, 2023 10:15

resolve conflicts

7e032bd

tiktoken.model.encoding_for_model to tiktoken.encoding_for_model

cd24ed6

shawnesquivel approved these changes May 4, 2023

View reviewed changes

shawnesquivel mentioned this pull request May 15, 2023

Tiktoken import bug? #3811

Closed

dosubot bot added Ɑ: embeddings Related to text embedding models module 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 14, 2023

efriis closed this Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed redundant and potentially error cause validation for single doc OpenAI embedding #3819

Removed redundant and potentially error cause validation for single doc OpenAI embedding #3819

Hase-U commented Apr 30, 2023

Hase-U commented Apr 30, 2023 •

edited

Loading

Hase-U commented Apr 30, 2023

shawnesquivel left a comment

shawnesquivel commented May 4, 2023 •

edited

Loading

leo-gan commented Sep 13, 2023

efriis commented Nov 7, 2023 •

edited

Loading

Removed redundant and potentially error cause validation for single doc OpenAI embedding #3819

Removed redundant and potentially error cause validation for single doc OpenAI embedding #3819

Conversation

Hase-U commented Apr 30, 2023

Hase-U commented Apr 30, 2023 • edited Loading

Hase-U commented Apr 30, 2023

shawnesquivel left a comment

Choose a reason for hiding this comment

shawnesquivel commented May 4, 2023 • edited Loading

leo-gan commented Sep 13, 2023

efriis commented Nov 7, 2023 • edited Loading

Hase-U commented Apr 30, 2023 •

edited

Loading

shawnesquivel commented May 4, 2023 •

edited

Loading

efriis commented Nov 7, 2023 •

edited

Loading