Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: After enabling full-text search (or adding a function to the schema), when performing an upsert, you still need to provide the value for the output field, otherwise, it will report an error Unexpected error: [upsert_rows], 'text_sparse_emb'. #37021

Closed
1 task done
zhuwenxing opened this issue Oct 21, 2024 · 5 comments
Assignees
Labels
feature/full text search kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-346510e-20241021
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):2.5.0rc101
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

self = <test_full_text_search.TestUpsertWithFullTextSearchNegative object at 0x1300cb7c0>, tokenizer = 'default', nullable = False

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("nullable", [False, True])
    @pytest.mark.parametrize("tokenizer", ["default"])
    def test_upsert_with_full_text_search(self, tokenizer, nullable):
        """
        target: test full text search
        method: 1. enable full text search and insert data with varchar
                2. search with text
                3. verify the result
        expected: full text search successfully and result is correct
        """
        if nullable:
            pytest.xfail(reason="nullable field not support yet")
    
        tokenizer_params = {
            "tokenizer": tokenizer,
        }
        dim = 128
        fields = [
            FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
            FieldSchema(
                name="word",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
                is_partition_key=True,
            ),
            FieldSchema(
                name="sentence",
                dtype=DataType.VARCHAR,
                max_length=65535,
                nullable=nullable,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="paragraph",
                dtype=DataType.VARCHAR,
                max_length=65535,
                nullable=nullable,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(
                name="text",
                dtype=DataType.VARCHAR,
                max_length=65535,
                enable_tokenizer=True,
                tokenizer_params=tokenizer_params,
            ),
            FieldSchema(name="emb", dtype=DataType.FLOAT_VECTOR, dim=dim),
            FieldSchema(name="text_sparse_emb", dtype=DataType.SPARSE_FLOAT_VECTOR),
        ]
        schema = CollectionSchema(fields=fields, description="test collection")
        bm25_function = Function(
            name="text_bm25_emb",
            function_type=FunctionType.BM25,
            input_field_names=["text"],
            output_field_names=["text_sparse_emb"],
            params={},
        )
        schema.add_function(bm25_function)
        data_size = 5000
        collection_w = self.init_collection_wrap(
            name=cf.gen_unique_str(prefix), schema=schema
        )
        fake = fake_en
        language = "en"
        if tokenizer == "jieba":
            fake = fake_zh
            language = "zh"
    
        if nullable:
            data = [
                {
                    "id": i,
                    "word": fake.word().lower() if random.random() < 0.5 else None,
                    "sentence": fake.sentence().lower() if random.random() < 0.5 else None,
                    "paragraph": fake.paragraph().lower() if random.random() < 0.5 else None,
                    "text": fake.text().lower(),  # function input should not be None
                    "emb": [random.random() for _ in range(dim)],
                }
                for i in range(data_size // 2, data_size)
            ]
        else:
            data = [
                {
                    "id": i,
                    "word": fake.word().lower(),
                    "sentence": fake.sentence().lower(),
                    "paragraph": fake.paragraph().lower(),
                    "text": fake.text().lower(),
                    "emb": [random.random() for _ in range(dim)],
                }
                for i in range(data_size)
            ]
        df = pd.DataFrame(data)
        log.info(f"dataframe\n{df}")
        batch_size = 5000
        for i in range(0, len(df), batch_size):
            collection_w.insert(
                data[i: i + batch_size]
                if i + batch_size < len(df)
                else data[i: len(df)]
            )
            collection_w.flush()
        collection_w.create_index(
            "emb",
            {"index_type": "HNSW", "metric_type": "L2", "params": {"M": 16, "efConstruction": 500}},
        )
        collection_w.create_index(
            "text_sparse_emb",
            {
                "index_type": "SPARSE_INVERTED_INDEX",
                "metric_type": "BM25",
                "params": {
                    "drop_ratio_build": 0.3,
                    "bm25_k1": 1.5,
                    "bm25_b": 0.75,
                }
            }
        )
        collection_w.create_index("text", {"index_type": "INVERTED"})
        collection_w.load()
        num_entities = collection_w.num_entities
        res, _ = collection_w.query(
            expr="",
            output_fields=["count(*)"]
        )
        count = res[0]["count(*)"]
        assert len(data) == num_entities
        assert len(data) == count
    
        # upsert in half of the data
        upsert_data = [
            {
                "id": i,
                "word": fake.word().lower(),
                "sentence": fake.sentence().lower(),
                "paragraph": fake.paragraph().lower(),
                "text": fake.text().lower(),
                "emb": [random.random() for _ in range(dim)],
            }
            for i in range(data_size // 2)
        ]
        upsert_data += data[data_size // 2:]
        for i in range(0, len(upsert_data), batch_size):
>           collection_w.upsert(
                upsert_data[i: i + batch_size]
                if i + batch_size < len(upsert_data)
                else upsert_data[i: len(upsert_data)]
            )

testcases/test_full_text_search.py:1383: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
utils/wrapper.py:33: in inner_wrapper
    res, result = func(*args, **kwargs)
base/collection_wrapper.py:338: in upsert
    check_result = ResponseChecker(res, func_name, check_task, check_items, check, **kwargs).run()
check/func_check.py:34: in run
    result = self.assert_succ(self.succ, True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <check.func_check.ResponseChecker object at 0x1300563d0>, actual = False, expect = True

    def assert_succ(self, actual, expect):
>       assert actual is expect, f"Response of API {self.func_name} expect {expect}, but got {actual}"
E       AssertionError: Response of API upsert expect True, but got False

check/func_check.py:116: AssertionError
------------------------------------------------------------------------------------------------------- Captured log setup --------------------------------------------------------------------------------------------------------
[2024-10-21 15:32:36 - INFO - ci_test]: [setup_class] Start setup class... (client_base.py:41)
[2024-10-21 15:32:36 - INFO - ci_test]: *********************************** setup *********************************** (client_base.py:47)
[2024-10-21 15:32:36 - INFO - ci_test]: pymilvus version: 2.5.0rc101 (client_base.py:48)
[2024-10-21 15:32:36 - INFO - ci_test]: [setup_method] Start setup test case test_upsert_with_full_text_search. (client_base.py:49)
-------------------------------------------------------------------------------------------------------- Captured log call --------------------------------------------------------------------------------------------------------
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_request)  : [Connections.has_connection] args: ['default'], kwargs: {} (api_request.py:62)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_response) : False  (api_request.py:37)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_request)  : [Connections.connect] args: ['default', '', '', 'default', ''], kwargs: {'host': '10.104.20.97', 'port': 19530} (api_request.py:62)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-21 15:32:36 - INFO - ci_test]: server version: 346510e-dev (client_base.py:166)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['full_text_search_collection_38VQKkPZ', {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': ......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)
[2024-10-21 15:32:36 - DEBUG - ci_test]: (api_response) : <Collection>:
-------------
<name>: full_text_search_collection_38VQKkPZ
<description>: test collection
<schema>: {'auto_id': False, 'description': 'test collection', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'word', 'de......  (api_request.py:37)
[2024-10-21 15:32:37 - INFO - ci_test]: dataframe
        id        word                                       sentence                                          paragraph                                               text                                                emb
0        0   scientist  interesting many boy oil be opportunity when.  carry form develop. cost rest dream energy. ch...  image front name provide should work. dark add...  [0.03767875334182125, 0.3989433904957238, 0.10...
1        1        plan      tax leave continue man trouble including.                  ability possible growth shoulder.  remember mrs couple hundred those. future phon...  [0.1468846081354882, 0.8416706721768779, 0.261...
2        2  population           positive later fight once knowledge.  hand hot attorney scene hard and. environmenta...  financial wife no stay listen top. very book r...  [0.31837460695144915, 0.31364729506342315, 0.8...
3        3       smile   partner kitchen floor until anything strong.  ball deal large concern fact institution note....  fly management step relationship officer. atte...  [0.30731171101897714, 0.5574160371688782, 0.87...
4        4   attention                               with black drop.  message too voice night. can control pm politi...  those keep lawyer there article these. somethi...  [0.6528922817343126, 0.8972341979186024, 0.373...
...    ...         ...                                            ...                                                ...                                                ...                                                ...
4995  4995    marriage       develop site live cut threat commercial.       defense may perform. my glass town although.  likely approach answer range. camera carry cat...  [0.25508427017015445, 0.6554567409136012, 0.32...
4996  4996        line                address vote scene eye medical.  society officer decision she bag step. bank st...  eye military dream will protect pass term pare...  [0.6205571146207843, 0.31696026115760145, 0.07...
4997  4997       hotel                           technology hold him.  social debate whether almost same different. f...  human cut table benefit set deep. ability ask ...  [0.23176432966854688, 0.1439607792449047, 0.85...
4998  4998        hair                      color husband over eight.  base safe speak against. call particularly str...  marriage unit office age pm. address under con...  [0.9673645756355516, 0.4313184485314254, 0.842...
4999  4999        wait     listen yes purpose none several walk song.  area across show require until police. easy lo...  strategy animal management base fast. radio dr...  [0.34513015065353203, 0.4114163562913943, 0.01...

[5000 rows x 6 columns] (test_full_text_search.py:1333)
[2024-10-21 15:32:37 - DEBUG - ci_test]: (api_request)  : [Collection.insert] args: [[{'id': 0, 'word': 'scientist', 'sentence': 'interesting many boy oil be opportunity when.', 'paragraph': 'carry form develop. cost rest dream energy. church fill their heart. our personal all.', 'text': 'image front name provide should work. dark address several. determine see magazine hear.\nsens......, kwargs: {'timeout': 180} (api_request.py:62)
[2024-10-21 15:32:38 - DEBUG - ci_test]: (api_response) : (insert count: 5000, delete count: 0, upsert count: 0, timestamp: 453376988575957001, success count: 5000, err count: 0  (api_request.py:37)
[2024-10-21 15:32:38 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)
[2024-10-21 15:32:41 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-21 15:32:41 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['emb', {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 16, 'efConstruction': 500}}, 1200], kwargs: {'index_name': ''} (api_request.py:62)
[2024-10-21 15:33:43 - DEBUG - ci_test]: (api_response) : Status(code=0, message=)  (api_request.py:37)
[2024-10-21 15:33:43 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['text_sparse_emb', {'index_type': 'SPARSE_INVERTED_INDEX', 'metric_type': 'BM25', 'params': {'drop_ratio_build': 0.3, 'bm25_k1': 1.5, 'bm25_b': 0.75}}, 1200], kwargs: {'index_name': ''} (api_request.py:62)
[2024-10-21 15:33:59 - DEBUG - ci_test]: (api_response) : Status(code=0, message=)  (api_request.py:37)
[2024-10-21 15:33:59 - DEBUG - ci_test]: (api_request)  : [Collection.create_index] args: ['text', {'index_type': 'INVERTED'}, 1200], kwargs: {'index_name': ''} (api_request.py:62)
[2024-10-21 15:34:38 - DEBUG - ci_test]: (api_response) : Status(code=0, message=)  (api_request.py:37)
[2024-10-21 15:34:38 - DEBUG - ci_test]: (api_request)  : [Collection.load] args: [None, 1, 180], kwargs: {} (api_request.py:62)
[2024-10-21 15:34:42 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-21 15:34:42 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)
[2024-10-21 15:34:42 - DEBUG - ci_test]: (api_response) : None  (api_request.py:37)
[2024-10-21 15:34:42 - DEBUG - ci_test]: (api_request)  : [Collection.query] args: ['', ['count(*)'], None, 180], kwargs: {} (api_request.py:62)
[2024-10-21 15:34:43 - DEBUG - ci_test]: (api_response) : data: ["{'count(*)': 5000}"]   (api_request.py:37)
[2024-10-21 15:34:43 - DEBUG - ci_test]: (api_request)  : [Collection.upsert] args: [[{'id': 0, 'word': 'military', 'sentence': 'various couple role structure leader.', 'paragraph': 'long voice our. community bit writer usually camera.', 'text': 'particularly onto market claim. possible above charge admit.\nrequire quality too push few past. weight stage here compare.', 'emb': [0.0......, kwargs: {} (api_request.py:62)
[2024-10-21 15:34:43 - ERROR - pymilvus.decorators]: Unexpected error: [upsert_rows], 'text_sparse_emb', <Time: {'RPC start': '2024-10-21 15:34:43.746189', 'Exception': '2024-10-21 15:34:43.746378'}> (decorators.py:158)
[2024-10-21 15:34:43 - ERROR - ci_test]: Traceback (most recent call last):
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 137, in handler
    return func(*args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 176, in handler
    return func(self, *args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 118, in handler
    raise e from e
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 86, in handler
    return func(*args, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 715, in upsert_rows
    request = self._prepare_row_upsert_request(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 696, in _prepare_row_upsert_request
    return Prepare.row_upsert_param(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/prepare.py", line 604, in row_upsert_param
    return cls._parse_upsert_row_request(request, fields_info, enable_dynamic, entities)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/client/prepare.py", line 520, in _parse_upsert_row_request
    field_info, field_data = field_info_map[key], fields_data[key]
KeyError: 'text_sparse_emb'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/zilliz/workspace/milvus/tests/python_client/utils/api_request.py", line 32, in inner_wrapper
    res = func(*args, **_kwargs)
  File "/Users/zilliz/workspace/milvus/tests/python_client/utils/api_request.py", line 63, in api_request
    return func(*arg, **kwargs)
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 635, in upsert
    res = conn.upsert_rows(
  File "/Users/zilliz/opt/anaconda3/envs/full_text_search/lib/python3.8/site-packages/pymilvus/decorators.py", line 159, in handler
    raise MilvusException(message=f"Unexpected error, message=<{e!s}>") from e
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=Unexpected error, message=<'text_sparse_emb'>)>
 (api_request.py:45)
[2024-10-21 15:34:43 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=Unexpected error, message=<'text_sparse_emb'>)> (api_request.py:46)

Expected Behavior

Upsert should behave the same as insert, and there's no need to assign a value to the output field in the function.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 21, 2024
@zhuwenxing
Copy link
Contributor Author

/assign @zhengbuqian

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 22, 2024
@yanliang567 yanliang567 added this to the 2.5.0 milestone Oct 22, 2024
@xiaofan-luan
Copy link
Collaborator

nice catch! @zhuwenxing

@zhengbuqian
Copy link
Contributor

error was introduced in milvus-io/pymilvus#2303 where I try to simplify the logic to check if the insert/request data matches the schema. obviously there are 2 lines I forgot to update and our CI in both pymilvus and milvus didn't catch that.

Updated in milvus-io/pymilvus#2309.

sre-ci-robot pushed a commit to milvus-io/pymilvus that referenced this issue Oct 24, 2024
issue: milvus-io/milvus#37022 and
milvus-io/milvus#37021

list of empty string is erroneously treated as sparse vector

adding a check to explicitly require that, if the input data will be
seen as sparse vector, its data type, if not scipy format, should be
list of list or list of dict.

Signed-off-by: Buqian Zheng <[email protected]>
@zhengbuqian
Copy link
Contributor

/assign @zhuwenxing
/unassign
please verify, thanks!

@zhuwenxing
Copy link
Contributor Author

verified and fixed in 2.5.0rc104

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/full text search kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants