Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: DescribeIndex performance is poor if the collection has lots of scalar indexes and large number of segments #29313

Closed
1 task done
yhmo opened this issue Dec 19, 2023 · 9 comments
Assignees
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@yhmo
Copy link
Contributor

yhmo commented Dec 19, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: v2.2.14
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Create a collection with dozens of scalar fields, and create an index for each field.
Insert 60M entities into the collection, generate 1000+ segments.
Call describe_index(), time cost 10+ seconds.and

Expected Behavior

No response

Steps To Reproduce

Set dataCoord.segment.maxSize=4, start a milvus standalone.
With this script, the describe_index time cost is 3+ seconds.

import random
import time

from pymilvus import (
    connections,
    FieldSchema, CollectionSchema, DataType,
    Collection,
    utility,
)

connections.connect(host='localhost', port='19530')
print(utility.get_server_version())

collection_name = "test"
dim = 128
metric_type = "L2"

# create collection
fields=[
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="vector", dtype = DataType.FLOAT_VECTOR, dim=dim),
]

for i in range(50):
    fields.append(FieldSchema(name=f"f{i}", dtype = DataType.INT64))

schema = CollectionSchema(fields)

if utility.has_collection(collection_name):
    utility.drop_collection(collection_name)

collection = Collection(name=collection_name, schema=schema)
print(f"Collection '{collection_name}' created")

index_params = {
    'metric_type': metric_type,
    'index_type': "IVF_FLAT",
    'params': {"nlist": 32},
}
collection.create_index(field_name="vector", index_params=index_params)
for i in range(50):
    start = time.time()
    collection.create_index(field_name=f"f{i}", index_params={})
    print(f"create index time cost: {time.time()-start} seconds")

print("index created")

def describe_index():
    print("\n\n=======================describe index===========================")
    start = time.time()
    for index in collection.indexes:
        print(index.to_dict())
    print(f"describe index time cost: {time.time()-start} seconds")

describe_index()

batch_count = 10000
data = [
        [[random.random() for _ in range(dim)] for _ in range(batch_count)],  # vectors
    ]
for i in range(50):
    data.append([k for k in range(batch_count)])

for i in range(200):
    print("insert", i)
    collection.insert(data=data)

collection.flush()
print("collection.num_entities = ", collection.num_entities)

while True:
    describe_index()
    time.sleep(10)

No response

Milvus Log

No response

Anything else?

No response

@yhmo yhmo added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 19, 2023
@xiaofan-luan
Copy link
Collaborator

/assign @xiaocai2333

please help on this

@xiaofan-luan
Copy link
Collaborator

each describe operation should takes less than 10ms

@xiaocai2333
Copy link
Contributor

xiaocai2333 commented Jan 3, 2024

In version 2.2, multiple RPC calls are required for DescribeIndex. And in order to accurate status, we cannot avoid these rpc calls.

  • GetFlushedSegments: to get all flushed segments IDs.
  • ListSegmentsInfo: to get flushed segments info to get total rows.
  • GetRecoveryInfoV2: to get segment view to get indexed rows.

In version 2.3, indexcoord was merged with datacoord, it will no longer need rpc calls.
And it costs 60+ms with 1800 segments.
image

--- Growing: 0, Sealed: 0, Flushed: 1800
--- Total Segments: 1800, row count: 2000000

Copy link

stale bot commented Feb 2, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Feb 2, 2024
@xiaocai2333
Copy link
Contributor

/keep it

Copy link

stale bot commented Mar 3, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Mar 3, 2024
sre-ci-robot pushed a commit that referenced this issue Mar 3, 2024
@yanliang567
Copy link
Contributor

keep it active

@xiaocai2333
Copy link
Contributor

In versions 2.3.12 and 2.4, it has reached about 10ms.
Version 2.2.X is no longer optimized
/close

@sre-ci-robot
Copy link
Contributor

@xiaocai2333: Closing this issue.

In response to this:

In versions 2.3.12 and 2.4, it has reached about 10ms.
Version 2.2.X is no longer optimized
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
None yet
Development

No branches or pull requests

5 participants