Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Failed to search: node offline[node=-1]: channel not available when streamingDeltaForwardPolicy is Direct #36887

Open
1 task done
ThreadDao opened this issue Oct 15, 2024 · 3 comments
Assignees
Labels
deletion-opt kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@ThreadDao
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4-20241010-eaa94875-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

milvus cluster

deploy a milvus with config

  config:
    dataCoord:
      enableActiveStandby: true
      segment:
        expansionRate: 1.15
        maxSize: 2048
        sealProportion: 0.12
    dataNode:
      compaction:
        levelZeroBatchMemoryRatio: 0.5
    indexCoord:
      enableActiveStandby: true
    log:
      level: debug
    minio:
      accessKeyID: miniozong
      bucketName: bucket-zong
      rootPath: compact_2
      secretAccessKey: miniozong
    queryCoord:
      enableActiveStandby: true
    **queryNode:
      levelZeroForwardPolicy: RemoteLoad
      streamingDeltaForwardPolicy: Direct**
    quotaAndLimits:
      dml:
        deleteRate:
          max: 0.5
        enabled: false
        insertRate:
          max: 8
        upsertRate:
          max: 8
      growingSegmentsSizeProtection:
        enabled: false
        highWaterLevel: 0.2
        lowWaterLevel: 0.1
      limitWriting:
        memProtection:
          dataNodeMemoryHighWaterLevel: 0.85
          dataNodeMemoryLowWaterLevel: 0.75
          queryNodeMemoryHighWaterLevel: 0.85
          queryNodeMemoryLowWaterLevel: 0.75
      limits:
        complexDeleteLimitEnable: true
    rootCoord:
      enableActiveStandby: true
    trace:
      exporter: jaeger
      jaeger:
        url: http://tempo-distributor.tempo:14268/api/traces"
      sampleFraction: 1

test steps

  1. There are a collection with a int64 pk field and a vector field. Collection has 100m entities
  2. When starting to delete, the search fails
[2024-10-15 10:48:02,882 - ERROR - fouram]: RPC error: [search], <MilvusException: (code=503, message=fail to search on QueryNode 23: distribution is not servcieable: channel not available[channel=compact-opt-100m-2-rootcoord-dml_0_453128445902192997v0])>, <Time:{'RPC start': '2024-10-15 10:47:45.846202', 'RPC error': '2024-10-15 10:48:02.882154'}> (decorators.py:147)

client delete log:

[2024-10-15 18:46:51,711 - INFO - ci_test]: start to delete [0, ..., 15999] with length 16000 (tmp.py:40)
[2024-10-15 18:46:51,825 - INFO - ci_test]: delete cost 0.11316943168640137 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)
[2024-10-15 18:46:52,716 - INFO - ci_test]: start to delete [16000, ..., 31999] with length 16000 (tmp.py:40)
...
[2024-10-15 18:51:55,817 - INFO - ci_test]: delete cost 0.11813139915466309 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)
[2024-10-15 18:51:56,703 - INFO - ci_test]: start to delete [4864000, ..., 4879999] with length 16000 (tmp.py:40)
[2024-10-15 18:51:56,825 - INFO - ci_test]: delete cost 0.12181949615478516 with res (insert count: 0, delete count: 16000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)

Expected Behavior

No response

Steps To Reproduce

- https://argo-workflows.zilliz.cc/archived-workflows/qa/64dab658-11fc-4a63-ac02-8770c303363f?nodeId=compact-opt-delete-100m-6b
- delete scripts

def get_ids(start, end, batch):
    while True:
        batch = min(batch, end - start)
        if start >= end:
            yield None
        ids = [i for i in range(start, start+batch)]
        start = start + len(ids)
        yield ids

def delete_with_rate(_host, _name, _start, _end, _batch, pk="id"):
    connections.connect(host=_host)
    c = Collection(name=_name)
    for ids in get_ids(_start, _end, _batch):
        if ids is None:
            break
        log.info(f"start to delete [{ids[0]}, ..., {ids[-1]}] with length {len(ids)}")
        start_time = time.time()
        delete_res = c.delete(expr=f"{pk} in {ids}")
        cost = time.time() - start_time
        log.info(f"delete cost {cost} with res {delete_res}")
        if cost < 1:
            time.sleep(1 - cost)

if __name__ == '__main__':
    host = "xxx"
    name = "fouram_3QEsE82U"
    delete_with_rate(host, name, 0, 50000000, _batch=16000)


### Milvus Log

pods:

compact-opt-100m-2-milvus-datanode-74b5c7854b-xxcdl 1/1 Running 0 3h53m 10.104.14.7 4am-node18
compact-opt-100m-2-milvus-indexnode-6cd9b49f5-9xtfj 1/1 Running 0 3h52m 10.104.4.36 4am-node11
compact-opt-100m-2-milvus-indexnode-6cd9b49f5-qb26s 1/1 Running 0 3h53m 10.104.17.2 4am-node23
compact-opt-100m-2-milvus-indexnode-6cd9b49f5-zp5bj 1/1 Running 0 3h51m 10.104.1.234 4am-node10
compact-opt-100m-2-milvus-mixcoord-8f9875d6d-khsb4 1/1 Running 0 3h53m 10.104.4.33 4am-node11
compact-opt-100m-2-milvus-proxy-5bd9875bb4-tkrzw 1/1 Running 0 3h53m 10.104.9.107 4am-node14
compact-opt-100m-2-milvus-querynode-0-7488f76b9b-8dz69 1/1 Running 0 3h52m 10.104.20.48 4am-node22
compact-opt-100m-2-milvus-querynode-0-7488f76b9b-dqwzf 1/1 Running 0 3h49m 10.104.23.93 4am-node27
compact-opt-100m-2-milvus-querynode-0-7488f76b9b-hs6hn 1/1 Running 0 3h53m 10.104.24.147 4am-node29
compact-opt-100m-2-milvus-querynode-0-7488f76b9b-p5dcv 1/1 Running 0 3h50m 10.104.30.192 4am-node38


### Anything else?

_No response_
@ThreadDao ThreadDao added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 15, 2024
@ThreadDao ThreadDao added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Oct 15, 2024
@ThreadDao ThreadDao added this to the 2.4.14 milestone Oct 15, 2024
@yanliang567
Copy link
Contributor

/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 16, 2024
congqixia added a commit to congqixia/milvus that referenced this issue Oct 16, 2024
Related to milvus-io#36887

Forward delete to L0 segment will return error and mark l0 segment
offline causing delegator unserviceable

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 16, 2024
Related to #36887

Forward delete to L0 segment will return error and mark l0 segment
offline causing delegator unserviceable

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 16, 2024
Related to milvus-io#36887

Forward delete to L0 segment will return error and mark l0 segment
offline causing delegator unserviceable

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 17, 2024
Cherry-pick from master
pr: #36899

Related to #36887

Forward delete to L0 segment will return error and mark l0 segment
offline causing delegator unserviceable

Signed-off-by: Congqi Xia <[email protected]>
@ThreadDao
Copy link
Contributor Author

search failed fixed, but deletegator and 2 querynodes oom

compact-opt-100m-2-milvus-querynode-1-68b66fcf88-22v7w            1/1     Running     2 (3h54m ago)   26h     10.104.23.52    4am-node27   <none>           <none>
compact-opt-100m-2-milvus-querynode-1-68b66fcf88-g9rdb            1/1     Running     1 (3h53m ago)   26h     10.104.34.82    4am-node37   <none>           <none>
compact-opt-100m-2-milvus-querynode-1-68b66fcf88-vxdnw            1/1     Running     2 (3h34m ago)   26h     10.104.30.132   4am-node38   <none>           <none>
compact-opt-100m-2-milvus-querynode-1-68b66fcf88-zbqhq            1/1     Running     3 (3h52m ago)   26h     10.104.24.50    4am-node29   <none>           <none>
[2024-10-18 14:39:08,390 - INFO - ci_test]: start to delete [4880000, ..., 4939999] with length 60000 (tmp.py:40)
...
[2024-10-18 14:44:47,257 - INFO - ci_test]: delete cost 0.43274760246276855 with res (insert count: 0, delete count: 60000, upsert count: 0, timestamp: 0, success count: 0, err count: 0 (tmp.py:44)
[2024-10-18 14:44:47,830 - INFO - ci_test]: start to delete [25100000, ..., 25159999] with length 60000 (tmp.py:40)
RPC error: [delete], <MilvusException: (code=9, message=quota exceeded[reason=memory quota exceeded, please allocate more resources])>, <Time:{'RPC start': '2024-10-18 14:44:47.846284', 'RPC error': '2024-10-18 14:44:47.913232'}>
Traceback (most recent call last):

@xiaofan-luan
Copy link
Collaborator

seems that it hits quota limitation and shouldn't be OOMed?

The OOM happened after "memory quota exceeded" log?

congqixia added a commit to congqixia/milvus that referenced this issue Oct 22, 2024
congqixia added a commit to congqixia/milvus that referenced this issue Oct 22, 2024
sre-ci-robot pushed a commit that referenced this issue Oct 22, 2024
Rewritten based on master pr
pr: #37043

Related to #36887

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 22, 2024
congqixia added a commit to congqixia/milvus that referenced this issue Oct 23, 2024
Relatedt milvus-io#36887

DirectFoward streaming delete will cause memory usage explode if the
segments number was large. This PR add batching delete API and using it
for direct forward implementation.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 24, 2024
Relatedt #36887

DirectFoward streaming delete will cause memory usage explode if the
segments number was large. This PR add batching delete API and using it
for direct forward implementation.

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 24, 2024
Relatedt milvus-io#36887

DirectFoward streaming delete will cause memory usage explode if the
segments number was large. This PR add batching delete API and using it
for direct forward implementation.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 25, 2024
…#37107)

Cherry pick from master
pr: #37076
Related #36887

DirectFoward streaming delete will cause memory usage explode if the
segments number was large. This PR add batching delete API and using it
for direct forward implementation.

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 28, 2024
Related to milvus-io#36887

`LoadDeltaLogs` API did not check memory usage. When system is under
high delete load pressure, this could result into OOM quit.

This PR add resource check for `LoadDeltaLogs` actions and separate
internal deltalog loading function with public one.

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 28, 2024
Related to milvus-io#36887

Previously using newly create pool per request shall
cause goroutine leakage. This PR change this behavior
by using singleton delete pool. This change could also
provide better concurrency control over delete memory
usage.

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 28, 2024
Related to milvus-io#36887

Previously using newly create pool per request shall
cause goroutine leakage. This PR change this behavior
by using singleton delete pool. This change could also
provide better concurrency control over delete memory
usage.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 29, 2024
Related to #36887

Previously using newly create pool per request shall cause goroutine
leakage. This PR change this behavior by using singleton delete pool.
This change could also provide better concurrency control over delete
memory usage.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 29, 2024
Related to #36887

`LoadDeltaLogs` API did not check memory usage. When system is under
high delete load pressure, this could result into OOM quit.

This PR add resource check for `LoadDeltaLogs` actions and separate
internal deltalog loading function with public one.

---------

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 29, 2024
Related to milvus-io#36887

`LoadDeltaLogs` API did not check memory usage. When system is under
high delete load pressure, this could result into OOM quit.

This PR add resource check for `LoadDeltaLogs` actions and separate
internal deltalog loading function with public one.

---------

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 29, 2024
…#37220)

Related to milvus-io#36887

Previously using newly create pool per request shall cause goroutine
leakage. This PR change this behavior by using singleton delete pool.
This change could also provide better concurrency control over delete
memory usage.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 29, 2024
)

Cherry-pick from master
pr: #37220
Related to #36887

Previously using newly create pool per request shall cause goroutine
leakage. This PR change this behavior by using singleton delete pool.
This change could also provide better concurrency control over delete
memory usage.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 29, 2024
) (#37233)

Cherry pick from master
pr: #37220

Related to #36887

Previously using newly create pool per request shall cause goroutine
leakage. This PR change this behavior by using singleton delete pool.
This change could also provide better concurrency control over delete
memory usage.

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 29, 2024
Related to milvus-io#36887

`LoadDeltaLogs` API did not check memory usage. When system is under
high delete load pressure, this could result into OOM quit.

This PR add resource check for `LoadDeltaLogs` actions and separate
internal deltalog loading function with public one.

---------

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 30, 2024
Cherry pick from master
pr: #37195

Related to #36887

`LoadDeltaLogs` API did not check memory usage. When system is under
high delete load pressure, this could result into OOM quit.

This PR add resource check for `LoadDeltaLogs` actions and separate
internal deltalog loading function with public one.

---------

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 30, 2024
Cherry pick from master
pr: #37195
Related to #36887

`LoadDeltaLogs` API did not check memory usage. When system is under
high delete load pressure, this could result into OOM quit.

This PR add resource check for `LoadDeltaLogs` actions and separate
internal deltalog loading function with public one.

---------

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 30, 2024
Related to milvus-io#36887

Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 30, 2024
Related to milvus-io#36887

Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 30, 2024
Cherry pick from master
pr: #37305 
Related to #36887

Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 30, 2024
Related to #36887

Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.

Signed-off-by: Congqi Xia <[email protected]>
congqixia added a commit to congqixia/milvus that referenced this issue Oct 30, 2024
Related to milvus-io#36887

Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.

Signed-off-by: Congqi Xia <[email protected]>
xiaofan-luan pushed a commit to xiaofan-luan/milvus that referenced this issue Oct 30, 2024
Related to milvus-io#36887

Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.

Signed-off-by: Congqi Xia <[email protected]>
xiaofan-luan pushed a commit to xiaofan-luan/milvus that referenced this issue Oct 30, 2024
Related to milvus-io#36887

Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.

Signed-off-by: Congqi Xia <[email protected]>
sre-ci-robot pushed a commit that referenced this issue Oct 31, 2024
Cherry pick from master
pr: #37305
Related to #36887

Remove non-hit pk delete record logic does not work since
`insert_record_.contain` does not work due to logic problem.

Signed-off-by: Congqi Xia <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deletion-opt kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants