Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(NODE-3688): make handshake errors retryable #3165

Merged
merged 21 commits into from
Apr 22, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions src/cmap/connect.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,14 @@ import { LEGACY_HELLO_COMMAND } from '../constants';
import {
AnyError,
MongoCompatibilityError,
MongoError,
MongoErrorLabel,
MongoInvalidArgumentError,
MongoNetworkError,
MongoNetworkTimeoutError,
MongoRuntimeError,
MongoServerError
MongoServerError,
needsRetryableWriteLabel
} from '../error';
import { Callback, ClientMetadata, HostAddress, makeClientMetadata, ns } from '../utils';
import { AuthContext, AuthProvider } from './auth/auth_provider';
Expand Down Expand Up @@ -182,7 +185,15 @@ function performInitialHandshake(
);
}
provider.auth(authContext, err => {
if (err) return callback(err);
if (err) {
if (err instanceof MongoError) {
dariakp marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

@dariakp dariakp Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for missing this the first time around, but after discussing this with the team, this check right now is too broad (e.g., if we get a non-retryable error we shouldn't add the label):

  1. the best solution would be to change this line to if (needsRetryableWriteLabel(err))
  2. this helper function is defined in error.ts and already does all the checks we'd expect, unfortunately it has a small bug in it right now which we'll want to fix to make this work correctly: the maxWireVersion check happens before the MongoNetworkError check, so basically we want to combine the maxWireVersion check via || with the check on L765 in that file, and change the block to return false (if it already has a label, it doesn't need one)
  3. additionally, I noticed that there is a lack of spec tests to cover the cases where the handshake should not be retried, so I'll be filing a ticket to add more unified tests for those cases,
  4. but in the meanwhile, for every manually implemented retry case in the test files, can we add an alternate case for when the failpoint error sent is not retryable? We can keep them in each respective spec file with a comment referencing future spec work in that drivers ticket, I don't think it would add much scope since we can more or less copy paste the existing tests and just change the error code and the corresponding assertion

Update: DRIVERS ticket filed here: https://jira.mongodb.org/browse/DRIVERS-2247

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been updated to use needsRetryableWriteLabel and that code has been refactored to not take the maxWireVersion into account anymore, since the configure fail point error does not automatically add the label in 4.4+ servers. So the code basically just looks to see if the label is there or not and ignores the wire version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I LGTM-ed but this looks outstanding still
@durran What're your thoughts on point 4? Copying the existing tests and changing the error code seems like it could "just work" 🤞 Maybe we merge this and open a follow up PR for the sake of ease of review, thoughts?

err.addErrorLabel(MongoErrorLabel.HandshakeError);
if (needsRetryableWriteLabel(err, response.maxWireVersion)) {
err.addErrorLabel(MongoErrorLabel.RetryableWriteError);
}
}
return callback(err);
}
callback(undefined, conn);
});

Expand Down
26 changes: 16 additions & 10 deletions src/error.ts
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,8 @@ export const MongoErrorLabel = Object.freeze({
RetryableWriteError: 'RetryableWriteError',
TransientTransactionError: 'TransientTransactionError',
UnknownTransactionCommitResult: 'UnknownTransactionCommitResult',
ResumableChangeStreamError: 'ResumableChangeStreamError'
ResumableChangeStreamError: 'ResumableChangeStreamError',
HandshakeError: 'HandshakeError'
} as const);

/** @public */
Expand Down Expand Up @@ -744,21 +745,22 @@ const RETRYABLE_WRITE_ERROR_CODES = new Set<number>([
]);

export function needsRetryableWriteLabel(error: Error, maxWireVersion: number): boolean {
if (maxWireVersion >= 9) {
// 4.4+ servers attach their own retryable write error
return false;
}
// pre-4.4 server, then the driver adds an error label for every valid case
nbbeeken marked this conversation as resolved.
Show resolved Hide resolved
// execute operation will only inspect the label, code/message logic is handled here

if (error instanceof MongoNetworkError) {
return true;
}

if (error instanceof MongoError && error.hasErrorLabel(MongoErrorLabel.RetryableWriteError)) {
// Before 4.4 the error label can be one way of identifying retry
// so we can return true if we have the label, but fall back to code checking below
return true;
if (error instanceof MongoError) {
if (
(maxWireVersion >= 9 || error.hasErrorLabel(MongoErrorLabel.RetryableWriteError)) &&
!error.hasErrorLabel(MongoErrorLabel.HandshakeError)
) {
// If we already have the error label no need to add it again. 4.4+ servers add the label.
// In the case where we have a handshake error, need to fall down to the logic checking
// the codes.
return false;
}
}

if (error instanceof MongoWriteConcernError) {
Expand All @@ -782,6 +784,10 @@ export function needsRetryableWriteLabel(error: Error, maxWireVersion: number):
return false;
}

export function isRetryableWriteError(error: MongoError): boolean {
return error.hasErrorLabel(MongoErrorLabel.RetryableWriteError);
}

/** Determines whether an error is something the driver should attempt to retry */
export function isRetryableReadError(error: MongoError): boolean {
const hasRetryableErrorCode =
Expand Down
4 changes: 2 additions & 2 deletions src/operations/execute_operation.ts
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
import type { Document } from '../bson';
import {
isRetryableReadError,
isRetryableWriteError,
MongoCompatibilityError,
MONGODB_ERROR_CODES,
MongoError,
MongoErrorLabel,
MongoExpiredSessionError,
MongoNetworkError,
MongoRuntimeError,
Expand Down Expand Up @@ -195,7 +195,7 @@ function executeWithServerSelection<TResult>(
);
}

if (isWriteOperation && !originalError.hasErrorLabel(MongoErrorLabel.RetryableWriteError)) {
if (isWriteOperation && !isRetryableWriteError(originalError)) {
return callback(originalError);
}

Expand Down
12 changes: 8 additions & 4 deletions test/integration/retryable-reads/retryable_reads.spec.test.js
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
'use strict';

const path = require('path');
const { TestRunnerContext, generateTopologyTests } = require('../../tools/spec-runner');
const { loadSpecTests } = require('../../spec');
const { runUnifiedSuite } = require('../../tools/unified-spec-runner/runner');

describe('Retryable Reads', function () {
describe('Retryable Reads (legacy)', function () {
const testContext = new TestRunnerContext();
const testSuites = loadSpecTests('retryable-reads');
const testSuites = loadSpecTests(path.join('retryable-reads', 'legacy'));

after(() => testContext.teardown());
before(function () {
Expand All @@ -28,3 +28,7 @@ describe('Retryable Reads', function () {
);
});
});

describe('Retryable Reads (unified)', function () {
runUnifiedSuite(loadSpecTests(path.join('retryable-reads', 'unified')));
});
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
import { expect } from 'chai';
import * as path from 'path';

import type { Collection, Db, MongoClient } from '../../../src';
import { loadSpecTests } from '../../spec';
import { legacyRunOnToRunOnRequirement } from '../../tools/spec-runner';
import { runUnifiedSuite } from '../../tools/unified-spec-runner/runner';
import { isAnyRequirementSatisfied } from '../../tools/unified-spec-runner/unified-utils';

interface RetryableWriteTestContext {
Expand Down Expand Up @@ -196,3 +198,7 @@ async function turnOffFailPoint(client, name) {
mode: 'off'
});
}

describe('Retryable Writes (unified)', function () {
runUnifiedSuite(loadSpecTests(path.join('retryable-writes', 'unified')));
});
97 changes: 83 additions & 14 deletions test/spec/retryable-reads/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,10 @@ Retryable Reads Tests
Introduction
============

The YAML and JSON files in this directory tree are platform-independent tests
that drivers can use to prove their conformance to the Retryable Reads spec.
The YAML and JSON files in the ``legacy`` and ``unified`` sub-directories are platform-independent tests
that drivers can use to prove their conformance to the Retryable Reads spec. Tests in the
``unified`` directory are written using the `Unified Test Format <../../unified-test-format/unified-test-format.rst>`_.
Tests in the ``legacy`` directory are written using the format described below.

Prose tests, which are not easily expressed in YAML, are also presented
in this file. Those tests will need to be manually implemented by each driver.
Expand Down Expand Up @@ -79,29 +81,49 @@ Each YAML file has the following keys:
the default is all topologies (i.e. ``["single", "replicaset", "sharded",
"load-balanced"]``).

- ``serverless``: Optional string. Whether or not the test should be run on
serverless instances imitating sharded clusters. Valid values are "require",
"forbid", and "allow". If "require", the test MUST only be run on serverless
instances. If "forbid", the test MUST NOT be run on serverless instances. If
omitted or "allow", this option has no effect.

The test runner MUST be informed whether or not serverless is being used in
order to determine if this requirement is met (e.g. through an environment
variable or configuration option). Since the serverless proxy imitates a
mongos, the runner is not capable of determining this by issuing a server
command such as ``buildInfo`` or ``hello``.

- ``database_name`` and ``collection_name``: Optional. The database and
collection to use for testing.

- ``bucket_name``: Optional. The GridFS bucket name to use for testing.

- ``data``: The data that should exist in the collection(s) under test before
each test run. This will typically be an array of documents to be inserted
into the collection under test (i.e. ``collection_name``); however, this field
may also be an object mapping collection names to arrays of documents to be
inserted into the specified collection.

- ``tests``: An array of tests that are to be run independently of each other.
Each test will have some or all of the following fields:

- ``description``: The name of the test.

- ``clientOptions``: Optional, parameters to pass to MongoClient().

- ``useMultipleMongoses`` (optional): If ``true``, the MongoClient for this
test should be initialized with multiple mongos seed addresses. If ``false``
or omitted, only a single mongos address should be specified. This field has
no effect for non-sharded topologies.

- ``useMultipleMongoses`` (optional): If ``true``, and the topology type is
``Sharded``, the MongoClient for this test should be initialized with multiple
mongos seed addresses. If ``false`` or omitted, only a single mongos address
should be specified.

If ``true``, and the topology type is ``LoadBalanced``, the MongoClient for
this test should be initialized with the URI of the load balancer fronting
multiple servers. If ``false`` or omitted, the MongoClient for this test
should be initialized with the URI of the load balancer fronting a single
server.

``useMultipleMongoses`` only affects ``Sharded`` and ``LoadBalanced`` topologies.

- ``skipReason``: Optional, string describing why this test should be skipped.

- ``failPoint``: Optional, a server fail point to enable, expressed as the
Expand All @@ -120,10 +142,10 @@ Each YAML file has the following keys:
- ``result``: Optional. The return value from the operation, if any. This
field may be a scalar (e.g. in the case of a count), a single document, or
an array of documents in the case of a multi-document read.

- ``error``: Optional. If ``true``, the test should expect an error or
exception.

- ``expectations``: Optional list of command-started events.

GridFS Tests
Expand All @@ -147,7 +169,8 @@ data.


.. _GridFSBucket spec: https://github.com/mongodb/specifications/blob/master/source/gridfs/gridfs-spec.rst#configurable-gridfsbucket-class



Speeding Up Tests
-----------------

Expand All @@ -165,9 +188,53 @@ Optional Enumeration Commands
A driver only needs to test the optional enumeration commands it has chosen to
implement (e.g. ``Database.listCollectionNames()``).

PoolClearedError Retryability Test
==================================

This test will be used to ensure drivers properly retry after encountering PoolClearedErrors.
It MUST be implemented by any driver that implements the CMAP specification.
This test requires MongoDB 4.2.9+ for ``blockConnection`` support in the failpoint.

1. Create a client with maxPoolSize=1 and retryReads=true. If testing against a
sharded deployment, be sure to connect to only a single mongos.

2. Enable the following failpoint::

{
configureFailPoint: "failCommand",
mode: { times: 1 },
data: {
failCommands: ["find"],
errorCode: 91,
blockConnection: true,
blockTimeMS: 1000
}
}

3. Start two threads and attempt to perform a ``findOne`` simultaneously on both.

4. Verify that both ``findOne`` attempts succeed.

5. Via CMAP monitoring, assert that the first check out succeeds.

6. Via CMAP monitoring, assert that a PoolClearedEvent is then emitted.

7. Via CMAP monitoring, assert that the second check out then fails due to a
connection error.

8. Via Command Monitoring, assert that exactly three ``find`` CommandStartedEvents
were observed in total.

9. Disable the failpoint.


Changelog
=========

:2022-01-10: Create legacy and unified subdirectories for new unified tests

:2021-08-27: Clarify behavior of ``useMultipleMongoses`` for ``LoadBalanced`` topologies.

:2019-03-19: Add top-level ``runOn`` field to denote server version and/or
topology requirements requirements for the test file. Removes the
``minServerVersion`` and ``topology`` top-level fields, which are
Expand All @@ -177,4 +244,6 @@ Changelog

:2020-09-16: Suggest lowering heartbeatFrequencyMS in addition to minHeartbeatFrequencyMS.

:2021-04-23: Add ``load-balanced`` to test topology requirements.
:2021-03-23: Add prose test for retrying PoolClearedErrors

:2021-04-29: Add ``load-balanced`` to test topology requirements.
Loading