chore(telemetry): use importlibs instead of pkg_resources (#3783)

pkg_resources has known performance issues: pypa/setuptools#926. This PR replaces pkg_resources with importlib.metadata and uses this module to retrieve package names and versions. A further optimization was made to the importlib implementation which parses package metadata: https://github.com/DataDog/dd-trace-py/compare/munir/benchmark-importlib...munir/tests-importlib-metadata-custom-parsing?expand=1. Benchmarks for this third optimization are also shown in the table below: | benchmark | test case | Number of Packages | mean (ms) | std (ms) | baseline (ms) | overhead (ms) | overhead (%) | |---------------------------|---------------------------|--------------------|:---------:|:--------:|:-------------:|:-------------:|:------------:| | ddtracerun-auto_telemetry | pkg_resources (1.x branch) | 15 | 326 | 13 | 274 | 52 | 19.0 | | ddtracerun-auto_telemetry | importlib | 15 | 285 | 5 | 270 | 15 | 5.6 | | ddtracerun-auto_telemetry | importlib with partial parsing | 15 | 285 | 10 | 269 | 16 | 5.9 | | ddtracerun-auto_telemetry | importlib | 30 | 377 | 5 | 350 | 27 | 7.7 | | ddtracerun-auto_telemetry | importlib with partial parsing | 30 | 362 | 7 | 350 | 12 | 3.4 | | ddtracerun-auto_telemetry | importlib | 45 | 381 | 24 | 348 | 31 | 8.9 | | ddtracerun-auto_telemetry | importlib with partial parsing | 45 | 363 | 9 | 350 | 23 | 6.3 | | ddtracerun-auto_telemetry | importlib | 313 | 1050 | 79 | 991 | 59 | 5.9 | | ddtracerun-auto_telemetry | importlib with partial parsing | 313 | 911 | 28 | 905 | 6 | 0.6 | | benchmark | test case | Number of Packages | mean (ms) | std (ms) | baseline (ms) | overhead (ms) | overhead (%) | |:---------------------------------:|---------------------------|--------------------|:---------:|:--------:|:-------------:|:-------------:|:------------:| | ddtracerun-auto_tracing_telemetry | pkg_resources (1.x) | 15 | 324 | 8 | 274 | 50 | 18.2 | | ddtracerun-auto_tracing_telemetry | importlib | 15 | 293 | 11 | 272 | 21 | 8.3 | | ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 15 | 291 | 12 | 272 | 19 | 6.9 | | ddtracerun-auto_tracing_telemetry | importlib | 30 | 373 | 11 | 351 | 22 | 6.28 | | ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 30 | 367 | 13 | 354 | 13 | 3.6 | | ddtracerun-auto_tracing_telemetry | importlib | 45 | 376 | 8 | 355 | 21 | 5.9 | | ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 45 | 364 | 9 | 352 | 22 | 6.5 | | ddtracerun-auto_tracing_telemetry | importlib | 313 | 1010 | 80 | 960 | 50 | 5.2 | | ddtracerun-auto_tracing_telemetry | importlib with partial parsing | 313 | 910 | 20 | 873 | 37 | 4.2 | Note: redis, requests and urllib3 were included in test cases with 30 and 45 packages. These packages were patched by `ddtrace-run` and this increased the baseline by ~74ms but the overhead of telemetry observed remained consistent. The case with 313 packages patched gevent, pylons, SQLAlchemy, requests, flask, grpc, cassandra, botocore, and urllib3. This was to simulate the overhead of telemetry in a real world application with telemetry enabled. Findings from benchmarking sending telemetry events with different number of packages installed, patching integrations, and/or enabling tracing: - Using importlib instead of pkg_resources reduced the overhead of telemetry in half (~50ms to ~19ms) - The number of packages does not appear to correlate with the overhead of telemetry - the benchmarks might've been too noisy to measure the difference accurately. - creating a custom parser to retrieve package names and versions from PKG-INFO and METADATA files lead to notable performance gains with a large number of packages. - the difference appears to be within a standard deviation so more testing is required to accurately measure the difference. - Iterating on this approach might lead to better results: https://github.com/DataDog/dd-trace-py/compare/munir/benchmark-importlib...munir/tests-importlib-metadata-custom-parsing?expand=1 - These performance gains seem to be minor. It might not be work developing and maintaining a metadata parser. ## Checklist - [x] Library documentation is updated. - [x] [Corp site](https://github.com/DataDog/documentation/) documentation is updated (link to the PR). ## Reviewer Checklist - [ ] Title is accurate. - [ ] Description motivates each change. - [ ] No unnecessary changes were introduced in this PR. - [ ] PR cannot be broken up into smaller PRs. - [ ] Avoid breaking [API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces) changes unless absolutely necessary. - [ ] Tests provided or description of manual testing performed is included in the code or PR. - [ ] Release note has been added for fixes and features, or else `changelog/no-changelog` label added. - [ ] All relevant GitHub issues are correctly linked. - [ ] Backports are identified and tagged with Mergifyio. - [ ] Add to milestone.
DataDog · Jun 22, 2022 · db54098 · db54098
1 parent 33b9ac3
commit db54098
Show file tree

Hide file tree

Showing 7 changed files with 83 additions and 7 deletions.
diff --git a/ddtrace/internal/packages.py b/ddtrace/internal/packages.py
@@ -0,0 +1,34 @@
+import typing
+
+
+Distribution = typing.NamedTuple("Distribution", [("name", str), ("version", str), ("path", str)])
+
+_DISTRIBUTIONS = None  # type: typing.Optional[typing.Set[Distribution]]
+
+
+def get_distributions():
+    # type: () -> typing.Set[Distribution]
+    """returns the name and version of all distributions in a python path"""
+    global _DISTRIBUTIONS
+    if _DISTRIBUTIONS is not None:
+        return _DISTRIBUTIONS
+
+    try:
+        import importlib.metadata as importlib_metadata
+    except ImportError:
+        import importlib_metadata  # type: ignore[no-redef]
+
+    pkgs = set()
+    for dist in importlib_metadata.distributions():
+        # Get the root path of all files in a distribution
+        path = str(dist.locate_file(""))
+        # PKG-INFO and/or METADATA files are parsed when dist.metadata is accessed
+        # Optimization: we should avoid accessing dist.metadata more than once
+        metadata = dist.metadata
+        name = metadata["name"]
+        version = metadata["version"]
+        if name and version:
+            pkgs.add(Distribution(path=path, name=name, version=version))
+
+    _DISTRIBUTIONS = pkgs
+    return _DISTRIBUTIONS
diff --git a/ddtrace/internal/telemetry/data.py b/ddtrace/internal/telemetry/data.py
@@ -1,10 +1,12 @@
 import platform
 import sys
 from typing import Dict
+from typing import List
 from typing import Tuple
 
 from ddtrace.internal.compat import PY3
 from ddtrace.internal.constants import DEFAULT_SERVICE_NAME
+from ddtrace.internal.packages import get_distributions
 from ddtrace.internal.runtime.container import get_container_info
 from ddtrace.internal.utils.cache import cached
 
@@ -58,6 +60,13 @@ def _get_application(key):
     }
 
 
+def get_dependencies():
+    # type: () -> List[Dict[str, str]]
+    """Returns a unique list of the names and versions of all installed packages"""
+    dependencies = {(dist.name, dist.version) for dist in get_distributions()}
+    return [{"name": name, "version": version} for name, version in dependencies]
+
+
 def get_application(service, version, env):
     """Creates a dictionary to store application data using ddtrace configurations and the System-Specific module"""
     # We cache the application dict to reduce overhead since service, version, or env configurations

diff --git a/ddtrace/internal/telemetry/writer.py b/ddtrace/internal/telemetry/writer.py
@@ -18,6 +18,7 @@
 from ..service import ServiceStatus
 from ..utils.time import StopWatch
 from .data import get_application
+from .data import get_dependencies
 from .data import get_host_info
 
 
@@ -187,12 +188,8 @@ def app_started_event(self):
         if self._forked:
             # app-started events should only be sent by the main process
             return
-        # pkg_resources import is inlined for performance reasons
-        # This import is an expensive operation
-        import pkg_resources
-
         payload = {
-            "dependencies": [{"name": pkg.project_name, "version": pkg.version} for pkg in pkg_resources.working_set],
+            "dependencies": get_dependencies(),
             "integrations": self._flush_integrations_queue(),
             "configurations": [],
         }

diff --git a/setup.py b/setup.py
@@ -247,6 +247,7 @@ def get_exts_for(name):
         "attrs>=19.2.0",
         "six>=1.12.0",
         "typing_extensions; python_version<'3.8'",
+        "importlib_metadata; python_version<'3.8'",
     ]
     + bytecode,
     extras_require={

diff --git a/tests/internal/test_packages.py b/tests/internal/test_packages.py
@@ -0,0 +1,26 @@
+import os
+
+from ddtrace.internal.packages import get_distributions
+
+
+def test_get_distributions():
+    """use pkg_resources to validate package names and versions returned by get_distributions()"""
+    import pkg_resources
+
+    pkg_resources_ws = {pkg.project_name for pkg in pkg_resources.working_set}
+
+    importlib_pkgs = set()
+    for pkg in get_distributions():
+        assert pkg.name
+        assert pkg.version
+        assert os.path.exists(pkg.path)
+        # The package name in typing_extensions-4.x.x.dist-info/METADATA is set to `typing_extensions`
+        # this is inconsistent with the package name found in pkg_resources. The block below corrects this.
+        # The correct package name is typing-extensions.
+        if pkg.name == "typing_extensions" and "typing-extensions" in pkg_resources_ws:
+            importlib_pkgs.add("typing-extensions")
+        else:
+            importlib_pkgs.add(pkg.name)
+
+    # assert that pkg_resources and importlib.metadata return the same packages
+    assert pkg_resources_ws == importlib_pkgs
diff --git a/tests/tracer/telemetry/test_data.py b/tests/tracer/telemetry/test_data.py
@@ -5,11 +5,13 @@
 import pytest
 
 from ddtrace.internal.compat import PY3
+from ddtrace.internal.packages import get_distributions
 from ddtrace.internal.runtime.container import CGroupInfo
 from ddtrace.internal.telemetry.data import _format_version_info
 from ddtrace.internal.telemetry.data import _get_container_id
 from ddtrace.internal.telemetry.data import _get_os_version
 from ddtrace.internal.telemetry.data import get_application
+from ddtrace.internal.telemetry.data import get_dependencies
 from ddtrace.internal.telemetry.data import get_host_info
 from ddtrace.internal.telemetry.data import get_hostname
 from ddtrace.internal.telemetry.data import get_version
@@ -143,3 +145,10 @@ def test_get_container_id_when_container_does_not_exists():
     with mock.patch("ddtrace.internal.telemetry.data.get_container_info") as gci:
         gci.return_value = None
         assert _get_container_id() == ""
+
+
+def test_get_dependencies():
+    """asserts that get_dependencies and get_distributions return the same packages"""
+    pkgs_as_dicts = {(dep["name"], dep["version"]) for dep in get_dependencies()}
+    pkgs_as_distributions = {(dist.name, dist.version) for dist in get_distributions()}
+    assert pkgs_as_dicts == pkgs_as_distributions
diff --git a/tests/tracer/telemetry/test_writer.py b/tests/tracer/telemetry/test_writer.py
@@ -2,10 +2,10 @@
 
 import httpretty
 import mock
-import pkg_resources
 import pytest
 
 from ddtrace.internal.telemetry.data import get_application
+from ddtrace.internal.telemetry.data import get_dependencies
 from ddtrace.internal.telemetry.data import get_host_info
 from ddtrace.internal.telemetry.writer import TelemetryWriter
 from ddtrace.internal.telemetry.writer import get_runtime_id
@@ -100,7 +100,7 @@ def test_app_started_event(mock_time, mock_send_request, telemetry_writer):
     assert headers["DD-Telemetry-Request-Type"] == "app-started"
     # validate request body
     payload = {
-        "dependencies": [{"name": pkg.project_name, "version": pkg.version} for pkg in pkg_resources.working_set],
+        "dependencies": get_dependencies(),
         "integrations": [
             {
                 "name": "integration-t",