Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPEC 11: API telemetry #1

Open
guenp opened this issue May 3, 2024 · 8 comments
Open

SPEC 11: API telemetry #1

guenp opened this issue May 3, 2024 · 8 comments
Assignees

Comments

@guenp
Copy link

guenp commented May 3, 2024

Brainstorm and discuss a SPEC to establish how to add instrumentation and telemetry for scientific python projects to gain insights into usage patterns. Currently, Python projects typically don't have any direct insights into how users interact with their library, what common errors they run into or which APIs are most (in)frequently used. The goal of this SPEC is to design a way to collect usage logs from users in a transparent, ethical and efficient way, with minimal impact to user experience, in order to provide project maintainers with useful metrics and insights into the performance, usability and common user errors of components (modules, functions, etc) in their library.

@jarrodmillman jarrodmillman transferred this issue from scientific-python/scientific-python-hugo-theme May 3, 2024
@jarrodmillman jarrodmillman transferred this issue from scientific-python/specs May 3, 2024
@drammock
Copy link

drammock commented May 3, 2024

I'll drop a link here to popylar which is I think only giving stats on overall module usage (i.e., number of times imported), not granular detail about which classes or functions are called. But @arokem is going to be at the summit and might have some thoughts on more granular metrics, so probably worth cornering him for a chat about it!

@guenp guenp self-assigned this May 3, 2024
@betatim betatim self-assigned this May 6, 2024
@lagru lagru self-assigned this May 6, 2024
@Carreau
Copy link

Carreau commented May 7, 2024

I will link to scientific-python/summit-2023#17, we might want to pull out notes from scipy about 10 years ago. See as well https://github.com/Carreau/consent_broker I started to work on some time ago, and a related discussion pyOpenSci/software-peer-review#183

@betatim
Copy link

betatim commented May 13, 2024

Different but related: https://github.com/betatim/kamal - a tool you run locally over a code base to get statistics on what API of a particular library is being used. The idea is that people can run this themselves and report the stats they get to somewhere (central?). The stats that are collected are kind of easy to look at, the goal being that you can somewhat easily convince yourself that no unwanted information is being shared. The original use case I had in mind was organisations that have private code bases but want to help a project learn which parts of their API are being used. Another idea could be running this in the CI of your project and reporting back to some central place (e.g. scikit-learn and pandas run this in their CI and report back to matplotlib or numpy regarding which parts of their API they use).

@guenp
Copy link
Author

guenp commented Jun 4, 2024

Summarizing today's discussions during the summit and dinner with @drammock @seberg @betatim @Carreau @stefanv et al:

  • To get user consent to track data, a nice solution would be to have to explicitly run a command like python -c import tracer; tracer.enable() or similar. This command can be described in the installation instructions in the scipy or scikit-learn docs and users are encouraged to turn it on to support the developers. We discussed various other options as well (stdout during first import, pop up in Jupyter notebook or Spyder, etc.).
  • To gather the telemetry data, our main options are to 1) track usage during runtime (e.g. with a decorator) or 2) statically like e.g. with kamal or pyinstrument. The downside of 1) is that it slows down the actual code or can cause confusing bugs during runtime, and downside of 2) is that someone (or some process) needs to trigger the analysis. An idea from @betatim is to use sampling (a la PySpy) to gather the info intermittently to minimize performance impact. @seberg suggests to write a very simple "counter" decorator that doesn't inspect any arguments. This will be very fast to do in C. We're not yet 100% decided if (1) or (2) is the best solution but we are leaning towards (2).
  • We have a quick-and-dirty prototype that implements runtime-based telemetry that uses the importlib machinery to dynamically add a decorator to a given API, as per suggested by @eriknw during SciPy last year. I added the code to this gist. It's not performance optimized - the wrapper uses inspect and serializes input args - but it's a basic demo of what this could look like.

@drammock
Copy link

drammock commented Jun 4, 2024

great summary @guenp. Also discussed breifly was how to incentivize opt-in; some ideas were:

  • bundle it with something useful folks already use (like a linter) --- only works for static analysis
  • detect anti-patterns / common inefficiencies in the code and suggest refactorings (I think it was @lagru who had this idea)
  • output a list of DOIs/citations for the packages you use in that script (could work for static or dynamic analysis)

@lagru
Copy link
Member

lagru commented Jun 4, 2024

The idea about combining the tool with linting was initially brought up by @rossbar. :D

@guenp
Copy link
Author

guenp commented Jun 5, 2024

Summary from chat with @crazy4pi314

  • Important to get explicit consent for tracking
  • Like the idea of making a standalone telemetry tool that is useful on its own
  • For air gapped/non-networked usage/devices, have some way to collect locally and upload later
  • Command line tool can take care of everything else (telemetry locally)
  • GDPR issues if we track session IDs
  • Create an extension for IDEs that run the tracker locally and have a checkbox for uploading to the cloud. Extension itself will handle networking for uploading. It also installs the telemetry tool if it's not already installed. Creates an environment and keeps it isolated from the project itself.
  • This could go in a .devcontainer
  • Gamification! Make extension show some silly stuff. "Your code is in the top 100 for usage of this function"
  • Another idea for incentive: web hosted dashboard with stats (per user? per package)

@guenp
Copy link
Author

guenp commented Jun 6, 2024

Chat with @eriknw and @betatim :

@guenp guenp changed the title SPEC: API observability SPEC 11: API observability Jun 6, 2024
@guenp guenp changed the title SPEC 11: API observability SPEC 11: API telemetry Jun 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

5 participants