Jupyter Telemetry Enhancement Proposal #41

jaipreet-s · 2019-07-05T23:46:21Z

Contains two accompanying files

Press Release
Technical proposal

westurner · 2019-07-06T00:48:27Z

A couple thoughts:

Pluggable persistence would likely eventually be an objective
Should folks use this event bus / messaging system for non-Jupyter application message persistence? Or "this is for logging structured metrics for Jupyter and extensions only"?

@choldgraf (@mybinder) and I were just talking about how to profile BinderHub container launches
https://twitter.com/westurner/status/1142175356880900102 :

https://binderhub.readthedocs.io/en/latest/overview.html#a-diagram-of-the-binderhub-architecture

But there's nothing that can easily profile all of the layers of the distributed stack for a given container launch request (when the image is already cached)? Maybe @sysdig?
https://kubernetes.io/docs/tasks/debug-application-cluster/resource-usage-monitoring/#sysdig

Sysdig pulls together data from system calls, Kubernetes events, Prometheus metrics, statsD, JMX, and more into a single pane that gives you a comprehensive picture of your environment.

JSON with a JSON Schema should be easy enough to integrate with a tool like sysdig, for example.

Presumably there'd be sinks for the supported persistence backends. Would there be a standard interface for reviewing telemetry events and quantitative metrics from within Notebook or JupyterLab; or would users be expected to also configure Grafana / ELK / Loki / Splunk / Sentry?

I'm not at all familiar with with Wikimedia or Mozilla telemetry systems;
so, this is a JSON message store with input validation?

30-telemetry/proposal.md

30-telemetry/press_release.md

30-telemetry/proposal.md

betatim · 2019-07-06T07:40:05Z

30-telemetry/proposal.md

+`app` object. They should use the core eventlogging library directly, and admins
+should be able to configure it as they would a standalone application.
+
+#### Authenticated routing service


This would be a "parallel universe" to https://github.com/jupyter/enhancement-proposals/pull/41/files#diff-5c74b6c64dfb44b841261c64623c9c6eR140 right? As in as an extension I could send events to either of these and they'd end up in the same sinks?

I'm going to defer to @yuvipanda on the JupyterHub functionality 😃

betatim · 2019-07-06T07:49:14Z

One thing that wasn't clear to me at the start of reading the JEP and was even less clear at the end: why have a router that is part of Jupyter instead of having the event sources talk directly to the event sinks. From the later parts of the proposal this is proposed for frontend extensions. Server extensions could obviously also send stuff directly to the event sinks.

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

yuvipanda · 2019-07-06T07:56:51Z

I wrote up https://github.com/jupyterlab/jupyterlab-telemetry/blob/master/design.md earlier which has informed a lot of choices in this, and has a ton of background material as well. Would recommend reading :)

betatim · 2019-07-06T08:03:16Z

I've read it previously and now but I don't think it answers my questions.

westurner · 2019-07-06T10:40:19Z

Does GDPR apply to anonymous unique IDs? Is ``hash(IP, datetime,)`` considered to be personally identifiable information? How could I look that up given your username? I shouldn't assume that there's only one user behind an IP (and so I shouldn't disclose everything for a given IP to whoever claims that's theirs). With a one-way hash of (IP, datetime, [entropy]) it's difficult to impossible to look up that information given just someone anyone's IP address. In order to profile BinderHub launches from initial request through to instance launch, do I need to include a username if I have a per-launch-request unique identifier? AFAIU, log retention for lawful purposes supersedes; at least in the United States.

…

On Saturday, July 6, 2019, Tim Head ***@***.***> wrote: I've read it previously and now but I don't think it answers my questions. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#41?email_source=notifications&email_token=AAAMNS3MW3UIUG72E2FXBJLP6BGUJA5CNFSM4H6QPCS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZKUW6A#issuecomment-508906360>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAMNS6LQYL5J2QBGOEQPELP6BGUJANCNFSM4H6QPCSQ> .

yuvipanda · 2019-07-07T05:18:36Z

@betatim:

I've read it previously and now but I don't think it answers my questions.

Apologies, that wasn't directed at you - just a general comment to those who might not have seen it yet.

betatim · 2019-07-07T06:59:51Z

One more thing I forgot to write down last time: I think adding a field to the messages that lets someone looking at the logs later tell if this message was sent from a trusted or untrusted component would be super useful. This field would have to be added by a trusted component (the "router" or some other server side component) to avoid clients faking it. The use case would be that only "trusted" messages can be part of any audit trail. Or maybe we can deal with this via having a "source" attribute that is added by a trusted component. I think for audit purposes anything that frontend sends is "useless" because that could have been tampered with by the user (I think).

yuvipanda · 2019-07-07T19:29:36Z

30-telemetry/proposal.md

+
+#### Open Questions
+
+1. Is this work done on the standalone jupyter-server implementation or on the classic jupyter/notebook?


I think it should be in the jupyter/notebook package for now, since I think that's going to see active use for anywhere between the next 3-5 years.

Yea we discussed this but this doc isn't updated. We do need to have some plan for porting these changes into jupyter_server. @Zsailer since you are also close to the jupyter_server WDYT?

Update: Just committed a change to re-word this open question.

Yeah, there's some effort to port PRs from notebook to jupyter_server in jupyter-server/jupyter_server#53.

I agree with Yuvi. This should go into notebook and be ported to/mirrored in jupyter_server. I think we're going to be stuck with constantly syncing/porting PRs for awhile.

westurner · 2019-07-09T05:34:49Z

I think adding a field to the messages that lets someone looking at the logs later tell if this message was sent from a trusted or untrusted component would be super useful. This field would have to be added by a trusted component (the "router" or some other server side component) to avoid clients faking it. The use case would be that only "trusted" messages can be part of any audit trail. Or maybe we can deal with this via having a "source" attribute that is added by a trusted component. I think for audit purposes anything that frontend sends is "useless" because that could have been tampered with by the user (I think).

Re: components self-identifying as "trusted"

Private key integrity may be the most challenging part of this. A JS app running in a browser (with the obfuscated or unobfuscated source available) does not have a secure enclave within which to store a cryptographic key to be used for signing messages. A JS or Python component would need to generate message signing keys which are then somehow approved as trusted.

CSRF mitigations like per-request token generation may negatively affect performance because there's a shortage of random.
https://github.com/OWASP/CheatSheetSeries/blob/master/cheatsheets/Cross-Site_Request_Forgery_Prevention_Cheat_Sheet.md#csrf-defense-recommendations-summary

There's already the Jupyter auth token; though that's not per-component and AFAIU is not designed to be used as a message signing key.

westurner · 2019-07-09T07:05:38Z

HMAC ("hash-based message authentication code") tokens are one way to mitigate the risk of CSRF (a different thing submitting a message as a trusted thing)
https://en.wikipedia.org/wiki/HMAC

Because JSON message key orderings are not necessarily stable (the key order may be different if an attribute is deleted and then inserted again later, for example), the cryptographic hash or signature varies unless the message is canonicalized first. json.dumps(sort_keys=True) is basically a message canonicalization algorithm.

Linked Data Signatures have (URIs for) signature suites, message canonicalization algorithms, and message digest algorithms. This makes things future proof in that instead of saying this is jupyter_telemetry_message_format v2, you specify the proof type (which defines a canonicalizationAlgorithm, digestAlgorithm, and proofAlgorithm)
https://w3c-dvcg.github.io/ld-signatures/#terminology

{
  "@context": "https://w3id.org/identity/v1",
  "title": "Hello World!",
  "proof": {
    "type": "RsaSignature2018",
    "creator": "https://example.com/i/pat/keys/5",
    "created": "2017-09-23T20:21:34Z",
    "domain": "example.org",
    "nonce": "2bbgh3dgjg2302d-d2b3gi423d42",
    "proofValue": "eyJ0eXAiOiJK...gFWFOEjXk"
  }
}

https://w3c-dvcg.github.io/ld-signatures/#signature-suites :

{
  "id": "https://w3id.org/security#RsaSignature2018",
  "type": "SignatureSuite",
  "canonicalizationAlgorithm": "https://w3id.org/security#GCA2015",
  "digestAlgorithm": "https://www.ietf.org/assignments/jwa-parameters#SHA256",
  "proofAlgorithm": "https://www.ietf.org/assignments/jws-parameters#RSASSA-PSS"
}

https://web-payments.org/vocabs/security#LinkedDataSignature2015 :

{
  "@context": ["https://w3id.org/security/v1", "http://json-ld.org/contexts/person.jsonld"],
  "@type": "Person",
  "name": "Manu Sporny",
  "homepage": "http://manu.sporny.org/",
  "signature": {
    "@type": "LinkedDataSignature2015",
    "creator": "http://manu.sporny.org/keys/5",
    "created": "2015-09-23T20:21:34Z",
    "signatureValue": "OGQzNGVkMzVmMmQ3ODIyOWM32MzQzNmExMgoYzI4ZDY3NjI4NTIyZTk="
  }
}

"JSON-LD Signatures with JSON Web Signatures"
https://github.com/WebOfTrustInfo/ld-signatures-python/blob/master/jld_signatures.py
"An implementation of the Linked Data Signatures specification for JSON-LD. Works in the browser and node.js."
https://github.com/digitalbazaar/jsonld-signatures/#examples
https://github.com/WebOfTrustInfo/ld-signatures-js
- Which signature suite is recommended changes over time and will change in the future.
  In order to future-proof, ld-signatures has URIs for standard signature suites:
  https://github.com/digitalbazaar/jsonld-signatures/tree/master/lib/suites
  - EcdsaKoblitzSignature2016.js
  - Ed25519Signature2018.js
  - GraphSignature2012.js
  - JwsLinkedDataSignature.js (JSON Web Signatures (JWS))
  - LinkedDataProof.js
  - LinkedDataSignature.js
  - LinkedDataSignature2015.js
  - RsaSignature2018

HMACs use symmetric keys (pre-shared key),
cryptographic signatures use asymmetric keys (public and private keys). In either case, if a key is kept in code and/or RAM, it's really not that secret.
https://gist.github.com/westurner/4345987bb29fca700f52163c339a270f#gistcomment-2822602

... What's a good way for a component to indicate that it's trusted?

Co-Authored-By: Tim Head <[email protected]>

jaipreet-s · 2019-07-10T20:02:33Z

One thing that wasn't clear to me at the start of reading the JEP and was even less clear at the end: why have a router that is part of Jupyter instead of having the event sources talk directly to the event sinks. From the later parts of the proposal this is proposed for frontend extensions. Server extensions could obviously also send stuff directly to the event sinks.

Hi @betatim ,
Thanks for the feedback!

The router fundamentally decouples event publishers from event consumers. For example, without the router, if an event sink interface is updated or a new event sink is replaced, each event publisher will need to be updated to use the new interface. With it, this is not an issue since publishers still talk to the router and new event sinks can be added/dropped via the telemetry_event_sinks configuration.

In addition, the router abstracts common functionality that would otherwise have to be implemented by each event sink, such as those listed in the Core Event Router section

Schema validation
Adds a mechanism for adding metadata fields
Dropping events that are not whitelisted in a given deployment

jaipreet-s · 2019-07-10T20:10:03Z

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

In terms of user privacy and transparency, this proposal is limited to making it clear to users what events are being collected, as well as having some kind of Opt-In in the JupyterLab UI. I'd be fine with having more nuanced proposals around audit trails and privacy preserving opt-in as a separate proposal. @Zsailer WDYT?

jaipreet-s · 2019-08-01T23:16:30Z

Hi @betatim and @westurner - sorry for being late to get back re: components self-identifying as "trusted"

These are all good points. The current implementation for the event publisher interface makes it possible for publishers to do this themselves m and also for consumers to validate the trust/integrity at that end.

That said, we should consider offering ways to make this easier to do for publishers. jupyter/telemetry#21 has a few ideas on how to provide this functionality

Zsailer · 2019-08-07T20:48:23Z

Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users.

In terms of user privacy and transparency, this proposal is limited to making it clear to users what events are being collected, as well as having some kind of Opt-In in the JupyterLab UI. I'd be fine with having more nuanced proposals around audit trails and privacy preserving opt-in as a separate proposal. @Zsailer WDYT?

@betatim and @jaipreet-s

Yes, this proposal is trying to communicate that we're injecting telemetry across various "layers" of the Jupyter stack (i.e. Kernel, Server, Lab, Hub, etc.). We want everyone to be aware of these changes without fear that "Jupyter is secretly collecting data about users". We'll provide tools for admins to inform users that data is being collected. And, like @jaipreet-s said, we'll likely provide UI in JupyterLab that allows users to have some control over event collection.

We could remove the technical design plans for "consent" from this proposal and make that a separate discussion if necessary, but I don't think we should remove the language that we care about user privacy and awareness.

betatim · 2019-08-08T05:43:00Z

I think my main point was that I'd avoid talking about user choice and audit trails inn the same part of the document because they have such different requirements. They can't be reconciled, but that is fine as they are two very different things :)

Zsailer · 2019-08-08T17:32:07Z

I'd avoid talking about user choice and audit trails inn the same part of the document

That makes sense—these are really two different experiences/environments. Maybe we should split that bit into two different paragraphs (assuming that you're talking about the press-release document right now).

One paragraph about environments where user is offering consent for admin/extension developer to collect data.
Another paragraph talking about strictly controlled environments where auditing is required. In this case, Jupyter provides tools that make it easy for environment admin to inform users that auditing is happening.

In both cases, we're communicating that Jupyter's stance is that administrators should be transparent with users.

adpatter · 2021-07-22T12:10:17Z

This is an example of a potential use case:

Our telemetry project, ETC JupyterLab Telemetry Extension, captures user interactions and logs these messages to a specified handler. The ETC JupyterLab Telemetry Example repo gives an example of the service provided by the extension being consumed and the events being logged to console.log.

Presently, we are capturing several user interactions with the Notebook:

Active Cell Changed
Cell Added
Cell Executed
Cell Removed
Notebook Opened
Notebook Saved
Notebook Scrolled

For each event, a list of cells relevant to the event are captured as well. This is described here. The messages include a list of relevant cells and the present state of the Notebook. Cell contents that have been seen before get replaced with a cell 'ID' in order to save storage space, which allows for the state of the Notebook to be reconstructed at a later time. The reason I point that out is that there might be use cases where multiple schemas could be registered for a single event.

This JSON schema matches the event messages:

{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "type": "object",
  "properties": {
    "event_name": {
      "type": "string"
    },
    "cells": {
      "type": "array",
      "items": [
        {
          "type": "object",
          "properties": {
            "id": {
              "type": "string"
            },
            "index": {
              "type": "integer"
            }
          },
          "required": [
            "id",
            "index"
          ]
        }
      ]
    },
    "notebook": {
      "type": "object",
      "properties": {
        "metadata": {
          "type": "object",
          "properties": {
            "kernelspec": {
              "type": "object",
              "properties": {
                "display_name": {
                  "type": "string"
                },
                "language": {
                  "type": "string"
                },
                "name": {
                  "type": "string"
                }
              },
              "required": [
                "display_name",
                "language",
                "name"
              ]
            },
            "language_info": {
              "type": "object",
              "properties": {
                "codemirror_mode": {
                  "type": "object",
                  "properties": {
                    "name": {
                      "type": "string"
                    },
                    "version": {
                      "type": "integer"
                    }
                  },
                  "required": [
                    "name",
                    "version"
                  ]
                },
                "file_extension": {
                  "type": "string"
                },
                "mimetype": {
                  "type": "string"
                },
                "name": {
                  "type": "string"
                },
                "nbconvert_exporter": {
                  "type": "string"
                },
                "pygments_lexer": {
                  "type": "string"
                },
                "version": {
                  "type": "string"
                }
              },
              "required": [
                "codemirror_mode",
                "file_extension",
                "mimetype",
                "name",
                "nbconvert_exporter",
                "pygments_lexer",
                "version"
              ]
            }
          },
          "required": [
            "kernelspec",
            "language_info"
          ]
        },
        "nbformat_minor": {
          "type": "integer"
        },
        "nbformat": {
          "type": "integer"
        },
        "cells": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "cell_type": {
                "type": "string"
              },
              "source": {
                "type": "string"
              },
              "metadata": {
                "type": "object",
                "properties": {
                  "trusted": {
                    "type": "boolean"
                  }
                },
                "required": [
                  "trusted"
                ]
              },
              "execution_count": {
                "type": "null"
              },
              "outputs": {
                "type": "array",
                "items": {}
              },
              "id": {
                "type": "string"
              }
            },
            "required": [

              "id"
            ]
          }
        }
      },
      "required": [
        "metadata",
        "nbformat_minor",
        "nbformat",
        "cells"
      ]
    },
    "seq": {
      "type": "integer"
    },
    "notebook_path": {
      "type": "string"
    },
    "user_id": {
      "type": "string"
    }
  },
  "required": [
    "event_name",
    "cells",
    "notebook",
    "seq",
    "notebook_path",
    "user_id"
  ]
}

Please let me know if anyone has any questions regarding our use case.

jaipreet-s · 2021-12-02T18:36:33Z

Hi @Zsailer - Do you think we can close this PR now? It hasn't had active discussion for a while now :) Thanks!

Zsailer · 2023-04-25T16:52:46Z

I'm going to close this enhancement proposal, as it has been mostly implemented anyways.

For folks reading this in the future, the work evolved and now resides in jupyter_events.

I think it's still worth opening a new JEP that describes the Jupyter Event System and documents how other projects should leverage this work going forward. In many follow-on discussions, we are aiming to put jupyter_events in many layers of the Jupyter stack. A JEP would help define best practices.

Further, we're in the process of creating a schema.jupyter.org subdomain where all Jupyter Event JSON schemas should be published.

Jupyter Telemetry Enhancement Proposal

84fdaa7

jaipreet-s mentioned this pull request Jul 5, 2019

Final review of JEP jupyter/telemetry#5

Closed

westurner reviewed Jul 6, 2019

View reviewed changes

30-telemetry/proposal.md Outdated Show resolved Hide resolved

westurner reviewed Jul 6, 2019

View reviewed changes

30-telemetry/proposal.md Outdated Show resolved Hide resolved

westurner reviewed Jul 6, 2019

View reviewed changes

30-telemetry/proposal.md Outdated Show resolved Hide resolved

westurner reviewed Jul 6, 2019

View reviewed changes

30-telemetry/proposal.md Show resolved Hide resolved

betatim reviewed Jul 6, 2019

View reviewed changes

30-telemetry/press_release.md Outdated Show resolved Hide resolved

betatim reviewed Jul 6, 2019

View reviewed changes

30-telemetry/proposal.md Outdated Show resolved Hide resolved

betatim reviewed Jul 6, 2019

View reviewed changes

yuvipanda reviewed Jul 7, 2019

View reviewed changes

jaipreet-s and others added 2 commits July 10, 2019 12:04

Update 30-telemetry/proposal.md

f105ea1

Co-Authored-By: Tim Head <[email protected]>

Fix a few syntax errors

0800381

Update open question about classic vs juptyer_server

ac1678d

jaipreet-s mentioned this pull request Aug 1, 2019

Trusted events jupyter/telemetry#21

Open

move GDPR bit to appropriate place in user consent paragraph

1fc6415

adpatter mentioned this pull request Jul 22, 2021

The purpose of this Issue is just to give an example use case. jupyter/telemetry#65

Closed

Zsailer mentioned this pull request Mar 17, 2022

Meeting Notes 2022 jupyter-server/team-compass#15

Closed

Zsailer mentioned this pull request Aug 29, 2022

Emit events from the Contents Service jupyter-server/jupyter_server#954

Merged

Zsailer closed this Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jupyter Telemetry Enhancement Proposal #41

Jupyter Telemetry Enhancement Proposal #41

jaipreet-s commented Jul 5, 2019

westurner commented Jul 6, 2019

betatim Jul 6, 2019

jaipreet-s Jul 10, 2019

betatim commented Jul 6, 2019

yuvipanda commented Jul 6, 2019

betatim commented Jul 6, 2019

westurner commented Jul 6, 2019 via email

yuvipanda commented Jul 7, 2019

betatim commented Jul 7, 2019

yuvipanda Jul 7, 2019

jaipreet-s Jul 10, 2019 •

edited

Loading

Zsailer Jul 10, 2019

westurner commented Jul 9, 2019

westurner commented Jul 9, 2019

jaipreet-s commented Jul 10, 2019 •

edited

Loading

jaipreet-s commented Jul 10, 2019

jaipreet-s commented Aug 1, 2019

Zsailer commented Aug 7, 2019

betatim commented Aug 8, 2019

Zsailer commented Aug 8, 2019

adpatter commented Jul 22, 2021

jaipreet-s commented Dec 2, 2021

Zsailer commented Apr 25, 2023 •

edited

Loading


		#### Open Questions

		1. Is this work done on the standalone jupyter-server implementation or on the classic jupyter/notebook?

Jupyter Telemetry Enhancement Proposal #41

Jupyter Telemetry Enhancement Proposal #41

Conversation

jaipreet-s commented Jul 5, 2019

westurner commented Jul 6, 2019

betatim Jul 6, 2019

Choose a reason for hiding this comment

jaipreet-s Jul 10, 2019

Choose a reason for hiding this comment

betatim commented Jul 6, 2019

yuvipanda commented Jul 6, 2019

betatim commented Jul 6, 2019

westurner commented Jul 6, 2019 via email

yuvipanda commented Jul 7, 2019

betatim commented Jul 7, 2019

yuvipanda Jul 7, 2019

Choose a reason for hiding this comment

jaipreet-s Jul 10, 2019 • edited Loading

Choose a reason for hiding this comment

Zsailer Jul 10, 2019

Choose a reason for hiding this comment

westurner commented Jul 9, 2019

westurner commented Jul 9, 2019

jaipreet-s commented Jul 10, 2019 • edited Loading

jaipreet-s commented Jul 10, 2019

jaipreet-s commented Aug 1, 2019

Zsailer commented Aug 7, 2019

betatim commented Aug 8, 2019

Zsailer commented Aug 8, 2019

adpatter commented Jul 22, 2021

jaipreet-s commented Dec 2, 2021

Zsailer commented Apr 25, 2023 • edited Loading

jaipreet-s Jul 10, 2019 •

edited

Loading

jaipreet-s commented Jul 10, 2019 •

edited

Loading

Zsailer commented Apr 25, 2023 •

edited

Loading