-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jupyter Telemetry Enhancement Proposal #41
Conversation
A couple thoughts:
@choldgraf (@mybinder) and I were just talking about how to profile BinderHub container launches
JSON with a JSON Schema should be easy enough to integrate with a tool like sysdig, for example. Presumably there'd be sinks for the supported persistence backends. Would there be a standard interface for reviewing telemetry events and quantitative metrics from within Notebook or JupyterLab; or would users be expected to also configure Grafana / ELK / Loki / Splunk / Sentry?
I'm not at all familiar with with Wikimedia or Mozilla telemetry systems; |
`app` object. They should use the core eventlogging library directly, and admins | ||
should be able to configure it as they would a standalone application. | ||
|
||
#### Authenticated routing service |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be a "parallel universe" to https://github.com/jupyter/enhancement-proposals/pull/41/files#diff-5c74b6c64dfb44b841261c64623c9c6eR140 right? As in as an extension I could send events to either of these and they'd end up in the same sinks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to defer to @yuvipanda on the JupyterHub functionality 😃
One thing that wasn't clear to me at the start of reading the JEP and was even less clear at the end: why have a router that is part of Jupyter instead of having the event sources talk directly to the event sinks. From the later parts of the proposal this is proposed for frontend extensions. Server extensions could obviously also send stuff directly to the event sinks. Another thing I wonder about is if it is a good idea to try and address audit trails and privacy preserving opt-in in one proposal? Audit trail related stuff is by definition about "the user has no choices and can't be trusted" where as user privacy respecting approaches give all the power to the users. |
I wrote up https://github.com/jupyterlab/jupyterlab-telemetry/blob/master/design.md earlier which has informed a lot of choices in this, and has a ton of background material as well. Would recommend reading :) |
I've read it previously and now but I don't think it answers my questions. |
Does GDPR apply to anonymous unique IDs? Is ``hash(IP, datetime,)``
considered to be personally identifiable information? How could I look that
up given your username? I shouldn't assume that there's only one user
behind an IP (and so I shouldn't disclose everything for a given IP to
whoever claims that's theirs). With a one-way hash of (IP, datetime,
[entropy]) it's difficult to impossible to look up that information given
just someone anyone's IP address.
In order to profile BinderHub launches from initial request through to
instance launch, do I need to include a username if I have a
per-launch-request unique identifier?
AFAIU, log retention for lawful purposes supersedes; at least in the United
States.
…On Saturday, July 6, 2019, Tim Head ***@***.***> wrote:
I've read it previously and now but I don't think it answers my questions.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#41?email_source=notifications&email_token=AAAMNS3MW3UIUG72E2FXBJLP6BGUJA5CNFSM4H6QPCS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZKUW6A#issuecomment-508906360>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAAMNS6LQYL5J2QBGOEQPELP6BGUJANCNFSM4H6QPCSQ>
.
|
Apologies, that wasn't directed at you - just a general comment to those who might not have seen it yet. |
One more thing I forgot to write down last time: I think adding a field to the messages that lets someone looking at the logs later tell if this message was sent from a trusted or untrusted component would be super useful. This field would have to be added by a trusted component (the "router" or some other server side component) to avoid clients faking it. The use case would be that only "trusted" messages can be part of any audit trail. Or maybe we can deal with this via having a "source" attribute that is added by a trusted component. I think for audit purposes anything that frontend sends is "useless" because that could have been tampered with by the user (I think). |
30-telemetry/proposal.md
Outdated
|
||
#### Open Questions | ||
|
||
1. Is this work done on the standalone jupyter-server implementation or on the classic jupyter/notebook? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be in the jupyter/notebook
package for now, since I think that's going to see active use for anywhere between the next 3-5 years.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea we discussed this but this doc isn't updated. We do need to have some plan for porting these changes into jupyter_server. @Zsailer since you are also close to the jupyter_server WDYT?
Update: Just committed a change to re-word this open question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there's some effort to port PRs from notebook to jupyter_server in jupyter-server/jupyter_server#53.
I agree with Yuvi. This should go into notebook and be ported to/mirrored in jupyter_server. I think we're going to be stuck with constantly syncing/porting PRs for awhile.
Re: components self-identifying as "trusted" Private key integrity may be the most challenging part of this. A JS app running in a browser (with the obfuscated or unobfuscated source available) does not have a secure enclave within which to store a cryptographic key to be used for signing messages. A JS or Python component would need to generate message signing keys which are then somehow approved as trusted. CSRF mitigations like per-request token generation may negatively affect performance because there's a shortage of random. There's already the Jupyter auth token; though that's not per-component and AFAIU is not designed to be used as a message signing key. |
HMAC ("hash-based message authentication code") tokens are one way to mitigate the risk of CSRF (a different thing submitting a message as a trusted thing) Because JSON message key orderings are not necessarily stable (the key order may be different if an attribute is deleted and then inserted again later, for example), the cryptographic hash or signature varies unless the message is canonicalized first. Linked Data Signatures have (URIs for) signature suites, message canonicalization algorithms, and message digest algorithms. This makes things future proof in that instead of saying this is jupyter_telemetry_message_format v2, you specify the proof type (which defines a canonicalizationAlgorithm, digestAlgorithm, and proofAlgorithm) {
"@context": "https://w3id.org/identity/v1",
"title": "Hello World!",
"proof": {
"type": "RsaSignature2018",
"creator": "https://example.com/i/pat/keys/5",
"created": "2017-09-23T20:21:34Z",
"domain": "example.org",
"nonce": "2bbgh3dgjg2302d-d2b3gi423d42",
"proofValue": "eyJ0eXAiOiJK...gFWFOEjXk"
}
} https://w3c-dvcg.github.io/ld-signatures/#signature-suites : {
"id": "https://w3id.org/security#RsaSignature2018",
"type": "SignatureSuite",
"canonicalizationAlgorithm": "https://w3id.org/security#GCA2015",
"digestAlgorithm": "https://www.ietf.org/assignments/jwa-parameters#SHA256",
"proofAlgorithm": "https://www.ietf.org/assignments/jws-parameters#RSASSA-PSS"
} https://web-payments.org/vocabs/security#LinkedDataSignature2015 : {
"@context": ["https://w3id.org/security/v1", "http://json-ld.org/contexts/person.jsonld"],
"@type": "Person",
"name": "Manu Sporny",
"homepage": "http://manu.sporny.org/",
"signature": {
"@type": "LinkedDataSignature2015",
"creator": "http://manu.sporny.org/keys/5",
"created": "2015-09-23T20:21:34Z",
"signatureValue": "OGQzNGVkMzVmMmQ3ODIyOWM32MzQzNmExMgoYzI4ZDY3NjI4NTIyZTk="
}
}
HMACs use symmetric keys (pre-shared key), ... What's a good way for a component to indicate that it's trusted? |
Co-Authored-By: Tim Head <[email protected]>
Hi @betatim , The router fundamentally decouples event publishers from event consumers. For example, without the router, if an event sink interface is updated or a new event sink is replaced, each event publisher will need to be updated to use the new interface. With it, this is not an issue since publishers still talk to the router and new event sinks can be added/dropped via the telemetry_event_sinks configuration. In addition, the router abstracts common functionality that would otherwise have to be implemented by each event sink, such as those listed in the Core Event Router section
|
In terms of user privacy and transparency, this proposal is limited to making it clear to users what events are being collected, as well as having some kind of Opt-In in the JupyterLab UI. I'd be fine with having more nuanced proposals around audit trails and privacy preserving opt-in as a separate proposal. @Zsailer WDYT? |
Hi @betatim and @westurner - sorry for being late to get back re: components self-identifying as "trusted" These are all good points. The current implementation for the event publisher interface makes it possible for publishers to do this themselves m and also for consumers to validate the trust/integrity at that end. That said, we should consider offering ways to make this easier to do for publishers. jupyter/telemetry#21 has a few ideas on how to provide this functionality |
@betatim and @jaipreet-s Yes, this proposal is trying to communicate that we're injecting telemetry across various "layers" of the Jupyter stack (i.e. Kernel, Server, Lab, Hub, etc.). We want everyone to be aware of these changes without fear that "Jupyter is secretly collecting data about users". We'll provide tools for admins to inform users that data is being collected. And, like @jaipreet-s said, we'll likely provide UI in JupyterLab that allows users to have some control over event collection. We could remove the technical design plans for "consent" from this proposal and make that a separate discussion if necessary, but I don't think we should remove the language that we care about user privacy and awareness. |
I think my main point was that I'd avoid talking about user choice and audit trails inn the same part of the document because they have such different requirements. They can't be reconciled, but that is fine as they are two very different things :) |
That makes sense—these are really two different experiences/environments. Maybe we should split that bit into two different paragraphs (assuming that you're talking about the press-release document right now).
In both cases, we're communicating that Jupyter's stance is that administrators should be transparent with users. |
This is an example of a potential use case: Our telemetry project, ETC JupyterLab Telemetry Extension, captures user interactions and logs these messages to a specified handler. The ETC JupyterLab Telemetry Example repo gives an example of the service provided by the extension being consumed and the events being logged to Presently, we are capturing several user interactions with the Notebook:
For each event, a list of cells relevant to the event are captured as well. This is described here. The messages include a list of relevant cells and the present state of the Notebook. Cell contents that have been seen before get replaced with a cell 'ID' in order to save storage space, which allows for the state of the Notebook to be reconstructed at a later time. The reason I point that out is that there might be use cases where multiple schemas could be registered for a single event. This JSON schema matches the event messages: {
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"event_name": {
"type": "string"
},
"cells": {
"type": "array",
"items": [
{
"type": "object",
"properties": {
"id": {
"type": "string"
},
"index": {
"type": "integer"
}
},
"required": [
"id",
"index"
]
}
]
},
"notebook": {
"type": "object",
"properties": {
"metadata": {
"type": "object",
"properties": {
"kernelspec": {
"type": "object",
"properties": {
"display_name": {
"type": "string"
},
"language": {
"type": "string"
},
"name": {
"type": "string"
}
},
"required": [
"display_name",
"language",
"name"
]
},
"language_info": {
"type": "object",
"properties": {
"codemirror_mode": {
"type": "object",
"properties": {
"name": {
"type": "string"
},
"version": {
"type": "integer"
}
},
"required": [
"name",
"version"
]
},
"file_extension": {
"type": "string"
},
"mimetype": {
"type": "string"
},
"name": {
"type": "string"
},
"nbconvert_exporter": {
"type": "string"
},
"pygments_lexer": {
"type": "string"
},
"version": {
"type": "string"
}
},
"required": [
"codemirror_mode",
"file_extension",
"mimetype",
"name",
"nbconvert_exporter",
"pygments_lexer",
"version"
]
}
},
"required": [
"kernelspec",
"language_info"
]
},
"nbformat_minor": {
"type": "integer"
},
"nbformat": {
"type": "integer"
},
"cells": {
"type": "array",
"items": {
"type": "object",
"properties": {
"cell_type": {
"type": "string"
},
"source": {
"type": "string"
},
"metadata": {
"type": "object",
"properties": {
"trusted": {
"type": "boolean"
}
},
"required": [
"trusted"
]
},
"execution_count": {
"type": "null"
},
"outputs": {
"type": "array",
"items": {}
},
"id": {
"type": "string"
}
},
"required": [
"id"
]
}
}
},
"required": [
"metadata",
"nbformat_minor",
"nbformat",
"cells"
]
},
"seq": {
"type": "integer"
},
"notebook_path": {
"type": "string"
},
"user_id": {
"type": "string"
}
},
"required": [
"event_name",
"cells",
"notebook",
"seq",
"notebook_path",
"user_id"
]
} Please let me know if anyone has any questions regarding our use case. |
Hi @Zsailer - Do you think we can close this PR now? It hasn't had active discussion for a while now :) Thanks! |
I'm going to close this enhancement proposal, as it has been mostly implemented anyways. For folks reading this in the future, the work evolved and now resides in jupyter_events. I think it's still worth opening a new JEP that describes the Jupyter Event System and documents how other projects should leverage this work going forward. In many follow-on discussions, we are aiming to put jupyter_events in many layers of the Jupyter stack. A JEP would help define best practices. Further, we're in the process of creating a schema.jupyter.org subdomain where all Jupyter Event JSON schemas should be published. |
Contains two accompanying files
cc @yuvipanda @Zsailer