-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add redactionPolicies field to Jupyter Event schemas #2
Conversation
cc @kiendang. This is a pretty major refactor, so I'd love to get your eyes on this. I'd like to merge and iterate at a high velocity, so that we can get this into Jupyter Server before the 2.0 final release. Since this library isn't likely used by anyone else at this point, I think that's okay (most folks are probably using telemetry). A couple of things to note...I found that filtering on "categories" is a pretty challenging thing to maintain and confusing for users in practice. So I've replaced most of that logic with a simpler flow. I've added a required key, Also, I was seeing nasty memory leaks in from jsonschema when we didn't cache the validators for validated schemas. Those have been address in this PR. See the top comment for all the changes I made. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is exciting to see. I do think the redaction stuff complicates this significantly and, given the subjective nature of what is considered "sensitive" and by whom, I wonder if this should be configurable outside the schemas. The initial thinking (although I'm not sure I fully understand the object relationships) would be to let the Operator redact specific properties of event definitions.
Of course, permitting operators to determine redacted data can affect potential logic that wants to act on field data that is now redacted. I suppose this issue exists with the current policy mechanism anyway.
We should also nail down the redaction policy labels and their ownership.
docs/pages/configure.md
Outdated
# Explicitly list the types of events | ||
# to record and what properties or what categories | ||
# of data to begin collecting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment doesn't seem to fit the code below it, probably due to the switch from categories to redaction policies and the fact the default behavior has also changed.
docs/pages/redaction_policies.md
Outdated
@@ -0,0 +1,5 @@ | |||
# Redacting Sensitive Data | |||
|
|||
Jupyter Events might possible include sensitive data, specifically personally identifiable information (PII). To reduce |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jupyter Events might possible include sensitive data, specifically personally identifiable information (PII). To reduce | |
Jupyter Events might possibly include sensitive data, specifically personally identifiable information (PII). To reduce |
docs/pages/schemas.md
Outdated
Jupyter Event Schemas must be valid [JSON schema](https://json-schema.org/) and can be written in valid | ||
YAML or JSON. Every schema is validated against Jupyter Event's "meta"-JSON schema, [here](). | ||
|
||
At a minimum, valid Jupyter Event schema requires have the following keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a minimum, valid Jupyter Event schema requires have the following keys: | |
At a minimum, valid Jupyter Event schema requires the following keys: |
docs/pages/schemas.md
Outdated
|
||
- `$id` : a URI to identify (and possibly locate) the schema. | ||
- `version` : the schema version. | ||
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting that this event happen does not reveal any person data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last phrase doesn't parse for me. Take the suggested change with a grain of salt...
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting that this event happen does not reveal any person data. | |
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting this event does not reveal any personal data. |
docs/pages/schemas.md
Outdated
Each property should have the following attributes: | ||
|
||
- `title` : name of the property | ||
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if the policy is not allowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if the policy is not allowed. | |
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if any of its `redactionPolicies` labels are listed in the event logger's `redactedPolicies` set. |
) | ||
self._add(schema) | ||
|
||
def get(self, id: str, version: int) -> EventSchema: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is id
the name of the schema? if so, could we call these parameters name
or schema_name
? Using id
has a connotation of an identifier like an integer or UUID.
docs/pages/schemas.md
Outdated
## Redaction Policies | ||
|
||
Each property can be labelled with `redactionPolicies` field. This makes it easier to | ||
filter properties based on a category. We recommend that schema authors use valid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
filter properties based on a category. We recommend that schema authors use valid | |
filter out properties based on a redaction policy. We recommend that schema authors use valid |
docs/pages/schemas.md
Outdated
- `category.jupyter.org/unrestricted` | ||
- `category.jupyter.org/user-identifier` | ||
- `category.jupyter.org/user-identifiable-information` | ||
- `category.jupyter.org/action-timestamp` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. If different event providers have their own definition of what is PII (including their own labels, but even perhaps not), how does an Operator:
a) determine what set of labels to add to the redactedPolicies
property on the event logger?
b) change a given property's redaction criteria because they happen to deem the current settings inadequate?
jupyter_events/schema_registry.py
Outdated
|
||
def __init__(self, schemas: dict = None, redacted_policies: list = None): | ||
self._schemas = schemas or {} | ||
self._redacted_policies = redacted_policies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I'm not following the relationship between EventLogger
, EventSchema
, and SchemaRegistry
.
Are any of these singletons? I figured SchemaRegistry
would be a singleton but now I suspect it's single-valued relative to an EventLogger
and there's an EventLogger
per what? Per event definition (e.g., content.file-saved) or event class (e.g., content)?
I'm assuming EventSchema
is a per event definition thing.
jupyter_events/schema_registry.py
Outdated
schema.validate(data) | ||
|
||
def process_event(self, id: str, version: int, data: dict) -> None: | ||
"""Validate and event and enforce an redaction policies (in place). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Validate and event and enforce an redaction policies (in place). | |
"""Validate an event and enforce its redaction policies (in place). |
@Zsailer - I just submitted a first review and noticed that you made a few commits during that time - so I apologize for any overlap. I think seeing an object relationship diagram of sorts would be super helpful. While poking around the web at PII definitions and event logging systems I found this Zendesk feature request - which implies the subjective nature of PII. |
Thank you for the review, @kevin-bates! This is great! Kevin and I chatted about this PR yesterday and had a good discussion about the redaction policies logic. We put together the following plan:
The thing that's challenging with "redactionPolicies" here is that there are instances where you'd want one event handler to redact data, while another handler does not. For example, in the Jupyter Server's Event system/bus, redacting data might break client-side extensions that depend on having all the data. If we can create a way for the data to never leave our system (or at least encrypt it in transit), we should be able to send non-redacted data to the event bus for these extensions while handlers that direct data outside of the jupyter server (e.g. to file) get redacted version of the data. |
Thanks @Zsailer. For completeness, I think another challenge is the subjective nature of what is sensitive. Although we don't have to deal with explicitly personal information as part of our inherent framework, it still can be encountered. For example, |
I almost wonder if we should replace |
When actually redacting the data, I agree that we need some way for people to apply their own policies to make this system useful. Otherwise, this system will become too restrictive for anyone to use. We want something flexible enough for people to use, while remaining data conscious and safe for users. |
Thanks @Zsailer and @kevin-bates, I have been monitoring this PR as well. I have a couple of thoughts here and am happy to be involved in later discussions.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff, just a few comments.
Since, iirc, the redaction policy stuff will be separately implemented, should the title and description be revised? Or is the idea that this will be the vehicle for that implementation?
Thanks, Kevin!
Yeah, sorry for the lack of clarity here. I've moved all my work to #4. That's where I think we should continue review on the current work. At a later time, I'll move the redaction stuff back to this PR, because there was a lot of good comments around the redaction stuff in this PR thread. I didn't want to lose that history. For now, I'll keep the title as-is as leave the PR in draft state. I'll move the redaction policy stuff here once #4 is finished and merged. |
Highlights
redacted_policies
, to configure which properties should be plucked/redacted from the outgoing events.Public API changes
EventLogger
object uses a new trait, a list, calledredactedPolicies
. Any policies listed in this trait will be removed from emitted eventsEventLogger
emits all events unless they are explicitly redacted. This is a major difference than the previous version, where the EventLogger redacted all events unless they were explicitly listed.redactedPolicies="all"