Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add redactionPolicies field to Jupyter Event schemas #2

Merged
merged 22 commits into from
Aug 11, 2022

Conversation

Zsailer
Copy link
Member

@Zsailer Zsailer commented Jul 14, 2022

Highlights

  • Replaces the required field "categories" with "redactionPolicies" for all schemas to explicitly redact fields in a event that shouldn't be emitted.
  • A new required field on all properties called "unredactedPolicies" to explicity state what fields are emitted.
  • A new setting/trait (list) in the EventLogger, redacted_policies, to configure which properties should be plucked/redacted from the outgoing events.

Public API changes

  • "categories" is no longer used for redacting sensitive information. Instead, the explicit key "redactionPolicies" is used to remove fields from an event.
    • To redact these fields, the EventLogger object uses a new trait, a list, called redactedPolicies. Any policies listed in this trait will be removed from emitted events
    • By default, the EventLogger emits all events unless they are explicitly redacted. This is a major difference than the previous version, where the EventLogger redacted all events unless they were explicitly listed.
    • You can prevent all events from being emitted by setting redactedPolicies="all"
  • "allowed_schemas", "allowed_categories", and "allowed_properties", have been removed from the EventLogger.
    • All registered event types will be emitted.
    • To redact any data, use the "redactedPolicies" field
    • The expectation is that consumers will use filters in the logging handlers to listen for specific events.
  • "redactionPolicies" field is required on every schema and (nested) property object in a Jupyter event schema to pass validation.

@Zsailer Zsailer marked this pull request as draft July 14, 2022 20:26
@blink1073 blink1073 added the enhancement New feature or request label Jul 14, 2022
@Zsailer
Copy link
Member Author

Zsailer commented Jul 21, 2022

cc @kiendang. This is a pretty major refactor, so I'd love to get your eyes on this. I'd like to merge and iterate at a high velocity, so that we can get this into Jupyter Server before the 2.0 final release. Since this library isn't likely used by anyone else at this point, I think that's okay (most folks are probably using telemetry).

A couple of things to note...I found that filtering on "categories" is a pretty challenging thing to maintain and confusing for users in practice. So I've replaced most of that logic with a simpler flow. I've added a required key, redactionPolicies, to every schema property. Then added a new List configurable trait on the EventLogger called redacted_policies which is a list of redaction policies to remove.

Also, I was seeing nasty memory leaks in from jsonschema when we didn't cache the validators for validated schemas. Those have been address in this PR. See the top comment for all the changes I made.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exciting to see. I do think the redaction stuff complicates this significantly and, given the subjective nature of what is considered "sensitive" and by whom, I wonder if this should be configurable outside the schemas. The initial thinking (although I'm not sure I fully understand the object relationships) would be to let the Operator redact specific properties of event definitions.

Of course, permitting operators to determine redacted data can affect potential logic that wants to act on field data that is now redacted. I suppose this issue exists with the current policy mechanism anyway.

We should also nail down the redaction policy labels and their ownership.

Thanks for getting this going @Zsailer and @kiendang!

Comment on lines 28 to 30
# Explicitly list the types of events
# to record and what properties or what categories
# of data to begin collecting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment doesn't seem to fit the code below it, probably due to the switch from categories to redaction policies and the fact the default behavior has also changed.

@@ -0,0 +1,5 @@
# Redacting Sensitive Data

Jupyter Events might possible include sensitive data, specifically personally identifiable information (PII). To reduce
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Jupyter Events might possible include sensitive data, specifically personally identifiable information (PII). To reduce
Jupyter Events might possibly include sensitive data, specifically personally identifiable information (PII). To reduce

Jupyter Event Schemas must be valid [JSON schema](https://json-schema.org/) and can be written in valid
YAML or JSON. Every schema is validated against Jupyter Event's "meta"-JSON schema, [here]().

At a minimum, valid Jupyter Event schema requires have the following keys:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
At a minimum, valid Jupyter Event schema requires have the following keys:
At a minimum, valid Jupyter Event schema requires the following keys:


- `$id` : a URI to identify (and possibly locate) the schema.
- `version` : the schema version.
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting that this event happen does not reveal any person data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last phrase doesn't parse for me. Take the suggested change with a grain of salt...

Suggested change
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting that this event happen does not reveal any person data.
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this event. The main logger can be configured to redact any events or event properties that might contain sensitive information. Set this value to `"unrestricted"` if emitting this event does not reveal any personal data.

Each property should have the following attributes:

- `title` : name of the property
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if the policy is not allowed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if the policy is not allowed.
- `redactionPolicies`: a list of labels representing the personal data sensitivity of this property. This field will be redacted from the emitted event if any of its `redactionPolicies` labels are listed in the event logger's `redactedPolicies` set.

)
self._add(schema)

def get(self, id: str, version: int) -> EventSchema:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is id the name of the schema? if so, could we call these parameters name or schema_name? Using id has a connotation of an identifier like an integer or UUID.

## Redaction Policies

Each property can be labelled with `redactionPolicies` field. This makes it easier to
filter properties based on a category. We recommend that schema authors use valid
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Suggested change
filter properties based on a category. We recommend that schema authors use valid
filter out properties based on a redaction policy. We recommend that schema authors use valid

- `category.jupyter.org/unrestricted`
- `category.jupyter.org/user-identifier`
- `category.jupyter.org/user-identifiable-information`
- `category.jupyter.org/action-timestamp`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. If different event providers have their own definition of what is PII (including their own labels, but even perhaps not), how does an Operator:
a) determine what set of labels to add to the redactedPolicies property on the event logger?
b) change a given property's redaction criteria because they happen to deem the current settings inadequate?


def __init__(self, schemas: dict = None, redacted_policies: list = None):
self._schemas = schemas or {}
self._redacted_policies = redacted_policies
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm not following the relationship between EventLogger, EventSchema, and SchemaRegistry.

Are any of these singletons? I figured SchemaRegistry would be a singleton but now I suspect it's single-valued relative to an EventLogger and there's an EventLogger per what? Per event definition (e.g., content.file-saved) or event class (e.g., content)?

I'm assuming EventSchema is a per event definition thing.

schema.validate(data)

def process_event(self, id: str, version: int, data: dict) -> None:
"""Validate and event and enforce an redaction policies (in place).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Validate and event and enforce an redaction policies (in place).
"""Validate an event and enforce its redaction policies (in place).

@kevin-bates
Copy link
Member

@Zsailer - I just submitted a first review and noticed that you made a few commits during that time - so I apologize for any overlap.

I think seeing an object relationship diagram of sorts would be super helpful.

While poking around the web at PII definitions and event logging systems I found this Zendesk feature request - which implies the subjective nature of PII.

@Zsailer
Copy link
Member Author

Zsailer commented Jul 22, 2022

Thank you for the review, @kevin-bates! This is great!

Kevin and I chatted about this PR yesterday and had a good discussion about the redaction policies logic. We put together the following plan:

  • Split out the redaction policy and redaction enforcement into a separate PR here
  • Open a PR in Jupyter Server with some basic events (without redaction).
    • This will help us determine how to properly enable redaction policy
  • Iterate on this PR based on the needs of Jupyter Server

The thing that's challenging with "redactionPolicies" here is that there are instances where you'd want one event handler to redact data, while another handler does not. For example, in the Jupyter Server's Event system/bus, redacting data might break client-side extensions that depend on having all the data. If we can create a way for the data to never leave our system (or at least encrypt it in transit), we should be able to send non-redacted data to the event bus for these extensions while handlers that direct data outside of the jupyter server (e.g. to file) get redacted version of the data.

@kevin-bates
Copy link
Member

Thanks @Zsailer.

For completeness, I think another challenge is the subjective nature of what is sensitive. Although we don't have to deal with explicitly personal information as part of our inherent framework, it still can be encountered. For example, path may be fine in most installations, but could also contain PII in others, so it seems like we should expose the ability to control policies on various properties. Perhaps decoupling what is sensitive/redacted from the event schema may be a better fit.

@Zsailer
Copy link
Member Author

Zsailer commented Jul 22, 2022

I almost wonder if we should replace redactionPolicies with recommendedRedactionPolicies in the schema. I understand the reason for decoupling the redaction enforcement from the schema definition, but I still feel that there is some responsibility for the schema to label its properties with sensitivity labels. Maybe those labels aren't used for redaction enforcement, but they help consumers understand the risk of collecting the data in question.

@Zsailer
Copy link
Member Author

Zsailer commented Jul 22, 2022

When actually redacting the data, I agree that we need some way for people to apply their own policies to make this system useful. Otherwise, this system will become too restrictive for anyone to use. We want something flexible enough for people to use, while remaining data conscious and safe for users.

@kiendang
Copy link
Member

Thanks @Zsailer and @kevin-bates, I have been monitoring this PR as well. I have a couple of thoughts here and am happy to be involved in later discussions.

  • Agree with Zach that there might have already been solutions to this existing out there. Will be happy to take a look and report back what I find.
  • Earlier I had a much harder time understanding redaction policies compared to categories, so would be nice if we could go with a solution that not only covers all bases but is also intuitive, though clear documentation/examples help a lot here.
  • If I remember correctly Jupyter Telemetry aimed to be opt-in. If I understand correctly in this redaction approach everything by default is emitted and users have to actively set policies to filter out sensitive fields so is that in anyway conflicting with our commitment to opt-in?
  • Regarding the implementation I was working on a way to filter out categories more efficiently as part of a WIP to solve Handle category-filtered fields telemetry#61, will see if I can apply the same approach here and submit a PR later.

@Zsailer Zsailer changed the title Add new EventSchema and SchemaRegistry API Add redactionPolicies field to Jupyter Event schemas Aug 9, 2022
Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff, just a few comments.

Since, iirc, the redaction policy stuff will be separately implemented, should the title and description be revised? Or is the idea that this will be the vehicle for that implementation?

docs/user_guide/defining-schema.md Outdated Show resolved Hide resolved
docs/user_guide/event-schemas.md Outdated Show resolved Hide resolved
docs/user_guide/event-schemas.md Outdated Show resolved Hide resolved
docs/user_guide/first-event.md Outdated Show resolved Hide resolved
jupyter_events/logger.py Outdated Show resolved Hide resolved
jupyter_events/logger.py Outdated Show resolved Hide resolved
jupyter_events/logger.py Outdated Show resolved Hide resolved
jupyter_events/logger.py Outdated Show resolved Hide resolved
@Zsailer
Copy link
Member Author

Zsailer commented Aug 10, 2022

Thanks, Kevin!

Since, iirc, the redaction policy stuff will be separately implemented, should the title and description be revised? Or is the idea that this will be the vehicle for that implementation?

Yeah, sorry for the lack of clarity here.

I've moved all my work to #4. That's where I think we should continue review on the current work. At a later time, I'll move the redaction stuff back to this PR, because there was a lot of good comments around the redaction stuff in this PR thread. I didn't want to lose that history.

For now, I'll keep the title as-is as leave the PR in draft state. I'll move the redaction policy stuff here once #4 is finished and merged.

@Zsailer Zsailer merged commit 88acd8e into jupyter:main Aug 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants