Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow adding custom values to telemetry masking #80

Closed
bwateratmsft opened this issue Nov 17, 2021 · 19 comments
Closed

Allow adding custom values to telemetry masking #80

bwateratmsft opened this issue Nov 17, 2021 · 19 comments

Comments

@bwateratmsft
Copy link

The vscode-azureextensionui package has a feature allowing callers to add some custom values to a list that ought to be masked out of telemetry events. You can see that here: https://github.com/microsoft/vscode-azuretools/blob/main/ui/src/masking.ts

Ideally all masking logic should occur here in vscode-extension-telemetry, nothing downstream. I think it is a useful feature in the somewhat uncommon cases where a telemetry caller knows something should be masked that the shared telemetry library couldn't possibly predict.

@lramos15
Copy link
Member

So is the idea that you pass a list of strings to replace with like "REDACTED". We of course don't want to leak PII but additional bloat to the module is also something we want to avoid

@bwateratmsft
Copy link
Author

Exactly. I think that as much masking logic as is reasonable should be here, since it is shared so widely. However, I'm sure there will always be cases where the masking logic would miss something, and if it's hyper-specific, it would be better for that to live in the caller's code rather than bloat here.

Could the parameter also accept RegExp's, in addition to strings?

@bwateratmsft
Copy link
Author

The litmus test for whether the code should be here or with the caller could basically be--is the thing being masked likely to only come from that extension, or is it reasonable to think it may come from all of them?

@lramos15
Copy link
Member

Yeah I almost want to say this feature in specific is out of scope as most people won't use masking and people who do are likely to have a custom telemetry service that wraps this module.

@bwateratmsft
Copy link
Author

That's fair, though I would say my motivation is chiefly to make the path of least resistance also the good path (i.e. the most privacy-respecting path possible). If it's very easy for consumers of this package to mask PII, they're more likely to do it.

@lramos15
Copy link
Member

lramos15 commented Apr 5, 2022

I think #93 fixes this.

@lramos15 lramos15 closed this as completed Apr 5, 2022
@bwateratmsft
Copy link
Author

Thanks @lramos15, @aeisenberg, I took a look at the PR and this is great! One question; is there a way to alter/add new replacements after the reporter is constructed?

@aeisenberg
Copy link
Contributor

No. That wasn't something I considered. It did not seem to fit in with the way the TelemetryRepoter is implemented. Do you think that's a feature users will want? For my use case, all replacements have been hard coded.

@lramos15
Copy link
Member

lramos15 commented Apr 5, 2022

@bwateratmsft Not at this time no. I think the general idea was that whatever you're trying to filter is most likely a static known set or at least known before you start sending telemetry. This design decision makes a lot of sense in my opinion because you wouldn't want to sends some telemetry unfiltered and then later add filters which would not retroactively apply to the previously sent events.

@bwateratmsft
Copy link
Author

Currently we are adding masks after-the-fact in our custom logic but the regexs we are adding are pretty narrowly-scoped (e.g. to the exact Azure subscription ID of the user). It's possible that more broadly-applicable, statically-defined regexs would cover our needs. I'll investigate.

@lramos15
Copy link
Member

lramos15 commented Apr 5, 2022

Currently we are adding masks after-the-fact in our custom logic but the regexs we are adding are pretty narrowly-scoped (e.g. to the exact Azure subscription ID of the user).

I mean the telemetry reporters don't have state really. In theory you could instantiate one per azure subscription and just cache telemetry until events are ready to be filtered. This feels a bit hacky though

@lramos15
Copy link
Member

lramos15 commented May 2, 2022

Hmm @aeisenberg I'm writing tests for this now and I think I completely blanked on this really only being a key mask and not a value one. So you replace entries with the key "foo" with a replacement string. This doesn't work for the use case of wanting to replace let's say some PII in a property because you would always just be redacting say the property "foo" and in that case why not just not send that property? I don't think this is what you need @bwateratmsft

@aeisenberg
Copy link
Contributor

Here, we redact the value of a key if that key matches a regex supplied by the user, or that key is deleted if no replacement value is provided.

https://github.com/microsoft/vscode-extension-telemetry/pull/93/files#diff-ba58f348990cdb73794c3b09904e7a51fed5a9373e2aebfcbe87046206b99591R43-R49

Is your suggestion that users may want to redact the key instead of the value? How would that work if multiple keys match the regex? And what happens to the value? Perhaps I'm misunderstanding what you are asking for.

@lramos15
Copy link
Member

lramos15 commented May 2, 2022

Is your suggestion that users may want to redact the key instead of the value? How would that work if multiple keys match the regex? And what happens to the value? Perhaps I'm misunderstanding what you are asking for.

My suggestion is that users want to redact values not keys. So lets say you want to make sure an IP doesn't end up in your telemetry then you could use an ipv4 regex such as (\b25[0-5]|\b2[0-4][0-9]|\b[01]?[0-9][0-9]?)(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)){3} and regardless of the keys it would replace it with the replacement string. Maybe an optional full replacement so {myProperty: "Failed to connect to 127.0.0.1"} would become {myProperty :"Failed to connect to redacted"} or full replacement {myProperty: "redacted"}. I understand that this isn't the current implementation and this is more what I had in mind at the time but believe I did not articulate it properly or catch it during my review. Reading back it is very clear this was your intent and makes sense for the data tags AI adds when you use the keys but most of the data that is being sent by the user is known keys with unknown values via the object {properties: {...}, measures: {...}} and therefore PII might enter there that they want to redact. The module does its best to redact this but of course every extension is different. Let me know your thoughts on this approach.

@aeisenberg
Copy link
Contributor

Thanks for the clarification. This was not the implementation I had in mind when I created the PR. I was only considering keys. I do see how it can be useful to ensure privacy.

It probably wouldn't be too complex to apply the same lookup and replacement to the values, but there would be some slightly different behaviour. If the lookup matches a key, then the entire value is replaced or removed. If the lookup matches a value, then the portion that is matched is replaced or removed (possibly with ****).

There could also be two separate parameters, one for key lookups and replacements, and one for values. Personally, I think it's simpler for users to have a single parameter to describe both. What do you think?

@lramos15
Copy link
Member

@aeisenberg Could your use case be done by just value redaction? My worry is that overloading the use case could be surprising. The current key redaction is surprising to me since keys are almost always known so I wonder if it's surprising to others.

@aeisenberg
Copy link
Contributor

I'm sure we could do something. The problem is that application insights adds a bunch of keys that we know we don't want, which we have no other way of controlling. So it is safer to delete those keys instead of relying on patterns. Eg- for ai.device.osPlatform we would need to figure out what all the potential values are and write a regex that matches them. It's not impossible, but it just seems more straight forward to delete the key.

Can you think of another way of achieving the same goal?

@lramos15
Copy link
Member

Can you think of another way of achieving the same goal?

Thanks for the feedback @aeisenberg . Maybe then just a boolean to apply the regex to the key vs value is best. We should probably default to value since the privacy case is more common. Would this work for you?

@aeisenberg
Copy link
Contributor

I'd be fine with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants