Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft: explainer-new. #193

Merged
merged 3 commits into from
May 31, 2023
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 3 additions & 125 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,135 +7,13 @@ Status:

* The Sanitizer API is currently being incubated in the
[Sanitizer API](https://github.com/WICG/sanitizer-api) [WICG](https://wicg.io/),
with the goal of bringing this as a standard into the
[W3C WebAppSec Working Group](https://www.w3.org/2011/webappsec/).
* Early implementations are available in [select web browsers](#Implementations).
with the goal of bringing this to the [WHATWG](https://whatwg.org/).
* The API is not finalized and still subject to change.

Here you can find additional information:

* The [draft specification](https://wicg.github.io/sanitizer-api/).
* A list of [questions & answers](faq.md).
* [MDN Web Docs](https://developer.mozilla.org/en-US/docs/Web/API/HTML_Sanitizer_API).
* Implementation Status:
* [Mozilla position](https://github.com/mozilla/standards-positions/issues/106),
[Chrome Status](https://www.chromestatus.com/feature/5786893650231296),
[WebKit position](https://lists.webkit.org/pipermail/webkit-dev/2021-March/031738.html).
* [Can I use 'Sanitizer API'](https://caniuse.com/mdn-api_sanitizer)?
* [Web Platform Tests](https://wpt.fyi/results/sanitizer-api?label=experimental&label=master&aligned)
([test source](https://github.com/web-platform-tests/wpt/tree/master/sanitizer-api)).
otherdaniel marked this conversation as resolved.
Show resolved Hide resolved
* The [Sanitizer API Playground](https://sanitizer-api.dev) is an easy way to
play with the API, if it's enabled in your browser.
* An early [W3C TAG review](https://github.com/w3ctag/design-reviews/issues/619).
* The [original explainer](explainer.md) goes into more detail about why
we are proposing this as a new standard (rather than a library). The API
proposed there is a little outdated, however.

## Implementations

If you wish to try out early Sanitizer implementations, the
[FAQ](faq.md#can-i-use-the-sanitizer-in-my-app) has you covered:

> Firefox: Go to about:config, search for the dom.security.sanitizer.enabled flag and set it to true
>
> Chromium / Chrome: Start the browser with the --enable-blink-features=SanitizerAPI flag.

## Explainer

The core API of the Sanitizer is rather simple: Take arbitrary HTML, then
parse and modify it to remove script content. The goal is to allow safe handling
of user-supplied HTML, without danger of
[Cross-Site Scripting (XSS)](https://en.wikipedia.org/wiki/Cross-site_scripting).

The Sanitzer is safe by default, which means it has built-in rules about which
markup to keep or to discard. Developers can customize the Sanitizer to suit
the needs of their applications. But the sanitization rules cannot be relaxed
below a built-in, safe baseline configuration.

The core API of the Sanitizer is this:

Example:
```js
// Every webapp has to deal with untrusted input in some form. It could be
// data off the network; from query parameters; any user inputs; or
// (sometimes) even from ones own server. Here, we use the simplest form as
// an example and get data right out of a <textarea> element:
const untrusted_input = document.querySelector("textarea").textContent;

// For our example, the goal is to safely display untrusted_input on the page:
const element = document.querySelector("#targetelement");

// In most cases, we don't want the user input to contain any markup anyhow,
// in which case the easiest and best method is to just assign it to
// another .textContent:
// (Of course, element shouldn't be a <script> element.)
element.textContent = untrusted_input;

// But what if you want (some) markup? Then, sanitize it before use. Sanitizer
// is safe by default, so the default instance will already do the job. The
// result might be ugly, or contain curse words, but it won't contain script.
element.setHTML(untrusted_input); // Default sanitization.

// All of these values for untrusted_input would have had the same result:
// <em>Hello World!</em>
element.setHTML("<em>Hello World!</em>");
element.setHTML("<script src='https://example.org/'></script><em>Hello World!</em>");
element.setHTML("<em onlick='console.log(1)'>Hello World!</em>");
```

Oftentimes, applications have additional &mdash; often stricter &mdash;
requirements beyond just script execution. For example, in a certain context
an application might want to allow formatted text, but no structural or other
complex markup. To accommodate this, the API allows for creation of multiple
`Sanitizer` instances, which can be customized on creation.

Example:
```js
// The generalized form of the `.setHTML` call takes an options bag with a
// `sanitizer` value:
const sanitizer = new Sanitizer();
element.setHTML(untrusted_input, {sanitizer: sanitizer});

// We must sanitize untrusted inputs, but we may want to restrict it further
// to meet other, related design goals. Here, we'll have a Sanitizer that
// allows for character-level formatting elements, plus the class= attribute
// on any element, but nothing else.
const for_display = new Sanitizer({
allowElements: ['span', 'em', 'strong', 'b', 'i'],
allowAttributes: {'class': ['*']}
});

const untrusted_example = "Well, <em class=nonchalant onclick='alert(\'General Kenobi\');'><a href='https://obiwan.example/home.php'>hello there<a>!"

// Well, <em class="nonchalant"><a href='https://obiwan.example/home.php'>hello there<a>!</em>
element.setHTML(untrusted_example, {sanitizer: sanitizer});
element.setHTML(untrusted_example); // Same, since it uses a default instance.

// Well, <em class="nonchalant">hello there!</em>
element.setHTML(untrusted_example, {sanitizer: for_display});
```

It is the overarching design goal of the Sanitizer API to be safe and simple,
at the same time. Therefore the API is not only safe by default, but is also
perma-safe. The Sanitizer will enforce a baseline that does not allow script
execution, even if a developer may have inadvertently configured script-ish
elements or attributes to be supported.

Example:
```js
const misconfigured = new Sanitizer({
allowElement: ["s", "strike", "span", "script"],
allowAttributes: {"class": ["*"], "style": ["span"], "onclick": ["*"]}
});

const untrusted_input = "<span onclick='2+2'>some</span><script>2+2</script>thing";
// <span>some</span>thing
element.setHTML(untrusted_input, {sanitizer: misconfigured});

// Sanitizer will refuse to insert script content, even if you (inadvertently)
// call it on an inappropriate context element, like <script>.
document.createElement("script").setHTML("console.log(1);"); // Throws.
```
The API is still being discussed. Please see the [explainer](explainer.md) for
our current thinking.

## Taking a Step Back: The Problem We're Solving

Expand Down
183 changes: 68 additions & 115 deletions explainer.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,119 +54,72 @@ a general purpose library. These should continue to be able to use whichever
library or mechanism they prefer. However, the library should play well with
other enforcement mechanisms.



## Proposal

Note: The proposal is being developed [here](https://wicg.github.io/sanitizer-api/).


We want to develop an API that learns from the
[DOMPurify](https://github.com/cure53/DOMPurify) library. In particular:

* The core API would be a single method which sanitizes a String and returns
a DocumentFragment.

* `sanitize(DOMString value)` => `DocumentFragment`

* Other input types (e.g. Document or DocumentFragment) can also be
supported.

* Other result types (e.g. String-to-String) can also be supported with
different methods. I.e., one method per supported output type.

* To support different use cases and to keep the API extensible, the
sanitization should be configurable via an options dictionary.

* The default (without configuration) should provide safety against script
execution.

* To make it easy to review and reason about sanitizer configs, there should
be sanitizer instances for a given configuration.

* DOMPurify supports per-call and a global "default" config. Global
configuration state can be awkward to use when different dependencies
have different ideas about what the global state should be. Likewise,
per-call configs can be error prone and hard to reason about, since every
call site might be a little different.

* There seem to be a handful of common use cases. There should be sensible
default options for each of these.

### Proposed API

The basic API would be`.sanitize(value)` to produce a DocumentFragment.
Sanitizers can be constructed with a dictionary of options.

## Proposed API

*Context: Various disagreements over API details lead to a re-design of the
Sanitizer API. This presents a new API proposal (April '23, based on #192):*

There is a 2x2 set of methods that parse and filter the resulting node tree:
On one axis they differ in the parsing context and are analogous to `innerHTML`
and `DOMParser`'s `parseFromString()`; on the other axis, they differ
in whether they enforce an XSS-focused security baseline or not. These two
aspects pair well and yield:

- `Element.setHTML(string, {options})` - Parses `string` using `this` as
context element, like assigning to `innerHTML` would; applies a filter,
while enforcing an XSS-focused baseline; and finally replaces the children
of `this` with the results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we want to directly compare parsing to innerHTML as there will at least be two major differences:

  1. Support for declarative shadow roots.
  2. No support for XML.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While technically correct, I think it helps to draw a comparison. Even if it's not exactly 1:1, I am expecting many developers to switch from innerHTML= to setHTML()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is summarizing the design for implementers, no? I don't mind it saying innerHTML, but it should be more accurate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We agree on a softer wording. "almost like", "similar to innerHTML". etc. :). Obviously, there should be a section that explains the differences more closely.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also added a note at the bottom to be more explicit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on, "No support for XML": I think we agree that Document.parseHTML should not create XML documents (unlike DOMParser). Is this also meant to apply to setHTML when called on an element in an existing XML document?

(I don't mind either way; but would like to be clear.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant it for both.

- `Element.setHTMLUnsafe(string, {options})` - Like above, but it does not
enforce a baseline. E.g. if the filter is configured to allow a `<script>`
element, then a `<script>` would be inserted. If no filter is configured
then no filtering takes place.
- `Document.parseHTML(string, {options})` (static method) - Creates a new
Document instance, and parses `string` as its content, like
`DOMParser.parseFromString` would. Applies a filter, while enforcing an
XSS-focused baseline. Returns the document.
- `Document.parseHTMLUnsafe(string, {options})` (static method) - Like
above, but it will not enforce a baseline or apply any filter by default.

All variants take an options dictionary with a `filter` (naming TBD) key and a
filter configuration. The options dictionary can be easily extended to accept
whatever parsing parameters make sense.

The 'safe' methods may have some built-in anti-XSS behaviours that are not
expressibleby the config, e.g. dropping `javascript:`-URLs in contexts
that navigate.

The 'unsafe' methods will not apply any filtering if no explicit config is
supplied.

## Major differences to previously proposed APIs:

The currently proposed API differs in a number of aspects:

- Two sets of methods, `innerHTML`-like and `DOMParser`-like.
- Since the 'sanitizer' config can now be used in both safe and unsafe ways,
it's arguably no longer a sanitizer config but a filter config.
- Enforcement of a security baseline depends on the method. The filter/sanitizer
config can now be used differently, either in a guaranteed-secure way or in
use-config-as-written way.

## Open questions:

- Defaults: If no filter is supplied, do the safe methods have any filtering
other than the baseline?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we reference #188 here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- Defaults: All of these are new methods without legacy usage. Would DSD
parsing default to `true`? (Probably. Decision lies with WHATWG.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can consider this decided.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. (I added this as one example to the note about differences between the new APIs and their existing counterparts.)

- Should the filter config be a separate object, or should it be a plain dictionary?
- Reasons pro dictionary:
- Simpler.
- Reasons pro object:
- Allows to pre-process the config and to amortize the cost over many calls.
- Allows adding other useful config operations, like introspection.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems correct. An object here would require one of:

  1. Compelling performance numbers.
  2. A compelling operation that would only work with a pre-processed dictionary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Adopted this wording.

- Exact filter options syntax. I'm assuming this will follow the discussion in
#181.
- Naming is TBD. Here I'm trying to follow the preferences expressed in the
recent 'sync' meeting.

## Examples
```js
// TODO
```
[
Exposed=Window,
SecureContext
] interface Sanitizer {
constructor(optional SanitizerConfig config = {});
DocumentFragment sanitize(DOMString input);

DOMString sanitizeToString(DOMString input);
readonly attribute SanitizerConfig creationOptions;
}
```

### Example usage

A simple web app wishes to take a string (say: a name) and display it on
the page:

```
const s = new Sanitizer();
const node = document.getElementById("...");

node.innerText = "";
node.appendChild(s.sanitize(user_supplied_value));
```


### Roadmap

* Sanitizer Specification 1.0
* Supports config-less sanitization;
* Supports customization of allowlists for elements and attributes;
* The core goal is the sanitization of any markup that can cause XSS.

* Sanitizer Specification 2.0
* Supports additional configuration options, possibly stemming from DOMPurify;
* Supports custom callbacks and hooks to fine-tune sanitization results.

## FAQ

### Who would use this and why?
* Web application developers who want to allow some - but not all - HTML. This could mean developers handling Wiki pages, message boards, crypto messengers, web mailers, etc.
* Developers of browser extensions who want to secure their applications against malicious user-controlled, or even site-controlled, HTML.
* Application developers who create Electron applications and comparable tools which interpret and display HTML and JavaScript.

### Wouldn’t this be just a niche feature?
* No, according to the statistics offered by the npm.js platform, libraries such as DOMPurify are downloaded over 200 thousand times every month . DOMPurify is furthermore used from within various CDN networks for which no metrics are available at this point.
* Besides web applications, sanitizer libraries are also used in Electron applications, browser extensions and other applications making use of a browser engine.

### But this can be done on the server, can’t it? Like in the “olden days”.
* While this is correct, server-side sanitizers have a terrible track record for being bypassed. Using them is conducive to a Denial of Service on the server and one simply cannot know about the browser’s quirks without being highly knowledgeable in this particular realm.
* As a golden rule, sanitization should happen where the sanitized result is used, so that the above noted knowledge gaps can be mitigated and various risks might be averted.

### What are the key advantages of Sanitizing in the browser?
* *Minimalistic Approach:* Various libraries, such as DOMPurify, currently need to work around browser-specific quirks. This would no longer matter had the implementations become directly embedded in the browser.
* *Simplicity:* This approach does not aim to create any additional complexity, introduce new data types, labels or flags, it simply aims to provide an API that allows developers to take an untrusted string, remove anything that can lead to script execution or comparable and retuirn the sanitized result, again as a string (see also [#4](https://github.com/WICG/sanitizer-api/issues/4)).
* *Bandwidth:* Sanitizer libraries are “heavy” and by reducing the need to pull them from a server by embedding them in the browser instead, bandwidth can be saved.
* *Performance:* Sanitizing markup in C/C++ is faster than doing the same in JavaScript.
* *Reusability:* Once the browser exposes a sanitizer in the DOM, it can be reused for potentially upcoming [SafeHTML](https://lists.w3.org/Archives/Public/public-webappsec/2016Jan/0113.html) implementations, [Trusted Types](https://github.com/WICG/trusted-types), secure elements and, if configurable, even be repurposed for other changes in the user-controlled HTML, for instance in connection with URL rewriting, removal of annoying UI elements and CSS sanitization.

### What if someone wants to customize the sanitization rules?
* It should be trivial to implement basic configuration options that allow customization of the default allowlist and enable developers to remove, add or completely rewrite the allowed elements and/or attributes.
* The already mentioned browser's clipboard sanitizer already ships an allowlist, so the only task would be to make it configurable.

### Isn’t building a sanitizer in the browser risky and difficult?
* No, it may appear so but, in fact, the browsers already feature at least one sanitizer, for instance the one handling HTML clipboard content that is copied and pasted across origins. The existing puzzle pieces only need to be put correctly together in a slightly different way before they are then exposed in the DOM.
* If there are any risks connected to the new process, then they are not new but rather already concern the handling of the user-generated HTML presently processed by the in-browser sanitizers. Aside for configuration parsing, which should be a trivial problem to solve, no added risks can be envisioned.

### Wait, what does secure even mean in this context?
* Calling the process secure means that a developer can expect that XSS attacks caused by user-controlled HTML, SVG, and MathML are eradicated.
* The sanitizer would remove all elements that cause script execution from the string it receives and returns.