Skip to content

Latest commit

 

History

History
506 lines (396 loc) · 17.3 KB

readme.md

File metadata and controls

506 lines (396 loc) · 17.3 KB

HAST

Hypertext Abstract Syntax Tree format.


HAST discloses HTML as an abstract syntax tree. Abstract means not all information is stored in this tree and an exact replica of the original document cannot be re-created. Syntax Tree means syntax is present in the tree, thus an exact syntactic document can be re-created.

The reason for introducing a new “virtual” DOM is primarily:

  • The DOM is very heavy to implement outside of the browser, a lean and stripped down virtual DOM can be used everywhere
  • Most virtual DOMs do not focus on ease of use in transformations
  • Other virtual DOMs cannot represent the syntax of HTML in its entirety (think comments, document types, and character data)
  • Neither HTML nor virtual DOMs focus on positional information

HAST is a subset of Unist and implemented by rehype.

This document may not be released. See releases for released documents. The latest released version is 2.2.0.

Table of Contents

List of Utilities

See the List of Unist Utilities for projects which work with HAST nodes too.

Related HTML Utilities

AST

Root

Root (Parent) houses all nodes.

interface Root <: Parent {
  type: "root";
}

Element

Element (Parent) represents an HTML Element. For example, a div. HAST Elements corresponds to the HTML Element interface.

One element is special, and comes with another property: <template> with content. The contents of a template element is not exposed through its children, like other elements, but instead on a content property which houses a Root node.

<noscript> elements should house their tree in the same way as other elements, as if scripting was not enabled.

interface Element <: Parent {
  type: "element";
  tagName: string;
  properties: Properties;
  content: Root?;
}

For example, the following HTML:

<a href="http://alpha.com" class="bravo" download></a>

Yields:

{
  "type": "element",
  "tagName": "a",
  "properties": {
    "href": "http://alpha.com",
    "id": "bravo",
    "className": ["bravo"],
    "download": true
  },
  "children": []
}

Properties

A dictionary of property names to property values. Most virtual DOMs require a disambiguation between attributes and properties. HAST does not and defers this to compilers.

interface Properties {}
Property names

Property names are keys on properties objects and reflect HTML, SVG, ARIA, XML, XMLNS, or XLink attribute names. Often, they have the same value as the corresponding attribute (for example, id is a property name reflecting the id attribute name), but there are some notable differences.

These rules aren’t simple. Use hastscript (or property-information directly) to help.

The following rules are used to disambiguate the names of attributes and their corresponding HAST property name. These rules are based on how ARIA is reflected in the DOM, and differs from how some (older) HTML attributes are reflected in the DOM.

  1. Any name referencing a combinations of multiple words (such as “stroke miter limit”) becomes a camel-cased property name capitalising each word boundary. This includes combinations that are sometimes written as several words. For example, stroke-miterlimit becomes strokeMiterLimit, autocorrect becomes autoCorrect, and allowfullscreen becomes allowFullScreen.
  2. Any name that can be hyphenated, becomes a camel-cased property name capitalising each boundary. For example, “read-only” becomes readOnly.
  3. Compound words that are not used with spaces or hyphens are treated as a normal word and the previous rules apply. For example, “placeholder”, “strikethrough”, and “playback” stay the same.
  4. Acronyms in names are treated as a normal word and the previous rules apply. For example, itemid become itemId and bgcolor becomes bgColor.
Exceptions

Some jargon is seen as one word even though it may not be seen as such by dictionaries. For example, nohref becomes noHref, playsinline becomes playsInline, and accept-charset becomes acceptCharset.

The HTML attributes class and for respectively become className and htmlFor in alignment with the DOM. No other attributes gain different names as properties, other than a change in casing.

Notes

The HAST rules for property names differ from how HTML is reflected in the DOM for the following attributes:

View list of differences
  • charoff becomes charOff (not chOff)
  • char stays char (does not become ch)
  • rel stays rel (does not become relList)
  • checked stays checked (does not become defaultChecked)
  • muted stays muted (does not become defaultMuted)
  • value stays value (does not become defaultValue)
  • selected stays selected (does not become defaultSelected)
  • char stays char (does not become ch)
  • allowfullscreen becomes allowFullScreen (not allowFullscreen)
  • hreflang becomes hrefLang, not hreflang
  • autoplay becomes autoPlay, not autoplay
  • autocomplete becomes autoComplete (not autocomplete)
  • autofocus becomes autoFocus, not autofocus
  • enctype becomes encType, not enctype
  • formenctype becomes formEncType (not formEnctype)
  • vspace becomes vSpace, not vspace
  • hspace becomes hSpace, not hspace
  • lowsrc becomes lowSrc, not lowsrc
Property values

Property values should reflect the data type determined by their property name. For example, the following HTML <div hidden></div> contains a hidden (boolean) attribute, which is reflected as a hidden property name set to true (boolean) as value in HAST, and <input minlength="5">, which contains a minlength (valid integer) attribute, is reflected as a property minLength set to 5 (number) in HAST.

In JSON, the value null must be treated as if the property was not included. In JavaScript, both null and undefined must be similarly ignored.

The DOM is strict in reflecting those properties, and HAST is not, where the DOM treats <div hidden=no></div> as having a true (boolean) value for the hidden attribute, and <img width="yes"> as having a 0 (number) value for the width attribute, these should be reflected as 'no' and 'yes', respectively, in HAST.

The reason for this is to allow plug-ins and utilities to inspect these non-standard values.

The DOM also specifies comma- and space-separated lists attribute values. In HAST, these should be treated as ordered lists. For example, <div class="alpha bravo"></div> is represented as ['alpha', 'bravo'].

There’s no special format for style.

Doctype

Doctype (Node) defines the type of the document.

interface Doctype <: Node {
  type: "doctype";
  name: string;
  public: string?;
  system: string?;
}

For example, the following HTML:

<!DOCTYPE html>

Yields:

{
  "type": "doctype",
  "name": "html",
  "public": null,
  "system": null
}

Comment

Comment (Text) represents embedded information.

interface Comment <: Text {
  type: "comment";
}

For example, the following HTML:

<!--Charlie-->

Yields:

{
  "type": "comment",
  "value": "Charlie"
}

Text

TextNode (Text) represents everything that is text. Note that its type property is text, but it is different from the abstract Unist interface Text.

interface TextNode <: Text {
  type: "text";
}

For example, the following HTML:

<span>Foxtrot</span>

Yields:

{
  "type": "element",
  "tagName": "span",
  "properties": {},
  "children": [{
    "type": "text",
    "value": "Foxtrot"
  }]
}

Related

Contribute

hast is built by people just like you! Check out contribute.md for ways to get started.

This project has a Code of Conduct. By interacting with this repository, organisation, or community you agree to abide by its terms.

Want to chat with the community and contributors? Join us in Gitter!

Have an idea for a cool new utility or tool? That’s great! If you want feedback, help, or just to share it with the world you can do so by creating an issue in the syntax-tree/ideas repository!

Acknowledgments

The initial release of this project was authored by @wooorm.

Special thanks to @eush77 for their work, ideas, and incredibly valuable feedback!

Thanks to @kthjm @KyleAMathews, @rhysd, @Rokt33r, @s1n, @Sarah-Seo, @sethvincent, and @simov for contributing commits since!

License

CC-BY-4.0 © Titus Wormer