The Stability AI Model Metadata Standard, or "SAI Model Spec", is a standard specification for the format and data of metadata keys in the header of .safetensors
AI model files.
It seeks to make the content of an AI model file quickly and easily identifiable, such that downstream users (inference engines and UIs) can understand the content of a file without the need for any tricky processing or analysis.
The data within should identify clearly (A) the type and function of a model, clearly enough that an inference engine can determine how to load it correctly, and (B) user-relevant information that a UI with model-selection capability should display to the end-user.
This is specification version 1.0.1.
Versions are (approximately) SemVer: (backward-breaking).(forward-breaking).(non-breaking-change)
.
Modern AI models are distributed via .safetensors
files (no longer 'pickle' files .ckpt
/.bin
/etc), and as such this format is specified to fit within the .safetensors
specification in such a way as to prevent any breakage on older implementations. You can read about the safetensors format here. Safetensors files have a standard JSON header on all files, with a __metadata__
key reserved for dynamic user-specified data. It is the perfect place for helpful metadata that prior to now has been underutilized. This Model Metadata Specification defines a set of keys to put within the __metadata__
header, all prefixed with modelspec.
.
The following is an example of what the header of a compliant file might look like:
{
"__metadata__": {
"modelspec.sai_model_spec": "1.0.0",
"modelspec.architecture": "stable-diffusion-xl-v1-base",
"modelspec.implementation": "sgm",
"modelspec.title": "Example Model Version 1.0",
"other_keys": "..."
},
"some.tensor.key": { ... }
}
Trainers should copy pre-existing keys from the source file where relevant (architecture, etc.), auto-generate keys (modelspec version, hash, data, etc.) where relevant, and make it easy for the user to specify any user-modifiable keys (title, description, etc.).
Tools that enable a user to use/browse/search/select models should:
- Use architecture and other technical keys to determine how to load and process the model, and warn the user if there is a failure (eg architecture is incorrectly specified)
- Having a mapping of architecture IDs to code paths is a good option. In the case of image generation models, there is often a python codepath
.yaml
config that can be associated with the ID.
- Having a mapping of architecture IDs to code paths is a good option. In the case of image generation models, there is often a python codepath
- Apply relevant keys where logical (eg
resolution
in image models should be applied as the default resolution for images made with that model) - Display keys to user where relevant (eg
title
andthumbnail
shown in model listings,description
should be accessible easily,usage_hint
should be readily visible, ...) - Allow the user to modify metadata when appropriate (if a user is running the model at home, they may wish to edit in their own description or usage hints, or replace the thumbnail with one more meaningful to them)
- If searching is possible within the relevant tool, this metadata should be provided as searchable content
Quick facts:
- All keys added per this spec must have the prefix
modelspec.
prepended to them, for example thetitle
key is written within the actual data-file asmodelspec.title
. - Keys not part of this spec may exist or not at end-users discretion, eg. to meet other standards or apply other data useful to more specific cases.
- Some keys are specific to the category of model, and are listed separately below.
This specification defines 3 categories of key: MUST, SHOULD, CAN
- "MUST" indicates that a key is required for a model to obey this format specification. If the key is not present, the model does not include this metadata. Downstream consumers (inferencers) are encouraged in such a case to assume the model predates this spec (eg Stable Diffusion v1 model). (Nothing technically prevents an inferencer from supporting a subset of keys, eg reading just the
title
, without obeying full spec beyond that if they so choose). - "SHOULD" indicates that a key is considered valuable and should be used, but if it is missing, that’s okay, and a downstream consumer should fill in a default or ignore it.
- "CAN" indicates a key is defined by the spec but is not expected in most models.
Name | Type | Description | Examples |
---|---|---|---|
sai_model_spec |
MUST | Mandatory identifier key, indicates the presence and version of this specification. Trainer tools that support the spec should automatically emit this key, set to the version they support. | 1.0.0 |
architecture |
MUST | The specific classifier of the model's architecture, must be unique between models that have different inferencing requirements (so for example SDV2-512 and SDv2-768-v are different enough that the distinction must be marked here, as the code must behave different to support it). Simple finetunes of a model do not require a unique class, as the inference code does not have to change to support it. See architecture ID listing below this table for specific examples. The / slash symbol is defined as an optional meaningful separator. For example, stable-diffusion-v1/lora indicates that it is a LoRA trained to be applied to a stable-diffusion-v1 model. Implementations are not required to parse this separator unless it is useful to them to process it. |
stable-diffusion-v1 , stable-diffusion-xl-v1-base , stable-diffusion-xl-v1-refiner , stable-diffusion-v1/lora , stable-diffusion-v1/textual-inversion , gpt-neo-x |
implementation |
MUST | A reliably static string that identifies the standard implementation codebase. The model's tensor keys generally will match variable names within the official codebase. This can be a named like sgm or a GitHub URL like https://github.com/Stability-AI/generative-models or any other string, as long as you do not change it between different models of the same format. |
sgm , diffusers |
title |
MUST | A title unique to the specific model. Generally for end-user training software, the user should provide this. If they do not, one can be provided as just eg the original file name or training run name. Inference UIs are encouraged to display this title to users in any model selector tools. | Stable Diffusion XL 1.0 Base , My Example LoRA |
description |
SHOULD | A user-friendly textual description of the model. This may describe what the model is trained on, what its capabilities are, or specific data like trigger words for a small SD finetunes. This field is permitted to contain very long sections of text, with paragraphs and etc. Inference UIs are encouraged to make this description visible-but-not-in-the-way to end users. Usage of markdown formatting is encouraged, and UIs are encouraged to format the markdown properly (displaying as plaintext is also acceptable where markdown is not possible). | Stable Diffusion XL is the next generation of Stable Diffusion, a 6.5B parameter model-ensemble that generates etc. etc. |
author |
SHOULD | The name or identity of the company or individual that created a model. Can even be a username or personal profile link. | Stability AI , MyCorp , John Doe , github.com/example |
date |
SHOULD | The precise date that a model was created or published, in any ISO-8601-compliant format. | 2023-07-16 , 2023‐07‐16T18:13:38Z |
hash_sha256 |
SHOULD | A hash of all tensor content (ie excluding the header section), with 0x prefix, all lowercase, and no byte-separator symbols. Other keys with the hash_ prefix followed by a different hash algorithm (eg hash_md5 ) are expected to obey the same format rules and implement the hash algorithm named within. Future versions of the spec may change which algorithm is encouraged as SHOULD . Inferencing engines are encouraged to validate that the hash matches after loading a file and warn the user if it does not match (ie possible file corruption). Model trainers/modifiers are strongly encouraged to calculate the hash and emit it correctly automatically whenever saving a model. This is not a MUST because hash algorithms may change with time, and the format should not be locked in to just one. |
0x123abc... |
implementation _version |
CAN | A minimum required version of the specified implementation codebase. This can be an actual version ID (eg 2.0.0 ) or a commit hash. |
2.0.0 , abc123 |
license |
CAN | If the model is under any form of license terms or restrictions, they should be clearly identified here. The model creator may at their own discretion (A) provide the name of the license, (B) provide a link to the license terms, or (C) emit the license terms in full in this slot. | CC-BY-SA-4.0 , CreativeML Open RAIL-M |
usage_hint |
CAN | Usage hint(s) for the model, where applicable. This field should be short, and just quickly describe bits of information a user might need while operating the model. Inference UIs are encouraged to make this information readily visible to the user when it is present. For example, a small SD finetune model would use this to list trigger words. | Trigger word: mypetcat , Always use <user> and <assistant> prefixes! |
thumbnail |
CAN | A (very small!) thumbnail icon in data-image format to be provided as a preview in inference UIs. Note that safetensors headers usually occupy a few hundreds of kilobytes, and don't get officially limited until 100 megabytes, so a small jpeg in data-image format does not significantly increase the size. 256x256 is a recommended size and aspect ratio (square). |
data:image/jpeg; base64,abc123… |
tags |
CAN | An optional user-specified comma-separated list of category/tag labels for a finetuned model. A model trained on a specific person would be named after that person, but might be tagged as simply Person . Model listings can use this key when present to organize and allow easy filtering by tag. When in doubt on what tag(s) to use, the model maker is encouraged to look at the category label on other similar models, or choose a new label if they're the first to make a model in a category. |
Person , Object , Style , Person,Celebrity,Video Game |
merged_from |
CAN | If the model was created by merging other models, you may provide a comma-separated list of the source models here. More details about merging or creation process may be included in the description key or in nonstandard keys. Models that do not provide this key are presumed to have been uniquely trained rather than merged. |
Stable Diffusion XL 1.0 Base, My Overburned Model, My New Model |
Name | Type | Description | Examples |
---|---|---|---|
resolution |
MUST | The base resolution an image generator is intended to work at, in (width)x(height) format. This does not need to account for aspect ratios. Future image generator models of a class that are able to handle any resolution may omit this key. Current generation Stable Diffusion models should always have this key. Note that adapter and component models can leave this off. |
512x512 , 1024x1024 |
trigger_phrase |
CAN | For image generation adapter models (eg LoRA) especially, if a model is trained to heavily require a phrase, it should be placed here. Inference UIs are welcomed to auto-emit this phrase into the prompt if it is present (encouraged to make this behavior optional to the user where possible). | mypetcat , mymodelnamehere |
prediction_type |
CAN | In Stable Diffusion, v or epsilon . Other model classes may have their own concepts that apply. |
v , epsilon |
timestep_range |
CAN | If a model is tuned on a sub-section of possible timesteps (Timestep-Expert Models), identify it here, in the format <min>,<max> . |
500,999 , 0,499 |
encoder_layer |
CAN | (Specialty) for "clip skip" in Stable Diffusion models, or similar practice in other models like it, this can be applied where relevant to identify that a non-standard layer of an encoder model should be used (so for example value 2 in an SD model indicates clip_skip=2 should be used). |
2 |
preprocessor |
CAN | (Specialty) for "ControlNet" or similar model-adapter types that require preprocessing, this is an indicator of the preprocessing type, as a simple text identifier. Should not identify exact tool (eg "MiDaS"), just the broad type (eg "depth"). | depth , canny |
is_negative_embedding |
CAN | (Specialty) for "Textual-Inversion" or similar input-embedding types that modify prompts, this can be true to indicate that the embedding is meant for Negative Prompts, or false to indicate it's meant for (Positive) Prompts. A UI implementation may use this key to apply embeddings correctly with less user-intervention. |
true , false |
unet_dtype |
CAN | (Specialty) for UNet based models that have special DType requirements (eg incompatible with fp16 but works with bf16) for inference, a comma-separated list of known-good types. Inference engines are recommended to ensure a compatible type is used when this is specified. | bf16,fp32 |
vae_dtype |
CAN | (Specialty) for latent models with a VAE that has special DType requirements (eg incompatible with fp16 but works with bf16) for inference, a comma-separated list of known-good types. Inference engines are recommended to ensure a compatible type is used when this is specified. | bf16,fp32 |
Name | Type | Description | Examples |
---|---|---|---|
data_format |
MUST | The format the data is in - needed due to the variety of specialty formats and quantization methods (often not accurately reflected in tensor data type, as eg there are different definitions of 4bit data). | fp16 , gptq-4bit , gptq-4bit-gr128 , bnb-nf4 |
format_type |
SHOULD | What type of format the model is intended to work in (writing stories vs question-and-answer chat vs coding). Should constrain to the enumerated examples unless a new format type has been created that is not yet listed. |
general (other), writing (stories), chat (Q&A), code , technical (emits special logical behavior, eg CoT) |
language |
CAN | The primary human language(s) the model is trained to understand, in standard language code format, as a comma-separated list. | en/US , en/US, en/UK , ja/JP, en/US |
format_template |
CAN | For formats where a specific template is trained in, it should be given here, as a string that identifies %%SYSTEM%% , %%USER%% , and %%AI%% . Some templates may exclude 'system' or add additional keys. Inferencing tools are encouraged to be lenient if the format does not match expectations. |
<system>%%SYSTEM%%<user>%%USER%%<assistant>%%AI%% , ### System:\n%%SYSTEM%%\n### Human:\n%%USER%%\n### Assistant:\n%%AI%% |
The following is a list of common Architecture ID values, both to serve as a reference for implementation, and as an example for other architecture IDs to be chosen by. This is not a complete list, just several examples.
- Stable Diffusion:
stable-diffusion-v1
, or changev1
to any of:v1
,v1-inpainting
v2-512
,v2-768-v
,v2-depth
,v2-inpainting
,v2-unclip-h
,v2-unclip-l
,xl-v1-base
,xl-v1-refiner
,xl-v1-edit
,xl-turbo-v1
,v3-medium
- Stable Video Diffusion:
stable-video-diffusion-img2vid-v1
, or changev1
to any of:v1
,v0_9
- Stable Diffusion Components:
stable-diffusion-xl-v1/vae
(changestable-diffusion-xl-v1
to the base model architecture) - Stable Diffusion Adapters:
stable-diffusion-v1/lora
,stable-diffusion-v1/textual-inversion
,stable-diffusion-v1/controlnet
,stable-diffusion-v1/control-lora
(changestable-diffusion-v1
to the base model architecture) - Stable Cascade:
stable-cascade-v1-stage-a
(orb
,c
, initial release also had variants of the formb-bf16
) - Language Models:
gpt-neo-x
(Note: relevant project leads for well-known model formats are welcomed to PR additions to this list)
This model metadata standard is published by Stability AI freely to the public in the interest of creating a shared standard format for the benefit of the community as a whole.