MultimodalModels

Multimodal models

Multimodal models are models that process both text and images. They are usually used for image captioning, visual question answering, and other tasks that require both text and image processing. Multimodal models are available in the @visheratin/web-ai/multimodal subpackage.

Image-to-text models

Image-to-text models are used to generate text from images. They are usually used for image captioning and visual question answering.

Image-to-text models have the type ModelType.Img2Text.

Create a model

Using the model identifier:

import { MultimodalModel } from "@visheratin/web-ai/multimodal";

const result = await MultimodalModel.create("blip-base");
const model = result.model;

Using the model metadata:

import { Img2TextModel, Metadata } from "@visheratin/web-ai/multimodal";

const metadata: Metadata = {
    modelPaths: new Map([
      [
        "image-encoder",
        "https://web-ai-models.org/multimodal/blip-base/encoder-quant.onnx.gz",
      ],
      [
        "text-decoder",
        "https://web-ai-models.org/multimodal/blip-base/decoder-quant.onnx.gz",
      ],
    ]),
    outputNames: new Map<string, string>([
      ["image-encoder", "last_hidden_state"],
      ["text-decoder", "logits"],
    ]),
    preprocessorPath:
      "https://web-ai-models.org/multimodal/blip-base/preprocessor_config.json",
    tokenizerPath:
      "https://web-ai-models.org/multimodal/blip-base/tokenizer.json",
    tokenizerParams: {
      bosTokenID: 30522,
      eosTokenID: 102,
      padTokenID: 0,
    },
  }
const model = new Img2TextModel(metadata);
const elapsed = await model.init();
console.log(elapsed);

Run the model

Img2Text models extract the image features and then use them to generate the text. One useful example of such processing is image captioning. You can also specify the prefix to set the beginning of the output text or a question for visual question answering. The output for this type of models is a string:

const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input, "The image shows")
console.log(output.text)

Zero-shot classification models

Zero-shot classification models are used to classify images using text without the need for training. They are usually used for image classification and image retrieval.

Zero-shot classification models have the type ModelType.ZeroShotClassification.

Create a model

Using the model identifier:

import { MultimodalModel } from "@visheratin/web-ai/multimodal";

const result = await MultimodalModel.create("clip-base");
const model = result.model;

Using the model metadata:

import {
  ZeroShotClassificationModel,
  Metadata,
} from "@visheratin/web-ai/multimodal";

const metadata: Metadata = {
  modelPaths: new Map([
    [
      "model",
      "https://web-ai-models.org/multimodal/clip-base/model-quant.onnx.gz",
    ],
  ]),
  preprocessorPath:
    "https://web-ai-models.org/multimodal/clip-base/preprocessor_config.json",
  tokenizerPath:
    "https://web-ai-models.org/multimodal/clip-base/tokenizer.json",
  tokenizerParams: {
    bosTokenID: 49406,
    eosTokenID: 49407,
    padTokenID: 49407,
  },
};

const model = new ZeroShotClassificationModel(metadata);
const elapsed = await model.init();
console.log(elapsed);

Run the model

ZeroShotClassification models output an array of predicted classes along with the confidence scores in range [0,1] sorted by confidence in the descending order. The output also includes feature vectors for input image and texts. These vectors are useful for analyzing similarity between images and classes. When running the process() method, you must specify the image and the list of classes:

const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input, ["road", "street", "car", "forest"])
for (let item of output.results) {
  console.log(item.class, item.confidence)
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MultimodalModels

Multimodal models

Image-to-text models

Create a model

Run the model

Zero-shot classification models

Create a model

Run the model

Clone this wiki locally