-
-
Notifications
You must be signed in to change notification settings - Fork 42
MultimodalModels
Multimodal models are models that process both text and images. They are usually used for image captioning, visual question answering, and other tasks that require both text and image processing. Multimodal models are available in the @visheratin/web-ai/multimodal
subpackage.
Image-to-text models are used to generate text from images. They are usually used for image captioning and visual question answering.
Image-to-text models have the type ModelType.Img2Text
.
Using the model identifier:
import { MultimodalModel } from "@visheratin/web-ai/multimodal";
const result = await MultimodalModel.create("blip-base");
const model = result.model;
Using the model metadata:
import { Img2TextModel, Metadata } from "@visheratin/web-ai/multimodal";
const metadata: Metadata = {
modelPaths: new Map([
[
"image-encoder",
"https://web-ai-models.org/multimodal/blip-base/encoder-quant.onnx.gz",
],
[
"text-decoder",
"https://web-ai-models.org/multimodal/blip-base/decoder-quant.onnx.gz",
],
]),
outputNames: new Map<string, string>([
["image-encoder", "last_hidden_state"],
["text-decoder", "logits"],
]),
preprocessorPath:
"https://web-ai-models.org/multimodal/blip-base/preprocessor_config.json",
tokenizerPath:
"https://web-ai-models.org/multimodal/blip-base/tokenizer.json",
tokenizerParams: {
bosTokenID: 30522,
eosTokenID: 102,
padTokenID: 0,
},
}
const model = new Img2TextModel(metadata);
const elapsed = await model.init();
console.log(elapsed);
Img2Text
models extract the image features and then use them to generate the text. One useful example of such processing is image captioning. You can also specify the prefix to set the beginning of the output text or a question for visual question answering. The output for this type of models is a string:
const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input, "The image shows")
console.log(output.text)
Zero-shot classification models are used to classify images using text without the need for training. They are usually used for image classification and image retrieval.
Zero-shot classification models have the type ModelType.ZeroShotClassification
.
Using the model identifier:
import { MultimodalModel } from "@visheratin/web-ai/multimodal";
const result = await MultimodalModel.create("clip-base");
const model = result.model;
Using the model metadata:
import {
ZeroShotClassificationModel,
Metadata,
} from "@visheratin/web-ai/multimodal";
const metadata: Metadata = {
modelPaths: new Map([
[
"model",
"https://web-ai-models.org/multimodal/clip-base/model-quant.onnx.gz",
],
]),
preprocessorPath:
"https://web-ai-models.org/multimodal/clip-base/preprocessor_config.json",
tokenizerPath:
"https://web-ai-models.org/multimodal/clip-base/tokenizer.json",
tokenizerParams: {
bosTokenID: 49406,
eosTokenID: 49407,
padTokenID: 49407,
},
};
const model = new ZeroShotClassificationModel(metadata);
const elapsed = await model.init();
console.log(elapsed);
ZeroShotClassification
models output an array of predicted classes along with the confidence scores in range [0,1] sorted by confidence in the descending order. The output also includes feature vectors for input image and texts. These vectors are useful for analyzing similarity between images and classes. When running the process()
method, you must specify the image and the list of classes:
const input = "https://upload.wikimedia.org/wikipedia/commons/thumb/8/81/Georgia5and120loop.jpg/640px-Georgia5and120loop.jpg"
const output = await model.process(input, ["road", "street", "car", "forest"])
for (let item of output.results) {
console.log(item.class, item.confidence)
}