Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

feat: add jina embedding models #757

Merged
merged 6 commits into from
Jul 11, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Added

- Add jina embeddings suit. ([#757](https://github.com/jina-ai/finetuner/pull/757))

- Add `cos_sim` helper to finetuner. ([#757](https://github.com/jina-ai/finetuner/pull/757))

### Removed

### Changed

- Finetuner always install torch and other dependencies. ([#757](https://github.com/jina-ai/finetuner/pull/757))

### Fixed

### Docs
Expand Down
4 changes: 1 addition & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -149,9 +149,7 @@ Make sure you have Python 3.8+ installed. Finetuner can be installed via `pip` b
pip install -U finetuner
```

If you want to encode local data with the `finetuner.encode` function, you need to install
`"finetuner[full]"`. This includes a number of additional dependencies, which are necessary for encoding: Torch,
Torchvision and OpenCLIP:
If you want to submit a fine-tuning job on the cloud, please use

```bash
pip install "finetuner[full]"
Expand Down
21 changes: 12 additions & 9 deletions docs/walkthrough/choose-backbone.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,15 +45,18 @@ to get a list of supported models:

````{tab} text-to-text
```bash
Finetuner backbones: text-to-text
┏━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name ┃ task ┃ output_dim ┃ architecture ┃ description ┃
┡━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ bert-base-en │ text-to-text │ 768 │ transformer │ BERT model pre-trained on BookCorpus and English Wikipedia │
│ bert-base-multi │ text-to-text │ 768 │ transformer │ BERT model pre-trained on multilingual Wikipedia │
│ distiluse-base-multi │ text-to-text │ 512 │ transformer │ Knowledge distilled version of the multilingual Universal Sentence Encoder │
│ sbert-base-en │ text-to-text │ 768 │ transformer │ Pretrained BERT, fine-tuned on MS Marco │
└──────────────────────┴──────────────┴────────────┴──────────────┴────────────────────────────────────────────────────────────────────────────┘
Finetuner backbones: text-to-text
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ name ┃ task ┃ output_dim ┃ architecture ┃ description ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ jina-embedding-s-en-v1 │ text-to-text │ 512 │ transformer │ Text embedding model trained using Linnaeus-Clean dataset by Jina AI │
│ jina-embedding-b-en-v1 │ text-to-text │ 768 │ transformer │ Text embedding model trained using Linnaeus-Clean dataset by Jina AI │
│ jina-embedding-l-en-v1 │ text-to-text │ 1024 │ transformer │ Text embedding model trained using Linnaeus-Clean dataset by Jina AI │
│ bert-base-en │ text-to-text │ 768 │ transformer │ BERT model pre-trained on BookCorpus and English Wikipedia │
│ bert-base-multi │ text-to-text │ 768 │ transformer │ BERT model pre-trained on multilingual Wikipedia │
│ distiluse-base-multi │ text-to-text │ 512 │ transformer │ Knowledge distilled version of the multilingual Sentence Encoder │
│ sbert-base-en │ text-to-text │ 768 │ transformer │ Pretrained BERT, fine-tuned on MS Marco │
└────────────────────────┴──────────────┴────────────┴──────────────┴─────────────────────────────────────────────────────────────────────────┘
```
````
````{tab} image-to-image
Expand Down
12 changes: 11 additions & 1 deletion finetuner/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from typing import TYPE_CHECKING, Any, Dict, List, Optional, TextIO, Union
from urllib.parse import urlparse

import numpy as np
from _finetuner.runner.stubs import model as model_stub
from docarray import Document, DocumentArray # noqa F401

Expand Down Expand Up @@ -33,7 +34,6 @@
from finetuner.model import list_model_classes

if TYPE_CHECKING:
import numpy as np
from _finetuner.models.inference import InferenceEngine

ft = Finetuner()
Expand Down Expand Up @@ -669,3 +669,13 @@ def encode(
batch.embeddings = output.detach().cpu().numpy()

return data if return_da else data.embeddings


def cos_sim(a: np.ndarray, b: np.ndarray) -> float:
"""Cosine similarity between two vectors.

:param a: The first vector.
:param b: The second vector.
:return: Cosine similarity between two vectors.
"""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
2 changes: 1 addition & 1 deletion setup.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[metadata]
version = 0.7.8
version = 0.7.9

[flake8]
# E501 is too long lines - ignore as black takes care of that
Expand Down
8 changes: 4 additions & 4 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,13 @@
setup_requires=['setuptools>=18.0', 'wheel'],
install_requires=[
'docarray[common]<0.30.0',
'trimesh==3.16.4',
'finetuner-stubs==0.13.7',
'jina-hubble-sdk==0.33.1',
'finetuner-stubs==0.13.9',
'finetuner-commons==0.13.9',
],
extras_require={
'full': [
'finetuner-commons==0.13.7',
'jina-hubble-sdk==0.33.1',
'trimesh==3.16.4',
],
'test': [
'black==23.3.0',
Expand Down
Loading