You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Testing the performance vs python I found a potential bottleneck when evaluating the model to get the token embeddings.
Julia: Around 800 ms
using Transformers
using Transformers.TextEncoders
using Transformers.HuggingFace
textenc, model = hgf"sentence-transformers/all-MiniLM-L6-v2"
sentences = ["This is an example sentence", "Each sentence is converted"]
using BenchmarkTools
@benchmark model(sentences_encoded)
BenchmarkTools.Trial: 7 samples with 1 evaluation.
Range (min … max): 763.074 ms … 847.289 ms ┊ GC (min … max): 1.60% … 3.46%
Time (median): 798.575 ms ┊ GC (median): 3.50%
Time (mean ± σ): 807.619 ms ± 34.539 ms ┊ GC (mean ± σ): 3.19% ± 0.70%
█ █ ██ █ ██
█▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁██ ▁
763 ms Histogram: frequency by time 847 ms <
Memory estimate: 195.70 MiB, allocs estimate: 4719116.
Python: Around 20 ms
from transformers import AutoTokenizer, AutoModel
import torch
sentences = ['This is an example sentence', 'Each sentence is converted']
textenc = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
encoded_input = textenc(sentences, padding=True, truncation=True, return_tensors='pt')
from timeit import timeit
def compute():
#python is using a pointer but ...
model(**encoded_input)
n = 100
result = timeit("compute()", setup='from __main__ import compute', number=n)
print("Total time : %.1f ms" % (1000 * (result/n)))
Total time : 21.5 ms
I would expect around the same time or maybe Julia being faster but it's almost 40x slower. Am I doing something wrong? Has this an explanation? Have someone detected this before?
I would appreciate any help, thank you!
The text was updated successfully, but these errors were encountered:
The benchmark code seems correct. My initial guess is that we don't fully utilize multithreading in our implementation. I would need to do more investigation.
Testing the performance vs python I found a potential bottleneck when evaluating the model to get the token embeddings.
Julia: Around 800 ms
Python: Around 20 ms
I would expect around the same time or maybe Julia being faster but it's almost 40x slower. Am I doing something wrong? Has this an explanation? Have someone detected this before?
I would appreciate any help, thank you!
The text was updated successfully, but these errors were encountered: