Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue #165

Open
AbrJA opened this issue Jan 5, 2024 · 1 comment
Open

Performance issue #165

AbrJA opened this issue Jan 5, 2024 · 1 comment

Comments

@AbrJA
Copy link

AbrJA commented Jan 5, 2024

Testing the performance vs python I found a potential bottleneck when evaluating the model to get the token embeddings.

Julia: Around 800 ms

using Transformers
using Transformers.TextEncoders
using Transformers.HuggingFace

textenc, model = hgf"sentence-transformers/all-MiniLM-L6-v2"
sentences = ["This is an example sentence", "Each sentence is converted"]

using BenchmarkTools

@benchmark model(sentences_encoded)

BenchmarkTools.Trial: 7 samples with 1 evaluation.
 Range (min … max):  763.074 ms … 847.289 ms  ┊ GC (min … max): 1.60% … 3.46%
 Time  (median):     798.575 ms               ┊ GC (median):    3.50%
 Time  (mean ± σ):   807.619 ms ±  34.539 ms  ┊ GC (mean ± σ):  3.19% ± 0.70%

  █    █                  ██                        █        ██  
  █▁▁▁▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█▁▁▁▁▁▁▁▁██ ▁
  763 ms           Histogram: frequency by time          847 ms <

 Memory estimate: 195.70 MiB, allocs estimate: 4719116.

Python: Around 20 ms

from transformers import AutoTokenizer, AutoModel
import torch

sentences = ['This is an example sentence', 'Each sentence is converted']

textenc = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

encoded_input = textenc(sentences, padding=True, truncation=True, return_tensors='pt')

from timeit import timeit

def compute():
    #python is using a pointer but ...
    model(**encoded_input)

n = 100
result = timeit("compute()", setup='from __main__ import compute', number=n)

print("Total time : %.1f ms" % (1000 * (result/n)))

Total time : 21.5 ms

I would expect around the same time or maybe Julia being faster but it's almost 40x slower. Am I doing something wrong? Has this an explanation? Have someone detected this before?

I would appreciate any help, thank you!

@chengchingwen
Copy link
Owner

The benchmark code seems correct. My initial guess is that we don't fully utilize multithreading in our implementation. I would need to do more investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants