Add numpy alternative to FE using torchaudio #26339

ylacombe · 2023-09-22T11:15:27Z

What does this PR do?

Following on from #26182, which ported torchaudio.compliance.kaldi.fbank to numpy in audio_utils, this PR aims to enable the use of numpy porting in previous Feature Extractors (AST and SpeechToText) that used torchaudio. It was discussed here.

This serves two purposes:

to give some examples of how to use audio_utils instead of torchaudio for future Feature Extractors
the possibility of removing torchaudio altogether in the future.

A next step would be to port audio_utils to torch, which might be faster (cc @sanchit-gandhi), but this is still open to discussion. Is this really relevant? And will it be really faster?

cc @ArthurZucker and @sanchit-gandhi

ylacombe · 2023-09-22T11:16:32Z

...ers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py

-        fbank = ta_kaldi.fbank(
-            waveform,
-            htk_compat=True,
-            sample_frequency=self.sampling_rate,
-            use_energy=False,
-            window_type="hanning",
-            num_mel_bins=self.num_mel_bins,
-            dither=0.0,
-            frame_shift=10,
-        )


I also took the opportunity to remove some unnecessary parameters here

HuggingFaceDocBuilderDev · 2023-09-22T11:34:26Z

The documentation is not available anymore as the PR was closed or merged.

sanchit-gandhi

Personally think it's better to remove the torchaudio dependency entirely and align these two outliers with the rest of the numpy-based audio feature extractors! Especially since we'll probably support a torch version in audio_utils in an upcoming PR, so the speed diff will be recovered.

sanchit-gandhi · 2023-09-22T17:17:56Z

...ers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py

-            dither=0.0,
-            frame_shift=10,
-        )
+        if self.use_torchaudio:


I don't think there's a need to have the use_torchaudio argument. IMO we can execute the torchaudio code if_torchaudio_is_available (thus maintaining backwards comp), and the NumPy code otherwise

if_torchaudio_is_available(): # do legacy code else: # do numpy code

I'm also fine with removing the legacy torchaudio code altogether. I know this makes the feature extraction quite a bit slower, but I think this is fine to remove the extra dependencies to bring these models in-line with the rest of the audio library.

Personally, I would favour this approach over supporting both methods for feature extraction (torchaudio and numpy). IMO having both methods convolutes the code quite a lot, which is something we want to avoid.

Fine with me to remove the previous code, it won’t be performance wise backward compatible 🫠

Let's go with the first option here then? Decorate with if_torchaudio_is_available?

sanchit-gandhi · 2023-09-22T17:20:02Z

...ers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py

@@ -198,3 +235,16 @@ def __call__(
            padded_inputs = padded_inputs.convert_to_tensors(return_tensors)

        return padded_inputs
+
+    def to_dict(self):


Is this method strictly necessary? If it is, shouldn't it go in the base FeatureExtractionMixin class? Rather than copying it out for every feature extractor?

which method do you mean? to_dict?

Yes it's about time we add this to the base feature class!
(it's necessary if we support the numpy part)

ArthurZucker

Looks good to me, aligned with @sanchit-gandhi on not adding the np support

ArthurZucker · 2023-09-26T09:04:39Z

...ers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py

@@ -198,3 +235,16 @@ def __call__(
            padded_inputs = padded_inputs.convert_to_tensors(return_tensors)

        return padded_inputs
+
+    def to_dict(self):


Yes it's about time we add this to the base feature class!
(it's necessary if we support the numpy part)

ArthurZucker · 2023-09-26T09:06:09Z

...ers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py

-            dither=0.0,
-            frame_shift=10,
-        )
+        if self.use_torchaudio:


Fine with me to remove the previous code, it won’t be performance wise backward compatible 🫠

ylacombe · 2023-09-26T09:25:57Z

Hey @ArthurZucker and @sanchit-gandhi, thanks for the review!

However, I'm not sure about what you meant here:

Looks good to me, aligned with @sanchit-gandhi on not adding the np support

And here:

I'm fine with adding a comment somewhere or a section in the doc to not lose the info on how to use numpy to get the same results as torchaudio for futur references when we'll improve or numpy port!

@sanchit-gandhi seems to be in favor of removing torchaudio support to only focus on the numpy port here, whereas @ArthurZucker seems to be in favor on not adding the numpy support.

Maybe I misunderstood the comments here! Thanks for your help!

ArthurZucker · 2023-09-26T10:18:06Z

Sorry I was confused! I agree that we should remove the old code, but worried about the performance issue, since we had to re introduce torch STFT for Whisper for example. (Performance wise backward compatible)

ylacombe · 2023-09-26T11:08:36Z

I've made a quick benchmark, on AST, with results here:

Basically, torchaudio is at least 19 faster than the numpy porting. If I haven't made any mistake in my benchmark, I'll be strongly in favor of keeping torchaudio compatibility.

WDYT @ArthurZucker and @sanchit-gandhi ? Can you also take a quick look at the benchmark code to make sure that my results are correct (or redirect me to an expert at HF haha) ?

For reference, here is the benchmark code:

from datasets import load_dataset
import pytest
from transformers import ASTFeatureExtractor

ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
speech_samples = ds.sort("id").select(range(64))[:64]["audio"]
speech_samples = [x["array"] for x in speech_samples]


def torchaudio_unbatch():
    fe = ASTFeatureExtractor(use_torchaudio=True)
    
    for sample in speech_samples:
        input_features = fe(sample, padding=True, return_tensors="pt")

def np_unbatch():
    fe = ASTFeatureExtractor(use_torchaudio=False)
    
    for sample in speech_samples:
        input_features = fe(sample, padding=True, return_tensors="pt")

def torchaudio_batch_8():
    fe = ASTFeatureExtractor(use_torchaudio=True)
    
    for i in range(0,len(speech_samples),8):
        samples = speech_samples[i:i+8]
        input_features = fe(samples, padding=True, return_tensors="pt")

def np_batch_8():
    fe = ASTFeatureExtractor(use_torchaudio=False)
    
    for i in range(0,len(speech_samples),8):
        samples = speech_samples[i:i+8]
        input_features = fe(samples, padding=True, return_tensors="pt")

@pytest.mark.benchmark(
    min_rounds=5, disable_gc=True, warmup=False
)
def test_torchaudio_unbatch(benchmark):
    benchmark(torchaudio_unbatch)

@pytest.mark.benchmark(
    min_rounds=5, disable_gc=True, warmup=False
)
def test_torchaudio_batch_8(benchmark):
    benchmark(torchaudio_batch_8)


@pytest.mark.benchmark(
    min_rounds=5, disable_gc=True, warmup=False
)
def test_np_unbatch(benchmark):
    benchmark(np_unbatch)

@pytest.mark.benchmark(
    min_rounds=5, disable_gc=True, warmup=False
)
def test_np_batch_8(benchmark):
    benchmark(np_batch_8)

ylacombe · 2023-09-26T11:10:37Z

For future reference, here is the same benchmark with Speech2TextFeatureExtractor:
Previous conclusions still hold:

ylacombe · 2023-09-26T11:12:07Z

It's also possible that we can optimize our audio_utils.py, WDYT?

sanchit-gandhi · 2023-09-27T17:25:40Z

Alright that's quite a significant difference - this probably requires overhauling the audio_utils file as you've suggested (use torch/torchaudio if available, or see where our numpy implementation is bottlenecked and try to improve it here).

ylacombe

Hey @ArthurZucker and @sanchit-gandhi, thanks for your help here.

To sum it up, I removed torchaudio dependency for both FE, but those FE still use it if it installed to ensure speed.
I've also simulated torchaudio absence to make sure everything is in order.

I'm requesting your reviews again!

ylacombe · 2023-09-29T11:13:59Z

src/transformers/feature_extraction_utils.py

    def to_dict(self) -> Dict[str, Any]:
        """
-        Serializes this instance to a Python dictionary.
-
-        Returns:
-            `Dict[str, Any]`: Dictionary of all the attributes that make up this feature extractor instance.
+        Serializes this instance to a Python dictionary. Returns:
+            `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance.
        """
        output = copy.deepcopy(self.__dict__)
        output["feature_extractor_type"] = self.__class__.__name__
-
+        if "mel_filters" in output:
+            del output["mel_filters"]
+        if "window" in output:
+            del output["window"]
        return output

    @classmethod


As requested I've modified to_dict directly in feature_extraction_utils.py

ylacombe · 2023-09-29T11:15:49Z

...odels/audio_spectrogram_transformer/test_feature_extraction_audio_spectrogram_transformer.py

+@unittest.mock.patch(
+    "transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer.is_speech_available",
+    lambda: False,
+)


This is how I simulate the absence of torchaudio in the test suite

ylacombe · 2023-09-29T11:16:25Z

...odels/audio_spectrogram_transformer/test_feature_extraction_audio_spectrogram_transformer.py

+    def test_using_audio_utils(self):
+        # Tests that it uses audio_utils instead of torchaudio
+        feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+
+        self.assertTrue(hasattr(feat_extract, "window"))
+        self.assertTrue(hasattr(feat_extract, "mel_filters"))


I'm also ensuring that we use the audio_utils package and torchaudio is indeed not used in this class

sanchit-gandhi

This on paper LGTM and nice job on getting the tests working. My only thought is that maybe we should overhaul audio_utils.py with these changes, rather than do the if/else in the feature extraction code? This way, all the is_xxx_available logic stays in audio_utils (which is fine if it gets complex, since most people won't interact with it), and the feature extraction code can stay simple

Open do either refactoring this PR to make this change, or merging this and doing it in a follow-up (along with #26119)

sanchit-gandhi · 2023-09-29T15:38:22Z

...ers/models/audio_spectrogram_transformer/feature_extraction_audio_spectrogram_transformer.py

+if is_speech_available():
+    import torchaudio.compliance.kaldi as ta_kaldi
+
+if is_torch_available():


In speech-to-text we bundle these imports into one:

if is_speech_available(): import torchaudio.compliance.kaldi as ta_kaldi import torch

Should we do the same here since we can only use torch if torchaudio is available?

Actually, torch is also used here even when torchaudio isn't used. I can maybe refactor the code to change that, but I'm not sure it's worth the time, WDYT ?

ylacombe · 2023-10-02T08:01:03Z

Hey @sanchit-gandhi, thanks for the review here!

My only thought is that maybe we should overhaul audio_utils.py with these changes, rather than do the if/else in the feature extraction code?

We'd have to create a fbank method to audio_utils which would create mel_filters and window on-the-fly in that case right ? (with hindsight, it doesn't matter much since creating mel_filters and `window isn't the bottleneck here)

In any case, I'd rather refactor that in another PR, which would maybe add the torch correspondence for every possible case in audio_utils

ArthurZucker

Thanks Looks good to me.
Aligned with @sanchit-gandhi on profiling our numpy code to see what's our huge bottlneck sometime soon!

ArthurZucker · 2023-10-03T09:12:42Z

...odels/audio_spectrogram_transformer/test_feature_extraction_audio_spectrogram_transformer.py

+    "transformers.models.audio_spectrogram_transformer.feature_extraction_audio_spectrogram_transformer.is_speech_available",
+    lambda: False,
+)
+class ASTFeatureExtractionWithoutTorchaudioTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):


Very nice. I think you can either add # Copied from on top of some tests, or (not sure if it's possible) find a way to use parametrized to mock the absence of the package as a parameter to avoid code duplications. Not very important # Copied from will be great

Hey @ydshieh, I'd love to have your take on how to best manage this.

In a few words, here I try to simulate that a library is missing in ASTFeatureExtractionTest. The thing is that now, I had to create another class: ASTFeatureExtractionWithoutTorchaudioTest, which is a copy of the previous one with a unittest.mock.patch decorator to simulate the library absence.

I've looked over the internet to avoid test duplication, but without success. Do you have any take on how to parametrize the library absence ?

Thanks for your help!

Hello @ylacombe ! There is something like unittest.mock.Mock() but I never used it (yet) myself.

Search in tests/models/t5/test_modeling_t5.py

def test_fp16_fp32_conversion(self): r""" A test to check whether the argument `keep_in_fp32_modules` correctly does its job """ orig_import = __import__ accelerate_mock = unittest.mock.Mock() # mock import of accelerate def import_accelerate_mock(name, *args, **kwargs): if name == "accelerate": if accelerate_available: return accelerate_mock else: raise ImportError return orig_import(name, *args, **kwargs)

and let me know how you feel. I can take a look too (good to learn anyway)

Hey @ydshieh, thanks for the quick response and for this example! I didn't know about it! I'm not sure this is the right fit for this purpose though

The idea is really to run ASTFeatureExtractionTest twice, one without context and the other with the missing library context!

Sorry, I missed the above comment. So here is 2 classes instead of 2 test methods.

Is the code in the new ASTFeatureExtractionWithoutTorchaudioTest be identical to the original ASTFeatureExtractionTest? If so, maybe try make it a subclass of ASTFeatureExtractionTest but decorated with unittest.mock.patch or something similar?

Thanks for this!

Making a subclass works for me, I'll try it

ArthurZucker · 2023-10-03T09:12:58Z

tests/models/speech_to_text/test_feature_extraction_speech_to_text.py

@@ -104,7 +103,213 @@ def _flatten(list_of_lists):
 @require_torch
 @require_torchaudio
 class Speech2TextFeatureExtractionTest(SequenceFeatureExtractionTestMixin, unittest.TestCase):


missing some copied from as well here!

github-actions · 2023-11-05T09:04:28Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ylacombe · 2023-11-06T14:03:32Z

tests/models/speech_to_text/test_feature_extraction_speech_to_text.py

+# exact same tests than before, except that we simulate that torchaudio is not available
+@require_torch
+@unittest.mock.patch(
+    "transformers.models.speech_to_text.feature_extraction_speech_to_text.is_speech_available", lambda: False
+)
+class Speech2TextFeatureExtractionWithoutTorchaudioTest(Speech2TextFeatureExtractionTest):
+    def test_using_audio_utils(self):
+        # Tests that it uses audio_utils instead of torchaudio
+        feat_extract = self.feature_extraction_class(**self.feat_extract_tester.prepare_feat_extract_dict())
+
+        self.assertTrue(hasattr(feat_extract, "window"))
+        self.assertTrue(hasattr(feat_extract, "mel_filters"))


I just want to make sure that this inheritance works with you @ArthurZucker and/or @amyeroberts, following @ydshieh suggestion!

As soon as I have approval, I'll merge!

Would be nice to assert that is_speech_available is False so we are sure the patch works 😄

(sometimes we have surprise ...)

self.mel_fitlers and self.window are not defined unless is_speech_available=False but it's best to be on the safe side

ArthurZucker

Looks alright with me!

* add audio_utils usage in the FE of SpeechToText * clean unecessary parameters of AudioSpectrogramTransformer FE * add audio_utils usage in AST * add serialization tests and function to FEs * make style * remove use_torchaudio and move to_dict to FE * test audio_utils usage * make style and fix import (remove torchaudio dependency import) * fix torch dependency for jax and tensor tests * fix typo * clean tests with suggestions * add lines to test if is_speech_availble is False

ylacombe added 5 commits September 22, 2023 10:46

add audio_utils usage in the FE of SpeechToText

f36232b

clean unecessary parameters of AudioSpectrogramTransformer FE

fbc40d3

add audio_utils usage in AST

5ad48e0

add serialization tests and function to FEs

73a4c06

make style

f06db42

ylacombe commented Sep 22, 2023

View reviewed changes

ylacombe requested review from ArthurZucker and sanchit-gandhi September 22, 2023 11:16

sanchit-gandhi reviewed Sep 22, 2023

View reviewed changes

ArthurZucker reviewed Sep 26, 2023

View reviewed changes

remove use_torchaudio and move to_dict to FE

608644b

ylacombe mentioned this pull request Sep 28, 2023

Add Seamless M4T model #25693

Merged

7 tasks

ylacombe added 2 commits September 29, 2023 11:03

test audio_utils usage

605ed14

make style and fix import (remove torchaudio dependency import)

b16b5a9

ylacombe commented Sep 29, 2023

View reviewed changes

ylacombe requested a review from ArthurZucker September 29, 2023 11:19

ylacombe added 2 commits September 29, 2023 11:25

fix torch dependency for jax and tensor tests

8af2313

fix typo

5e98476

sanchit-gandhi approved these changes Sep 29, 2023

View reviewed changes

ArthurZucker approved these changes Oct 3, 2023

View reviewed changes

Merge branch 'huggingface:main' into torchaudio-alternative

4aa7100

ylacombe and others added 2 commits November 6, 2023 13:53

Merge branch 'huggingface:main' into torchaudio-alternative

d2e2714

clean tests with suggestions

c18ee1a

ylacombe commented Nov 6, 2023

View reviewed changes

ArthurZucker approved these changes Nov 7, 2023

View reviewed changes

add lines to test if is_speech_availble is False

4dd6207

ylacombe merged commit be74b2e into huggingface:main Nov 8, 2023
21 checks passed

ylacombe deleted the torchaudio-alternative branch November 8, 2023 07:39

ylacombe mentioned this pull request Nov 9, 2023

remove failing tests and clean FE files #27414

Merged

Add numpy alternative to FE using torchaudio #26339

Add numpy alternative to FE using torchaudio #26339

Conversation

ylacombe commented Sep 22, 2023

What does this PR do?

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 22, 2023 • edited Loading

sanchit-gandhi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker Sep 26, 2023 • edited Loading

Choose a reason for hiding this comment

ylacombe commented Sep 26, 2023

ArthurZucker commented Sep 26, 2023

ylacombe commented Sep 26, 2023

ylacombe commented Sep 26, 2023

ylacombe commented Sep 26, 2023

sanchit-gandhi commented Sep 27, 2023

ylacombe left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sanchit-gandhi left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylacombe commented Oct 2, 2023

ArthurZucker left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylacombe Oct 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Nov 5, 2023

ylacombe Nov 6, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ylacombe Nov 7, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Sep 22, 2023 •

edited

Loading

ArthurZucker Sep 26, 2023 •

edited

Loading

ArthurZucker Sep 26, 2023 •

edited

Loading

sanchit-gandhi left a comment •

edited

Loading

ylacombe Oct 11, 2023 •

edited

Loading

ylacombe Nov 6, 2023 •

edited

Loading

ylacombe Nov 7, 2023 •

edited

Loading