GitHub - espnet/espnet_onnx: Onnx wrapper for espnet infrernce model (original) (raw)

ESPnet without PyTorch!

Utility library to easily export, quantize, and optimize espnet models to onnx format. There is no need to install PyTorch or ESPnet on your machine if you already have exported files!

espnet_onnx demo in Colab

Now demonstration notebook is available in google colab!

Simple ASR demo:
Simple TTS demo:

Install

espnet_onnx can be installed with pip
If you want to export pretrained model, you need to install torch>=1.11.0, espnet, espnet_model_zoo, onnx additionally.onnx==1.12.0 might cause some errors. If you got an error while inference or exporting, please consider downgrading the onnx version.

Install guide for developers

Clone this repository.

git clone git@github.com:espnet/espnet_onnx.git

Create virtual environment.

cd tools make venv export

Activate virtual environment and install torch if required.

. tools/venv/bin/activate

Please reference official installation guide of PyTorch.

pip install torch

Clone the s3prl repository and install with pip.

cd tools git clone https://github.com/s3prl/s3prl cd s3prl pip install .

Install warp_transducer for developing transducer model.

cd tools git clone --single-branch --branch espnet_v1.1 https://github.com/b-flo/warp-transducer.git cd warp-transducer mkdir build

Please set WITH_OMP to ON or OFF.

cd build && cmake -DWITH_OMP="ON" .. && make cd pytorch_binding && python3 -m pip install -e .

If you want to develop optimization, you also need to develop onnxruntime. Please clone the onnxruntime repository.
Since espnet==202308(latest on v0.2.0 release) requires protobuf<=3.20.1 while the latest onnx requires protobuf>=3.20.2, you may get error with installation. In this case, first, install the espnet==202308, update protobuf to 3.20.3, and then install other libraries.

Usage

Export models

espnet_onnx can export pretrained model published on espnet_model_zoo. By default, exported files will be stored in ${HOME}/.cache/espnet_onnx/<tag_name>.

from espnet2.bin.asr_inference import Speech2Text from espnet_onnx.export import ASRModelExport

m = ASRModelExport()

download with espnet_model_zoo and export from pretrained model

m.export_from_pretrained('', quantize=True)

export from trained model

speech2text = Speech2Text(args) m.export(speech2text, '', quantize=True)

You can export pretrained model from zipped file. The zipped file should contain meta.yaml.

from espnet_onnx.export import ASRModelExport

m = ASRModelExport() m.export_from_zip( 'path/to/the/zipfile', tag_name='tag_name_for_zipped_model', quantize=True )

You can set some configuration for export. The available configurations are shown in the details for each models.

from espnet_onnx.export import ASRModelExport

m = ASRModelExport()

Set maximum sequence length to 3000

m.set_export_config(max_seq_len=3000) m.export_from_zip( 'path/to/the/zipfile', tag_name='tag_name_for_zipped_model', )

You can easily optimize your model by using the optimize option. If you want to fully optimize your model, you need to install the custom version of onnxruntime from here. Please read this document for more detail.

from espnet_onnx.export import ASRModelExport

m = ASRModelExport() m.export_from_zip( 'path/to/the/zipfile', tag_name='tag_name_for_zipped_model', optimize=True, quantize=True )

You can export model from the command line.

python -m espnet_onnx.export
--model_type asr
--input ${path_to_zip}
--tag transformer_lm
--apply_optimize
--apply_quantize

Inference

For inference, tag_name or model_dir is used to load onnx file. tag_name has to be defined in tag_config.yaml

import librosa from espnet_onnx import Speech2Text

speech2text = Speech2Text(tag_name='')

speech2text = Speech2Text(model_dir='path to the onnx directory')

y, sr = librosa.load('sample.wav', sr=16000) nbest = speech2text(y)

For streaming asr, you can use StreamingSpeech2Text class. The speech length should be the same as StreamingSpeech2Text.hop_size

from espnet_onnx import StreamingSpeech2Text

stream_asr = StreamingSpeech2Text(tag_name)

start streaming asr

stream_asr.start() while streaming: wav = assert len(wav) == stream_asr.hop_size stream_text = stream_asr(wav)[0][0]

You can get non-streaming asr result with end function

nbest = stream_asr.end()

You can also simulate streaming model with your wav file with simulate function. Passing True as the second argument will show the streaming text as the following code.

import librosa from espnet_onnx import StreamingSpeech2Text

stream_asr = StreamingSpeech2Text(tag_name) y, sr = librosa.load('path/to/wav', sr=16000) nbest = stream_asr.simulate(y, True)

Processing audio with 6 processes.

Result at position 0 :

Result at position 1 :

Result at position 2 : this

Result at position 3 : this is

Result at position 4 : this is a

Result at position 5 : this is a

print(nbest[0][0])

'this is a pen'

If you installed the custom version of onnxruntime, you can run optimized model for inference. You don't have to change any code from the above. If the model was optimized, then espnet_onnx would automatically load the optimized version.
You can use only hubert model for your frontend.

from espnet_onnx.export import ASRModelExport

export your model

tag_name = 'ESPnet pretrained model with hubert' m = ASRModelExport() m.export_from_pretrained(tag_name, optimize=True)

load only the frontend model

from espnet_onnx.asr.frontend import Frontend frontend = Frontend.get_frontend(tag_name)

use the model in your application

import librosa y, sr = librosa.load('wav file')

y: (B, T)

y_len: (B,)

feats = frontend(y[None,:], np.array([len(y)]))

If you installed torch in your environment, you can use frontend in your training.

from espnet_onnx.asr.frontend import TorchFrontend frontend = TorchFrontend.get_frontend(tag_name) # load pretrained frontend model

use the model while training

import librosa y, sr = librosa.load('wav file')

You need to place your data on GPU,

and specify the output shape in tuple

y = torch.Tensor(y).unsqueeze(0).to('cuda') # (1, wav_length) output_shape = (batch_size, feat_length, feats_dims) feats = frontend(y, y.size(1), output_shape)

Text2Speech inference

You can export TTS models as ASR models.

from espnet2.bin.tts_inference import Text2Speech from espnet_onnx.export import TTSModelExport

m = TTSModelExport()

download with espnet_model_zoo and export from pretrained model

m.export_from_pretrained('', quantize=True)

export from trained model

text2speech = Text2Speech(args) m.export(text2speech, '', quantize=True)

You can generate wav files with just simply using the Text2Speech class.

from espnet_onnx import Text2Speech

tag_name = 'kan-bayashi/ljspeech_vits' text2speech = Text2Speech(tag_name, use_quantized=True)

text = 'Hello world!' output_dict = text2speech(text) # inference with onnx model. wav = output_dict['wav']

How to use GPU on espnet_onnx

Install dependency.

First, we need onnxruntime-gpu library, instead of onnxruntime. Please follow this article to select and install the correct version of onnxruntime-gpu, depending on your CUDA version.

Inference on GPU

Now you can speedup the inference speed with GPU. All you need is to select the correct providers, and give it to the Speech2Text or StreamingSpeech2Text instance. See this article for more information about providers.

import librosa from espnet_onnx import Speech2Text

PROVIDERS = ['CUDAExecutionProvider'] tag_name = 'some_tag_name'

speech2text = Speech2Text( tag_name, providers=PROVIDERS ) y, sr = librosa.load('path/to/wav', sr=16000) nbest = speech2text(y) # runs on GPU.

Note that some quantized models are not supported for GPU computation. If you got an error with quantized model, please try not-quantized model.

Changes from ESPNet

To avoid the cache problem, I modified some scripts from the original espnet implementation.

Add <blank> before <sos>
Give some torch.zeros() arrays to the model.
Remove the first token in post process. (remove blank)
Replace make_pad_mask into new implementation, which can be converted into onnx format.
Removed extend_pe() from positional encoding module. The length of pe is 512 by default.

Supported Archs

ASR: Supported architecture for ASR

TTS: Supported architecture for TTS

Developer's Guide

ASR: Developer's Guide

References

COPYRIGHT

Released under MIT licence

Author

Masao Someki

contact: masao.someki@gmail.com