2
2
3
3
## Exporting Whisper with Beam Search
4
4
5
- There are two ways to export Whisper with beam search (using Whisper tiny as an example).
5
+ There are several ways to export Whisper with beam search (using Whisper tiny as an example).
6
+
7
+ ### Option 1: from convert_to_onnx
6
8
7
- Option 1: from source
8
9
```
10
+ # From source
9
11
$ git clone https://github.com/microsoft/onnxruntime
10
- $ cd onnxruntime/onnxruntime/python/tools/transformers/models/whisper
11
- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format
12
+ $ cd onnxruntime/onnxruntime/python/tools/transformers/
13
+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
14
+
15
+ # From wheel
16
+ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
12
17
```
13
18
14
- Option 2: from wheel
19
+ ### Option 2: end-to-end model from [ Olive] ( https://github.com/microsoft/Olive/tree/main/examples/whisper )
20
+
21
+ Please follow the [ README instructions] ( https://github.com/microsoft/Olive/tree/main/examples/whisper#prerequisites ) in Olive.
22
+
23
+ ### Option 3: from [ Hugging Face Optimum] ( https://github.com/huggingface/optimum )
24
+
25
+ Run the following Python code to export:
26
+
15
27
```
16
- $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format
28
+ from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
29
+
30
+ model_name = "openai/whisper-large-v2"
31
+ model = ORTModelForSpeechSeq2Seq.from_pretrained(
32
+ model_name,
33
+ export=True,
34
+ )
35
+ model.save_pretrained(model_name.split("/")[-1] + "-onnx")
17
36
```
18
37
19
38
## Exporting + Optimizing + Quantizing Whisper with Beam Search
@@ -23,7 +42,7 @@ Here are some additional examples for exporting Whisper with beam search.
23
42
Export with Forced Decoder Input Ids
24
43
```
25
44
# From source:
26
- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
45
+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
27
46
28
47
# From wheel:
29
48
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --use_forced_decoder_ids
@@ -32,7 +51,7 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
32
51
Export + Optimize for FP32
33
52
```
34
53
# From source:
35
- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
54
+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
36
55
37
56
# From wheel:
38
57
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp32
@@ -41,7 +60,7 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
41
60
Export + Optimize for FP16 and GPU
42
61
```
43
62
# From source:
44
- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
63
+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
45
64
46
65
# From wheel:
47
66
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --optimize_onnx --precision fp16 --use_gpu --provider cuda
@@ -50,8 +69,128 @@ $ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/w
50
69
Export + Quantize for INT8
51
70
```
52
71
# From source:
53
- $ python3 convert_to_onnx.py -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
72
+ $ python3 -m models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
54
73
55
74
# From wheel:
56
75
$ python3 -m onnxruntime.transformers.models.whisper.convert_to_onnx -m openai/whisper-tiny --output whispertiny --use_external_data_format --precision int8 --quantize_embedding_layer
57
76
```
77
+
78
+ ## Benchmark Whisper
79
+
80
+ Here are some examples of how you can benchmark Whisper across various end-to-end (E2E) implementations.
81
+
82
+ Note: In the below examples, ` PyTorch ` refers to running in PyTorch without ` torch.compile ` and ` PyTorch 2.0 ` refers to running in PyTorch with ` torch.compile ` .
83
+
84
+ ### Variants
85
+
86
+ 1 . PyTorch (without ` torch.compile ` ), FP32
87
+ ```
88
+ python3 -m models.whisper.benchmark \
89
+ --benchmark-type hf-pt \
90
+ --audio-path 1272-141231-0002.mp3 \
91
+ --model-name openai/whisper-large-v2 \
92
+ --precision fp32 \
93
+ --device cpu
94
+ ```
95
+
96
+ 2 . PyTorch 2.0 (with ` torch.compile ` ), FP16
97
+ ```
98
+ python3 -m models.whisper.benchmark \
99
+ --benchmark-type hf-pt2 \
100
+ --audio-path 1272-141231-0002.mp3 \
101
+ --model-name openai/whisper-large-v2 \
102
+ --precision fp16 \
103
+ --device cuda
104
+ ```
105
+
106
+ 3 . Optimum + ONNX Runtime, FP32, export via Optimum
107
+ ```
108
+ python3 -m models.whisper.benchmark \
109
+ --benchmark-type hf-ort \
110
+ --audio-path 1272-141231-0002.mp3 \
111
+ --model-name openai/whisper-large-v2 \
112
+ --hf-ort-model-path ./whisper-large-v2-onnx/ \
113
+ --precision fp32 \
114
+ --device cpu
115
+ ```
116
+
117
+ 4 . ONNX Runtime, FP32, export via Olive or convert_to_onnx
118
+ ```
119
+ python3 -m models.whisper.benchmark \
120
+ --benchmark-type ort \
121
+ --audio-path 1272-141231-0002.mp3 \
122
+ --model-name openai/whisper-large-v2 \
123
+ --ort-model-path ./wlarge-fp32/whisper-large-v2_beamsearch.onnx \
124
+ --precision fp32 \
125
+ --device cpu
126
+ ```
127
+
128
+ 5 . ONNX Runtime, FP16, export via Olive or convert_to_onnx
129
+ ```
130
+ python3 -m models.whisper.benchmark \
131
+ --benchmark-type ort \
132
+ --audio-path 1272-141231-0002.mp3 \
133
+ --model-name openai/whisper-large-v2 \
134
+ --ort-model-path ./wlarge-fp32/whisper-large_all.onnx \
135
+ --precision fp16 \
136
+ --device cuda
137
+ ```
138
+
139
+ 6 . ONNX Runtime, INT8, export via Olive or convert_to_onnx
140
+ ```
141
+ python3 -m models.whisper.benchmark \
142
+ --benchmark-type ort \
143
+ --audio-path 1272-141231-0002.mp3 \
144
+ --model-name openai/whisper-large-v2 \
145
+ --ort-model-path ./wlarge-fp32/whisper-large-v2_all.onnx \
146
+ --precision fp32 \
147
+ --device cpu
148
+ ```
149
+
150
+ You can profile a variant by adding the ` --profile ` flag.
151
+
152
+ ### Benchmark All
153
+
154
+ You can use ` benchmark_all.py ` to benchmark across various platforms and automatically store the results in a CSV file. Here is an example.
155
+
156
+ ```
157
+ python3 -m models.whisper.benchmark_all \
158
+ --audio-path ./whisper-test-audios/ \
159
+ --hf-ort-model-path ./whisper-large-v2-onnx/ \
160
+ --ort-model-path ./wlarge-fp32/whisper-large-v2_all.onnx \
161
+ --model-name openai/whisper-large-v2 \
162
+ --precision fp32 \
163
+ --device cpu
164
+ ```
165
+
166
+ ### Benchmarking on NVIDIA A100
167
+
168
+ Here is a benchmark for an MP3 file with 20.7s of audio.
169
+
170
+ #### FP16
171
+
172
+ | Engine | Size | Per-Token Latency | Real-Time Factor |
173
+ | ------------- | -------- | ----------------- | ---------------- |
174
+ | PyTorch | Tiny | 4.697 ms/token | 0.004697 |
175
+ | PyTorch 2.0 | Tiny | 3.406 ms/token | 0.003406 |
176
+ | ONNX Runtime | Tiny | 0.746 ms/token | 0.000746 |
177
+ | PyTorch | Medium | 17.837 ms/token | 0.017387 |
178
+ | PyTorch 2.0 | Medium | 18.124 ms/token | 0.018124 |
179
+ | ONNX Runtime | Medium | 3.894 ms/token | 0.003894 |
180
+ | PyTorch | Large v2 | 23.470 ms/token | 0.023470 |
181
+ | PyTorch 2.0 | Large v2 | 23.146 ms/token | 0.023146 |
182
+ | ONNX Runtime | Large v2 | 6.262 ms/token | 0.006262 |
183
+
184
+ #### FP32
185
+
186
+ | Engine | Size | Per-Token Latency | Real-Time Factor |
187
+ | ------------- | -------- | ----------------- | ---------------- |
188
+ | PyTorch | Tiny | 6.220 ms/token | 0.006220 |
189
+ | PyTorch 2.0 | Tiny | 3.944 ms/token | 0.003944 |
190
+ | ONNX Runtime | Tiny | 1.545 ms/token | 0.001545 |
191
+ | PyTorch | Medium | 19.093 ms/token | 0.019093 |
192
+ | PyTorch 2.0 | Medium | 20.459 ms/token | 0.020459 |
193
+ | ONNX Runtime | Medium | 9.440 ms/token | 0.009440 |
194
+ | PyTorch | Large v2 | 25.844 ms/token | 0.025844 |
195
+ | PyTorch 2.0 | Large v2 | 26.397 ms/token | 0.026397 |
196
+ | ONNX Runtime | Large v2 | 7.492 ms/token | 0.007492 |
0 commit comments