segmentation fault while using onnxruntime==1.21.0 #24144

vmnit · 2025-03-24T17:06:11Z

onnxruntime getting crashed with segmentation fault when using version 1.21.0. It is not crashing when using 1.20.1 release.

Steps to reproduce:

import onnxruntime as ort
sess_options = ort.SessionOptions()
sess = ort.InferenceSession('hf_Qwen2-7B-Instruct_model.onnx', sess_options)

hf_Qwen2-7B-Instruct_model.onnx.gz

The text was updated successfully, but these errors were encountered:

yuslepukhin · 2025-03-24T17:41:37Z

The model is referring to an external weights file. Would you like to supply it?

yuslepukhin · 2025-03-24T17:49:24Z

here is the exception message that is issued. I am not seeing a segmentation fault on the main build:

unknown file: error: C++ exception with description "Load model from D :/dev/data/SegmentationFault_gh_24144/hf_Qwen2 - 7B - Instruct_model.onnx failed:Load model D :/dev/data/SegmentationFault_gh_24144/hf_Qwen2 - 7B - Instruct_model.onnx failed" thrown in the test body.

vmnit · 2025-03-25T04:45:27Z

Hi @yuslepukhin,

Thanks for looking into it.
The data file is huge around 29 GB. Can you please suggest a way to share the data file?

vmnit · 2025-03-25T12:52:52Z

Hi @yuslepukhin ,

I'm using model from the following location: https://huggingface.co/Qwen/Qwen2-7B-Instruct/tree/main
Can you please try generating onnx_model from there because I'm unable to find a way to upload the big data file?

yuslepukhin · 2025-03-25T16:41:35Z

Please, share exactly what you did.

Also, please, share any console messages, enable logging and share, specifically what makes you think there is a segmentation fault. Also, please, fill out the template as to the version of your Linux OS etc.

amd-vivekag · 2025-03-26T17:10:12Z

@yuslepukhin I'm trying to create a script which can reproduce the issue at your end. I'll try to share with you soon.

vmnit · 2025-03-27T06:15:17Z

Steps to reproduce:

Create virtual environment: python -m venv myenv.env
Activate it: source myenv.env/bin/activate
Pip upgrade: pip install --upgrade pip
Install some libraries: pip install onnx optimum[exporters] onnxruntime
Set CACHE_DIR: export CACHE_DIR=<SOME_PATH>
run script: python test_seg_fault.py

# test_seg_fault.py
import os
from optimum.exporters.onnx import main_export

cache_dir = os.environ["CACHE_DIR"]
os.environ["HF_HOME"] = cache_dir
os.environ["HUGGINGFACE_HUB_CACHE"] = cache_dir

main_export(
        "Qwen/Qwen2-7B-Instruct",
        os.getcwd(),
        task='text-generation',
        cache_dir=cache_dir,
        local_files_only=False,
        monolith=True,
        framework="pt",
        optimize=None,
        )

import onnxruntime as ort
sess_options = ort.SessionOptions()

print("before ort.InferenceSession")
sess = ort.InferenceSession('model.onnx', sess_options)

print(sess)

Please let me know if you need any information from my side in this regard.

Thanks

yuslepukhin · 2025-03-27T22:04:04Z

I have followed the procedure and got the model. Produced a debug build from the tip of main.
Tried from a C++ test and your python script simply loading the model.
I did not have a repro. At one-point physical memory usage clocked 31Gb and total commit was 44 Gb, so it had its share of page faults, but the process completed normally. The closest release is next month.

vmnit · 2025-03-28T04:45:17Z

Hi @yuslepukhin,

Were you able to run the complete script without any segmentation fault? If yes, can you please check the onnxruntime version?
I'm able to reproduce with following library versions:

Successfully installed MarkupSafe-3.0.2 certifi-2025.1.31 charset-normalizer-3.4.1 coloredlogs-15.0.1 filelock-3.18.0 flatbuffers-25.2.10 fsspec-2025.3.0 huggingface-hub-0.29.3 humanfriendly-10.0 idna-3.10 jinja2-3.1.6 mpmath-1.3.0 networkx-3.4.2 numpy-2.2.4 nvidia-
cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 n
vidia-cusparselt-cu12-0.6.2 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 onnx-1.17.0 onnxruntime-1.21.0 optimum-1.24.0 packaging-24.2 pillow-11.1.0 protobuf-6.30.2 pyyaml-6.0.2 regex-2024.11.6 requests-2.32.3 safetensors-0.5.3 sym
py-1.13.1 timm-1.0.15 tokenizers-0.21.1 torch-2.6.0 torchvision-0.21.0 tqdm-4.67.1 transformers-4.48.3 triton-3.2.0 typing-extensions-4.13.0 urllib3-2.3.0

While running inference step, I was getting Seg Fault: sess = ort.InferenceSession('model.onnx', sess_options)
I was not getting print after that. But if you are getting valid sess object then it is working for you, it seems.

yuslepukhin · 2025-03-28T17:57:17Z

The bug reproduces with 1.21.0, but is not there with the latest code.

D:\dev\data\SegmentationFault_gh_24144$ pip list
Package Version

certifi 2025.1.31
charset-normalizer 3.4.1
colorama 0.4.6
coloredlogs 15.0.1
filelock 3.18.0
flatbuffers 25.2.10
fsspec 2025.3.0
huggingface-hub 0.29.3
humanfriendly 10.0
idna 3.10
Jinja2 3.1.6
MarkupSafe 3.0.2
mpmath 1.3.0
networkx 3.4.2
numpy 1.24.3
onnx 1.17.0
onnxruntime 1.22.0
optimum 1.24.0
packaging 24.2
pip 25.0.1
protobuf 6.30.2
pyreadline3 3.5.4
PyYAML 6.0.2
regex 2024.11.6
requests 2.32.3
safetensors 0.5.3
setuptools 65.5.0
sympy 1.13.1
tokenizers 0.21.1
torch 2.6.0
tqdm 4.67.1
transformers 4.50.2
typing_extensions 4.13.0
urllib3 2.3.0

vmnit · 2025-03-29T11:10:34Z

The bug reproduces with 1.21.0, but is not there with the latest code.

D:\dev\data\SegmentationFault_gh_24144$ pip list Package Version

certifi 2025.1.31 charset-normalizer 3.4.1 colorama 0.4.6 coloredlogs 15.0.1 filelock 3.18.0 flatbuffers 25.2.10 fsspec 2025.3.0 huggingface-hub 0.29.3 humanfriendly 10.0 idna 3.10 Jinja2 3.1.6 MarkupSafe 3.0.2 mpmath 1.3.0 networkx 3.4.2 numpy 1.24.3 onnx 1.17.0 onnxruntime 1.22.0 optimum 1.24.0 packaging 24.2 pip 25.0.1 protobuf 6.30.2 pyreadline3 3.5.4 PyYAML 6.0.2 regex 2024.11.6 requests 2.32.3 safetensors 0.5.3 setuptools 65.5.0 sympy 1.13.1 tokenizers 0.21.1 torch 2.6.0 tqdm 4.67.1 transformers 4.50.2 typing_extensions 4.13.0 urllib3 2.3.0

@yuslepukhin It is great that you are able to reproduce the issue. I think we should add this as a testcase to avoid such a regression in the future. What do you say? Let me know if you want me to add it. If yes, can you please share some documentation or guide me on how to add it and verify the same.

yuslepukhin · 2025-04-01T00:03:35Z

As you mentioned, the problem was fixed in the maintenance release in this PR in one of the optimizations.

Feel free to add a test if you so desire. There is not a documentation on how to add tests, but you can check any of the test files and see how they are written.

amd-vivekag · 2025-04-01T10:29:28Z

@yuslepukhin I'm unable verify if this has been fixed in the latest onnxruntime binary. I tried installing nightly build (onnxruntime-1.22.0.dev20250321002), but still getting the segmentation fault.
Should I wait for the official release?

yuslepukhin · 2025-04-01T17:42:41Z

Everything that is fixed in the maintenance release, is fixed in main. You are welcome to debug it, it may be a new issue. I do not have a repro.

yuslepukhin · 2025-04-29T18:00:31Z

@yuslepukhin I'm unable verify if this has been fixed in the latest onnxruntime binary. I tried installing nightly build (onnxruntime-1.22.0.dev20250321002), but still getting the segmentation fault. Should I wait for the official release?

Any details? What platform?

yuslepukhin · 2025-04-29T20:26:12Z

Please, Add this information to the GH issue along with the hardware you are using. I do not have time allocated for this now, but I will get to it eventually.

…

-- Dmitri From: Dmytro Varich ***@***.***> Sent: Tuesday, April 29, 2025 4:41 To: microsoft/onnxruntime ***@***.***> Cc: Mention ***@***.***>; Comment ***@***.***>; Subscribed ***@***.***> Subject: Re: [microsoft/onnxruntime] segmentation fault while using onnxruntime==1.21.0 (Issue #24144) @yuslepukhin<https://github.com/yuslepukhin>, Hi! I am writing an bachelor's thesis in which I am developing a custom package for ROS 2. In this package, my node uses ONNX Runtime to run a segmentation model using CUDA (GPU acceleration). I have a segmentation fault when trying to run inference using ONNX Runtime 1.21.1 with CUDAExecutionProvider (CUDA version 12.2). The crash occurs during the initialization of the InferenceSession. Here is the relevant code snippet: if not os.path.exists(self.model_path): self.get_logger().error(f"Model file not found at {self.model_path}") self.session = ort.InferenceSession(self.model_path, providers=['CUDAExecutionProvider', 'CPUExecutionProvider']) self.get_logger().info(f"ONNX model loaded from {self.model_path}") Environment: * ONNX Runtime: 1.21.1 * CUDA: 12.2 * OS: Ubuntu 22.04 * Python: 3.10 * Execution Providers: ['CUDAExecutionProvider', 'CPUExecutionProvider'] * Model: peoplesemsegnet_shuffleseg.onnx (Nvidia NGC Catalog<https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tao/models/peoplesemsegnet/files?version=deployable_shuffleseg_unet_onnx_v1.0>) * Model Zip File: peoplesemsegnet_deployable_shuffleseg_unet_onnx.zip<https://github.com/user-attachments/files/19957865/peoplesemsegnet_deployable_shuffleseg_unet_onnx.zip> 🤔 Idk, It's either a problem because I'm using a model from NVIDIA that was originally designed for TensorRT (from Isaac ROS), or it's still an issue with the ONNX Runtime version — even though you said it would be fixed in the new release compared to 1.21.0. Thank you in advance! — Reply to this email directly, view it on GitHub<#24144 (comment)> or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACWHYNERVMO3FUPUARAYQED235QN3BFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVIZDAMJUGE4DKOJWGGSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJRGU3DSMZZGY3TFAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDEOJUGM4TCNBRG43IFJDUPFYGLJLMMFRGK3FFOZQWY5LFVIZDAMJUGE4DKOJWGGTXI4TJM5TWK4VGMNZGKYLUMU>. You are receiving this email because you were mentioned. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

amd-vivekag · 2025-05-07T05:36:08Z

Please,

Add this information to the GH issue along with the hardware you are using.
I do not have time allocated for this now, but I will get to it eventually.
…

System:
  Host: xxx Kernel: 5.15.0-138-generic x86_64 bits: 64 Desktop: N/A
    Distro: Ubuntu 22.04.5 LTS (Jammy Jellyfish)

CPU:
  Info: 2x 64-core model: AMD EPYC 9554 bits: 64 type: MCP SMP cache:
    L2: 2x 64 MiB (128 MiB)

Graphics:
  Device-1: ASPEED Graphics Family driver: N/A
  Display: server: X.org v: 1.21.1.4 driver: gpu: N/A resolution: 1536x960
  OpenGL: renderer: llvmpipe (LLVM 15.0.7 256 bits)
    v: 4.5 Mesa 23.2.1-1ubuntu3.1~22.04.3

yuslepukhin added the core runtime label Mar 24, 2025

vmnit mentioned this issue Mar 27, 2025

HF model tracker nod-ai/SHARK-ModelDev#899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

segmentation fault while using onnxruntime==1.21.0 #24144

segmentation fault while using onnxruntime==1.21.0 #24144

vmnit commented Mar 24, 2025 •

edited

Loading

yuslepukhin commented Mar 24, 2025

yuslepukhin commented Mar 24, 2025

vmnit commented Mar 25, 2025

vmnit commented Mar 25, 2025

yuslepukhin commented Mar 25, 2025

amd-vivekag commented Mar 26, 2025

vmnit commented Mar 27, 2025

yuslepukhin commented Mar 27, 2025 •

edited

Loading

vmnit commented Mar 28, 2025

yuslepukhin commented Mar 28, 2025 •

edited

Loading

vmnit commented Mar 29, 2025 •

edited

Loading

yuslepukhin commented Apr 1, 2025

amd-vivekag commented Apr 1, 2025

yuslepukhin commented Apr 1, 2025

yuslepukhin commented Apr 29, 2025

yuslepukhin commented Apr 29, 2025 via email

amd-vivekag commented May 7, 2025

segmentation fault while using onnxruntime==1.21.0 #24144

segmentation fault while using onnxruntime==1.21.0 #24144

Comments

vmnit commented Mar 24, 2025 • edited Loading

yuslepukhin commented Mar 24, 2025

yuslepukhin commented Mar 24, 2025

vmnit commented Mar 25, 2025

vmnit commented Mar 25, 2025

yuslepukhin commented Mar 25, 2025

amd-vivekag commented Mar 26, 2025

vmnit commented Mar 27, 2025

yuslepukhin commented Mar 27, 2025 • edited Loading

vmnit commented Mar 28, 2025

yuslepukhin commented Mar 28, 2025 • edited Loading

vmnit commented Mar 29, 2025 • edited Loading

yuslepukhin commented Apr 1, 2025

amd-vivekag commented Apr 1, 2025

yuslepukhin commented Apr 1, 2025

yuslepukhin commented Apr 29, 2025

yuslepukhin commented Apr 29, 2025 via email

amd-vivekag commented May 7, 2025

vmnit commented Mar 24, 2025 •

edited

Loading

yuslepukhin commented Mar 27, 2025 •

edited

Loading

yuslepukhin commented Mar 28, 2025 •

edited

Loading

vmnit commented Mar 29, 2025 •

edited

Loading