Skip to content

[Docs] EPcontext error handling #24471

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 19, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 21 additions & 12 deletions docs/execution-providers/EP-Context-Design.md
Original file line number Diff line number Diff line change
@@ -199,7 +199,7 @@ ONNX Runtime EPs that support loading models with `EPContext` nodes should follo
- The EP should only partition the `EPContext` nodes where the `source` attribute matches the key required by the EP.
- The EP loads the cached context from the matched `EPContext` nodes.

- **Handling External Context Binaries (embed_mode = 0)**
- **Handling External Context Binaries (embed_mode = 0)**
When the `EPContext` cache model is generated with `embed_mode = 0`, the context binary is stored as a separate file alongside the ONNX model in the same folder.
- ONNX Runtime retrieves the relative path of the context binary file from the `ep_cache_context` attribute of the `EPContext` node.
- **For models loaded from a file path:**
@@ -208,10 +208,19 @@ ONNX Runtime EPs that support loading models with `EPContext` nodes should follo
- Since the EP cannot derive the model's folder path, the user must specify the session option `ep.context_file_path`.
- The EP uses `ep.context_file_path` to determine the folder path and combines it with the relative path to construct the full path to the context binary file.

- **Support for Multiple Primary `EPContext` Nodes (`main_context = 1`)**
- **Support for Multiple Primary `EPContext` Nodes (`main_context = 1`)**
- The EP should support multiple primary `EPContext` nodes without any limitations.
- The EP must be capable of loading all EP context binary buffers/files specified in the `ep_cache_context` attributes of the `EPContext` nodes, deserializing them, managing the `ep_graphs`, and selecting the appropriate one for execution.

- **Error Handling During EP Context Binary Loading**

The EP or its backend SDK should be capable of detecting common failure scenarios (including but not limited to the following). In such cases, the EP should return a status with the `INVALID_GRAPH` status code:

- Detect mismatches between the driver version and the version required by the EP context binary; return an error if they are incompatible.
- Detect mismatches between the runtime SDK version and the version used to generated the EP context binary; return an error if they are incompatible.
- Return an error if loading the EP context binary fails for any reason.


<p align="center"><img width="60%" src="../../images/EP_context_nodes_with_different_eps.png" alt="EP Context nodes with different EPs"/></p>

### Usage Scenario Code Examples
@@ -253,13 +262,13 @@ Creating a session from a memory buffer of the model causes the session to lose
session1.run(...);
```

# EPContext with Weight Sharing
## EPContext with Weight Sharing

## Weight Sharing in Onnx Domain
### Weight Sharing in Onnx Domain
In ONNX, weight sharing refers to multiple ONNX models with external weights pointing to the same external weight file. These models use the same tensor names, allowing them to reference the same tensor data.
<p align="center"><img width="50%" src="../../images/Onnx_weight_sharing.png" alt="Weight sharing across Onnx models"/></p>

## Weight Sharing in EP Domain with EPContext
### Weight Sharing in EP Domain with EPContext
EP weight sharing is enabled using a pre-generated EP context binary/blob.
To do this, users must **generate the context binary offline** (Ahead Of Time).
- Some EPs require specific platforms, such as **Linux x86_64** and/or **Windows x86_64**. Please refer to the specific EP page for details.
@@ -272,15 +281,15 @@ The EP or backend SDK should be capable of converting and compiling the graph as
- When new graphs are compiled into the EP context, they should reuse existing weights if they are recognized as identical.
For example, in `[model_name]_[ep].bin`, `tensor1_1` from `ep_graph1` and `tensor2_1` from `ep_graph2` are identical and both point to the same data offset, `tensor_data1`.

## EPContext Model Generation with Weight Sharing Workflow
### EPContext Model Generation with Weight Sharing Workflow
<p align="center"><img width="90%" src="../../images/EP_weight_sharing_workflow.png" alt="Weight sharing workflow"/></p>

Each ONNX Runtime session is associated with an ONNX model. Models that share weights are grouped into a model group, while ONNX Runtime sessions with common properties are organized into a session group. ONNX Runtime introduces two session options: `ep.share_ep_contexts` and `ep.stop_share_ep_contexts` to facilitate session grouping.
- All ONNX Runtime sessions within the session group should have `ep.share_ep_contexts` enabled.
- The final ONNX Runtime session uses `ep.stop_share_ep_contexts` to indicate that it is the last session in the group.
Note: A single ONNX model may contain multiple `EPContext` nodes, depending on the graph partitioning result. However, for simplicity, each model is shown with only one `EPcontext` node here.

## Implementation Guidelines for EPContext Model Generation with Weight Sharing
### Implementation Guidelines for EPContext Model Generation with Weight Sharing
- Shared Workspace Creation:
<br/> The first session creates a shared workspace (e.g., EP Singleton) to share resources with other sessions.
- EP Context Binary File Naming:
@@ -298,7 +307,7 @@ Note: A single ONNX model may contain multiple `EPContext` nodes, depending on t
<br/> For N source models that share weights, a total of N+1 files should be generated.
<br/> The generated files are `model1_ctx.onnx`, `...`, `modeln_ctx.onnx`, `[model1_name]_[ep].bin`.

### User Code Example
#### User Code Example
```
Ort::SessionOptions so;

@@ -321,21 +330,21 @@ Note: A single ONNX model may contain multiple `EPContext` nodes, depending on t
Ort::Session session2(env, "model2.onnx", so);
```

### General Tool for EPContext Model Generation with Weight Sharing
#### General Tool for EPContext Model Generation with Weight Sharing
OnnxRuntime provides the [ep_weight_sharing_ctx_gen](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/ep_weight_sharing_ctx_gen) tool to automate the weight-sharing workflow. This tool handles the entire process. This tool is specifically designed for **weight sharing** scenarios, streamlining the `EPContext` model generation process.
Example command line:
```
./ep_weight_sharing_ctx_gen -e qnn -i "soc_model|60 htp_graph_finalization_optimization_mode|3" ./model1.onnx,./model2.onnx
```
It creates two Onnx models (`model1_ctx.onnx`, `model2_ctx.onnx`) and one QNN context binary file (`[model1_name]_[ep].bin`).

## Inference Sessions from EPContext Models with Weight Sharing
### Inference Sessions from EPContext Models with Weight Sharing
To use the dumped EPContext models with weight sharing enabled, ONNX Runtime inference sessions must have **resource sharing** activated. This is done by setting the session option:
```
ep.share_ep_contexts = 1
```

### Implementation Guidelines for Inferencing from EPContext Models with Weight Sharing
#### Implementation Guidelines for Inferencing from EPContext Models with Weight Sharing
- Create the first OnnxRuntime inference session
- Set session option: `ep.share_ep_contexts=1`.
- Load the `model1_ctx.onnx` model.
@@ -355,7 +364,7 @@ To use the dumped EPContext models with weight sharing enabled, ONNX Runtime inf
- To avoid issues during concurrent execution, it is recommended to **destroy the sessions in reverse order** (i.e., destroy the second session before the first session).
- This ensures proper resource management and prevents potential conflicts with shared resources.

### User Code Example
#### User Code Example
```
Ort::SessionOptions so;
// enable ep.share_ep_contexts
3 changes: 2 additions & 1 deletion docs/execution-providers/QNN-ExecutionProvider.md
Original file line number Diff line number Diff line change
@@ -149,9 +149,10 @@ Alternatively to setting profiling_level at compile time, profiling can be enabl
|'1'|Enable the QNN HTP shared memory allocator. Requires libcdsprpc.so/dll to be available. [Code example](https://github.com/microsoft/onnxruntime/blob/544bdd60730270f49f6a5baafdff54065f626776/onnxruntime/test/shared_lib/test_inference.cc#L2262-L2354)|

### Run Options

|`"qnn.lora_config"`|Description|
|---|---|
|Config path|LoRAv2 config file path. The format of the config will be mentioned in the <b>LoraV2 support</b>.|
|Config path|LoRAv2 config file path. The format of the config will be mentioned in the **LoraV2 support**.|

## Supported ONNX operators