You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix infinite loop in profile picker and switch predictor based routing to on by default with a header to disable (#1929)
* fix infinite loop in profile picker when using latency routing with predictor based scheduling off
* add fix in ProcessResults
* Fix type for lint
* Fix prefix cache not being being ordered properly in profile picker and set predictor scheduling to true instead of flase when no flag is present
* Change predictor based scheduling header to one that dissables it, and make it on by default when deploying with latency based routing
* Move slo profile handler into slo routing package
* Add slo aware handler
logger.V(logutil.DEBUG).Error(errutil.Error{Code: errutil.BadRequest, Msg: fmt.Sprintf("%v must be a float: %v", tpotSLOHeaderKey, err)}, "SLOAwareRouter: Error parsing TPOT SLO from header")
logger.V(logutil.DEBUG).Error(errutil.Error{Code: errutil.BadRequest, Msg: fmt.Sprintf("x-prediction-based-scheduling must be a bool: %v", err)}, "SLOAwareRouter: Error parsing PredictorBasedScheduling from header")
logger.V(logutil.DEBUG).Error(err, "error parsing predictorBasedScheduling from header failed to choose scheduling profile: x-prediction-based-scheduling must be a bool")
returnnil, fmt.Errorf("error parsing predictorBasedScheduling from header failed to choose scheduling profile: x-prediction-based-scheduling must be a bool: %v", err)
Copy file name to clipboardExpand all lines: site-src/guides/latency-based-predictor.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ Latency-based routing is a feature of the Inference Gateway that enables intelli
8
8
9
9
The latency-based routing feature is implemented as a plugin for the Endpoint Picker (EPP). When a request is received, the plugin performs the following steps:
10
10
11
-
1.**SLO Extraction**: The plugin extracts the TTFT and TPOT SLOs from the request headers (`x-slo-ttft-ms` and `x-slo-tpot-ms`). It also checks for the `x-prediction-based-scheduling` header to determine if latency-based routing should be used for this request.
11
+
1.**SLO Extraction**: The plugin extracts the TTFT and TPOT SLOs from the request headers (`x-slo-ttft-ms` and `x-slo-tpot-ms`). It also checks for the `x-prediction-based-scheduling-off` header to determine if latency-based routing should be used for this request.
12
12
13
13
2.**Latency Prediction**: The plugin uses a latency predictor, deployed as a set of sidecar containers to the EPP, to predict the TTFT and TPOT for the request on each of the available model servers. The prediction is based on the current state of the server, including its KV cache utilization, and the number of running and waiting requests.
14
14
@@ -22,7 +22,7 @@ The latency-based routing feature is implemented as a plugin for the Endpoint Pi
22
22
23
23
To use latency-based routing, you need to include the following headers in your inference requests:
24
24
25
-
-`x-prediction-based-scheduling`: Set to `true` to enable latency-based routing for the request, setting this to false or omiting the header will use non-SLO routing, but will still use the latency data to train the predictor.
25
+
-`x-prediction-based-scheduling-off`: Include this header to disable predictive routing for that specific request. If omitted, predictive routing is enabled by default.
26
26
-`x-slo-ttft-ms`: The Time to First Token SLO in milliseconds.
27
27
-`x-slo-tpot-ms`: The Time Per Output Token SLO in milliseconds (this is vLLMs equivalent of ITL, is it **not** NTPOT).
28
28
@@ -78,7 +78,7 @@ If you have a standard setup via using the [Getting Started Guide](getting-start
78
78
```txt
79
79
export GW_IP=$(kubectl get gateway/inference-gateway -o jsonpath='{.status.addresses[0].value}'):80
0 commit comments