Start trained model deployment API
editStart trained model deployment API
editStarts a new trained model deployment.
This functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.
Request
editPOST _ml/trained_models/<model_id>/deployment/_start
Prerequisites
editRequires the manage_ml cluster privilege. This privilege is included in the machine_learning_admin built-in role.
Description
editCurrently only pytorch models are supported for deployment. Once deployed the model can be used by the Inference processor in an ingest pipeline or directly in the Infer trained model API.
Scaling inference performance can be achieved by setting the parameters number_of_allocations and threads_per_allocation.
Increasing threads_per_allocation means more threads are used when an inference request is processed on a node. This can improve inference speed for certain models. It may also result in improvement to throughput.
Increasing number_of_allocations means more threads are used to process multiple inference requests in parallel resulting in throughput improvement. Each model allocation uses a number of threads defined by threads_per_allocation.
Model allocations are distributed across machine learning nodes. All allocations assigned to a node share the same copy of the model in memory. To avoid thread oversubscription which is detrimental to performance, model allocations are distributed in such a way that the total number of used threads does not surpass the node’s allocated processors.
Path parameters
edit-
<model_id> - (Required, string) The unique identifier of the trained model.
Query parameters
edit-
cache_size - (Optional, byte value) The inference cache size (in memory outside the JVM heap) per node for the model. The default value is the size of the model as reported by the
model_size_bytesfield in the Get trained models stats. To disable the cache,0bcan be provided. -
number_of_allocations - (Optional, integer) The total number of allocations this model is assigned across machine learning nodes. Increasing this value generally increases the throughput. Defaults to 1.
-
queue_capacity - (Optional, integer) Controls how many inference requests are allowed in the queue at a time. Every machine learning node in the cluster where the model can be allocated has a queue of this size; when the number of requests exceeds the total value, new requests are rejected with a 429 error. Defaults to 1024. Max allowed value is 1000000.
-
threads_per_allocation - (Optional, integer) Sets the number of threads used by each model allocation during inference. This generally increases the speed per inference request. The inference process is a compute-bound process;
threads_per_allocationsmust not exceed the number of available allocated processors per node. Defaults to 1. Must be a power of 2. Max allowed value is 32. -
timeout - (Optional, time) Controls the amount of time to wait for the model to deploy. Defaults to 20 seconds.
-
wait_for - (Optional, string) Specifies the allocation status to wait for before returning. Defaults to
started. The valuestartingindicates deployment is starting but not yet on any node. The valuestartedindicates the model has started on at least one node. The valuefully_allocatedindicates the deployment has started on all valid nodes.
Examples
editThe following example starts a new deployment for a elastic__distilbert-base-uncased-finetuned-conll03-english trained model:
POST _ml/trained_models/elastic__distilbert-base-uncased-finetuned-conll03-english/deployment/_start?wait_for=started&timeout=1m
The API returns the following results:
{ "assignment": { "task_parameters": { "model_id": "elastic__distilbert-base-uncased-finetuned-conll03-english", "model_bytes": 265632637, "threads_per_allocation" : 1, "number_of_allocations" : 1, "queue_capacity" : 1024 }, "routing_table": { "uckeG3R8TLe2MMNBQ6AGrw": { "routing_state": "started", "reason": "" } }, "assignment_state": "started", "start_time": "2022-11-02T11:50:34.766591Z" } }