Create an Elasticsearch inference endpoint | Elasticsearch API documentation (original) (raw)
Dismiss highlight Show more
Path parameters
- The type of the inference task that the model will perform.
Values arererank
,sparse_embedding
, ortext_embedding
. - The unique identifier of the inference endpoint. The must not match the
model_id
.
application/json
Body
- Hide chunking_settings attributes Show chunking_settings attributes object
- The maximum size of a chunk in words. This value cannot be higher than
300
or lower than20
(forsentence
strategy) or10
(forword
strategy). - The number of overlapping words for chunks. It is applicable only to a
word
chunking strategy. This value cannot be higher than half themax_chunk_size
value. - The number of overlapping sentences for chunks. It is applicable only for a
sentence
chunking strategy. It can be either1
or0
. - The chunking strategy:
sentence
orword
.
- The maximum size of a chunk in words. This value cannot be higher than
- Hide service_settings attributes Show service_settings attributes object
- Hide adaptive_allocations attributes Show adaptive_allocations attributes object
* Turn onadaptive_allocations
.
* The maximum number of allocations to scale to. If set, it must be greater than or equal tomin_number_of_allocations
.
* The minimum number of allocations to scale to. If set, it must be greater than or equal to 0. If not defined, the deployment scales to 0. - The deployment identifier for a trained model deployment. When
deployment_id
is used themodel_id
is optional. - The name of the model to use for the inference task. It can be the ID of a built-in model (for example,
.multilingual-e5-small
for E5) or a text embedding model that was uploaded by using the Eland client.
External documentation - The total number of allocations that are assigned to the model across machine learning nodes. Increasing this value generally increases the throughput. If adaptive allocations are enabled, do not set this value because it's automatically set.
- The number of threads used by each model allocation during inference. This setting generally increases the speed per inference request. The inference process is a compute-bound process;
threads_per_allocations
must not exceed the number of available allocated processors per node. The value must be a power of 2. The maximum value is 32.
- Hide adaptive_allocations attributes Show adaptive_allocations attributes object
- Hide task_settings attribute Show task_settings attribute object
- For a
rerank
task, return the document instead of only the index.
- For a
Responses
- 200 application/json
Hide response attributes Show response attributes object- Hide chunking_settings attributes Show chunking_settings attributes object
* The maximum size of a chunk in words. This value cannot be higher than300
or lower than20
(forsentence
strategy) or10
(forword
strategy).
* The number of overlapping words for chunks. It is applicable only to aword
chunking strategy. This value cannot be higher than half themax_chunk_size
value.
* The number of overlapping sentences for chunks. It is applicable only for asentence
chunking strategy. It can be either1
or0
.
* The chunking strategy:sentence
orword
. - The service type
- The inference Id
- Values are
sparse_embedding
,text_embedding
,rerank
,completion
, orchat_completion
.
- Hide chunking_settings attributes Show chunking_settings attributes object