Launch (original) (raw)
Scale Launch¶
Simple, scalable, and high performance ML service deployment in python.
Example¶
Launch Usage
[](#%5F%5Fcodelineno-0-1)import os [](#%5F%5Fcodelineno-0-2)import time [](#%5F%5Fcodelineno-0-3)from launch import LaunchClient [](#%5F%5Fcodelineno-0-4)from launch import EndpointRequest [](#%5F%5Fcodelineno-0-5)from pydantic import BaseModel [](#%5F%5Fcodelineno-0-6)from rich import print [](#%5F%5Fcodelineno-0-7) [](#%5F%5Fcodelineno-0-8) [](#%5F%5Fcodelineno-0-9)class MyRequestSchema(BaseModel): [](#%5F%5Fcodelineno-0-10) x: int [](#%5F%5Fcodelineno-0-11) y: str [](#%5F%5Fcodelineno-0-12) [](#%5F%5Fcodelineno-0-13)class MyResponseSchema(BaseModel): [](#%5F%5Fcodelineno-0-14) __root__: int [](#%5F%5Fcodelineno-0-15) [](#%5F%5Fcodelineno-0-16) [](#%5F%5Fcodelineno-0-17)def my_load_predict_fn(model): [](#%5F%5Fcodelineno-0-18) def returns_model_of_x_plus_len_of_y(x: int, y: str) -> int: [](#%5F%5Fcodelineno-0-19) """MyRequestSchema -> MyResponseSchema""" [](#%5F%5Fcodelineno-0-20) assert isinstance(x, int) and isinstance(y, str) [](#%5F%5Fcodelineno-0-21) return model(x) + len(y) [](#%5F%5Fcodelineno-0-22) [](#%5F%5Fcodelineno-0-23) return returns_model_of_x_plus_len_of_y [](#%5F%5Fcodelineno-0-24) [](#%5F%5Fcodelineno-0-25) [](#%5F%5Fcodelineno-0-26)def my_load_model_fn(): [](#%5F%5Fcodelineno-0-27) def my_model(x): [](#%5F%5Fcodelineno-0-28) return x * 2 [](#%5F%5Fcodelineno-0-29) [](#%5F%5Fcodelineno-0-30) return my_model [](#%5F%5Fcodelineno-0-31) [](#%5F%5Fcodelineno-0-32)BUNDLE_PARAMS = { [](#%5F%5Fcodelineno-0-33) "model_bundle_name": "test-bundle", [](#%5F%5Fcodelineno-0-34) "load_predict_fn": my_load_predict_fn, [](#%5F%5Fcodelineno-0-35) "load_model_fn": my_load_model_fn, [](#%5F%5Fcodelineno-0-36) "request_schema": MyRequestSchema, [](#%5F%5Fcodelineno-0-37) "response_schema": MyResponseSchema, [](#%5F%5Fcodelineno-0-38) "requirements": ["pytest==7.2.1", "numpy"], # list your requirements here [](#%5F%5Fcodelineno-0-39) "pytorch_image_tag": "1.7.1-cuda11.0-cudnn8-runtime", [](#%5F%5Fcodelineno-0-40)} [](#%5F%5Fcodelineno-0-41) [](#%5F%5Fcodelineno-0-42)ENDPOINT_PARAMS = { [](#%5F%5Fcodelineno-0-43) "endpoint_name": "demo-endpoint", [](#%5F%5Fcodelineno-0-44) "model_bundle": "test-bundle", [](#%5F%5Fcodelineno-0-45) "cpus": 1, [](#%5F%5Fcodelineno-0-46) "min_workers": 0, [](#%5F%5Fcodelineno-0-47) "endpoint_type": "async", [](#%5F%5Fcodelineno-0-48) "update_if_exists": True, [](#%5F%5Fcodelineno-0-49) "labels": { [](#%5F%5Fcodelineno-0-50) "team": "MY_TEAM", [](#%5F%5Fcodelineno-0-51) "product": "launch", [](#%5F%5Fcodelineno-0-52) } [](#%5F%5Fcodelineno-0-53)} [](#%5F%5Fcodelineno-0-54) [](#%5F%5Fcodelineno-0-55)def predict_on_endpoint(request: MyRequestSchema) -> MyResponseSchema: [](#%5F%5Fcodelineno-0-56) # Wait for the endpoint to be ready first before submitting a task [](#%5F%5Fcodelineno-0-57) endpoint = client.get_model_endpoint(endpoint_name="demo-endpoint") [](#%5F%5Fcodelineno-0-58) while endpoint.status() != "READY": [](#%5F%5Fcodelineno-0-59) time.sleep(10) [](#%5F%5Fcodelineno-0-60) [](#%5F%5Fcodelineno-0-61) endpoint_request = EndpointRequest(args=request.dict(), return_pickled=False) [](#%5F%5Fcodelineno-0-62) [](#%5F%5Fcodelineno-0-63) future = endpoint.predict(request=endpoint_request) [](#%5F%5Fcodelineno-0-64) raw_response = future.get() [](#%5F%5Fcodelineno-0-65) [](#%5F%5Fcodelineno-0-66) response = MyResponseSchema.parse_raw(raw_response.result) [](#%5F%5Fcodelineno-0-67) return response [](#%5F%5Fcodelineno-0-68) [](#%5F%5Fcodelineno-0-69) [](#%5F%5Fcodelineno-0-70)client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY")) [](#%5F%5Fcodelineno-0-71) [](#%5F%5Fcodelineno-0-72)client.create_model_bundle_from_callable_v2(**BUNDLE_PARAMS) [](#%5F%5Fcodelineno-0-73)endpoint = client.create_model_endpoint(**ENDPOINT_PARAMS) [](#%5F%5Fcodelineno-0-74) [](#%5F%5Fcodelineno-0-75)request = MyRequestSchema(x=5, y="hello") [](#%5F%5Fcodelineno-0-76)response = predict_on_endpoint(request) [](#%5F%5Fcodelineno-0-77)print(response) [](#%5F%5Fcodelineno-0-78)""" [](#%5F%5Fcodelineno-0-79)MyResponseSchema(__root__=10) [](#%5F%5Fcodelineno-0-80)"""
What's going on here:
- First we use pydantic to define our request and response schemas,
MyRequestSchema
andMyResponseSchema
. These schemas are used to generate the API documentation for our models. - Next we define the the
model
and theload_predict_fn
, which tells Launch how to load our model and how to make predictions with it. In this case, we're just returning a function that adds the length of the stringy
tomodel(x)
, wheremodel
doubles the integerx
. - We then define the model bundle by specifying the
load_predict_fn
, therequest_schema
, and theresponse_schema
. We also specify theenv_params
, which tell Launch environment settings like the base image to use. In this case, we're using a PyTorch image. - Next, we create the model endpoint, which is the API that we'll use to make predictions. We specify the
model_bundle
that we created above, and we specify theendpoint_type
, which tells Launch whether to use a synchronous or asynchronous endpoint. In this case, we're using an asynchronous endpoint, which means that we can make predictions and return immediately with afuture
object. We can then use thefuture
object to get the prediction result later. - Finally, we make a prediction by calling
predict_on_endpoint
with aMyRequestSchema
object. This function first waits for the endpoint to be ready, then it submits a prediction request to the endpoint. It then waits for the prediction result and returns it.
Notice that we specified min_workers=0
, meaning that the endpoint will scale down to 0 workers when it's not being used.
Installation¶
To use Scale Launch, first install it using pip
:
Installation
[](#%5F%5Fcodelineno-1-1)pip install -U scale-launch