Launch (original) (raw)

Scale Launch¶

Simple, scalable, and high performance ML service deployment in python.

Example¶

Launch Usage

[](#%5F%5Fcodelineno-0-1)import os [](#%5F%5Fcodelineno-0-2)import time [](#%5F%5Fcodelineno-0-3)from launch import LaunchClient [](#%5F%5Fcodelineno-0-4)from launch import EndpointRequest [](#%5F%5Fcodelineno-0-5)from pydantic import BaseModel [](#%5F%5Fcodelineno-0-6)from rich import print [](#%5F%5Fcodelineno-0-7) [](#%5F%5Fcodelineno-0-8) [](#%5F%5Fcodelineno-0-9)class MyRequestSchema(BaseModel): [](#%5F%5Fcodelineno-0-10) x: int [](#%5F%5Fcodelineno-0-11) y: str [](#%5F%5Fcodelineno-0-12) [](#%5F%5Fcodelineno-0-13)class MyResponseSchema(BaseModel): [](#%5F%5Fcodelineno-0-14) __root__: int [](#%5F%5Fcodelineno-0-15) [](#%5F%5Fcodelineno-0-16) [](#%5F%5Fcodelineno-0-17)def my_load_predict_fn(model): [](#%5F%5Fcodelineno-0-18) def returns_model_of_x_plus_len_of_y(x: int, y: str) -> int: [](#%5F%5Fcodelineno-0-19) """MyRequestSchema -> MyResponseSchema""" [](#%5F%5Fcodelineno-0-20) assert isinstance(x, int) and isinstance(y, str) [](#%5F%5Fcodelineno-0-21) return model(x) + len(y) [](#%5F%5Fcodelineno-0-22) [](#%5F%5Fcodelineno-0-23) return returns_model_of_x_plus_len_of_y [](#%5F%5Fcodelineno-0-24) [](#%5F%5Fcodelineno-0-25) [](#%5F%5Fcodelineno-0-26)def my_load_model_fn(): [](#%5F%5Fcodelineno-0-27) def my_model(x): [](#%5F%5Fcodelineno-0-28) return x * 2 [](#%5F%5Fcodelineno-0-29) [](#%5F%5Fcodelineno-0-30) return my_model [](#%5F%5Fcodelineno-0-31) [](#%5F%5Fcodelineno-0-32)BUNDLE_PARAMS = { [](#%5F%5Fcodelineno-0-33) "model_bundle_name": "test-bundle", [](#%5F%5Fcodelineno-0-34) "load_predict_fn": my_load_predict_fn, [](#%5F%5Fcodelineno-0-35) "load_model_fn": my_load_model_fn, [](#%5F%5Fcodelineno-0-36) "request_schema": MyRequestSchema, [](#%5F%5Fcodelineno-0-37) "response_schema": MyResponseSchema, [](#%5F%5Fcodelineno-0-38) "requirements": ["pytest==7.2.1", "numpy"], # list your requirements here [](#%5F%5Fcodelineno-0-39) "pytorch_image_tag": "1.7.1-cuda11.0-cudnn8-runtime", [](#%5F%5Fcodelineno-0-40)} [](#%5F%5Fcodelineno-0-41) [](#%5F%5Fcodelineno-0-42)ENDPOINT_PARAMS = { [](#%5F%5Fcodelineno-0-43) "endpoint_name": "demo-endpoint", [](#%5F%5Fcodelineno-0-44) "model_bundle": "test-bundle", [](#%5F%5Fcodelineno-0-45) "cpus": 1, [](#%5F%5Fcodelineno-0-46) "min_workers": 0, [](#%5F%5Fcodelineno-0-47) "endpoint_type": "async", [](#%5F%5Fcodelineno-0-48) "update_if_exists": True, [](#%5F%5Fcodelineno-0-49) "labels": { [](#%5F%5Fcodelineno-0-50) "team": "MY_TEAM", [](#%5F%5Fcodelineno-0-51) "product": "launch", [](#%5F%5Fcodelineno-0-52) } [](#%5F%5Fcodelineno-0-53)} [](#%5F%5Fcodelineno-0-54) [](#%5F%5Fcodelineno-0-55)def predict_on_endpoint(request: MyRequestSchema) -> MyResponseSchema: [](#%5F%5Fcodelineno-0-56) # Wait for the endpoint to be ready first before submitting a task [](#%5F%5Fcodelineno-0-57) endpoint = client.get_model_endpoint(endpoint_name="demo-endpoint") [](#%5F%5Fcodelineno-0-58) while endpoint.status() != "READY": [](#%5F%5Fcodelineno-0-59) time.sleep(10) [](#%5F%5Fcodelineno-0-60) [](#%5F%5Fcodelineno-0-61) endpoint_request = EndpointRequest(args=request.dict(), return_pickled=False) [](#%5F%5Fcodelineno-0-62) [](#%5F%5Fcodelineno-0-63) future = endpoint.predict(request=endpoint_request) [](#%5F%5Fcodelineno-0-64) raw_response = future.get() [](#%5F%5Fcodelineno-0-65) [](#%5F%5Fcodelineno-0-66) response = MyResponseSchema.parse_raw(raw_response.result) [](#%5F%5Fcodelineno-0-67) return response [](#%5F%5Fcodelineno-0-68) [](#%5F%5Fcodelineno-0-69) [](#%5F%5Fcodelineno-0-70)client = LaunchClient(api_key=os.getenv("LAUNCH_API_KEY")) [](#%5F%5Fcodelineno-0-71) [](#%5F%5Fcodelineno-0-72)client.create_model_bundle_from_callable_v2(**BUNDLE_PARAMS) [](#%5F%5Fcodelineno-0-73)endpoint = client.create_model_endpoint(**ENDPOINT_PARAMS) [](#%5F%5Fcodelineno-0-74) [](#%5F%5Fcodelineno-0-75)request = MyRequestSchema(x=5, y="hello") [](#%5F%5Fcodelineno-0-76)response = predict_on_endpoint(request) [](#%5F%5Fcodelineno-0-77)print(response) [](#%5F%5Fcodelineno-0-78)""" [](#%5F%5Fcodelineno-0-79)MyResponseSchema(__root__=10) [](#%5F%5Fcodelineno-0-80)"""

What's going on here:

First we use pydantic to define our request and response schemas, MyRequestSchema and MyResponseSchema. These schemas are used to generate the API documentation for our models.
Next we define the the model and the load_predict_fn, which tells Launch how to load our model and how to make predictions with it. In this case, we're just returning a function that adds the length of the string y to model(x), where model doubles the integer x.
We then define the model bundle by specifying the load_predict_fn, the request_schema, and theresponse_schema. We also specify the env_params, which tell Launch environment settings like the base image to use. In this case, we're using a PyTorch image.
Next, we create the model endpoint, which is the API that we'll use to make predictions. We specify the model_bundle that we created above, and we specify the endpoint_type, which tells Launch whether to use a synchronous or asynchronous endpoint. In this case, we're using an asynchronous endpoint, which means that we can make predictions and return immediately with afuture object. We can then use the future object to get the prediction result later.
Finally, we make a prediction by calling predict_on_endpoint with a MyRequestSchema object. This function first waits for the endpoint to be ready, then it submits a prediction request to the endpoint. It then waits for the prediction result and returns it.

Notice that we specified min_workers=0, meaning that the endpoint will scale down to 0 workers when it's not being used.

Installation¶

To use Scale Launch, first install it using pip:

Installation

[](#%5F%5Fcodelineno-1-1)pip install -U scale-launch