Serverless AI API (original) (raw)

The nature of AI and LLM workloads on already trained models lends itself very naturally to a serverless-style architecture. As a framework for building and deploying serverless applications, Spin provides an interface for you to perform AI inference within Spin applications.

Using Serverless AI From Applications

Configuration

By default, a given component of a Spin application will not have access to any Serverless AI models. Access must be provided explicitly via the Spin application’s manifest (the spin.toml file). For example, an individual component in a Spin application could be given access to the llama2-chat model by adding the following ai_models configuration inside the specific [component.(name)] section:

// -- snip --

[component.please-send-the-codes]
ai_models = ["codellama-instruct"]

// -- snip --

Spin supports the models of the Llama architecture for inferencing and “all-minilm-l6-v2” for generating embeddings.

File Structure

By default, the Spin framework will expect any already trained model files (which are configured as per the previous section) to be downloaded by the user and made available inside a .spin/ai-models/ file path of a given application. Within the .spin/ai-models directory, models of the same architecture (e.g. llama) must be grouped under a directory with the same name as the architecture. Within an architecture directory, each individual model (e.g. llama2-chat, codellama-instruct) must be placed under a folder with the same name as the model. So for any given model, that files for the model are placed in the directory .spin/ai-models/<architecture>/<model>. For example:

code-generator-rs/.spin/ai-models/llama/codellama-instruct/safetensors
code-generator-rs/.spin/ai-models/llama/codellama-instruct/config.json
code-generator-rs/.spin/ai-models/llama/codellama-instruct/tokenizer.json

See the serverless AI Tutorial documentation for more concrete examples of implementing the Fermyon Serverless AI API, in your favorite language.

For embeddings models, it is expected that both a tokenizer.json and a model.safetensors are located in the directory named after the model. For example, for the foo-bar-baz model, Spin will look in the .spin/ai-models/foo-bar-baz directory for tokenizer.json and a model.safetensors.

Serverless AI Interface

The Spin SDK surfaces the Serverless AI interface to a variety of different languages. See the Language Support Overview to see if your specific language is supported.

The set of operations is common across all supporting language SDKs:

Operation Parameters Returns Behavior
infer modelstring promptstring string The infer is performed on a specific model. The name of the model is the first parameter provided (i.e. llama2-chat, codellama-instruct, or other; passed in as a string). The second parameter is a prompt; passed in as a string.
infer_with_options modelstring promptstring paramslist string The infer_with_options is performed on a specific model. The name of the model is the first parameter provided (i.e. llama2-chat, codellama-instruct, or other; passed in as a string). The second parameter is a prompt; passed in as a string. The third parameter is a mix of float and unsigned integers relating to inferencing parameters in this order: - max-tokens (unsigned 32 integer) Note: the backing implementation may return less tokens. Default is 100 - repeat-penalty (float 32) The amount the model should avoid repeating tokens. Default is 1.1 - repeat-penalty-last-n-token-count (unsigned 32 integer) The number of tokens the model should apply the repeat penalty to. Default is 64 - temperature (float 32) The randomness with which the next token is selected. Default is 0.8 - top-k (unsigned 32 integer) The number of possible next tokens the model will choose from. Default is 40 - top-p (float 32) The probability total of next tokens the model will choose from. Default is 0.9 The result from infer_with_options is a string
generate-embeddings modelstring promptlist string The generate-embeddings is performed on a specific model. The name of the model is the first parameter provided (i.e. all-minilm-l6-v2, passed in as a string). The second parameter is a prompt; passed in as a list of strings. The result from generate-embeddings is a two-dimension array containing float32 type values only

The exact detail of calling these operations from your application depends on your language:

Want to go straight to the reference documentation? Find it here.

To use Serverless AI functions, the llm module from the Spin SDK provides the methods. The following snippet is from the Rust code generation example:

use spin_sdk::{
    http::{IntoResponse, Request, Response},
    llm,
};

// -- snip --

fn handle_code(req: Request) -> anyhow::Result<impl IntoResponse> {
    // -- snip --

    let result = llm::infer_with_options(
        llm::InferencingModel::CodellamaInstruct,
        &prompt,
        llm::InferencingParams {
            max_tokens: 400,
            repeat_penalty: 1.1,
            repeat_penalty_last_n_token_count: 64,
            temperature: 0.8,
            top_k: 40,
            top_p: 0.9,
        },
    )?;

    // -- snip --	
}

General Notes

The infer_with_options examples, operation:

Want to go straight to the reference documentation? Find it here.

To use Serverless AI functions, the @spinframework/spin-llm pacakge provides two methods: infer and generateEmbeddings. For example:

import { AutoRouter } from 'itty-router';
import { inferencingModels, EmbeddingModels, infer, generateEmbeddings } from '@spinframework/spin-llm';

let router = AutoRouter();

router
    .get("/", () => {
       let embeddings = generateEmbeddings(EmbeddingModels.AllMiniLmL6V2, ["someString"])
        console.log(embeddings.embeddings)
        let result = infer(InferencingModels.Llama2Chat, prompt)

        return new Response(result.text);
    })

//@ts-ignore
addEventListener('fetch', async (event: FetchEvent) => {
    event.respondWith(router.fetch(event.request));
});

General Notes

infer operation:

generateEmbeddings operation:

Want to go straight to the reference documentation? Find it here.

from spin_sdk import http
from spin_sdk.http import Request, Response
from spin_sdk import llm

class IncomingHandler(http.IncomingHandler):
    def handle_request(self, request: Request) -> Response:
        prompt="You are a stand up comedy writer. Tell me a joke."
        result = llm.infer("llama2-chat", prompt)
        return Response(200,
                        {"content-type": "application/json"},
                        bytes(result.text, "utf-8"))

General Notes

infer operation:

Want to go straight to the reference documentation? Find it here.

Serverless AI functions are available in the github.com/fermyon/spin/sdk/go/v2/llm package. See Go Packages for reference documentation. For example:

package main

import (
    "fmt"
    "net/http"

    spinhttp "github.com/fermyon/spin/sdk/go/v2/http"
    "github.com/fermyon/spin/sdk/go/v2/llm"
)

func init() {
    spinhttp.Handle(func(w http.ResponseWriter, r *http.Request) {
        result, err := llm.Infer("llama2-chat", "What is a good prompt?", nil)
        if err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }
        fmt.Printf("Prompt tokens:    %d\n", result.Usage.PromptTokenCount)
        fmt.Printf("Generated tokens: %d\n", result.Usage.GeneratedTokenCount)
        fmt.Fprintf(w, "%s\n", result.Text)

        embeddings, err := llm.GenerateEmbeddings("all-minilm-l6-v2", []string{"Hello world"})
        if err != nil {
            http.Error(w, err.Error(), http.StatusInternalServerError)
            return
        }
        fmt.Printf("Prompt Tokens: %d\n", embeddings.Usage.PromptTokenCount)
        fmt.Printf("%v\n", embeddings.Embeddings)
    })
}

General Notes

infer operation:

generateEmbeddings operation:

Troubleshooting

Error “Local LLM operations are not supported in this version of Spin”

If you see “Local LLM operations are not supported in this version of Spin”, then your copy of Spin has been built without local LLM support.

The term “version” in the error message refers to how the software you are using built the Spin runtime, not to the numeric version of the runtime itself.

Most Spin builds support local LLMs as described above. However, the models built into Spin do not build on some combinations of platforms (for example, there are known problems with the aarch64/musl combination). This may cause some environments that embed Spin to disable the local LLM feature altogether. (For examples, some versions of the containerd-spin-shim did this.) In such cases, you will see the error above.

In such cases, you can: