Using LlamaChatSession (original) (raw)

To chat with a text generation model, you can use the LlamaChatSession class.

Here are usage examples of LlamaChatSession:

Simple Chatbot

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

);

console

.

log

("AI: " + 

a1

);


const 

q2

 = "Summarize what you said";

console

.

log

("User: " + 

q2

);

const 

a2

 = await 

session

.

prompt

(

q2

);

console

.

log

("AI: " + 

a2

);

Specific Chat Wrapper

To learn more about chat wrappers, see the chat wrapper guide.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

, 

GeneralChatWrapper

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

(),
    
chatWrapper

: new 

GeneralChatWrapper

()
});


const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

);

console

.

log

("AI: " + 

a1

);


const 

q2

 = "Summarize what you said";

console

.

log

("User: " + 

q2

);

const 

a2

 = await 

session

.

prompt

(

q2

);

console

.

log

("AI: " + 

a2

);

Response Streaming

You can see all the possible options of the prompt function here.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

process

.

stdout

.

write

("AI: ");
const 

a1

 = await 

session

.

prompt

(

q1

, {
    
onTextChunk

(

chunk

: string) {
        
process

.

stdout

.

write

(

chunk

);
    }
});

To stream thought segment, see Stream Response Segments

Repeat Penalty Customization

You can see all the possible options of the prompt function here.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

, 

Token

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Write a poem about llamas";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

, {
    
repeatPenalty

: {
        
lastTokens

: 24,
        
penalty

: 1.12,
        
penalizeNewLine

: true,
        
frequencyPenalty

: 0.02,
        
presencePenalty

: 0.02,
        
punishTokensFilter

(

tokens

: 

Token

[]) {
            return 

tokens

.

filter

(

token

 => {
                const 

text

 = 

model

.

detokenize

([

token

]);

                // allow the model to repeat tokens
                // that contain the word "better"
                return !

text

.

toLowerCase

().

includes

("better");
            });
        }
    }
});

console

.

log

("AI: " + 

a1

);

Custom Temperature

Setting the temperature option is useful for controlling the randomness of the model's responses.

A temperature of 0 (the default) will ensure the model response is always deterministic for a given prompt.

The randomness of the temperature can be controlled by the seed parameter. Setting a specific seed and a specific temperature will yield the same response every time for the same input.

You can see the description of the prompt function options here.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

, {
    
temperature

: 0.8,
    
topK

: 40,
    
topP

: 0.02,
    
seed

: 2462
});

console

.

log

("AI: " + 

a1

);

JSON Response

To learn more about grammars, see the grammar guide.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});
const 

grammar

 = await 

llama

.

getGrammarFor

("json");


const 

q1

 = 'Create a JSON that contains a message saying "hi there"';

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

, {
    
grammar

,
    
maxTokens

: 

context

.

contextSize


});

console

.

log

("AI: " + 

a1

);

console

.

log

(

JSON

.

parse

(

a1

));


const 

q2

 = 'Add another field to the JSON with the key being "author" ' +
    'and the value being "Llama"';

console

.

log

("User: " + 

q2

);

const 

a2

 = await 

session

.

prompt

(

q2

, {
    
grammar

,
    
maxTokens

: 

context

.

contextSize


});

console

.

log

("AI: " + 

a2

);

console

.

log

(

JSON

.

parse

(

a2

));

JSON Response With a Schema

To learn more about the JSON schema grammar, see the grammar guide.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(
    
fileURLToPath

(import.meta.

url

)
);

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});

const 

grammar

 = await 

llama

.

createGrammarForJsonSchema

({
    
type

: "object",
    
properties

: {
        
positiveWordsInUserMessage

: {
            

type

: "array",
            

items

: {
                

type

: "string"
            }
        },
        
userMessagePositivityScoreFromOneToTen

: {
            

enum

: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
        },
        
nameOfUser

: {
            

oneOf

: [{
                

type

: "null"
            }, {
                

type

: "string"
            }]
        }
    }
});

const 

prompt

 = "Hi there! I'm John. Nice to meet you!";

const 

res

 = await 

session

.

prompt

(

prompt

, {

grammar

});
const 

parsedRes

 = 

grammar

.

parse

(

res

);

console

.

log

("User name:", 

parsedRes

.

nameOfUser

);

console

.

log

(
    "Positive words in user message:",
    
parsedRes

.

positiveWordsInUserMessage


);

console

.

log

(
    "User message positivity score:",
    
parsedRes

.

userMessagePositivityScoreFromOneToTen


);

Function Calling

To learn more about using function calling, read the function calling guide.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

, 

defineChatSessionFunction

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});

const 

fruitPrices

: 

Record

<string, string> = {
    "apple": "$6",
    "banana": "$4"
};
const 

functions

 = {
    
getFruitPrice

: 

defineChatSessionFunction

({
        
description

: "Get the price of a fruit",
        
params

: {
            

type

: "object",
            

properties

: {
                

name

: {
                    

type

: "string"
                }
            }
        },
        async 

handler

(

params

) {
            const 

name

 = 

params

.

name

.

toLowerCase

();
            if (

Object

.

keys

(

fruitPrices

).

includes

(

name

))
                return {
                    

name

: 

name

,
                    

price

: 

fruitPrices

[

name

]
                };

            return `Unrecognized fruit "${

params

.

name

}"`;
        }
    })
};


const 

q1

 = "Is an apple more expensive than a banana?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

, {

functions

});

console

.

log

("AI: " + 

a1

);

Customizing the System Prompt

What is a system prompt?

A system prompt is a text that guides the model towards the kind of responses we want it to generate.

It's recommended to explain to the model how to behave in certain situations you care about, and to tell it to not make up information if it doesn't know something.

Here is an example of how to customize the system prompt:

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

(),
    
systemPrompt

: "You are a helpful, respectful and honest botanist. " +
        "Always answer as helpfully as possible.\n" +
        
        "If a question does not make any sense or is not factually coherent," +
        "explain why instead of answering something incorrectly.\n" +
        
        "Attempt to include nature facts that you know in your answers.\n" + 
        
        "If you don't know the answer to a question, " +
        "don't share false information."
});


const 

q1

 = "What is the tallest tree in the world?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

);

console

.

log

("AI: " + 

a1

);

Saving and Restoring a Chat Session

Save chat history

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import 

fs

 from "fs/promises";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

);

console

.

log

("AI: " + 

a1

);

const 

chatHistory

 = 

session

.

getChatHistory

();
await 

fs

.

writeFile

("chatHistory.json", 

JSON

.

stringify

(

chatHistory

), "utf8");

Restore chat history

typescript

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});

const 

chatHistory

 = 

JSON

.

parse

(await 

fs

.

readFile

("chatHistory.json", "utf8"));

session

.

setChatHistory

(

chatHistory

);

const 

q2

 = "Summarize what you said";

console

.

log

("User: " + 

q2

);

const 

a2

 = await 

session

.

prompt

(

q2

);

console

.

log

("AI: " + 

a2

);

Saving and restoring a context sequence evaluation state

You can also save and restore the context sequence evaluation state to avoid re-evaluating the chat history when you load it on a new context sequence.

Please note that context sequence state files can get very large (109MB for only 1K tokens). Using this feature is only recommended when the chat history is very long and you plan to load it often, or when the evaluation is too slow due to hardware limitations.

WARNING

When loading a context sequence state from a file, always ensure that the model used to create the context sequence is exactly the same as the one used to save the state file.

Loading a state file created from a different model can crash the process, thus you have to pass {acceptRisk: true} to the loadStateFromFile method to use it.

Use with caution.

Save chat history and context sequence state

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import 

fs

 from "fs/promises";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

contextSequence

 = 

context

.

getSequence

();
const 

session

 = new 

LlamaChatSession

({

contextSequence

});


const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

);

console

.

log

("AI: " + 

a1

);

const 

chatHistory

 = 

session

.

getChatHistory

();
await 

Promise

.

all

([
    
contextSequence

.

saveStateToFile

("state.bin"),
    
fs

.

writeFile

("chatHistory.json", 

JSON

.

stringify

(

chatHistory

), "utf8")
]);

Restore chat history and context sequence state

typescript

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

contextSequence

 = 

context

.

getSequence

();
const 

session

 = new 

LlamaChatSession

({

contextSequence

});

await 

contextSequence

.

loadStateFromFile

("state.bin", {

acceptRisk

: true});
const 

chatHistory

 = 

JSON

.

parse

(await 

fs

.

readFile

("chatHistory.json", "utf8"));

session

.

setChatHistory

(

chatHistory

);

const 

q2

 = "Summarize what you said";

console

.

log

("User: " + 

q2

);

const 

a2

 = await 

session

.

prompt

(

q2

);

console

.

log

("AI: " + 

a2

);

Prompt Without Updating Chat History

Prompt without saving the prompt to the chat history.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import 

fs

 from "fs/promises";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});

// Save the initial chat history
const 

initialChatHistory

 = 

session

.

getChatHistory

();

const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

);

console

.

log

("AI: " + 

a1

);

// Reset the chat history

session

.

setChatHistory

(

initialChatHistory

);

const 

q2

 = "Summarize what you said";

console

.

log

("User: " + 

q2

);

// This response will not be aware of the previous interaction
const 

a2

 = await 

session

.

prompt

(

q2

);

console

.

log

("AI: " + 

a2

);

Preload User Prompt

You can preload a user prompt onto the context sequence state to make the response start being generated sooner when the final prompt is given.

This won't speed up inference if you call the .prompt() function immediately after preloading the prompt, but can greatly improve initial response times if you preload a prompt before the user gives it.

You can call this function with an empty string to only preload the existing chat history onto the context sequence state.

NOTE

Preloading a long prompt can cause context shifts, so it's recommended to limit the maximum length of the prompt you preload.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});

const 

prompt

 = "Hi there, how are you?";

console

.

log

("Preloading prompt");
await 

session

.

preloadPrompt

(

prompt

);

console

.

log

("Prompt preloaded. Waiting 10 seconds");
await new 

Promise

(

resolve

 => 

setTimeout

(

resolve

, 1000 * 10));

console

.

log

("Generating response...");

process

.

stdout

.

write

("AI: ");
const 

res

 = await 

session

.

prompt

(

prompt

, {
    
onTextChunk

(

text

) {
        
process

.

stdout

.

write

(

text

);
    }
});

console

.

log

("AI: " + 

res

);

Complete User Prompt

You can try this feature in the example Electron app. Just type a prompt and see the completion generated by the model.

You can generate a completion to a given incomplete user prompt and let the model complete it.

The advantage of doing that on the chat session is that it will use the chat history as context for the completion, and also use the existing context sequence state, so you don't have to create another context sequence for this.

NOTE

Generating a completion to a user prompt can incur context shifts, so it's recommended to limit the maximum number of tokens that are used for the prompt + completion.

INFO

Prompting the model while a prompt completion is in progress will automatically abort the prompt completion.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Give me a recipe for a cheesecake";

console

.

log

("User: " + 

q1

);

process

.

stdout

.

write

("AI: ");
const 

a1

 = await 

session

.

prompt

(

q1

, {
    
onTextChunk

(

text

) {
        
process

.

stdout

.

write

(

text

);
    }
});

console

.

log

("AI: " + 

a1

);

const 

maxTokens

 = 100;
const 

partialPrompt

 = "Can I replace the cream cheese with ";

const 

maxCompletionTokens

 = 

maxTokens

 - 

model

.

tokenize

(

partialPrompt

).

length

;

console

.

log

("Partial prompt: " + 

partialPrompt

);

process

.

stdout

.

write

("Completion: ");
const 

promptCompletion

 = await 

session

.

completePrompt

(

partialPrompt

, {
    
maxTokens

: 

maxCompletionTokens

,
    
onTextChunk

(

text

) {
        
process

.

stdout

.

write

(

text

);
    }
});

console

.

log

("\nPrompt completion: " + 

promptCompletion

);

Prompt Completion Engine

If you want to complete a user prompt as the user types it in an input field, you need a more robust prompt completion engine that can work well with partial prompts that their completion is frequently cancelled and restarted.

The prompt completion created with .createPromptCompletionEngine() allows you to trigger the completion of a prompt, while utilizing existing cache to avoid redundant inference and provide fast completions.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});

// ensure the model is fully loaded before continuing this demo
await 

session

.

preloadPrompt

("");

const 

completionEngine

 = 

session

.

createPromptCompletionEngine

({
    // 15 is used for demonstration only,
    // it's best to omit this option
    
maxPreloadTokens

: 15,
    // temperature: 0.8, // you can set custom generation options
    
onGeneration

(

prompt

, 

completion

) {
        
console

.

log

(`Prompt: ${

prompt

} | Completion:${

completion

}`);
        // you should add a custom code here that checks whether
        // the existing input text equals to `prompt`, and if it does,
        // use `completion` as the completion of the input text.
        // this callback will be called multiple times
        // as the completion is being generated.
    }
});

completionEngine

.

complete

("Hi the");

await new 

Promise

(

resolve

 => 

setTimeout

(

resolve

, 1500));

completionEngine

.

complete

("Hi there");
await new 

Promise

(

resolve

 => 

setTimeout

(

resolve

, 1500));

completionEngine

.

complete

("Hi there! How");
await new 

Promise

(

resolve

 => 

setTimeout

(

resolve

, 1500));

// get an existing completion from the cache
// and begin/continue generating a completion for it
const 

cachedCompletion

 = 

completionEngine

.

complete

("Hi there! How");

console

.

log

("Cached completion:", 

cachedCompletion

);

Response Prefix

You can force the model response to start with a specific prefix, to make the model follow a certain direction in its response.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

const 

a1

 = await 

session

.

prompt

(

q1

, {
    
responsePrefix

: "The weather today is"
});

console

.

log

("AI: " + 

a1

);

Stop Response Generation

To stop the generation of the current response, without removing the existing partial generation from the chat history, you can use the stopOnAbortSignal option to configure what happens when the given signal is aborted.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "Meta-Llama-3.1-8B-Instruct.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

abortController

 = new 

AbortController

();
const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

let 

response

 = "";

const 

a1

 = await 

session

.

prompt

(

q1

, {
    // stop the generation, instead of cancelling it
    
stopOnAbortSignal

: true,
    
    signal

: 

abortController

.

signal

,
    
onTextChunk

(

chunk

) {
        
response

 += 

chunk

;
        
        if (

response

.

length

 >= 10)
            

abortController

.

abort

();
    }
});

console

.

log

("AI: " + 

a1

);

Stream Response Segments

The raw model response is automatically segmented into different types of segments. The main response is not segmented, but other kinds of sections, like thoughts (chain of thought) and comments (on relevant models, like gpt-oss), are segmented.

To stream response segments you can use the onResponseChunk option.

typescript

import {

fileURLToPath

} from "url";
import 

path

 from "path";
import {

getLlama

, 

LlamaChatSession

} from "node-llama-cpp";

const 

__dirname

 = 

path

.

dirname

(

fileURLToPath

(import.meta.

url

));

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({
    
modelPath

: 

path

.

join

(

__dirname

, "models", "DeepSeek-R1-Distill-Qwen-14B.Q4_K_M.gguf")
});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Hi there, how are you?";

console

.

log

("User: " + 

q1

);

process

.

stdout

.

write

("AI: ");
const 

a1

 = await 

session

.

promptWithMeta

(

q1

, {
    
onResponseChunk

(

chunk

) {
        const 

isThoughtSegment

 = 

chunk

.

type

 === "segment" &&
            

chunk

.

segmentType

 === "thought";
        const 

isCommentSegment

 = 

chunk

.

type

 === "segment" &&
            

chunk

.

segmentType

 === "comment";
        
        if (

chunk

.

type

 === "segment" && 

chunk

.

segmentStartTime

 != null)
            

process

.

stdout

.

write

(` [segment start: ${

chunk

.

segmentType

}] `);

        process

.

stdout

.

write

(

chunk

.

text

);

        if (

chunk

.

type

 === "segment" && 

chunk

.

segmentEndTime

 != null)
            

process

.

stdout

.

write

(` [segment end: ${

chunk

.

segmentType

}] `);
    }
});

const 

fullResponse

 = 

a1

.

response


    .

map

((

item

) => {
        if (typeof 

item

 === "string")
            return 

item

;
        else if (

item

.

type

 === "segment") {
            const 

isThoughtSegment

 = 

item

.

segmentType

 === "thought";
            const 

isCommentSegment

 = 

item

.

segmentType

 === "comment";
            let 

res

 = "";
            
            if (

item

.

startTime

 != null)
                

res

 += ` [segment start: ${

item

.

segmentType

}] `;
            

res

 += 

item

.

text

;

            if (

item

.

endTime

 != null)
                

res

 += ` [segment end: ${

item

.

segmentType

}] `;

            return 

res

;
        }

        return "";
    })
    .

join

("");

console

.

log

("Full response: " + 

fullResponse

);

Set Reasoning Budget

You can set a reasoning budget to limit the number of tokens a thinking model can spend on thought segments.

typescript

import {
    
getLlama

, 

LlamaChatSession

, 

resolveModelFile

, 

Token


} from "node-llama-cpp";

const 

modelPath

 = await 

resolveModelFile

("hf:Qwen/Qwen3-14B-GGUF:Q4_K_M");

const 

llama

 = await 

getLlama

();
const 

model

 = await 

llama

.

loadModel

({

modelPath

});
const 

context

 = await 

model

.

createContext

();
const 

session

 = new 

LlamaChatSession

({
    
contextSequence

: 

context

.

getSequence

()
});


const 

q1

 = "Where do llamas come from?";

console

.

log

("User: " + 

q1

);

const 

maxThoughtTokens

 = 100;

let 

responseTokens

 = 0;
let 

thoughtTokens

 = 0;

process

.

stdout

.

write

("AI: ");
const 

response

 = await 

session

.

prompt

(

q1

, {
    
budgets

: {
        
thoughtTokens

: 

maxThoughtTokens


    },
    
onResponseChunk

(

chunk

) {
        const 

isThoughtSegment

 = 

chunk

.

type

 === "segment" &&
            

chunk

.

segmentType

 === "thought";

        if (

chunk

.

type

 === "segment" && 

chunk

.

segmentStartTime

 != null)
            

process

.

stdout

.

write

(` [segment start: ${

chunk

.

segmentType

}] `);

        process

.

stdout

.

write

(

chunk

.

text

);

        if (

chunk

.

type

 === "segment" && 

chunk

.

segmentEndTime

 != null)
            

process

.

stdout

.

write

(` [segment end: ${

chunk

.

segmentType

}] `);

        if (

isThoughtSegment

)
            

thoughtTokens

 += 

chunk

.

tokens

.

length

;
        else
            

responseTokens

 += 

chunk

.

tokens

.

length

;
    }
});

console

.

log

("Response: " + 

response

);

console

.

log

("Response tokens: " + 

responseTokens

);

console

.

log

("Thought tokens: " + 

thoughtTokens

);

Last edited 7 months ago

View full history

Using LlamaChatSession (original) (raw)

Simple Chatbot ​

Specific Chat Wrapper ​

Response Streaming ​

Repeat Penalty Customization ​

Custom Temperature ​

JSON Response ​

JSON Response With a Schema ​

Function Calling ​

Customizing the System Prompt ​

Saving and Restoring a Chat Session ​

Prompt Without Updating Chat History ​

Preload User Prompt ​

Complete User Prompt ​

Prompt Completion Engine ​

Response Prefix ​

Stop Response Generation ​

Stream Response Segments ​

Set Reasoning Budget ​

Simple Chatbot

Specific Chat Wrapper

Response Streaming

Repeat Penalty Customization

Custom Temperature

JSON Response

JSON Response With a Schema

Function Calling

Customizing the System Prompt

Saving and Restoring a Chat Session

Prompt Without Updating Chat History

Preload User Prompt

Complete User Prompt

Prompt Completion Engine

Response Prefix

Stop Response Generation

Stream Response Segments

Set Reasoning Budget