Class Wllama

Constructors

constructor

new Wllama(pathConfig: AssetsPathConfig, wllamaConfig?: WllamaConfig): Wllama
Parameters
- pathConfig: AssetsPathConfig
- wllamaConfig: WllamaConfig = {}
Returns Wllama
- Defined in src/wllama.ts:296

Properties

cacheManager

cacheManager: CacheManager

modelManager

modelManager: ModelManager

Methods

_getDebugInfo

_getDebugInfo(): Promise<any>
get debug info

Returns Promise<any>
- Defined in src/wllama.ts:1296

_testBenchmark

_testBenchmark(type: "tg" | "pp", nSamples: number): Promise<{ t_ms: number }>
benchmark function, only used internally
Parameters
- type: "tg" | "pp"
- nSamples: number
Returns Promise<{ t_ms: number }>
- Defined in src/wllama.ts:1304

_testPerplexity

_testPerplexity(tokens: number[]): Promise<{ ppl: number }>
perplexity function, only used internally
Parameters
- tokens: number[]
Returns Promise<{ ppl: number }>
- Defined in src/wllama.ts:1322

createChatCompletion

createChatCompletion(
messages: WllamaChatMessage[],
options: ChatCompletionOptions & { stream?: false },
): Promise<string>
Make completion for a given chat messages.

NOTE: this function uses the chat template (if available) to format the chat messages. If the template is not available, it will use the default format (chatml). It can throw an error if the chat template is not compatible.
Parameters
- messages: WllamaChatMessage[]
  Chat messages
- options: ChatCompletionOptions & { stream?: false }
Returns Promise<string>
Output completion text (only the completion part)
- Defined in src/wllama.ts:701
createChatCompletion(
messages: WllamaChatMessage[],
options: ChatCompletionOptions & { stream: true },
): Promise<AsyncIterable<CompletionChunk>>
Make completion for a given chat messages.

NOTE: this function uses the chat template (if available) to format the chat messages. If the template is not available, it will use the default format (chatml). It can throw an error if the chat template is not compatible.
Parameters
- messages: WllamaChatMessage[]
  Chat messages
- options: ChatCompletionOptions & { stream: true }
Returns Promise<AsyncIterable<CompletionChunk>>
Output completion text (only the completion part)
- Defined in src/wllama.ts:705

createCompletion

createCompletion(
prompt: string,
options: ChatCompletionOptions & { stream?: false },
): Promise<string>
Make completion for a given text.
Parameters
- prompt: string
  Input text
- options: ChatCompletionOptions & { stream?: false }
Returns Promise<string>
Output completion text (only the completion part)
- Defined in src/wllama.ts:725
createCompletion(
prompt: string,
options: ChatCompletionOptions & { stream: true },
): Promise<AsyncIterable<CompletionChunk>>
Make completion for a given text.
Parameters
- prompt: string
  Input text
- options: ChatCompletionOptions & { stream: true }
Returns Promise<AsyncIterable<CompletionChunk>>
Output completion text (only the completion part)
- Defined in src/wllama.ts:729

createEmbedding

createEmbedding(
text: string,
options?: { skipBOS?: boolean; skipEOS?: boolean },
): Promise<number[]>
Calculate embedding vector for a given text. By default, BOS and EOS tokens will be added automatically. You can use the "skipBOS" and "skipEOS" option to disable it.
Parameters
- text: string
  Input text
- options: { skipBOS?: boolean; skipEOS?: boolean } = {}
Returns Promise<number[]>
An embedding vector
- Defined in src/wllama.ts:666

decode

decode(tokens: number[], options: CompletionOptions): Promise<{ nPast: number }>
Run llama_decode()
Parameters
- tokens: number[]
  A list of tokens to be decoded
- options: CompletionOptions
  Additional options
Returns Promise<{ nPast: number }>
n_past (number of tokens so far in the sequence)
- Defined in src/wllama.ts:941

detokenize

detokenize(tokens: number[], returnString?: false): Promise<Uint8Array>
Convert a list of tokens to text
Parameters
- tokens: number[]
- OptionalreturnString: false
  Return a string instead of Uint8Array
Returns Promise<Uint8Array>
Uint8Array, which maybe an unfinished unicode
- Defined in src/wllama.ts:918
detokenize(tokens: number[], returnString: true): Promise<string>
Convert a list of tokens to text
Parameters
- tokens: number[]
- returnString: true
  Return a string instead of Uint8Array
Returns Promise<string>
Uint8Array, which maybe an unfinished unicode
- Defined in src/wllama.ts:919

embeddings

embeddings(tokens: number[]): Promise<number[]>
Calculate embeddings for a given list of tokens. Output vector is always normalized
Parameters
- tokens: number[]
Returns Promise<number[]>
A list of number represents an embedding vector of N dimensions
- Defined in src/wllama.ts:1121

encode

encode(
tokens: number[],
options?: CompletionOptions,
): Promise<{ nPast: number }>
Run llama_encode()
Parameters
- tokens: number[]
  A list of tokens to be encoded
- Optionaloptions: CompletionOptions
  Additional options
Returns Promise<{ nPast: number }>
n_past (number of tokens so far in the sequence)
- Defined in src/wllama.ts:994

exit

exit(): Promise<void>
Unload the model and free all memory.

Note: This function will NOT crash if model is not yet loaded

Returns Promise<void>
- Defined in src/wllama.ts:1288

formatChat

formatChat(
    messages: WllamaChatMessage[],
    addAssistant: boolean,
    template?: string,
): Promise<string>
Apply chat template to a list of messages
Parameters
- messages: WllamaChatMessage[]
  list of messages
- addAssistant: boolean
  whether to add assistant prompt at the end
- Optionaltemplate: string
  (optional) custom template, see llama-server --chat-template argument for more details
Returns Promise<string>
formatted chat
- Defined in src/wllama.ts:1247

getBOS

getBOS(): number
Get token ID associated to BOS (begin of sentence) token.

NOTE: This can only being used after loadModel is called.

Returns number
-1 if the model is not loaded.
- Defined in src/wllama.ts:339

getChatTemplate

getChatTemplate(): null | string
Get the jinja chat template comes with the model. It only available if the original model (before converting to gguf) has the template in tokenizer_config.json

NOTE: This can only being used after loadModel is called.

Returns null | string
the jinja template. null if there is no template in gguf
- Defined in src/wllama.ts:469

getDecoderStartToken

getDecoderStartToken(): number
Get token ID associated to token used by decoder, to start generating output sequence(only usable for encoder-decoder architecture). In other words, encoder uses normal BOS and decoder uses this token.

NOTE: This can only being used after loadModel is called.

Returns number
-1 if the model is not loaded.
- Defined in src/wllama.ts:386

getEOS

getEOS(): number
Get token ID associated to EOS (end of sentence) token.

NOTE: This can only being used after loadModel is called.

Returns number
-1 if the model is not loaded.
- Defined in src/wllama.ts:350

getEOT

getEOT(): number
Get token ID associated to EOT (end of turn) token.

NOTE: This can only being used after loadModel is called.

Returns number
-1 if the model is not loaded.
- Defined in src/wllama.ts:361

getLoadedContextInfo

getLoadedContextInfo(): LoadedContextInfo
Returns LoadedContextInfo
- Defined in src/wllama.ts:648

getLogits

getLogits(topK?: number): Promise<{ p: number; token: number }[]>
Get softmax-ed probability of logits, can be used for custom sampling
Parameters
- topK: number = 40
  Get top K tokens having highest logits value. If topK == -1, we return all n_vocab logits, but this is not recommended because it's slow.
Returns Promise<{ p: number; token: number }[]>
- Defined in src/wllama.ts:1097

getModelMetadata

getModelMetadata(): ModelMetadata
Get model hyper-parameters and metadata

NOTE: This can only being used after loadModel is called.

Returns ModelMetadata
ModelMetadata
- Defined in src/wllama.ts:397

getNumThreads

getNumThreads(): number
Get number of threads used in the current context.

NOTE: This can only being used after loadModel is called.

Returns number
number of threads
- Defined in src/wllama.ts:421

getVocab

getVocab(): Promise<Uint8Array[]>
Get a list of pieces in vocab. NOTE: This function is slow, should only be used once.

Returns Promise<Uint8Array[]>
A list of Uint8Array. The nth element in the list associated to nth token in vocab
- Defined in src/wllama.ts:860

isEncoderDecoderArchitecture

isEncoderDecoderArchitecture(): boolean
Check if the current model uses encoder-decoder architecture

NOTE: This can only being used after loadModel is called.

Returns boolean
true if multi-thread is used.
- Defined in src/wllama.ts:433

isModelLoaded

isModelLoaded(): boolean
Check if the model is loaded via loadModel()

Returns boolean
- Defined in src/wllama.ts:328

isMultithread

isMultithread(): boolean
Check if we're currently using multi-thread build.

NOTE: This can only being used after loadModel is called.

Returns boolean
true if multi-thread is used.
- Defined in src/wllama.ts:409

isTokenEOG

isTokenEOG(token: number): boolean
Check if a given token is end-of-generation token (e.g. EOS, EOT, etc.)
Parameters
- token: number
  the token ID to be checked
Returns boolean
true if the token is EOS, EOT, or any other end-of-generation tokens
- Defined in src/wllama.ts:371

kvClear

kvClear(): Promise<void>
Clear all tokens in KV cache

Returns Promise<void>
- Defined in src/wllama.ts:1192

kvRemove

kvRemove(nKeep: number, nDiscard: number): Promise<void>
Remove and shift some tokens from KV cache. Keep n_keep, remove n_discard then shift the rest
Parameters
- nKeep: number
- nDiscard: number
Returns Promise<void>
- Defined in src/wllama.ts:1172

loadModel

loadModel(
ggufBlobsOrModel: Blob[] | Model,
config?: LoadModelConfig,
): Promise<void>
Load model from a given list of Blob.

You can pass multiple buffers into the function (in case the model contains multiple shards).
Parameters
- ggufBlobsOrModel: Blob[] | Model
  Can be either list of Blobs (in case you use local file), or a Model object (in case you use ModelManager)
- config: LoadModelConfig = {}
  LoadModelConfig
Returns Promise<void>
- Defined in src/wllama.ts:526

loadModelFromHF

loadModelFromHF(
    modelId: string,
    filePath: string,
    config?: LoadModelConfig & DownloadOptions & { useCache?: boolean },
): Promise<void>
Load model from a given Hugging Face model ID and file path.
Parameters
- modelId: string
  The HF model ID, for example: 'ggml-org/models'
- filePath: string
  The GGUF file path, for example: 'tinyllamas/stories15M-q4_0.gguf'
- config: LoadModelConfig & DownloadOptions & { useCache?: boolean } = {}
Returns Promise<void>
- Defined in src/wllama.ts:501

loadModelFromUrl

loadModelFromUrl(
modelUrl: string | string[],
config?: LoadModelConfig & DownloadOptions & { useCache?: boolean },
): Promise<void>
Load model from a given URL (or a list of URLs, in case the model is splitted into smaller files)
- If the model already been downloaded (via downloadModel()), then we will use the cached model
- Else, we download the model from internet
Parameters
- modelUrl: string | string[]
  URL to the GGUF file. If the model is splitted, pass the URL to the first shard.
- config: LoadModelConfig & DownloadOptions & { useCache?: boolean } = {}
Returns Promise<void>
- Defined in src/wllama.ts:481

lookupToken

lookupToken(piece: string): Promise<number>
Lookup to see if a token exist in vocab or not. Useful for searching special tokens like "<|im_start|>" NOTE: It will match the whole token, so do not use it as a replacement for tokenize()
Parameters
- piece: string
Returns Promise<number>
Token ID associated to the given piece. Returns -1 if cannot find the token.
- Defined in src/wllama.ts:877

mustAddBosToken

mustAddBosToken(): boolean
Must we add BOS token to the tokenized sequence?

NOTE: This can only being used after loadModel is called.

Returns boolean
true if BOS token must be added to the sequence
- Defined in src/wllama.ts:445

mustAddEosToken

mustAddEosToken(): boolean
Must we add EOS token to the tokenized sequence?

NOTE: This can only being used after loadModel is called.

Returns boolean
true if EOS token must be added to the sequence
- Defined in src/wllama.ts:457

samplingAccept

samplingAccept(tokens: number[]): Promise<void>
Accept and save a new token to ctx_sampling
Parameters
- tokens: number[]
Returns Promise<void>
- Defined in src/wllama.ts:1079

samplingInit

samplingInit(config: SamplingConfig, pastTokens?: number[]): Promise<void>
Create or reset the ctx_sampling
Parameters
- config: SamplingConfig
- pastTokens: number[] = []
  In case re-initializing the ctx_sampling, you can re-import past tokens into the new context
Returns Promise<void>
- Defined in src/wllama.ts:836

samplingSample

samplingSample(): Promise<{ piece: Uint8Array; token: number }>
Sample a new token (remember to samplingInit() at least once before calling this function)

Returns Promise<{ piece: Uint8Array; token: number }>
the token ID and its detokenized value (which maybe an unfinished unicode)
- Defined in src/wllama.ts:1061

setOptions

setOptions(opt: ContextOptions): Promise<void>
Set options for underlaying llama_context
Parameters
- opt: ContextOptions
Returns Promise<void>
- Defined in src/wllama.ts:1274

tokenize

tokenize(text: string, special?: boolean): Promise<number[]>
Convert a given text to list of tokens
Parameters
- text: string
- special: boolean = true
  Should split special tokens?
Returns Promise<number[]>
List of token ID
- Defined in src/wllama.ts:899

Class Wllama

Index

Constructors

Properties

Methods

Constructors

constructor

Parameters

Returns Wllama

Properties

cacheManager

modelManager

Methods

_getDebugInfo

Returns Promise<any>

_testBenchmark

Parameters

Returns Promise<{ t_ms: number }>

_testPerplexity

Parameters

Returns Promise<{ ppl: number }>

createChatCompletion

Parameters

Returns Promise<string>

Parameters

Returns Promise<AsyncIterable<CompletionChunk>>

createCompletion

Parameters

Returns Promise<string>

Parameters

Returns Promise<AsyncIterable<CompletionChunk>>

createEmbedding

Parameters

Returns Promise<number[]>

decode

Parameters

Returns Promise<{ nPast: number }>

detokenize

Parameters

Returns Promise<Uint8Array>

Parameters

Returns Promise<string>

embeddings

Parameters

Returns Promise<number[]>

encode

Parameters

Returns Promise<{ nPast: number }>

exit

Returns Promise<void>

formatChat

Parameters

Returns Promise<string>

getBOS

Returns number

getChatTemplate

Returns null | string

getDecoderStartToken

Returns number

getEOS

Returns number

getEOT

Returns number

getLoadedContextInfo

Returns LoadedContextInfo

getLogits

Parameters

Returns Promise<{ p: number; token: number }[]>

getModelMetadata

Returns ModelMetadata

getNumThreads

Returns number

getVocab

Returns Promise<Uint8Array[]>

isEncoderDecoderArchitecture

Returns boolean

isModelLoaded

Returns boolean

isMultithread

Returns boolean