Constructors

Properties

cacheManager: CacheManager
modelManager: ModelManager

Methods

  • get debug info

    Returns Promise<any>

  • Make completion for a given chat messages.

    NOTE: this function uses the chat template (if available) to format the chat messages. If the template is not available, it will use the default format (chatml). It can throw an error if the chat template is not compatible.

    Parameters

    Returns Promise<string>

    Output completion text (only the completion part)

  • Make completion for a given text.

    Parameters

    Returns Promise<string>

    Output completion text (only the completion part)

  • Calculate embedding vector for a given text. By default, BOS and EOS tokens will be added automatically. You can use the "skipBOS" and "skipEOS" option to disable it.

    Parameters

    • text: string

      Input text

    • options: { skipBOS?: boolean; skipEOS?: boolean } = {}

    Returns Promise<number[]>

    An embedding vector

  • Run llama_decode()

    Parameters

    • tokens: number[]

      A list of tokens to be decoded

    • options: { skipLogits?: boolean }

    Returns Promise<{ nPast: number }>

    n_past (number of tokens so far in the sequence)

  • Convert a list of tokens to text

    Parameters

    • tokens: number[]

    Returns Promise<Uint8Array>

    Uint8Array, which maybe an unfinished unicode

  • Calculate embeddings for a given list of tokens. Output vector is always normalized

    Parameters

    • tokens: number[]

    Returns Promise<number[]>

    A list of number represents an embedding vector of N dimensions

  • Run llama_encode()

    Parameters

    • tokens: number[]

      A list of tokens to be encoded

    • Optionaloptions: Record<never, never>

      Unused for now

    Returns Promise<{ nPast: number }>

    n_past (number of tokens so far in the sequence)

  • Unload the model and free all memory.

    Note: This function will NOT crash if model is not yet loaded

    Returns Promise<void>

  • Apply chat template to a list of messages

    Parameters

    • messages: WllamaChatMessage[]

      list of messages

    • addAssistant: boolean

      whether to add assistant prompt at the end

    • Optionaltemplate: string

      (optional) custom template, see llama-server --chat-template argument for more details

    Returns Promise<string>

    formatted chat

  • Get token ID associated to BOS (begin of sentence) token.

    NOTE: This can only being used after loadModel is called.

    Returns number

    -1 if the model is not loaded.

  • Get the jinja chat template comes with the model. It only available if the original model (before converting to gguf) has the template in tokenizer_config.json

    NOTE: This can only being used after loadModel is called.

    Returns null | string

    the jinja template. null if there is no template in gguf

  • Get token ID associated to token used by decoder, to start generating output sequence(only usable for encoder-decoder architecture). In other words, encoder uses normal BOS and decoder uses this token.

    NOTE: This can only being used after loadModel is called.

    Returns number

    -1 if the model is not loaded.

  • Get token ID associated to EOS (end of sentence) token.

    NOTE: This can only being used after loadModel is called.

    Returns number

    -1 if the model is not loaded.

  • Get token ID associated to EOT (end of turn) token.

    NOTE: This can only being used after loadModel is called.

    Returns number

    -1 if the model is not loaded.

  • Get softmax-ed probability of logits, can be used for custom sampling

    Parameters

    • topK: number = 40

      Get top K tokens having highest logits value. If topK == -1, we return all n_vocab logits, but this is not recommended because it's slow.

    Returns Promise<{ p: number; token: number }[]>

  • Get a list of pieces in vocab. NOTE: This function is slow, should only be used once.

    Returns Promise<Uint8Array[]>

    A list of Uint8Array. The nth element in the list associated to nth token in vocab

  • Check if the current model uses encoder-decoder architecture

    NOTE: This can only being used after loadModel is called.

    Returns boolean

    true if multi-thread is used.

  • Check if the model is loaded via loadModel()

    Returns boolean

  • Check if we're currently using multi-thread build.

    NOTE: This can only being used after loadModel is called.

    Returns boolean

    true if multi-thread is used.

  • Clear all tokens in KV cache

    Returns Promise<void>

  • Remove and shift some tokens from KV cache. Keep n_keep, remove n_discard then shift the rest

    Parameters

    • nKeep: number
    • nDiscard: number

    Returns Promise<void>

  • Load model from a given list of Blob.

    You can pass multiple buffers into the function (in case the model contains multiple shards).

    Parameters

    • ggufBlobsOrModel: Blob[] | Model

      Can be either list of Blobs (in case you use local file), or a Model object (in case you use ModelManager)

    • config: LoadModelConfig = {}

      LoadModelConfig

    Returns Promise<void>

  • Load model from a given Hugging Face model ID and file path.

    Parameters

    • modelId: string

      The HF model ID, for example: 'ggml-org/models'

    • filePath: string

      The GGUF file path, for example: 'tinyllamas/stories15M-q4_0.gguf'

    • config: LoadModelConfig & DownloadOptions & { useCache?: boolean } = {}

    Returns Promise<void>

  • Load model from a given URL (or a list of URLs, in case the model is splitted into smaller files)

    • If the model already been downloaded (via downloadModel()), then we will use the cached model
    • Else, we download the model from internet

    Parameters

    • modelUrl: string | string[]

      URL to the GGUF file. If the model is splitted, pass the URL to the first shard.

    • config: LoadModelConfig & DownloadOptions & { useCache?: boolean } = {}

    Returns Promise<void>

  • Lookup to see if a token exist in vocab or not. Useful for searching special tokens like "<|im_start|>" NOTE: It will match the whole token, so do not use it as a replacement for tokenize()

    Parameters

    • piece: string

    Returns Promise<number>

    Token ID associated to the given piece. Returns -1 if cannot find the token.

  • Must we add BOS token to the tokenized sequence?

    NOTE: This can only being used after loadModel is called.

    Returns boolean

    true if BOS token must be added to the sequence

  • Must we add EOS token to the tokenized sequence?

    NOTE: This can only being used after loadModel is called.

    Returns boolean

    true if EOS token must be added to the sequence

  • Accept and save a new token to ctx_sampling

    Parameters

    • tokens: number[]

    Returns Promise<void>

  • Create or reset the ctx_sampling

    Parameters

    • config: SamplingConfig
    • pastTokens: number[] = []

      In case re-initializing the ctx_sampling, you can re-import past tokens into the new context

    Returns Promise<void>

  • Sample a new token (remember to samplingInit() at least once before calling this function)

    Returns Promise<{ piece: Uint8Array; token: number }>

    the token ID and its detokenized value (which maybe an unfinished unicode)

  • Load session from file (virtual file system) TODO: add ability to download the file

    Parameters

    • filePath: string

    Returns Promise<void>

  • Save session to file (virtual file system) TODO: add ability to download the file

    Parameters

    • filePath: string

    Returns Promise<{ tokens: number[] }>

    List of tokens saved to the file

  • Convert a given text to list of tokens

    Parameters

    • text: string
    • special: boolean = true

      Should split special tokens?

    Returns Promise<number[]>

    List of token ID