Constructors

Properties

cacheManager: CacheManager
modelManager: ModelManager

Methods

  • get debug info

    Returns Promise<any>

  • Make chat completion for a given chat messages.

    Parameters

    • options: {
          max_tokens?: number;
          messages: ChatCompletionMessage[];
          model?: string;
          response_format?: {
              json_schema?: { name: string; schema: unknown; strict?: boolean };
              type: "text" | "json_object" | "json_schema";
          };
          stream?: boolean;
          temperature?: number;
          tool_choice?: ChatCompletionToolChoice;
          tools?: ChatCompletionTool[];
          user?: string;
      } & SamplingParams & { stream?: false }

      OAI-compatible chat completion options

    Returns Promise<ChatCompletionResponse>

    OAI-compatible chat completion response (only the final result when stream=false) or an async iterator of completion chunks (when stream=true)

  • Make chat completion for a given chat messages.

    Parameters

    Returns Promise<AsyncIterable<ChatCompletionChunk>>

    OAI-compatible chat completion response (only the final result when stream=false) or an async iterator of completion chunks (when stream=true)

  • Make (raw) completion for a given text.

    Parameters

    • options: {
          best_of?: number;
          echo?: boolean;
          frequency_penalty?: number;
          logit_bias?: Record<string, number>;
          logprobs?: null | number;
          max_tokens?: number;
          model?: string;
          n?: number;
          presence_penalty?: number;
          prompt: string | string[];
          seed?: number;
          stop?: string | string[];
          stream?: boolean;
          suffix?: string;
          temperature?: number;
          top_p?: number;
          user?: string;
      } & SamplingParams & { stream?: false }

      OAI-compatible completion options

    Returns Promise<RawCompletionResponse>

    OAI-compatible completion response (only the final result when stream=false) or an async iterator of completion chunks (when stream=true)

  • Make (raw) completion for a given text.

    Parameters

    • options: {
          best_of?: number;
          echo?: boolean;
          frequency_penalty?: number;
          logit_bias?: Record<string, number>;
          logprobs?: null | number;
          max_tokens?: number;
          model?: string;
          n?: number;
          presence_penalty?: number;
          prompt: string | string[];
          seed?: number;
          stop?: string | string[];
          stream?: boolean;
          suffix?: string;
          temperature?: number;
          top_p?: number;
          user?: string;
      } & SamplingParams & StreamParams<RawCompletionChunk>

      OAI-compatible completion options

    Returns Promise<AsyncIterable<RawCompletionChunk>>

    OAI-compatible completion response (only the final result when stream=false) or an async iterator of completion chunks (when stream=true)

  • Unload the model and free all memory.

    Note: This function will NOT crash if model is not yet loaded

    Returns Promise<void>

  • Get token ID associated to BOS (begin of sentence) token.

    NOTE: This can only being used after loadModel is called.

    Returns number

    -1 if the model is not loaded.

  • Get the jinja chat template comes with the model. It only available if the original model (before converting to gguf) has the template in tokenizer_config.json

    NOTE: This can only being used after loadModel is called.

    Returns null | string

    the jinja template. null if there is no template in gguf

  • Get token ID associated to token used by decoder, to start generating output sequence(only usable for encoder-decoder architecture). In other words, encoder uses normal BOS and decoder uses this token.

    NOTE: This can only being used after loadModel is called.

    Returns number

    -1 if the model is not loaded.

  • Get token ID associated to EOS (end of sentence) token.

    NOTE: This can only being used after loadModel is called.

    Returns number

    -1 if the model is not loaded.

  • Get token ID associated to EOT (end of turn) token.

    NOTE: This can only being used after loadModel is called.

    Returns number

    -1 if the model is not loaded.

  • Get number of threads used in the current context.

    NOTE: This can only being used after loadModel is called.

    Returns number

    number of threads

  • Check if the current model uses encoder-decoder architecture

    NOTE: This can only being used after loadModel is called.

    Returns boolean

    true if multi-thread is used.

  • Check if the model is loaded via loadModel()

    Returns boolean

  • Check if we're currently using multi-thread build.

    NOTE: This can only being used after loadModel is called.

    Returns boolean

    true if multi-thread is used.

  • Check if WebGPU is supported by the current environment.

    Returns boolean

    true if WebGPU is supported

  • Check if a given token is end-of-generation token (e.g. EOS, EOT, etc.)

    Parameters

    • token: number

      the token ID to be checked

    Returns boolean

    true if the token is EOS, EOT, or any other end-of-generation tokens

  • Load model from a given list of Blob.

    You can pass multiple buffers into the function (in case the model contains multiple shards).

    Parameters

    • ggufBlobsOrModel: Blob[] | Model

      Can be either list of Blobs (in case you use local file), or a Model object (in case you use ModelManager)

    • params: LoadModelParams = {}

      LoadModelParams

    Returns Promise<void>

  • Load model from a given URL (or a list of URLs, in case the model is splitted into smaller files)

    • If the model already been downloaded (via downloadModel()), then we will use the cached model
    • Else, we download the model from internet

    Parameters

    Returns Promise<void>

  • Must we add BOS token to the tokenized sequence?

    NOTE: This can only being used after loadModel is called.

    Returns boolean

    true if BOS token must be added to the sequence

  • Must we add EOS token to the tokenized sequence?

    NOTE: This can only being used after loadModel is called.

    Returns boolean

    true if EOS token must be added to the sequence

  • Set compatibility options for Wllama.

    Parameters

    • compat: null | WllamaCompat | "default"

      Set to null to disable compatibility, or 'default' to use the default compat resources from CDN.

    • mode: "safari" | "firefox_safari" = 'safari'

      'safari' by default; If set to 'firefox_safari', the compat mode will also be enabled on Firefox, which will significantly degrade the performance but allow using WebGPU on Firefox.

    Returns void

  • Whether the currently loaded model supports a specific input modality (e.g. image or audio).

    Parameters

    • modality: "image" | "audio"

    Returns boolean

  • Get the libllama version string, e.g. "b6327-4d74393".

    Returns string

    version string embedded at build time.