NativeApi

llama_reset_timings(SafeLLamaContextHandle)

public static void llama_reset_timings(SafeLLamaContextHandle ctx)

Parameters

llama_print_system_info()

Print system information

public static IntPtr llama_print_system_info()

Returns

IntPtr

llama_model_quantize(String, String, LLamaFtype, Int32)

public static int llama_model_quantize(string fname_inp, string fname_out, LLamaFtype ftype, int nthread)

Parameters

fname_inp String

fname_out String

ftype LLamaFtype

nthread Int32

Returns

llama_sample_repetition_penalty(SafeLLamaContextHandle, IntPtr, Int32[], UInt64, Single)

Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.

public static void llama_sample_repetition_penalty(SafeLLamaContextHandle ctx, IntPtr candidates, Int32[] last_tokens, ulong last_tokens_size, float penalty)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

last_tokens Int32[]

last_tokens_size UInt64

penalty Single

llama_sample_frequency_and_presence_penalties(SafeLLamaContextHandle, IntPtr, Int32[], UInt64, Single, Single)

Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.

public static void llama_sample_frequency_and_presence_penalties(SafeLLamaContextHandle ctx, IntPtr candidates, Int32[] last_tokens, ulong last_tokens_size, float alpha_frequency, float alpha_presence)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

last_tokens Int32[]

last_tokens_size UInt64

alpha_frequency Single

alpha_presence Single

llama_sample_softmax(SafeLLamaContextHandle, IntPtr)

Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.

public static void llama_sample_softmax(SafeLLamaContextHandle ctx, IntPtr candidates)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

llama_sample_top_k(SafeLLamaContextHandle, IntPtr, Int32, UInt64)

Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

public static void llama_sample_top_k(SafeLLamaContextHandle ctx, IntPtr candidates, int k, ulong min_keep)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

k Int32

min_keep UInt64

llama_sample_top_p(SafeLLamaContextHandle, IntPtr, Single, UInt64)

Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751

public static void llama_sample_top_p(SafeLLamaContextHandle ctx, IntPtr candidates, float p, ulong min_keep)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

p Single

min_keep UInt64

llama_sample_tail_free(SafeLLamaContextHandle, IntPtr, Single, UInt64)

Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.

public static void llama_sample_tail_free(SafeLLamaContextHandle ctx, IntPtr candidates, float z, ulong min_keep)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

z Single

min_keep UInt64

llama_sample_typical(SafeLLamaContextHandle, IntPtr, Single, UInt64)

Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.

public static void llama_sample_typical(SafeLLamaContextHandle ctx, IntPtr candidates, float p, ulong min_keep)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

p Single

min_keep UInt64

llama_sample_temperature(SafeLLamaContextHandle, IntPtr, Single)

public static void llama_sample_temperature(SafeLLamaContextHandle ctx, IntPtr candidates, float temp)

Parameters

candidates IntPtr

temp Single

llama_sample_token_mirostat(SafeLLamaContextHandle, IntPtr, Single, Single, Int32, Single)*

Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.

public static int llama_sample_token_mirostat(SafeLLamaContextHandle ctx, IntPtr candidates, float tau, float eta, int m, Single* mu)

Parameters

candidates IntPtr

A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.

tau Single

The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.

eta Single

The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.

m Int32

The number of tokens considered in the estimation of s_hat. This is an arbitrary value that is used to calculate s_hat, which in turn helps to calculate the value of k. In the paper, they use m = 100, but you can experiment with different values to see how it affects the performance of the algorithm.

mu Single*

Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.

Returns

llama_sample_token_mirostat_v2(SafeLLamaContextHandle, IntPtr, Single, Single, Single)*

Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.

public static int llama_sample_token_mirostat_v2(SafeLLamaContextHandle ctx, IntPtr candidates, float tau, float eta, Single* mu)

Parameters

candidates IntPtr

A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.

tau Single

The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.

eta Single

The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.

mu Single*

Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.

Returns

llama_sample_token_greedy(SafeLLamaContextHandle, IntPtr)

Selects the token with the highest probability.

public static int llama_sample_token_greedy(SafeLLamaContextHandle ctx, IntPtr candidates)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

Returns

llama_sample_token(SafeLLamaContextHandle, IntPtr)

Randomly selects a token from the candidates based on their probabilities.

public static int llama_sample_token(SafeLLamaContextHandle ctx, IntPtr candidates)

Parameters

candidates IntPtr

Pointer to LLamaTokenDataArray

Returns

llama_empty_call()

public static bool llama_empty_call()

Returns

llama_context_default_params()

public static LLamaContextParams llama_context_default_params()

Returns

LLamaContextParams

llama_mmap_supported()

public static bool llama_mmap_supported()

Returns

llama_mlock_supported()

public static bool llama_mlock_supported()

Returns

llama_init_from_file(String, LLamaContextParams)

Various functions for loading a ggml llama model.
Allocate (almost) all memory needed for the model.
Return NULL on failure

public static IntPtr llama_init_from_file(string path_model, LLamaContextParams params_)

Parameters

path_model String

params_ LLamaContextParams

Returns

IntPtr

llama_init_backend()

not great API - very likely to change.
Initialize the llama + ggml backend
Call once at the start of the program

public static void llama_init_backend()

llama_free(IntPtr)

Frees all allocated memory

public static void llama_free(IntPtr ctx)

Parameters

ctx IntPtr

llama_apply_lora_from_file(SafeLLamaContextHandle, String, String, Int32)

Apply a LoRA adapter to a loaded model
path_base_model is the path to a higher quality model to use as a base for
the layers modified by the adapter. Can be NULL to use the current loaded model.
The model needs to be reloaded before applying a new adapter, otherwise the adapter
will be applied on top of the previous one

public static int llama_apply_lora_from_file(SafeLLamaContextHandle ctx, string path_lora, string path_base_model, int n_threads)

Parameters

path_lora String

path_base_model String

n_threads Int32

Returns

Int32

Returns 0 on success

llama_get_kv_cache_token_count(SafeLLamaContextHandle)

Returns the number of tokens in the KV cache

public static int llama_get_kv_cache_token_count(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_set_rng_seed(SafeLLamaContextHandle, Int32)

Sets the current rng seed.

public static void llama_set_rng_seed(SafeLLamaContextHandle ctx, int seed)

Parameters

seed Int32

llama_get_state_size(SafeLLamaContextHandle)

Returns the maximum size in bytes of the state (rng, logits, embedding
and kv_cache) - will often be smaller after compacting tokens

public static ulong llama_get_state_size(SafeLLamaContextHandle ctx)

Parameters

Returns

UInt64

llama_copy_state_data(SafeLLamaContextHandle, Byte[])

Copies the state to the specified destination address.
Destination needs to have allocated enough memory.
Returns the number of bytes copied

public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte[] dest)

Parameters

dest Byte[]

Returns

UInt64

llama_set_state_data(SafeLLamaContextHandle, Byte[])

Set the state reading from the specified address
Returns the number of bytes read

public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte[] src)

Parameters

src Byte[]

Returns

UInt64

llama_load_session_file(SafeLLamaContextHandle, String, Int32[], UInt64, UInt64)*

Load session file

public static bool llama_load_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens_out, ulong n_token_capacity, UInt64* n_token_count_out)

Parameters

path_session String

tokens_out Int32[]

n_token_capacity UInt64

n_token_count_out UInt64*

Returns

llama_save_session_file(SafeLLamaContextHandle, String, Int32[], UInt64)

Save session file

public static bool llama_save_session_file(SafeLLamaContextHandle ctx, string path_session, Int32[] tokens, ulong n_token_count)

Parameters

path_session String

tokens Int32[]

n_token_count UInt64

Returns

llama_eval(SafeLLamaContextHandle, Int32[], Int32, Int32, Int32)

Run the llama inference to obtain the logits and probabilities for the next token.
tokens + n_tokens is the provided batch of new tokens to process
n_past is the number of tokens to use from previous eval calls

public static int llama_eval(SafeLLamaContextHandle ctx, Int32[] tokens, int n_tokens, int n_past, int n_threads)

Parameters

tokens Int32[]

n_tokens Int32

n_past Int32

n_threads Int32

Returns

Int32

Returns 0 on success

llama_eval_with_pointer(SafeLLamaContextHandle, Int32, Int32, Int32, Int32)*

public static int llama_eval_with_pointer(SafeLLamaContextHandle ctx, Int32* tokens, int n_tokens, int n_past, int n_threads)

Parameters

tokens Int32*

n_tokens Int32

n_past Int32

n_threads Int32

Returns

llama_tokenize(SafeLLamaContextHandle, String, Encoding, Int32[], Int32, Boolean)

Convert the provided text into tokens.
The tokens pointer must be large enough to hold the resulting tokens.
Returns the number of tokens on success, no more than n_max_tokens
Returns a negative number on failure - the number of tokens that would have been returned

public static int llama_tokenize(SafeLLamaContextHandle ctx, string text, Encoding encoding, Int32[] tokens, int n_max_tokens, bool add_bos)

Parameters

text String

encoding Encoding

tokens Int32[]

n_max_tokens Int32

add_bos Boolean

Returns

llama_tokenize_native(SafeLLamaContextHandle, SByte[], Int32[], Int32, Boolean)

public static int llama_tokenize_native(SafeLLamaContextHandle ctx, SByte[] text, Int32[] tokens, int n_max_tokens, bool add_bos)

Parameters

text SByte[]

tokens Int32[]

n_max_tokens Int32

add_bos Boolean

Returns

llama_n_vocab(SafeLLamaContextHandle)

public static int llama_n_vocab(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_n_ctx(SafeLLamaContextHandle)

public static int llama_n_ctx(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_n_embd(SafeLLamaContextHandle)

public static int llama_n_embd(SafeLLamaContextHandle ctx)

Parameters

Returns

llama_get_logits(SafeLLamaContextHandle)

Token logits obtained from the last call to llama_eval()
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token
Rows: n_tokens
Cols: n_vocab

public static Single* llama_get_logits(SafeLLamaContextHandle ctx)

Parameters

Returns

Single*

llama_get_embeddings(SafeLLamaContextHandle)

Get the embeddings for the input
shape: [n_embd] (1-dimensional)

public static Single* llama_get_embeddings(SafeLLamaContextHandle ctx)

Parameters

Returns

Single*

llama_token_to_str(SafeLLamaContextHandle, Int32)

Token Id -> String. Uses the vocabulary in the provided context

public static IntPtr llama_token_to_str(SafeLLamaContextHandle ctx, int token)

Parameters

token Int32

Returns

IntPtr

Pointer to a string.

llama_token_bos()

public static int llama_token_bos()

Returns

llama_token_eos()

public static int llama_token_eos()

Returns

llama_token_nl()

public static int llama_token_nl()

Returns