Namespace: LLama.Native
Direct translation of the llama.cpp API
public static class NativeApi
Inheritance Object → NativeApi
Mirostat 1.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static LLamaToken llama_sample_token_mirostat(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, int m, Single& mu)
candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.
m Int32
The number of tokens considered in the estimation of s_hat. This is an arbitrary value that is used to calculate s_hat, which in turn helps to calculate the value of k. In the paper, they use m = 100, but you can experiment with different values to see how it affects the performance of the algorithm.
mu Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.
Mirostat 2.0 algorithm described in the paper https://arxiv.org/abs/2007.14966. Uses tokens instead of words.
public static LLamaToken llama_sample_token_mirostat_v2(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float tau, float eta, Single& mu)
candidates LLamaTokenDataArrayNative&
A vector of llama_token_data containing the candidate tokens, their probabilities (p), and log-odds (logit) for the current position in the generated text.
tau Single
The target cross-entropy (or surprise) value you want to achieve for the generated text. A higher value corresponds to more surprising or less predictable text, while a lower value corresponds to less surprising or more predictable text.
eta Single
The learning rate used to update mu based on the error between the target and observed surprisal of the sampled word. A larger learning rate will cause mu to be updated more quickly, while a smaller learning rate will result in slower updates.
mu Single&
Maximum cross-entropy. This value is initialized to be twice the target cross-entropy (2 * tau) and is updated in the algorithm based on the error between the target and observed surprisal.
Selects the token with the highest probability.
public static LLamaToken llama_sample_token_greedy(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Randomly selects a token from the candidates based on their probabilities.
public static LLamaToken llama_sample_token(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
internal static Single* <llama_get_embeddings>g__llama_get_embeddings_native|30_0(SafeLLamaContextHandle ctx)
internal static int <llama_token_to_piece>g__llama_token_to_piece_native|44_0(SafeLlamaModelHandle model, LLamaToken llamaToken, Byte* buffer, int length)
model SafeLlamaModelHandle
llamaToken LLamaToken
buffer Byte*
length Int32
internal static IntPtr <TryLoadLibraries>g__TryLoad|84_0(string path)
path String
internal static string <TryLoadLibraries>g__TryFindPath|84_1(string filename, <>c__DisplayClass84_0& )
filename String
Set the number of threads used for decoding
public static void llama_set_n_threads(SafeLLamaContextHandle ctx, uint n_threads, uint n_threads_batch)
n_threads UInt32
n_threads is the number of threads used for generation (single token)
n_threads_batch UInt32
n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)
public static LLamaVocabType llama_vocab_type(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
public static LLamaRopeType llama_rope_type(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Create a new grammar from the given set of grammar rules
public static IntPtr llama_grammar_init(LLamaGrammarElement** rules, ulong n_rules, ulong start_rule_index)
rules LLamaGrammarElement**
n_rules UInt64
start_rule_index UInt64
Free all memory from the given SafeLLamaGrammarHandle
public static void llama_grammar_free(IntPtr grammar)
grammar IntPtr
Create a copy of an existing grammar instance
public static IntPtr llama_grammar_copy(SafeLLamaGrammarHandle grammar)
grammar SafeLLamaGrammarHandle
Apply constraints from grammar
public static void llama_sample_grammar(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, SafeLLamaGrammarHandle grammar)
candidates LLamaTokenDataArrayNative&
grammar SafeLLamaGrammarHandle
Accepts the sampled token into the grammar
public static void llama_grammar_accept_token(SafeLLamaContextHandle ctx, SafeLLamaGrammarHandle grammar, LLamaToken token)
grammar SafeLLamaGrammarHandle
token LLamaToken
Sanity check for clip <-> llava embed size match
public static bool llava_validate_embed_size(SafeLLamaContextHandle ctxLlama, SafeLlavaModelHandle ctxClip)
ctxLlama SafeLLamaContextHandle
LLama Context
ctxClip SafeLlavaModelHandle
Llava Model
Boolean
True if validate successfully
Build an image embed from image file bytes
public static SafeLlavaImageEmbedHandle llava_image_embed_make_with_bytes(SafeLlavaModelHandle ctx_clip, int n_threads, Byte[] image_bytes, int image_bytes_length)
ctx_clip SafeLlavaModelHandle
SafeHandle to the Clip Model
n_threads Int32
Number of threads
image_bytes Byte[]
Binary image in jpeg format
image_bytes_length Int32
Bytes lenght of the image
SafeLlavaImageEmbedHandle
SafeHandle to the Embeddings
Build an image embed from a path to an image filename
public static SafeLlavaImageEmbedHandle llava_image_embed_make_with_filename(SafeLlavaModelHandle ctx_clip, int n_threads, string image_path)
ctx_clip SafeLlavaModelHandle
SafeHandle to the Clip Model
n_threads Int32
Number of threads
image_path String
Image filename (jpeg) to generate embeddings
SafeLlavaImageEmbedHandle
SafeHandel to the embeddings
Free an embedding made with llava_image_embed_make_*
public static void llava_image_embed_free(IntPtr embed)
embed IntPtr
Embeddings to release
Write the image represented by embed into the llama context with batch size n_batch, starting at context
pos n_past. on completion, n_past points to the next position in the context after the image embed.
public static bool llava_eval_image_embed(SafeLLamaContextHandle ctx_llama, SafeLlavaImageEmbedHandle embed, int n_batch, Int32& n_past)
ctx_llama SafeLLamaContextHandle
Llama Context
embed SafeLlavaImageEmbedHandle
Embedding handle
n_batch Int32
n_past Int32&
Boolean
True on success
Returns 0 on success
public static uint llama_model_quantize(string fname_inp, string fname_out, LLamaModelQuantizeParams* param)
fname_inp String
fname_out String
param LLamaModelQuantizeParams*
UInt32
Returns 0 on success
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
public static void llama_sample_repetition_penalties(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, LLamaToken* last_tokens, ulong last_tokens_size, float penalty_repeat, float penalty_freq, float penalty_present)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
last_tokens LLamaToken*
last_tokens_size UInt64
penalty_repeat Single
Repetition penalty described in CTRL academic paper https://arxiv.org/abs/1909.05858, with negative logit fix.
penalty_freq Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
penalty_present Single
Frequency and presence penalties described in OpenAI API https://platform.openai.com/docs/api-reference/parameter-details.
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
public static void llama_sample_apply_guidance(SafeLLamaContextHandle ctx, Span<float> logits, ReadOnlySpan<float> logits_guidance, float scale)
logits Span<Single>
Logits extracted from the original generation context.
logits_guidance ReadOnlySpan<Single>
Logits extracted from a separate context from the same model.
Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
Apply classifier-free guidance to the logits as described in academic paper "Stay on topic with Classifier-Free Guidance" https://arxiv.org/abs/2306.17806
public static void llama_sample_apply_guidance(SafeLLamaContextHandle ctx, Single* logits, Single* logits_guidance, float scale)
logits Single*
Logits extracted from the original generation context.
logits_guidance Single*
Logits extracted from a separate context from the same model.
Other than a negative prompt at the beginning, it should have all generated and user input tokens copied from the main context.
scale Single
Guidance strength. 1.0f means no guidance. Higher values mean stronger guidance.
Sorts candidate tokens by their logits in descending order and calculate probabilities based on logits.
public static void llama_sample_softmax(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
Top-K sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_k(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, int k, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
k Int32
min_keep UInt64
Nucleus sampling described in academic paper "The Curious Case of Neural Text Degeneration" https://arxiv.org/abs/1904.09751
public static void llama_sample_top_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
Minimum P sampling as described in https://github.com/ggerganov/llama.cpp/pull/3841
public static void llama_sample_min_p(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
Tail Free Sampling described in https://www.trentonbricken.com/Tail-Free-Sampling/.
public static void llama_sample_tail_free(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float z, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
z Single
min_keep UInt64
Locally Typical Sampling implementation described in the paper https://arxiv.org/abs/2202.00666.
public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float p, ulong min_keep)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
p Single
min_keep UInt64
Dynamic temperature implementation described in the paper https://arxiv.org/abs/2309.02772.
public static void llama_sample_typical(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float min_temp, float max_temp, float exponent_val)
candidates LLamaTokenDataArrayNative&
Pointer to LLamaTokenDataArray
min_temp Single
max_temp Single
exponent_val Single
Modify logits by temperature
public static void llama_sample_temp(SafeLLamaContextHandle ctx, LLamaTokenDataArrayNative& candidates, float temp)
candidates LLamaTokenDataArrayNative&
temp Single
Get the embeddings for the input
public static Span<float> llama_get_embeddings(SafeLLamaContextHandle ctx)
Apply chat template. Inspired by hf apply_chat_template() on python.
Both "model" and "custom_template" are optional, but at least one is required. "custom_template" has higher precedence than "model"
NOTE: This function does not use a jinja parser. It only support a pre-defined list of template. See more: https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
public static int llama_chat_apply_template(SafeLlamaModelHandle model, Char* tmpl, LLamaChatMessage* chat, IntPtr n_msg, bool add_ass, Char* buf, int length)
model SafeLlamaModelHandle
tmpl Char*
A Jinja template to use for this chat. If this is nullptr, the model’s default chat template will be used instead.
chat LLamaChatMessage*
Pointer to a list of multiple llama_chat_message
n_msg IntPtr
Number of llama_chat_message in this chat
add_ass Boolean
Whether to end the prompt with the token(s) that indicate the start of an assistant message.
buf Char*
A buffer to hold the output formatted prompt. The recommended alloc size is 2 * (total number of characters of all messages)
length Int32
The size of the allocated buffer
Int32
The total number of bytes of the formatted prompt. If is it larger than the size of buffer, you may need to re-alloc it and then re-apply the template.
Get the "Beginning of sentence" token
public static LLamaToken llama_token_bos(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Get the "End of sentence" token
public static LLamaToken llama_token_eos(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Get the "new line" token
public static LLamaToken llama_token_nl(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Returns -1 if unknown, 1 for true or 0 for false.
public static int llama_add_bos_token(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Returns -1 if unknown, 1 for true or 0 for false.
public static int llama_add_eos_token(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
codellama infill tokens, Beginning of infill prefix
public static int llama_token_prefix(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
codellama infill tokens, Beginning of infill middle
public static int llama_token_middle(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
codellama infill tokens, Beginning of infill suffix
public static int llama_token_suffix(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
codellama infill tokens, End of infill middle
public static int llama_token_eot(SafeLlamaModelHandle model)
model SafeLlamaModelHandle
Print out timing information for this context
public static void llama_print_timings(SafeLLamaContextHandle ctx)
Reset all collected timing information for this context
public static void llama_reset_timings(SafeLLamaContextHandle ctx)
Print system information
public static IntPtr llama_print_system_info()
Convert a single token into text
public static int llama_token_to_piece(SafeLlamaModelHandle model, LLamaToken llamaToken, Span<byte> buffer)
model SafeLlamaModelHandle
llamaToken LLamaToken
buffer Span<Byte>
buffer to write string into
Int32
The length written, or if the buffer is too small a negative that indicates the length required
Convert text into tokens
public static int llama_tokenize(SafeLlamaModelHandle model, Byte* text, int text_len, LLamaToken* tokens, int n_max_tokens, bool add_bos, bool special)
model SafeLlamaModelHandle
text Byte*
text_len Int32
tokens LLamaToken*
n_max_tokens Int32
add_bos Boolean
special Boolean
Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext. Does not insert a leading space.
Int32
Returns the number of tokens on success, no more than n_max_tokens.
Returns a negative number on failure - the number of tokens that would have been returned
Register a callback to receive llama log messages
public static void llama_log_set(LLamaLogCallback logCallback)
logCallback LLamaLogCallback
Clear the KV cache
public static void llama_kv_cache_clear(SafeLLamaContextHandle ctx)
Removes all tokens that belong to the specified sequence and have positions in [p0, p1)
public static void llama_kv_cache_seq_rm(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1)
seq LLamaSeqId
p0 LLamaPos
p1 LLamaPos
Copy all tokens that belong to the specified sequence to another sequence
Note that this does not allocate extra KV cache memory - it simply assigns the tokens to the new sequence
public static void llama_kv_cache_seq_cp(SafeLLamaContextHandle ctx, LLamaSeqId src, LLamaSeqId dest, LLamaPos p0, LLamaPos p1)
src LLamaSeqId
dest LLamaSeqId
p0 LLamaPos
p1 LLamaPos
Removes all tokens that do not belong to the specified sequence
public static void llama_kv_cache_seq_keep(SafeLLamaContextHandle ctx, LLamaSeqId seq)
seq LLamaSeqId
Adds relative position "delta" to all tokens that belong to the specified sequence and have positions in [p0, p1)
If the KV cache is RoPEd, the KV data is updated accordingly:
public static void llama_kv_cache_seq_add(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1, int delta)
seq LLamaSeqId
p0 LLamaPos
p1 LLamaPos
delta Int32
Integer division of the positions by factor of d > 1
If the KV cache is RoPEd, the KV data is updated accordingly:
public static void llama_kv_cache_seq_div(SafeLLamaContextHandle ctx, LLamaSeqId seq, LLamaPos p0, LLamaPos p1, int d)
seq LLamaSeqId
p0 LLamaPos
p1 LLamaPos
d Int32
Returns the largest position present in the KV cache for the specified sequence
public static LLamaPos llama_kv_cache_seq_pos_max(SafeLLamaContextHandle ctx, LLamaSeqId seq)
seq LLamaSeqId
Defragment the KV cache. This will be applied:
public static LLamaPos llama_kv_cache_defrag(SafeLLamaContextHandle ctx)
Apply the KV cache updates (such as K-shifts, defragmentation, etc.)
public static void llama_kv_cache_update(SafeLLamaContextHandle ctx)
Allocates a batch of tokens on the heap
Each token can be assigned up to n_seq_max sequence ids
The batch has to be freed with llama_batch_free()
If embd != 0, llama_batch.embd will be allocated with size of n_tokens * embd * sizeof(float)
Otherwise, llama_batch.token will be allocated to store n_tokens llama_token
The rest of the llama_batch members are allocated with size n_tokens
All members are left uninitialized
public static LLamaNativeBatch llama_batch_init(int n_tokens, int embd, int n_seq_max)
n_tokens Int32
embd Int32
n_seq_max Int32
Each token can be assigned up to n_seq_max sequence ids
Frees a batch of tokens allocated with llama_batch_init()
public static void llama_batch_free(LLamaNativeBatch batch)
batch LLamaNativeBatch
public static int llama_decode(SafeLLamaContextHandle ctx, LLamaNativeBatch batch)
batch LLamaNativeBatch
Int32
Positive return values does not mean a fatal error, but rather a warning:
Create an empty KV cache view. (use only for debugging purposes)
public static LLamaKvCacheView llama_kv_cache_view_init(SafeLLamaContextHandle ctx, int n_max_seq)
n_max_seq Int32
Free a KV cache view. (use only for debugging purposes)
public static void llama_kv_cache_view_free(LLamaKvCacheView& view)
view LLamaKvCacheView&
Update the KV cache view structure with the current state of the KV cache. (use only for debugging purposes)
public static void llama_kv_cache_view_update(SafeLLamaContextHandle ctx, LLamaKvCacheView& view)
view LLamaKvCacheView&
Returns the number of tokens in the KV cache (slow, use only for debug)
If a KV cell has multiple sequences assigned to it, it will be counted multiple times
public static int llama_get_kv_cache_token_count(SafeLLamaContextHandle ctx)
Returns the number of used KV cells (i.e. have at least one sequence assigned to them)
public static int llama_get_kv_cache_used_cells(SafeLLamaContextHandle ctx)
Deterministically returns entire sentence constructed by a beam search.
public static void llama_beam_search(SafeLLamaContextHandle ctx, LLamaBeamSearchCallback callback, IntPtr callback_data, ulong n_beams, int n_past, int n_predict, int n_threads)
ctx SafeLLamaContextHandle
Pointer to the llama_context.
callback LLamaBeamSearchCallback
Invoked for each iteration of the beam_search loop, passing in beams_state.
callback_data IntPtr
A pointer that is simply passed back to callback.
n_beams UInt64
Number of beams to use.
n_past Int32
Number of tokens already evaluated.
n_predict Int32
Maximum number of tokens to predict. EOS may occur earlier.
n_threads Int32
Number of threads.
A method that does nothing. This is a native method, calling it will force the llama native dependencies to be loaded.
public static void llama_empty_call()
Get the maximum number of devices supported by llama.cpp
public static long llama_max_devices()
Create a LLamaModelParams with default values
public static LLamaModelParams llama_model_default_params()
Create a LLamaContextParams with default values
public static LLamaContextParams llama_context_default_params()
Create a LLamaModelQuantizeParams with default values
public static LLamaModelQuantizeParams llama_model_quantize_default_params()
Check if memory mapping is supported
public static bool llama_supports_mmap()
Check if memory locking is supported
public static bool llama_supports_mlock()
Check if GPU offload is supported
public static bool llama_supports_gpu_offload()
Sets the current rng seed.
public static void llama_set_rng_seed(SafeLLamaContextHandle ctx, uint seed)
seed UInt32
Returns the maximum size in bytes of the state (rng, logits, embedding
and kv_cache) - will often be smaller after compacting tokens
public static ulong llama_get_state_size(SafeLLamaContextHandle ctx)
Copies the state to the specified destination address.
Destination needs to have allocated enough memory.
public static ulong llama_copy_state_data(SafeLLamaContextHandle ctx, Byte* dest)
dest Byte*
UInt64
the number of bytes copied
Set the state reading from the specified address
public static ulong llama_set_state_data(SafeLLamaContextHandle ctx, Byte* src)
src Byte*
UInt64
the number of bytes read
Load session file
public static bool llama_load_session_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens_out, ulong n_token_capacity, UInt64& n_token_count_out)
path_session String
tokens_out LLamaToken[]
n_token_capacity UInt64
n_token_count_out UInt64&
Save session file
public static bool llama_save_session_file(SafeLLamaContextHandle ctx, string path_session, LLamaToken[] tokens, ulong n_token_count)
path_session String
tokens LLamaToken[]
n_token_count UInt64
public static Byte* llama_token_get_text(SafeLlamaModelHandle model, LLamaToken token)
model SafeLlamaModelHandle
token LLamaToken
public static float llama_token_get_score(SafeLlamaModelHandle model, LLamaToken token)
model SafeLlamaModelHandle
token LLamaToken
public static LLamaTokenType llama_token_get_type(SafeLlamaModelHandle model, LLamaToken token)
model SafeLlamaModelHandle
token LLamaToken
Get the size of the context window for the model for this context
public static uint llama_n_ctx(SafeLLamaContextHandle ctx)
Get the batch size for this context
public static uint llama_n_batch(SafeLLamaContextHandle ctx)
Token logits obtained from the last call to llama_decode
The logits for the last token are stored in the last row
Can be mutated in order to change the probabilities of the next token.
Rows: n_tokens
Cols: n_vocab
public static Single* llama_get_logits(SafeLLamaContextHandle ctx)
Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab
public static Single* llama_get_logits_ith(SafeLLamaContextHandle ctx, int i)
i Int32
Get the embeddings for the ith sequence. Equivalent to: llama_get_embeddings(ctx) + i*n_embd
public static Single* llama_get_embeddings_ith(SafeLLamaContextHandle ctx, int i)
i Int32