You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

SafeLLamaContextHandle.cs 26 kB

April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635
  1. using System;
  2. using System.Collections.Generic;
  3. using System.Runtime.InteropServices;
  4. using System.Text;
  5. using LLama.Exceptions;
  6. namespace LLama.Native
  7. {
  8. /// <summary>
  9. /// A safe wrapper around a llama_context
  10. /// </summary>
  11. // ReSharper disable once ClassNeverInstantiated.Global (used implicitly in native API)
  12. public sealed class SafeLLamaContextHandle
  13. : SafeLLamaHandleBase
  14. {
  15. #region properties and fields
  16. /// <summary>
  17. /// Total number of tokens in vocabulary of this model
  18. /// </summary>
  19. public int VocabCount => ThrowIfDisposed().VocabCount;
  20. public LLamaVocabType LLamaVocabType => ThrowIfDisposed().VocabType;
  21. /// <summary>
  22. /// Total number of tokens in the context
  23. /// </summary>
  24. public uint ContextSize => llama_n_ctx(this);
  25. /// <summary>
  26. /// Dimension of embedding vectors
  27. /// </summary>
  28. public int EmbeddingSize => ThrowIfDisposed().EmbeddingSize;
  29. /// <summary>
  30. /// Get the maximum batch size for this context
  31. /// </summary>
  32. public uint BatchSize => llama_n_batch(this);
  33. /// <summary>
  34. /// Get the physical maximum batch size for this context
  35. /// </summary>
  36. public uint UBatchSize => llama_n_ubatch(this);
  37. /// <summary>
  38. /// Get the model which this context is using
  39. /// </summary>
  40. public SafeLlamaModelHandle ModelHandle => ThrowIfDisposed();
  41. private SafeLlamaModelHandle? _model;
  42. #endregion
  43. #region construction/destruction
  44. /// <inheritdoc />
  45. protected override bool ReleaseHandle()
  46. {
  47. llama_free(handle);
  48. SetHandle(IntPtr.Zero);
  49. // Decrement refcount on model
  50. _model?.DangerousRelease();
  51. _model = null!;
  52. return true;
  53. }
  54. private SafeLlamaModelHandle ThrowIfDisposed()
  55. {
  56. if (IsClosed)
  57. throw new ObjectDisposedException("Cannot use this `SafeLLamaContextHandle` - it has been disposed");
  58. if (_model == null || _model.IsClosed)
  59. throw new ObjectDisposedException("Cannot use this `SafeLLamaContextHandle` - `SafeLlamaModelHandle` has been disposed");
  60. return _model!;
  61. }
  62. /// <summary>
  63. /// Create a new llama_state for the given model
  64. /// </summary>
  65. /// <param name="model"></param>
  66. /// <param name="lparams"></param>
  67. /// <returns></returns>
  68. /// <exception cref="RuntimeError"></exception>
  69. public static SafeLLamaContextHandle Create(SafeLlamaModelHandle model, LLamaContextParams lparams)
  70. {
  71. var ctx = llama_new_context_with_model(model, lparams);
  72. if (ctx == null)
  73. throw new RuntimeError("Failed to create context from model");
  74. // Increment the model reference count while this context exists.
  75. // DangerousAddRef throws if it fails, so there is no need to check "success"
  76. ctx._model = model;
  77. var success = false;
  78. ctx._model.DangerousAddRef(ref success);
  79. return ctx;
  80. }
  81. #endregion
  82. #region Native API
  83. static SafeLLamaContextHandle()
  84. {
  85. // This ensures that `NativeApi` has been loaded before calling the two native methods below
  86. NativeApi.llama_empty_call();
  87. }
  88. /// <summary>
  89. /// Create a new llama_context with the given model. **This should never be called directly! Always use SafeLLamaContextHandle.Create**!
  90. /// </summary>
  91. /// <param name="model"></param>
  92. /// <param name="params"></param>
  93. /// <returns></returns>
  94. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  95. private static extern SafeLLamaContextHandle llama_new_context_with_model(SafeLlamaModelHandle model, LLamaContextParams @params);
  96. /// <summary>
  97. /// Frees all allocated memory in the given llama_context
  98. /// </summary>
  99. /// <param name="ctx"></param>
  100. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  101. private static extern void llama_free(IntPtr ctx);
  102. /// <summary>
  103. /// Set a callback which can abort computation
  104. /// </summary>
  105. /// <param name="ctx"></param>
  106. /// <param name="abort_callback"></param>
  107. /// <param name="abort_callback_data"></param>
  108. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  109. private static extern unsafe void llama_set_abort_callback(SafeLLamaContextHandle ctx, GgmlAbortCallback abort_callback, void* abort_callback_data);
  110. /// <summary>
  111. /// If this returns true computation is cancelled
  112. /// </summary>
  113. /// <param name="data"></param>
  114. /// <returns></returns>
  115. private unsafe delegate bool GgmlAbortCallback(void* data);
  116. /// <summary>
  117. /// </summary>
  118. /// <param name="ctx"></param>
  119. /// <param name="batch"></param>
  120. /// <returns>Positive return values does not mean a fatal error, but rather a warning:<br />
  121. /// - 0: success<br />
  122. /// - 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)<br />
  123. /// - &lt; 0: error<br />
  124. /// </returns>
  125. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  126. private static extern int llama_decode(SafeLLamaContextHandle ctx, LLamaNativeBatch batch);
  127. /// <summary>
  128. /// Set the number of threads used for decoding
  129. /// </summary>
  130. /// <param name="ctx"></param>
  131. /// <param name="n_threads">n_threads is the number of threads used for generation (single token)</param>
  132. /// <param name="n_threads_batch">n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)</param>
  133. /// <returns></returns>
  134. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  135. private static extern void llama_set_n_threads(SafeLLamaContextHandle ctx, uint n_threads, uint n_threads_batch);
  136. /// <summary>
  137. /// Token logits obtained from the last call to llama_decode
  138. /// The logits for the last token are stored in the last row
  139. /// Can be mutated in order to change the probabilities of the next token.<br />
  140. /// Rows: n_tokens<br />
  141. /// Cols: n_vocab
  142. /// </summary>
  143. /// <param name="ctx"></param>
  144. /// <returns></returns>
  145. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  146. private static extern unsafe float* llama_get_logits(SafeLLamaContextHandle ctx);
  147. /// <summary>
  148. /// Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab
  149. /// </summary>
  150. /// <param name="ctx"></param>
  151. /// <param name="i"></param>
  152. /// <returns></returns>
  153. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  154. private static extern unsafe float* llama_get_logits_ith(SafeLLamaContextHandle ctx, int i);
  155. /// <summary>
  156. /// Get the size of the context window for the model for this context
  157. /// </summary>
  158. /// <param name="ctx"></param>
  159. /// <returns></returns>
  160. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  161. private static extern uint llama_n_ctx(SafeLLamaContextHandle ctx);
  162. /// <summary>
  163. /// Get the batch size for this context
  164. /// </summary>
  165. /// <param name="ctx"></param>
  166. /// <returns></returns>
  167. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  168. private static extern uint llama_n_batch(SafeLLamaContextHandle ctx);
  169. /// <summary>
  170. /// Get the ubatch size for this context
  171. /// </summary>
  172. /// <param name="ctx"></param>
  173. /// <returns></returns>
  174. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  175. private static extern uint llama_n_ubatch(SafeLLamaContextHandle ctx);
  176. /// <summary>
  177. /// Sets the current rng seed.
  178. /// </summary>
  179. /// <param name="ctx"></param>
  180. /// <param name="seed"></param>
  181. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  182. private static extern void llama_set_rng_seed(SafeLLamaContextHandle ctx, uint seed);
  183. /// <summary>
  184. /// Returns the maximum size in bytes of the state (rng, logits, embedding
  185. /// and kv_cache) - will often be smaller after compacting tokens
  186. /// </summary>
  187. /// <param name="ctx"></param>
  188. /// <returns></returns>
  189. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  190. private static extern ulong llama_state_get_size(SafeLLamaContextHandle ctx);
  191. /// <summary>
  192. /// Copies the state to the specified destination address.
  193. /// Destination needs to have allocated enough memory.
  194. /// </summary>
  195. /// <param name="ctx"></param>
  196. /// <param name="dest"></param>
  197. /// <returns>the number of bytes copied</returns>
  198. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  199. private static extern unsafe ulong llama_state_get_data(SafeLLamaContextHandle ctx, byte* dest);
  200. /// <summary>
  201. /// Set the state reading from the specified address
  202. /// </summary>
  203. /// <param name="ctx"></param>
  204. /// <param name="src"></param>
  205. /// <returns>the number of bytes read</returns>
  206. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  207. private static extern unsafe ulong llama_state_set_data(SafeLLamaContextHandle ctx, byte* src);
  208. /// <summary>
  209. /// Get the exact size needed to copy the KV cache of a single sequence
  210. /// </summary>
  211. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  212. private static extern nuint llama_state_seq_get_size(SafeLLamaContextHandle ctx, LLamaSeqId seq_id);
  213. /// <summary>
  214. /// Copy the KV cache of a single sequence into the specified buffer
  215. /// </summary>
  216. /// <param name="ctx"></param>
  217. /// <param name="dst"></param>
  218. /// <param name="seq_id"></param>
  219. /// <returns></returns>
  220. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  221. private static extern unsafe nuint llama_state_seq_get_data(SafeLLamaContextHandle ctx, byte* dst, LLamaSeqId seq_id);
  222. /// <summary>
  223. /// Copy the sequence data (originally copied with `llama_state_seq_get_data`) into the specified sequence
  224. /// </summary>
  225. /// <param name="ctx"></param>
  226. /// <param name="src"></param>
  227. /// <param name="dest_seq_id"></param>
  228. /// <returns>
  229. /// - Positive: Ok
  230. /// - Zero: Failed to load
  231. /// </returns>
  232. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  233. private static extern unsafe nuint llama_state_seq_set_data(SafeLLamaContextHandle ctx, byte* src, LLamaSeqId dest_seq_id);
  234. /// <summary>
  235. /// Defragment the KV cache. This will be applied:
  236. /// - lazily on next llama_decode()
  237. /// - explicitly with llama_kv_cache_update()
  238. /// </summary>
  239. /// <param name="ctx"></param>
  240. /// <returns></returns>
  241. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  242. private static extern void llama_kv_cache_defrag(SafeLLamaContextHandle ctx);
  243. /// <summary>
  244. /// Apply the KV cache updates (such as K-shifts, defragmentation, etc.)
  245. /// </summary>
  246. /// <param name="ctx"></param>
  247. [DllImport(NativeApi.libraryName, CallingConvention = CallingConvention.Cdecl)]
  248. public static extern void llama_kv_cache_update(SafeLLamaContextHandle ctx);
  249. #endregion
  250. /// <summary>
  251. /// Token logits obtained from the last call to llama_decode
  252. /// The logits for the last token are stored in the last row
  253. /// Can be mutated in order to change the probabilities of the next token.<br />
  254. /// Rows: n_tokens<br />
  255. /// Cols: n_vocab
  256. /// </summary>
  257. /// <returns></returns>
  258. public Span<float> GetLogits()
  259. {
  260. var model = ThrowIfDisposed();
  261. unsafe
  262. {
  263. var logits = llama_get_logits(this);
  264. return new Span<float>(logits, model.VocabCount);
  265. }
  266. }
  267. /// <summary>
  268. /// Logits for the ith token. Equivalent to: llama_get_logits(ctx) + i*n_vocab
  269. /// </summary>
  270. /// <param name="i"></param>
  271. /// <returns></returns>
  272. public Span<float> GetLogitsIth(int i)
  273. {
  274. var model = ThrowIfDisposed();
  275. unsafe
  276. {
  277. var logits = llama_get_logits_ith(this, i);
  278. return new Span<float>(logits, model.VocabCount);
  279. }
  280. }
  281. #region tokens
  282. /// <summary>
  283. /// Convert the given text into tokens
  284. /// </summary>
  285. /// <param name="text">The text to tokenize</param>
  286. /// <param name="add_bos">Whether the "BOS" token should be added</param>
  287. /// <param name="encoding">Encoding to use for the text</param>
  288. /// <param name="special">Allow tokenizing special and/or control tokens which otherwise are not exposed and treated as plaintext.</param>
  289. /// <returns></returns>
  290. /// <exception cref="RuntimeError"></exception>
  291. public LLamaToken[] Tokenize(string text, bool add_bos, bool special, Encoding encoding)
  292. {
  293. return ThrowIfDisposed().Tokenize(text, add_bos, special, encoding);
  294. }
  295. /// <summary>
  296. /// Convert a single llama token into bytes
  297. /// </summary>
  298. /// <param name="token">Token to decode</param>
  299. /// <param name="dest">A span to attempt to write into. If this is too small nothing will be written</param>
  300. /// <returns>The size of this token. **nothing will be written** if this is larger than `dest`</returns>
  301. public uint TokenToSpan(LLamaToken token, Span<byte> dest)
  302. {
  303. return ThrowIfDisposed().TokenToSpan(token, dest);
  304. }
  305. #endregion
  306. #region infer
  307. /// <summary>
  308. /// This object exists to ensure there is only ever 1 inference running at a time. This is a workaround for thread safety issues in llama.cpp itself.
  309. /// Most notably CUDA, which seems to use some global singleton resources and will crash if multiple inferences are run (even against different models).
  310. ///
  311. /// For more information see these issues:
  312. /// - https://github.com/SciSharp/LLamaSharp/issues/596
  313. /// - https://github.com/ggerganov/llama.cpp/issues/3960
  314. ///
  315. /// If these are ever resolved this lock can probably be removed.
  316. /// </summary>
  317. private static readonly object GlobalInferenceLock = new();
  318. /// <summary>
  319. /// </summary>
  320. /// <param name="batch"></param>
  321. /// <returns>Positive return values does not mean a fatal error, but rather a warning:<br />
  322. /// - 0: success<br />
  323. /// - 1: could not find a KV slot for the batch (try reducing the size of the batch or increase the context)<br />
  324. /// - &lt; 0: error<br />
  325. /// </returns>
  326. public DecodeResult Decode(LLamaBatch batch)
  327. {
  328. if (batch.TokenCount == 0)
  329. return DecodeResult.Ok;
  330. lock (GlobalInferenceLock)
  331. using (batch.ToNativeBatch(out var nb))
  332. return (DecodeResult)llama_decode(this, nb);
  333. }
  334. /// <summary>
  335. /// Decode a set of tokens in batch-size chunks.
  336. /// </summary>
  337. /// <param name="tokens"></param>
  338. /// <param name="id"></param>
  339. /// <param name="batch"></param>
  340. /// <param name="n_past"></param>
  341. /// <returns>A tuple, containing the decode result and the number of tokens that have <b>not</b> been decoded yet.</returns>
  342. internal (DecodeResult, int) Decode(List<LLamaToken> tokens, LLamaSeqId id, LLamaBatch batch, ref int n_past)
  343. {
  344. if (tokens.Count == 0)
  345. return (DecodeResult.Ok, 0);
  346. var batchSize = checked((int)BatchSize);
  347. // Evaluate the prompt, in chunks smaller than the max batch size
  348. var n_left = tokens.Count;
  349. for (var i = 0; i < tokens.Count; i += batchSize)
  350. {
  351. var n_eval = tokens.Count - i;
  352. if (n_eval > batchSize)
  353. n_eval = batchSize;
  354. batch.Clear();
  355. for (var j = 0; j < n_eval; j++)
  356. batch.Add(tokens[i + j], n_past++, id, (i + j) == tokens.Count - 1);
  357. var returnCode = Decode(batch);
  358. if (returnCode != DecodeResult.Ok)
  359. return (returnCode, n_left);
  360. n_left -= n_eval;
  361. }
  362. return (DecodeResult.Ok, 0);
  363. }
  364. #endregion
  365. #region state
  366. /// <summary>
  367. /// Get the size of the state, when saved as bytes
  368. /// </summary>
  369. public ulong GetStateSize()
  370. {
  371. return llama_state_get_size(this);
  372. }
  373. /// <summary>
  374. /// Get the size of the KV cache for a single sequence ID, when saved as bytes
  375. /// </summary>
  376. /// <param name="sequence"></param>
  377. /// <returns></returns>
  378. public ulong GetStateSize(LLamaSeqId sequence)
  379. {
  380. return llama_state_seq_get_size(this, sequence);
  381. }
  382. /// <summary>
  383. /// Get the raw state of this context, encoded as bytes. Data is written into the `dest` pointer.
  384. /// </summary>
  385. /// <param name="dest">Destination to write to</param>
  386. /// <param name="size">Number of bytes available to write to in dest (check required size with `GetStateSize()`)</param>
  387. /// <returns>The number of bytes written to dest</returns>
  388. /// <exception cref="ArgumentOutOfRangeException">Thrown if dest is too small</exception>
  389. public unsafe ulong GetState(byte* dest, ulong size)
  390. {
  391. var required = GetStateSize();
  392. if (size < required)
  393. throw new ArgumentOutOfRangeException(nameof(size), $"Allocated space is too small, {size} < {required}");
  394. unsafe
  395. {
  396. return llama_state_get_data(this, dest);
  397. }
  398. }
  399. /// <summary>
  400. /// Get the raw state of a single sequence from this context, encoded as bytes. Data is written into the `dest` pointer.
  401. /// </summary>
  402. /// <param name="dest">Destination to write to</param>
  403. /// <param name="size">Number of bytes available to write to in dest (check required size with `GetStateSize()`)</param>
  404. /// <param name="sequence">The sequence to get state data for</param>
  405. /// <returns>The number of bytes written to dest</returns>
  406. public unsafe ulong GetState(byte* dest, ulong size, LLamaSeqId sequence)
  407. {
  408. var required = GetStateSize(sequence);
  409. if (size < required)
  410. throw new ArgumentOutOfRangeException(nameof(size), $"Allocated space is too small, {size} < {required}");
  411. return llama_state_seq_get_data(this, dest, sequence);
  412. }
  413. /// <summary>
  414. /// Set the raw state of this context
  415. /// </summary>
  416. /// <param name="src">The pointer to read the state from</param>
  417. /// <returns>Number of bytes read from the src pointer</returns>
  418. public unsafe ulong SetState(byte* src)
  419. {
  420. return llama_state_set_data(this, src);
  421. }
  422. /// <summary>
  423. /// Set the raw state of a single sequence
  424. /// </summary>
  425. /// <param name="src">The pointer to read the state from</param>
  426. /// <param name="sequence">Sequence ID to set</param>
  427. /// <returns>Number of bytes read from the src pointer</returns>
  428. public unsafe ulong SetState(byte* src, LLamaSeqId sequence)
  429. {
  430. return llama_state_seq_set_data(this, src, sequence);
  431. }
  432. #endregion
  433. /// <summary>
  434. /// Set the RNG seed
  435. /// </summary>
  436. /// <param name="seed"></param>
  437. public void SetSeed(uint seed)
  438. {
  439. llama_set_rng_seed(this, seed);
  440. }
  441. /// <summary>
  442. /// Set the number of threads used for decoding
  443. /// </summary>
  444. /// <param name="threads">n_threads is the number of threads used for generation (single token)</param>
  445. /// <param name="threadsBatch">n_threads_batch is the number of threads used for prompt and batch processing (multiple tokens)</param>
  446. public void SetThreads(uint threads, uint threadsBatch)
  447. {
  448. llama_set_n_threads(this, threads, threadsBatch);
  449. }
  450. #region KV Cache Management
  451. /// <summary>
  452. /// Apply KV cache updates (such as K-shifts, defragmentation, etc.)
  453. /// </summary>
  454. public void KvCacheUpdate()
  455. {
  456. llama_kv_cache_update(this);
  457. }
  458. /// <summary>
  459. /// Defragment the KV cache. This will be applied:
  460. /// - lazily on next llama_decode()
  461. /// - explicitly with llama_kv_cache_update()
  462. /// </summary>
  463. /// <returns></returns>
  464. public void KvCacheDefrag()
  465. {
  466. llama_kv_cache_defrag(this);
  467. }
  468. /// <summary>
  469. /// Get a new KV cache view that can be used to debug the KV cache
  470. /// </summary>
  471. /// <param name="maxSequences"></param>
  472. /// <returns></returns>
  473. public LLamaKvCacheViewSafeHandle KvCacheGetDebugView(int maxSequences = 4)
  474. {
  475. return LLamaKvCacheViewSafeHandle.Allocate(this, maxSequences);
  476. }
  477. /// <summary>
  478. /// Count the number of used cells in the KV cache (i.e. have at least one sequence assigned to them)
  479. /// </summary>
  480. /// <returns></returns>
  481. public int KvCacheCountCells()
  482. {
  483. return NativeApi.llama_get_kv_cache_used_cells(this);
  484. }
  485. /// <summary>
  486. /// Returns the number of tokens in the KV cache (slow, use only for debug)
  487. /// If a KV cell has multiple sequences assigned to it, it will be counted multiple times
  488. /// </summary>
  489. /// <returns></returns>
  490. public int KvCacheCountTokens()
  491. {
  492. return NativeApi.llama_get_kv_cache_token_count(this);
  493. }
  494. /// <summary>
  495. /// Clear the KV cache
  496. /// </summary>
  497. public void KvCacheClear()
  498. {
  499. NativeApi.llama_kv_cache_clear(this);
  500. }
  501. /// <summary>
  502. /// Removes all tokens that belong to the specified sequence and have positions in [p0, p1)
  503. /// </summary>
  504. /// <param name="seq"></param>
  505. /// <param name="p0"></param>
  506. /// <param name="p1"></param>
  507. public void KvCacheRemove(LLamaSeqId seq, LLamaPos p0, LLamaPos p1)
  508. {
  509. NativeApi.llama_kv_cache_seq_rm(this, seq, p0, p1);
  510. }
  511. /// <summary>
  512. /// Copy all tokens that belong to the specified sequence to another sequence. Note that
  513. /// this does not allocate extra KV cache memory - it simply assigns the tokens to the
  514. /// new sequence
  515. /// </summary>
  516. /// <param name="src"></param>
  517. /// <param name="dest"></param>
  518. /// <param name="p0"></param>
  519. /// <param name="p1"></param>
  520. public void KvCacheSequenceCopy(LLamaSeqId src, LLamaSeqId dest, LLamaPos p0, LLamaPos p1)
  521. {
  522. NativeApi.llama_kv_cache_seq_cp(this, src, dest, p0, p1);
  523. }
  524. /// <summary>
  525. /// Removes all tokens that do not belong to the specified sequence
  526. /// </summary>
  527. /// <param name="seq"></param>
  528. public void KvCacheSequenceKeep(LLamaSeqId seq)
  529. {
  530. NativeApi.llama_kv_cache_seq_keep(this, seq);
  531. }
  532. /// <summary>
  533. /// Adds relative position "delta" to all tokens that belong to the specified sequence
  534. /// and have positions in [p0, p1. If the KV cache is RoPEd, the KV data is updated
  535. /// accordingly
  536. /// </summary>
  537. /// <param name="seq"></param>
  538. /// <param name="p0"></param>
  539. /// <param name="p1"></param>
  540. /// <param name="delta"></param>
  541. public void KvCacheSequenceAdd(LLamaSeqId seq, LLamaPos p0, LLamaPos p1, int delta)
  542. {
  543. NativeApi.llama_kv_cache_seq_add(this, seq, p0, p1, delta);
  544. }
  545. /// <summary>
  546. /// Integer division of the positions by factor of `d > 1`.
  547. /// If the KV cache is RoPEd, the KV data is updated accordingly.<br />
  548. /// p0 &lt; 0 : [0, p1]<br />
  549. /// p1 &lt; 0 : [p0, inf)
  550. /// </summary>
  551. /// <param name="seq"></param>
  552. /// <param name="p0"></param>
  553. /// <param name="p1"></param>
  554. /// <param name="divisor"></param>
  555. public void KvCacheSequenceDivide(LLamaSeqId seq, LLamaPos p0, LLamaPos p1, int divisor)
  556. {
  557. NativeApi.llama_kv_cache_seq_div(this, seq, p0, p1, divisor);
  558. }
  559. #endregion
  560. }
  561. }