You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

LLamaQuantizer.cs 6.9 kB

April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
April 2024 Binary Update (#662) * Updated binaries, using [this build](https://github.com/SciSharp/LLamaSharp/actions/runs/8654672719/job/23733195669) for llama.cpp commit `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7`. - Added all new functions. - Moved some functions (e.g. `SafeLlamaModelHandle` specific functions) into `SafeLlamaModelHandle.cs` - Exposed tokens on `SafeLlamaModelHandle` and `LLamaWeights` through a `Tokens` property. As new special tokens are added in the future they can be added here. - Changed all token properties to return nullable tokens, to handle some models not having some tokens. - Fixed `DefaultSamplingPipeline` to handle no newline token in some models. * Moved native methods to more specific locations. - Context specific things have been moved into `SafeLLamaContextHandle.cs` and made private - they're exposed through C# properties and methods already. - Checking that GPU layer count is zero if GPU offload is not supported. - Moved methods for creating default structs (`llama_model_quantize_default_params` and `llama_context_default_params`) into relevant structs. * Removed exception if `GpuLayerCount > 0` when GPU is not supported. * - Added low level wrapper methods for new per-sequence state load/save in `SafeLLamaContextHandle` - Added high level wrapper methods (save/load with `State` object or memory mapped file) in `LLamaContext` - Moved native methods for per-sequence state load/save into `SafeLLamaContextHandle` * Added update and defrag methods for KV cache in `SafeLLamaContextHandle` * Updated submodule to `f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7` * Passing the sequence ID when saving a single sequence state
1 year ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153
  1. using LLama.Native;
  2. using System;
  3. using System.Collections.Generic;
  4. namespace LLama
  5. {
  6. /// <summary>
  7. /// The quantizer to quantize the model.
  8. /// </summary>
  9. public static class LLamaQuantizer
  10. {
  11. /// <summary>
  12. /// Quantize the model.
  13. /// </summary>
  14. /// <param name="srcFileName">The model file to be quantized.</param>
  15. /// <param name="dstFilename">The path to save the quantized model.</param>
  16. /// <param name="ftype">The type of quantization.</param>
  17. /// <param name="nthread">Thread to be used during the quantization. By default it's the physical core number.</param>
  18. /// <param name="allowRequantize"></param>
  19. /// <param name="quantizeOutputTensor"></param>
  20. /// <returns>Whether the quantization is successful.</returns>
  21. /// <exception cref="ArgumentException"></exception>
  22. public static bool Quantize(
  23. string srcFileName, string dstFilename, LLamaFtype ftype, int nthread = -1, bool allowRequantize = true, bool quantizeOutputTensor = false)
  24. {
  25. if (!ValidateFtype(ftype))
  26. {
  27. throw new ArgumentException($"The type {Enum.GetName(typeof(LLamaFtype), ftype)} is not a valid type " +
  28. $"to perform quantization.");
  29. }
  30. var quantizeParams = LLamaModelQuantizeParams.Default();
  31. quantizeParams.ftype = ftype;
  32. quantizeParams.nthread = nthread;
  33. quantizeParams.allow_requantize = allowRequantize;
  34. quantizeParams.quantize_output_tensor = quantizeOutputTensor;
  35. // todo: fill in other quantize params fields.
  36. // This method could probably do with a redesign - passing in a config object (maybe directly
  37. // expose `LLamaModelQuantizeParams`) instead of an ever growing list of method parameters!
  38. return NativeApi.llama_model_quantize(srcFileName, dstFilename, ref quantizeParams) == 0;
  39. }
  40. /// <summary>
  41. /// Quantize the model.
  42. /// </summary>
  43. /// <param name="srcFileName">The model file to be quantized.</param>
  44. /// <param name="dstFilename">The path to save the quantized model.</param>
  45. /// <param name="ftype">The type of quantization.</param>
  46. /// <param name="nthread">Thread to be used during the quantization. By default it's the physical core number.</param>
  47. /// <param name="allowRequantize"></param>
  48. /// <param name="quantizeOutputTensor"></param>
  49. /// <returns>Whether the quantization is successful.</returns>
  50. /// <exception cref="ArgumentException"></exception>
  51. public static bool Quantize(string srcFileName, string dstFilename, string ftype, int nthread = -1, bool allowRequantize = true,
  52. bool quantizeOutputTensor = false)
  53. {
  54. return Quantize(srcFileName, dstFilename, StringToFtype(ftype), nthread, allowRequantize, quantizeOutputTensor);
  55. }
  56. private static bool ValidateFtype(LLamaFtype ftype)
  57. {
  58. // Validation copies from here:
  59. // https://github.com/ggerganov/llama.cpp/blob/f7001ccc5aa359fcf41bba19d1c99c3d25c9bcc7/llama.cpp#L13450
  60. switch (ftype)
  61. {
  62. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q4_0:
  63. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q4_1:
  64. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q5_0:
  65. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q5_1:
  66. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q8_0:
  67. case LLamaFtype.LLAMA_FTYPE_MOSTLY_F16:
  68. case LLamaFtype.LLAMA_FTYPE_ALL_F32:
  69. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q2_K_S:
  70. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q2_K:
  71. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ3_K_XS:
  72. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q3_K_S:
  73. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q3_K_M:
  74. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q3_K_L:
  75. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q4_K_S:
  76. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q4_K_M:
  77. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q5_K_S:
  78. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q5_K_M:
  79. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q6_K:
  80. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ2_XXS:
  81. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ2_XS:
  82. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ2_S:
  83. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ2_M:
  84. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ3_XXS:
  85. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ1_S:
  86. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ1_M:
  87. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ4_NL:
  88. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ4_XS:
  89. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ3_S:
  90. case LLamaFtype.LLAMA_FTYPE_MOSTLY_IQ3_M:
  91. return true;
  92. case LLamaFtype.LLAMA_FTYPE_MOSTLY_Q4_1_SOME_F16:
  93. case LLamaFtype.LLAMA_FTYPE_GUESSED:
  94. default:
  95. return false;
  96. }
  97. }
  98. /// <summary>
  99. /// Parse a string into a LLamaFtype. This is a "relaxed" parsing, which allows any string which is contained within
  100. /// the enum name to be used.
  101. ///
  102. /// For example "Q5_K_M" will convert to "LLAMA_FTYPE_MOSTLY_Q5_K_M"
  103. /// </summary>
  104. /// <param name="str"></param>
  105. /// <returns></returns>
  106. /// <exception cref="ArgumentException"></exception>
  107. private static LLamaFtype StringToFtype(string str)
  108. {
  109. // Find all variants which contain the input string
  110. var matches = new List<LLamaFtype>();
  111. foreach (LLamaFtype ftype in Enum.GetValues(typeof(LLamaFtype)))
  112. {
  113. var name = Enum.GetName(typeof(LLamaFtype), ftype);
  114. // Note: this is using "IndexOf" instead of "Contains" to be compatible with netstandard2.0
  115. #pragma warning disable CA2249
  116. if (name != null && name.IndexOf(str, StringComparison.OrdinalIgnoreCase) >= 0)
  117. matches.Add(ftype);
  118. #pragma warning restore CA2249
  119. }
  120. // If there was just one match, success!
  121. if (matches.Count == 1)
  122. return matches[0];
  123. // If none matched throw a generic error
  124. if (matches.Count == 0)
  125. throw new ArgumentException($"Unknown ftype \"{str}\" for quantization.");
  126. // There were several matches, throw an error asking the user to be more specific
  127. throw new ArgumentException($"\"{str}\" matches multiple potential ftypes: {string.Join(",", matches)}");
  128. }
  129. }
  130. }