vulkan: lock accesses of pinned_memory vector by jeffbolznv · Pull Request #14333 · ggml-org/llama.cpp (original) (raw)

@jeffbolznv

@github-actions github-actions Bot added Vulkan

Issues specific to the Vulkan backend

ggml

changes relating to the ggml tensor library for machine learning

labels

Jun 22, 2025

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request

Jun 30, 2025

@gabe-l-hart

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 2, 2025

@jeffbolznv @qnixsynapse

Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 2, 2025

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp


Signed-off-by: nscipione nicolo.scipione@codeplay.com

ggml-ci

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com


Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Enable uniform linking with subproject and with find_package.

ggml-ci

ggml-ci

ggml-ci

It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy.

The problem was apparently caused by how the tail cells were swapped.

The state_copy shuffle assumes everything is moved at once, which is not true when states_extra is copied back to the cache before copying the range of states between head and head + n_seqs. This is only a problem if any of the cells in [head, head + n_seqs) have an src in [head + n_seqs, head + n_kv), which does happen when n_ubatch > 1 in the llama-parallel example.

Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes.

This naming should reduce confusion between the state size and the number of states.

Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.

ggml-ci

This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.

This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_ which need to be set in cmake, instead of GGML_ that users would set for x86.

This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags.

Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_ as we don't have GGML_.

Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used.

The other platforms will need their own specific variants.

This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.

This fixes RWKV inference which otherwise failed when the worst case ubatch.n_seq_tokens rounded to 0.

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

The rebuild of build-info.cpp still gets triggered when .git/index gets changes.

Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669 which adds SYCL-Graph support for recording CUDA BLAS commands.

With this change the MUL_MAT tests now pass on DPC++ CUDA backends with SYCL-Graph enabled. Prior to this change, an error would be thrown.

$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2

UR CUDA ERROR:
        Value:           700
        Name:            CUDA_ERROR_ILLEGAL_ADDRESS
        Description:     an illegal memory access was encountered
        Function:        operator()
        Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154

Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
  in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

ggml-ci

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

Currently when a model generates output which looks like a tool call, but is invalid an exception is thrown and not handled, causing the cli or llama-server to bail. Instead, handle the chat parser exception and simply return the generated text in such cases.

Signed-off-by: Piotr Stankiewicz piotr.stankiewicz@docker.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Adds:


The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

The model architecture is a combination of Qwen and Deepseek parts, as seen here:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

ggml-ci

Instead show something like this:

main: server is listening on file.sock - starting the main loop

Signed-off-by: Eric Curtin ecurtin@redhat.com

Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com


Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com

This fixes the remaining crash in test-thread-safety on my system.

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Georgi Gerganov ggerganov@gmail.com

when main_gpu < 0 GPU devices are not used


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

This commit adds the examples in the "list" of targets to ignore MSVC warnings.

The motivation for this is that currently the examples generate a number of warnings that are ignore/disabled for the core ggml project. This makes for a cleaner output when building.

This commit removes the unused ggml_context_container structure from the ggml library. It looks like the usage of this struct was removed in Commit 4757fe18d56ec11bf9c07feaca6e9d5b5357e7f4 ("ggml : alloc ggml_contexts on the heap (whisper/2525)").

The motivation for this changes is to improve code clarity/readability.

This commit disables warnings for tests on windows when using MSVC.

The motivation for this is that this brings the build output more inline with what Linux/MacOS systems produce.

There is still one warning generated for the tests which is:

  Building Custom Rule C:/ggml/tests/CMakeLists.txt
cl : command line  warning D9025: overriding '/DNDEBUG' with '/UNDEBUG'
[C:\ggml\build\tests\test-arange.vcxproj]
  test-arange.cpp
  test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exe

ggml-ci

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-ci

ggml-ci

Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent."

Branch: HybridCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches.

This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

NOTE: I intentionally did not add support for s_mask since it will be going away soon

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

No longer needed now that unified isn't also supporting recurrent

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069

Branch: HybridRecurrentCache

Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This is no longer needed now that there are separate implementations

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches.

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are:

This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-Authored-By: @younesbelkada

Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

ggml-ci

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states).

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

It just happens to have the same number of letters as _attn!

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking.

Addresses #14224

Addresses unused reorder path

Co-authored-by: aa956 27946957+aa956@users.noreply.github.com

Co-authored-by: compilade git@compilade.net


Co-authored-by: compilade git@compilade.net

Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread

When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.


Co-authored-by: Diego Devesa slarengh@gmail.com

Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels.


Signed-off-by: nscipione nicolo.scipione@codeplay.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Mistral Small 2506 models using Pixtral vision encoder were running out of GPU memory when processing images larger than 1024x1024 pixels due to exponential memory growth from unlimited image size.

This fix applies the same 1024x1024 limit used by Qwen2VL models to prevent OOM issues while maintaining compatibility with existing models.

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Signed-off-by: Molly Sophia mollysophia379@gmail.com


Signed-off-by: Molly Sophia mollysophia379@gmail.com

Co-authored-by: Johannes Gäßler johannesg@5d6.de

This will allow the use of tools on the llama-server

ggml-ci

ggml-ci

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com


Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-ci

ggml-ci

ggml-ci

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

This setting needs to be passed through to vulkan-shaders-gen

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

style fixes


Co-authored-by: slaren slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

This commit refactors the SYCL element-wise operations to improve performance by:

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

This commit renames the variable best_mad to best_error in the make_qkx2_quants function.

The motivation for this is that the name best_mad can be somewhat confusing if mean absolute deviation (MAD) is not in use.

ggml-ci

Signed-off-by: noemotiovon 757486878@qq.com

Signed-off-by: noemotiovon 757486878@qq.com

Signed-off-by: noemotiovon 757486878@qq.com


Signed-off-by: noemotiovon 757486878@qq.com

Right now it's not easy to find those.

ggml-ci

ggml-ci


Co-authored-by: Diego Devesa slarengh@gmail.com


Signed-off-by: nscipione nicolo.scipione@codeplay.com Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Signed-off-by: Piotr Stankiewicz piotr.stankiewicz@docker.com Signed-off-by: Eric Curtin ecurtin@redhat.com Signed-off-by: Aaron Teo aaron.teo1@ibm.com Signed-off-by: Gabe Goodhart ghart@us.ibm.com Signed-off-by: Molly Sophia mollysophia379@gmail.com Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com Signed-off-by: noemotiovon 757486878@qq.com Co-authored-by: Yuanhao Ji jiyuanhao@apache.org Co-authored-by: Đinh Trọng Huy 77562200+huydt84@users.noreply.github.com Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Nicolò Scipione nicolo.scipione@codeplay.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: R0CKSTAR yeahdongcn@gmail.com Co-authored-by: Xinpeng Dou 15529241576@163.com Co-authored-by: Diego Devesa slarengh@gmail.com Co-authored-by: xctan axunlei@gmail.com Co-authored-by: Kai Pastor dg0yt@darc.de Co-authored-by: Isaac McFadyen isaac@imcf.me Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Juk Armstrong 69222624+jukofyork@users.noreply.github.com Co-authored-by: Jeff Bolz jbolz@nvidia.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net Co-authored-by: lhez quic_lih@quicinc.com Co-authored-by: Taylor quantumtraveling@gmail.com Co-authored-by: Aman amangupta052@gmail.com Co-authored-by: Christian Kastner ckk@kvr.at Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com Co-authored-by: Anton Mitkov anton.mitkov@codeplay.com Co-authored-by: Ewan Crawford ewan@codeplay.com Co-authored-by: ddpasa 112642920+ddpasa@users.noreply.github.com Co-authored-by: Guy Goldenberg guy110698@gmail.com Co-authored-by: Svetlozar Georgiev 55534064+sgeor255@users.noreply.github.com Co-authored-by: Piotr piotr.stankiewicz@docker.com Co-authored-by: Pepijn de Vos me@pepijndevos.nl Co-authored-by: Mikko Juola mikjuo@gmail.com Co-authored-by: uvos philipp@uvos.xyz Co-authored-by: Ed Addario 29247825+EAddario@users.noreply.github.com Co-authored-by: Eric Curtin ecurtin@redhat.com Co-authored-by: Bartowski 3266127+bartowski1182@users.noreply.github.com Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com Co-authored-by: xctan xc-tan@outlook.com Co-authored-by: Charles Xu charles.xu@arm.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: pqnet 119850+pqnet@users.noreply.github.com Co-authored-by: bashayer hijji bashayer.hijji@gmail.com Co-authored-by: Anton Mitkov anton_b_mitkov@abv.bg Co-authored-by: fanyang fanyang89@outlook.com Co-authored-by: aa956 aa956@users.noreply.github.com Co-authored-by: aa956 27946957+aa956@users.noreply.github.com Co-authored-…

Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 5, 2025

@jeffbolznv @Minh141120

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 6, 2025

@jeffbolznv @qnixsynapse

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 6, 2025

@jeffbolznv @qnixsynapse

Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 8, 2025

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp


Signed-off-by: nscipione nicolo.scipione@codeplay.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com


Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

ggml-ci

It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy.

The problem was apparently caused by how the tail cells were swapped.

The state_copy shuffle assumes everything is moved at once, which is not true when states_extra is copied back to the cache before copying the range of states between head and head + n_seqs. This is only a problem if any of the cells in [head, head + n_seqs) have an src in [head + n_seqs, head + n_kv), which does happen when n_ubatch > 1 in the llama-parallel example.

Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes.

This naming should reduce confusion between the state size and the number of states.

Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.

ggml-ci

This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.

This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_ which need to be set in cmake, instead of GGML_ that users would set for x86.

This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags.

Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_ as we don't have GGML_.

Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used.

The other platforms will need their own specific variants.

This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

The rebuild of build-info.cpp still gets triggered when .git/index gets changes.

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Adds:


The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

The model architecture is a combination of Qwen and Deepseek parts, as seen here:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

ggml-ci

Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com


Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com

This fixes the remaining crash in test-thread-safety on my system.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Georgi Gerganov ggerganov@gmail.com

when main_gpu < 0 GPU devices are not used


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

This commit adds the examples in the "list" of targets to ignore MSVC warnings.

The motivation for this is that currently the examples generate a number of warnings that are ignore/disabled for the core ggml project. This makes for a cleaner output when building.

This commit disables warnings for tests on windows when using MSVC.

The motivation for this is that this brings the build output more inline with what Linux/MacOS systems produce.

There is still one warning generated for the tests which is:

  Building Custom Rule C:/ggml/tests/CMakeLists.txt
cl : command line  warning D9025: overriding '/DNDEBUG' with '/UNDEBUG'
[C:\ggml\build\tests\test-arange.vcxproj]
  test-arange.cpp
  test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exe

ggml-ci

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-ci

ggml-ci

Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent."

Branch: HybridCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches.

This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

NOTE: I intentionally did not add support for s_mask since it will be going away soon

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

No longer needed now that unified isn't also supporting recurrent

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069

Branch: HybridRecurrentCache

Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This is no longer needed now that there are separate implementations

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches.

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are:

This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-Authored-By: @younesbelkada

Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

ggml-ci

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states).

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

It just happens to have the same number of letters as _attn!

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: aa956 27946957+aa956@users.noreply.github.com

Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread

When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.


Co-authored-by: Diego Devesa slarengh@gmail.com

Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels.


Signed-off-by: nscipione nicolo.scipione@codeplay.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Co-authored-by: Johannes Gäßler johannesg@5d6.de

ggml-ci

ggml-ci

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com


Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-ci

ggml-ci

ggml-ci

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

This setting needs to be passed through to vulkan-shaders-gen

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

style fixes


Co-authored-by: slaren slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

This commit refactors the SYCL element-wise operations to improve performance by:

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Diego Devesa slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci

This commit adds a function ggml_version() to the ggml library that returns the version of the library as a string.

The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.

Usage:

printf("GGML version: %s\n", ggml_version());

Output:

GGML version: 0.0.2219

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler.

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

Works, but using lambda functions might not be that clean.

There is still room for improvement, but it works!

Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: nscipione nicolo.scipione@codeplay.com

Co-authored-by: luyuhong luyuhong@kylinos.cn

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com


Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com

Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.


Signed-off-by: nscipione nicolo.scipione@codeplay.com Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Signed-off-by: Aaron Teo aaron.teo1@ibm.com Signed-off-by: Gabe Goodhart ghart@us.ibm.com Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: Đinh Trọng Huy 77562200+huydt84@users.noreply.github.com Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Nicolò Scipione nicolo.scipione@codeplay.com Co-authored-by: R0CKSTAR yeahdongcn@gmail.com Co-authored-by: Xinpeng Dou 15529241576@163.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: xctan axunlei@gmail.com Co-authored-by: Diego Devesa slarengh@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Jeff Bolz jbolz@nvidia.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net Co-authored-by: lhez quic_lih@quicinc.com Co-authored-by: Aman amangupta052@gmail.com Co-authored-by: Christian Kastner ckk@kvr.at Co-authored-by: Guy Goldenberg guy110698@gmail.com Co-authored-by: Mikko Juola mikjuo@gmail.com Co-authored-by: Bartowski 3266127+bartowski1182@users.noreply.github.com Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com Co-authored-by: xctan xc-tan@outlook.com Co-authored-by: Charles Xu charles.xu@arm.com Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com Co-authored-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: pqnet 119850+pqnet@users.noreply.github.com Co-authored-by: fanyang fanyang89@outlook.com Co-authored-by: aa956 aa956@users.noreply.github.com Co-authored-by: aa956 27946957+aa956@users.noreply.github.com Co-authored-by: Ruikai Peng retr0@retr0.blog Co-authored-by: Acly aclysia@gmail.com Co-authored-by: Daniel Han danielhanchen@gmail.com Co-authored-by: Markus Tavenrath mtavenrath@users.noreply.github.com Co-authored-by: uvos philipp@uvos.xyz Co-authored-by: Ed Addario 29247825+EAddario@users.noreply.github.com Co-authored-by: Johannes Gäßler johannesg@5d6.de Co-authored-by: Mathieu Baudier mbaudier@argeo.org Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Radoslav Gerganov rgerganov@gmail.com Co-authored-by: Weizhao Ouyang weizhao.ouyang@arm.com Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Renat rntk@users.noreply.github.com Co-authored-by: matteo matteo.serva@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com Co-authored-by: Vedran Miletić vedran@miletic.net Co-authored-by: xiaobing318 71554036+xiaobing318@users.noreply.github.com Co-authored-by: Romain Biessy romain.biessy@codeplay.com Co-authored-by: Björn Ganster mail@bjoern-ganster.de Co-authored-by: Eric Zhang 34133756+EZForever@users.noreply.github.com Co-authored-by: zhouwg zhouwg2000@gmail.com Co-authored-by: Rotem Dan rotemdan@gmail.com Co-authored-by: luyhcsu 110711054+luyhcsu@users.noreply.github.com Co-authored-by: luyuhong luyuhong@kylinos.cn

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 10, 2025

@jeffbolznv @qnixsynapse

olek-tether pushed a commit to tetherto/qvac-fabric-llm.cpp that referenced this pull request

Aug 15, 2025

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com


Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-ci

ggml-ci

ggml-ci

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

This setting needs to be passed through to vulkan-shaders-gen

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

style fixes


Co-authored-by: slaren slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

This commit refactors the SYCL element-wise operations to improve performance by:

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

This commit renames the variable best_mad to best_error in the make_qkx2_quants function.

The motivation for this is that the name best_mad can be somewhat confusing if mean absolute deviation (MAD) is not in use.

ggml-ci

Signed-off-by: noemotiovon 757486878@qq.com

Signed-off-by: noemotiovon 757486878@qq.com

Signed-off-by: noemotiovon 757486878@qq.com


Signed-off-by: noemotiovon 757486878@qq.com

Right now it's not easy to find those.

ggml-ci

ggml-ci


Co-authored-by: Diego Devesa slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci

This commit adds a function ggml_version() to the ggml library that returns the version of the library as a string.

The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.

Usage:

printf("GGML version: %s\n", ggml_version());

Output:

GGML version: 0.0.2219

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler.

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

Works, but using lambda functions might not be that clean.

There is still room for improvement, but it works!

Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: nscipione nicolo.scipione@codeplay.com

Co-authored-by: luyuhong luyuhong@kylinos.cn

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com


Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com

Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.

The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.

Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com

Splits producing more than one ubatch per batch for recurrent models were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.


Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com

Signed-off-by: stevenkuang stevenkuang@tencent.com

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: compilade git@compilade.net

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: compilade git@compilade.net


Co-authored-by: younesbelkada younes.belkada@tii.ae Co-authored-by: Younes B 49240599+younesbelkada@users.noreply.github.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

The implementation already supported it, and this makes Mamba's conv step slightly faster.

This also slightly reduces the diff from the master branch

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Some of the tensor names are common with Llama4


Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Signed-off-by: ryan-mangeno ryanmangeno@gmail.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed.

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

The implementation already supported it, and this makes Mamba's conv step slightly faster.

This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway)

Still, I'm open to better suggestions.

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

This also slightly reduces the diff from the master branch

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler.

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

Works, but using lambda functions might not be that clean.

There is still room for improvement, but it works!

This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture if statement inside the mamba implementation.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This will allow these layer-builder methods to be used from other build structs without complex inheritance.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Also no need to pass in kv cache since it's already in the inp_attn

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

It generates (garbage) tokens! Still lots of debugging to do.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The challenge here is to give both the non-hybrid classes (llm_build_mamba and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access to the same intermediate "base class" functionality (build_mamba*_layer, build_granite_attention_layer) without running into trouble with diamond inheritance of llm_graph_context. Due to the non-trivial initialization that happens in llm_graph_context, diamond inheritance results in multiple initializations of the common base which cause problems around the unique ptrs. I wanted to get away from self-> everywhere, but this is still a bit cleaner than making those methods static I think.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This follows the pattern where the type of input is pinned to the type of memory and that is used to dispatch to the correct version of build_rs / build_attn. There's a lot of code duplication that can hopefully be pulled into common functions in the graph later.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba).

As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON

ggml-ci

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...).

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Some of the tensor names are common with Llama4

The only key difference is the use of rope which is now set via rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance

https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The gist is to be explicit about which base class is being used with the multiple inheritance setup

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

After further discussion, this encourages sloppy overwriting in the model converters

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Francis Couture-Harpin git@compilade.net

(thanks for the sharp eyes and patience!)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com


Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Francis Couture-Harpin git@compilade.net Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

ggml-ci

Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source

pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"

Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.

Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

gianni-cor pushed a commit to gianni-cor/qvac-fabric-llm.cpp that referenced this pull request

Mar 23, 2026

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com


Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-ci

ggml-ci

ggml-ci

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

This setting needs to be passed through to vulkan-shaders-gen

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

style fixes


Co-authored-by: slaren slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

This commit refactors the SYCL element-wise operations to improve performance by:

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

This commit renames the variable best_mad to best_error in the make_qkx2_quants function.

The motivation for this is that the name best_mad can be somewhat confusing if mean absolute deviation (MAD) is not in use.

ggml-ci

Signed-off-by: noemotiovon 757486878@qq.com

Signed-off-by: noemotiovon 757486878@qq.com

Signed-off-by: noemotiovon 757486878@qq.com


Signed-off-by: noemotiovon 757486878@qq.com

Right now it's not easy to find those.

ggml-ci

ggml-ci


Co-authored-by: Diego Devesa slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci

This commit adds a function ggml_version() to the ggml library that returns the version of the library as a string.

The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.

Usage:

printf("GGML version: %s\n", ggml_version());

Output:

GGML version: 0.0.2219

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler.

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

Works, but using lambda functions might not be that clean.

There is still room for improvement, but it works!

Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: nscipione nicolo.scipione@codeplay.com

Co-authored-by: luyuhong luyuhong@kylinos.cn

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com


Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com

Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.

The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.

Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com

Splits producing more than one ubatch per batch for recurrent models were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.


Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com

Signed-off-by: stevenkuang stevenkuang@tencent.com

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: compilade git@compilade.net

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: compilade git@compilade.net


Co-authored-by: younesbelkada younes.belkada@tii.ae Co-authored-by: Younes B 49240599+younesbelkada@users.noreply.github.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

The implementation already supported it, and this makes Mamba's conv step slightly faster.

This also slightly reduces the diff from the master branch

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Some of the tensor names are common with Llama4


Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Signed-off-by: ryan-mangeno ryanmangeno@gmail.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed.

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

The implementation already supported it, and this makes Mamba's conv step slightly faster.

This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway)

Still, I'm open to better suggestions.

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

This also slightly reduces the diff from the master branch

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler.

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

Works, but using lambda functions might not be that clean.

There is still room for improvement, but it works!

This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture if statement inside the mamba implementation.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This will allow these layer-builder methods to be used from other build structs without complex inheritance.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Also no need to pass in kv cache since it's already in the inp_attn

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

It generates (garbage) tokens! Still lots of debugging to do.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The challenge here is to give both the non-hybrid classes (llm_build_mamba and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access to the same intermediate "base class" functionality (build_mamba*_layer, build_granite_attention_layer) without running into trouble with diamond inheritance of llm_graph_context. Due to the non-trivial initialization that happens in llm_graph_context, diamond inheritance results in multiple initializations of the common base which cause problems around the unique ptrs. I wanted to get away from self-> everywhere, but this is still a bit cleaner than making those methods static I think.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This follows the pattern where the type of input is pinned to the type of memory and that is used to dispatch to the correct version of build_rs / build_attn. There's a lot of code duplication that can hopefully be pulled into common functions in the graph later.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba).

As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON

ggml-ci

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...).

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Some of the tensor names are common with Llama4

The only key difference is the use of rope which is now set via rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance

https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The gist is to be explicit about which base class is being used with the multiple inheritance setup

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

After further discussion, this encourages sloppy overwriting in the model converters

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Francis Couture-Harpin git@compilade.net

(thanks for the sharp eyes and patience!)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com


Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Francis Couture-Harpin git@compilade.net Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

ggml-ci

Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source

pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"

Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.

Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

gianni-cor pushed a commit to gianni-cor/qvac-fabric-llm.cpp that referenced this pull request

Mar 23, 2026

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com


Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Signed-off-by: Aaron Teo aaron.teo1@ibm.com


Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-ci

ggml-ci

ggml-ci

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

ggml-ci

This setting needs to be passed through to vulkan-shaders-gen

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

style fixes


Co-authored-by: slaren slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

This commit refactors the SYCL element-wise operations to improve performance by:

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

ggml-ci


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

This commit renames the variable best_mad to best_error in the make_qkx2_quants function.

The motivation for this is that the name best_mad can be somewhat confusing if mean absolute deviation (MAD) is not in use.

ggml-ci

Signed-off-by: noemotiovon 757486878@qq.com

Signed-off-by: noemotiovon 757486878@qq.com

Signed-off-by: noemotiovon 757486878@qq.com


Signed-off-by: noemotiovon 757486878@qq.com

Right now it's not easy to find those.

ggml-ci

ggml-ci


Co-authored-by: Diego Devesa slarengh@gmail.com

ggml-ci

ggml-ci

ggml-ci

This commit adds a function ggml_version() to the ggml library that returns the version of the library as a string.

The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.

Usage:

printf("GGML version: %s\n", ggml_version());

Output:

GGML version: 0.0.2219

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

ggml-ci

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler.

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

Works, but using lambda functions might not be that clean.

There is still room for improvement, but it works!

Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: nscipione nicolo.scipione@codeplay.com

Co-authored-by: luyuhong luyuhong@kylinos.cn

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com


Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com

Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.

The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.

Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

ggml-ci

ggml-ci

ggml-ci

ggml-ci


Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com

Splits producing more than one ubatch per batch for recurrent models were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.


Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com

Signed-off-by: stevenkuang stevenkuang@tencent.com

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: compilade git@compilade.net

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: compilade git@compilade.net


Co-authored-by: younesbelkada younes.belkada@tii.ae Co-authored-by: Younes B 49240599+younesbelkada@users.noreply.github.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

The implementation already supported it, and this makes Mamba's conv step slightly faster.

This also slightly reduces the diff from the master branch

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Some of the tensor names are common with Llama4


Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Signed-off-by: ryan-mangeno ryanmangeno@gmail.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed.

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

The implementation already supported it, and this makes Mamba's conv step slightly faster.

This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway)

Still, I'm open to better suggestions.

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

This also slightly reduces the diff from the master branch

The max index is 31, so trimming the arguments is necessary.

Whoops, this is needed for the offset in the concatenated output.

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

This makes the weight buft detection in src/llama.cpp simpler.

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

Works, but using lambda functions might not be that clean.

There is still room for improvement, but it works!

This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture if statement inside the mamba implementation.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This will allow these layer-builder methods to be used from other build structs without complex inheritance.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Also no need to pass in kv cache since it's already in the inp_attn

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

It generates (garbage) tokens! Still lots of debugging to do.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The challenge here is to give both the non-hybrid classes (llm_build_mamba and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access to the same intermediate "base class" functionality (build_mamba*_layer, build_granite_attention_layer) without running into trouble with diamond inheritance of llm_graph_context. Due to the non-trivial initialization that happens in llm_graph_context, diamond inheritance results in multiple initializations of the common base which cause problems around the unique ptrs. I wanted to get away from self-> everywhere, but this is still a bit cleaner than making those methods static I think.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This follows the pattern where the type of input is pinned to the type of memory and that is used to dispatch to the correct version of build_rs / build_attn. There's a lot of code duplication that can hopefully be pulled into common functions in the graph later.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba).

As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON

ggml-ci

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...).

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Some of the tensor names are common with Llama4

The only key difference is the use of rope which is now set via rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance

https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

The gist is to be explicit about which base class is being used with the multiple inheritance setup

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

After further discussion, this encourages sloppy overwriting in the model converters

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Francis Couture-Harpin git@compilade.net

(thanks for the sharp eyes and patience!)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com


Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Francis Couture-Harpin git@compilade.net Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

ggml-ci

Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source

pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"

Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.

Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

ggml-ci

ggml-ci

ggml-ci

ggml-ci

ggml-ci

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com


Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request

Apr 26, 2026

@jeffbolznv

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request

Apr 28, 2026

@firecoperana @0cc4m

vulkan : do not use tensor->extra (ggml-org#9407)

This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used.

Ref: ggml-org#8536


Co-authored-by: 0cc4m picard12@live.de

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan : fix build (#0)

ggml-ci

Improve Vulkan shader build system (ggml-org#9239)

ggml : fix build break for the vulkan-debug (ggml-org#9265)

Signed-off-by: Changyeon Kim cyzero.kim@samsung.com

vulkan: correctly report support for OP_CONT (ggml/946)

test-backend-ops fails because ggml_cont aborts when invoked passing an unsupported type.

This commit makes ggml_cont tests pass

Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com

vulkan: add dryrun support to sin and cos ops (ggml/947)

sin and cos failed test-backend-ops because they tried to dereference a context pointer that is null on dry runs.

This commit prevents that segfault.

Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com

Conflicts:

ggml/src/ggml-vulkan.cpp

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (ggml-org#9118)

Conflicts:

ggml/src/ggml-vulkan.cpp

Enable use to the rebar feature to upload buffers to the device. (ggml-org#9251)

vulkan : argsort barriers must be under uniform control flow (ggml/951)

a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs).

BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work.

Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com

vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)

vulkan : multithread pipeline creation (ggml/963)

vulkan : mul_mat: fix UB with small warps (ggml/952)

When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size.

The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication.

When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0.

We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8).

Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com

vulkan : retry allocation with fallback flags (whisper/2451)

Co-authored-by: Samuel Morris samuel.morris@artlist.io

vulkan : improve ggml_vk_create_buffer error handling (ggml-org#9898)

vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (ggml-org#10226)

vulkan: Throttle the number of shader compiles during the build step. (ggml-org#10222)

Fixes ggml-org#9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes" errors on Linux. This change applies the same throttling we use for multithreaded pipeline creation.

Conflicts:

ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp

vulkan: Optimize contiguous copies (ggml-org#10254)

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.

Conflicts:

tests/test-backend-ops.cpp

vulkan: Use macros to make the mat mul pipeline creation more concise (ggml-org#10259)

Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a pipeline. This isn't really used yet.

vulkan: Optimize binary ops (ggml-org#10270)

Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/acc.comp

ggml : vulkan logs (whisper/2547)

vulkan: Optimize some mat-vec mul quant shaders (ggml-org#10296)

Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.

Vulkan: Fix device info output format specifiers (ggml-org#10366)

vulkan: remove use of null initializer (ggml-org#10372)

Seems like this isn't working for vulkan-over-metal when the array is sized by a spec constant. Maybe a spirv-cross limitation?

vulkan: Optimize soft_max (ggml-org#10301)

Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H.

Restore the workgroup size of 512 case, use it for >1024.

Use unrollable loops for more iteration counts.

vulkan: further optimize mul_mat_vec using larger loads (ggml-org#10387)

Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up.

Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions.

In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits.

Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions.

Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system.

The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes.

Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.

vulkan: copy iq4_nl LUT into shared memory (ggml-org#10409)

vulkan: predicate max operation in soft_max shaders/soft_max (ggml-org#10437)

Fixes ggml-org#10434

vulkan: Fix a vulkan-shaders-gen arugment parsing error (ggml-org#10484)

The vulkan-shaders-gen was not parsing the --no-clean argument correctly. Because the previous code was parsing the arguments which have a value only and the --no-clean argument does not have a value, it was not being parsed correctly. This commit can now correctly parse arguments that don't have values.

vulkan: fix group_norm (ggml-org#10496)

Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion).

Fixes leejet/stable-diffusion.cpp#439.

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: optimize Q2_K and Q3_K mul_mat_vec (ggml-org#10459)

vulkan: skip integer div/mod in get_offsets for batch_idx==0 (ggml-org#10506)

vulkan: further optimize q5_k mul_mat_vec (ggml-org#10479)

vulkan: Handle GPUs with less shared memory (ggml-org#10468)

There have been reports of failure to compile on systems with <= 32KB of shared memory (e.g. ggml-org#10037). This change makes the large tile size fall back to a smaller size if necessary, and makes mul_mat_id fall back to CPU if there's only 16KB of shared memory.

vulkan: define all quant data structures in types.comp (ggml-org#10440)

vulkan: get the first command buffer submitted sooner (ggml-org#10499)

This is an incremental improvement over ggml-org#9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.

vulkan: Dynamic subgroup size support for Q6_K mat_vec (ggml-org#10536)

scalable version

tested for subgroup sizes 16-128

vulkan: optimize and reenable split_k (ggml-org#10637)

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (ggml-org#10642)

vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (ggml-org#10206)

Conflicts:

ggml/src/vulkan-shaders/dequant_funcs_cm2.comp

ggml/src/vulkan-shaders/flash_attn_cm2.comp

ggml/src/vulkan-shaders/mul_mm_cm2.comp

Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (ggml-org#10597)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: compile a test shader in cmake to check for coopmat2 support (ggml-org#10713)

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/ggml-vulkan/CMakeLists.txt

ggml/src/vulkan-shaders/test_coopmat2_support.comp

Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (ggml-org#10723)

vulkan: fix compile warnings (ggml-org#10731)

vulkan: disable spirv-opt for coopmat shaders (ggml-org#10763)

There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly necessary anyway.

Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it changes.

Fix coopmat support reporting when glslc doesn't support NV_coopmat2.

vulkan: dynamic subgroup size for the remaining k quants (ggml-org#10745)

q4_k

q3_k

q2_k

q6_k multi row example

vulkan: request round-to-even for fp16 in im2col/rope_head (ggml-org#10767)

Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls feature allows rounding mode to be requested if the implementation supports it.

Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats (ggml-org#10721)

Add accf32 and accf16 checks for coopmats

Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (ggml-org#10798)

vulkan: small mul_mat_vec optimizations (ggml-org#10665)

Change Debug print name

add GGML_ROPE_TYPE_MROPE

rwkv6: add wkv6 support for Vulkan backend (ggml-org#10829)

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Signed-off-by: Molly Sophia mollysophia379@gmail.com


Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Molly Sophia mollysophia379@gmail.com

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/wkv6.comp

vulkan: bugfixes for small subgroup size systems + llvmpipe test (ggml-org#10809)

more fixes

add test

Conflicts:

.github/workflows/build.yml

vulkan : fix soft_max.comp division by zero (whisper/2633)

This change prevents a division by zero error when p.KY is 0.

vulkan: optimize coopmat2 dequant functions (ggml-org#10855)

Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.

vulkan: build fixes for 32b (ggml-org#10927)

Should fix ggml-org#10923

examples, ggml : fix GCC compiler warnings (ggml-org#10983)

Warning types fixed (observed under MSYS2 GCC 14.2.0):

Conflicts:

examples/export-lora/export-lora.cpp

vulkan: multi-row k quants (ggml-org#10846)

vulkan: Use push constant offset to handle misaligned descriptors (ggml-org#10987)

vulkan: im2col and matmul optimizations for stable diffusion (ggml-org#10942)

vulkan: optimize mul_mat for small values of N (ggml-org#10991)

Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better.

Share some code for reducing the result values to memory in mul_mat_vec_base.

Conflicts:

tests/test-backend-ops.cpp

fix: Vulkan shader gen binary path (ggml-org#11037)

Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (ggml-org#11074)

fix lora print

Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (ggml-org#11117)

Conflicts:

ggml/src/vulkan-shaders/test_coopmat_support.comp

llama: add support for QRWKV6 model architecture (ggml-org#11001)

Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (ggml-org#11161)

fix: ggml: fix vulkan-shaders-gen build (ggml-org#10448)

The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host.

Use configure_file to generate host_toolchain.cmake from template

Fix compile error not finding vulkan-shaders-gen

Fix build issues with vulkan-shaders-gen:

Improve host compiler detection for vulkan shader generation:

Simplified the CMake function to improve the process of detecting the host compiler.

Since vulkan-shader-gen.cpp only requires the glslc executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: ecc93d0)

Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN

Conflicts:

ggml/src/ggml-vulkan/CMakeLists.txt

vulkan: scale caching for k quants + misc fixes (ggml-org#11081)

This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b.

vulkan: optimize coopmat2 q2_k dequant function (ggml-org#11130)

vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (ggml-org#11206)

Do masking on whole dwords, fetch all scales at once.

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (ggml-org#11166)

Shaders are based on cpy.cu.

Conflicts:

ggml/src/ggml-cpu/ggml-cpu.c

ggml/src/vulkan-shaders/copy_from_quant.comp

ggml/src/vulkan-shaders/copy_to_quant.comp

vulkan: fix coopmat2 flash attention for non-contiguous inputs (ggml-org#11281)

Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression.

Add noncontiguous FA tests in test-backend-ops.

Fixes ggml-org#11268.

Conflicts:

tests/test-backend-ops.cpp

vulkan: fix coopmat2 validation failures (ggml-org#11284)

mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16.

coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.

vulkan: fix diag_mask_inf (ggml-org#11323)

With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.

vulkan: sort shaders for more deterministic binary (ggml-org#11315)

Fixes ggml-org#11306.

Vulkan-run-test: fix mmq_wg_denoms (ggml-org#11343)

There should be a copy-and-paste error here.

*mmq_wg_denoms should be used together with *warptile_mmq, instead of wg_denoms.

vulkan: compile shaders on-demand (ggml-org#11406)

Reduce first-run startup time and memory consumption.

Should fix ggml-org#11339.

vulkan: Catch pipeline creation failure and print an error message (ggml-org#11436)

Also, fix some warnings from my on-demand compile change.

vulkan: implement initial support for IQ2 and IQ3 quantizations (ggml-org#11360)


Co-authored-by: Jeff Bolz jbolz@nvidia.com

Conflicts:

ggml/src/vulkan-shaders/dequant_iq2_s.comp

ggml/src/vulkan-shaders/dequant_iq2_xs.comp

ggml/src/vulkan-shaders/dequant_iq2_xxs.comp

ggml/src/vulkan-shaders/dequant_iq3_s.comp

ggml/src/vulkan-shaders/dequant_iq3_xxs.comp

CUDA: non-contiguous (RMS) norm support (ggml-org#11659)

vulkan: use smaller combined allocations to avoid fragmentation (ggml-org#11551)

Conflicts:

ggml/src/ggml-alloc.c

vulkan: initial support for IQ4_XS quantization (ggml-org#11501)

Conflicts:

ggml/src/vulkan-shaders/dequant_iq4_xs.comp

vulkan: optimize coopmat2 iq2/iq3 callbacks (ggml-org#11521)

vulkan: print shared memory size (ggml-org#11719)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: account for lookup tables when checking shared memory size (ggml-org#11502)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (ggml-org#11592)

vulkan: linux builds + small subgroup size fixes (ggml-org#11767)

vulkan: initial support for IQ1_S and IQ1_M quantizations (ggml-org#11528)

Conflicts:

ggml/src/vulkan-shaders/dequant_iq1_m.comp

ggml/src/vulkan-shaders/dequant_iq1_s.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq1_m.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq1_s.comp

vulkan: support multi/vision rope, and noncontiguous rope (ggml-org#11902)

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/rope_multi.comp

ggml/src/vulkan-shaders/rope_vision.comp

vulkan: implement several ops relevant for ggml_opt (ggml-org#11769)

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/argmax.comp

ggml/src/vulkan-shaders/count_equal.comp

ggml/src/vulkan-shaders/opt_step_adamw.comp

ggml/src/vulkan-shaders/repeat_back.comp

ggml/src/vulkan-shaders/sub.comp

tests/test-backend-ops.cpp

vulkan: implement more backpropagation operators (ggml-org#11914)

Conflicts:

ggml/src/vulkan-shaders/rms_norm_back.comp

ggml/src/vulkan-shaders/silu_back.comp

ggml/src/vulkan-shaders/soft_max_back.comp

Add memset tensor in all backend interface

SYCL: implement memset ggml backend buffer interface (ggml-org#12580)

Conflicts:

ggml/src/ggml-sycl.cpp

add OP sigmoid (ggml-org#12056)

Co-authored-by: Judd foldl@boxvest.com

Conflicts:

ggml/src/vulkan-shaders/sigmoid.comp

vulkan: fix assertion when qy_needs_dequant (ggml-org#12068)

Looks like a copy/paste bug from qx_needs_dequant.

vulkan: improve im2col (ggml-org#11826)

vulkan: matmul dequantization improvements (ggml-org#12015)

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (ggml-org#11595)

Conflicts:

ggml/src/vulkan-shaders/mul_mat_vec_iq2_s.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq2_xs.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq2_xxs.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq3_s.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq3_xxs.comp

cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)

ggml-ci

Conflicts:

ggml/src/ggml-cuda.cu

tests/test-backend-ops.cpp

mat vec double buffer (ggml-org#12188)

vulkan: fix bug in coopmat1 mul_mat_id (ggml-org#12316)

Update build.yml for Windows Vulkan builder to use Vulkan 1.4.304 SDK for VK_NV_cooperative_matrix2 support (ggml-org#12301)

vulkan: Adjust coopmat2 tile sizes and selection heuristic (ggml-org#12258)

vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (ggml-org#12273)

vulkan: use fp32 in coopmat2 q4_k dequant function (ggml-org#12309)

vulkan: subgroup size tuning (ggml-org#12087)


Co-authored-by: 0cc4m picard12@live.de

vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (ggml-org#12312)

ggml-vulkan: remove unused find_program(glslc) (ggml-org#12416)

It's already found by FindVulkan.cmake in the parent CMakeLists

Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (ggml-org#12434)

vulkan: Submit once enough matmul work has been recorded (ggml-org#12406)

I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.

vulkan: optimize iq1 coopmat2 dequant functions (ggml-org#12427)

vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (ggml-org#12472)

Vulkan: RTE rounding for cpy to quant (ggml-org#12480)

Co-Authored-By: Jeff Bolz jbolz@nvidia.com


Co-authored-by: Jeff Bolz jbolz@nvidia.com

vulkan: Optimize mul_mat_vec p021 and nc shaders (ggml-org#12505)

These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.

Conflicts:

tests/test-backend-ops.cpp

vulkan: fix mul_mat_vec failure in backend tests (ggml-org#12529)

The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.

vulkan: fix coopmat shader generation when cross-compiling (ggml-org#12272)

Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated.

Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject.

Signed-off-by: Icenowy Zheng uwu@icenowy.me


Signed-off-by: Icenowy Zheng uwu@icenowy.me Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com

cmake: improve Vulkan cooperative matrix support checks (whisper/2966)

Co-authored-by: Sandro Hanea me@sandro.rocks

cmake : fix whitespace (#0)

Vulkan: Add DP4A MMQ and Q8_1 quantization shader (ggml-org#12135)

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/mul_mmq.comp

ggml/src/vulkan-shaders/mul_mmq_funcs.comp

ggml/src/vulkan-shaders/quantize_q8_1.comp

ggml/src/vulkan-shaders/test_integer_dot_support.comp

vulkan: fix build when glslc doesn't support coopmat (ggml-org#12683)

Vulkan: Fix mmq int dot float cache size (ggml-org#12722)

vulkan: Implement grouped query attention in the coopmat2 FA shader (ggml-org#12559)

When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have

=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.

cmake: remove caching from vulkan coopmat checks (ggml-org#12719)

vulkan: Implement split_k for coopmat2 flash attention. (ggml-org#12627)

When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.

Conflicts:

ggml/src/vulkan-shaders/flash_attn_split_k_reduce.comp

vulkan: Fix missing cmake logic for dot product extension (ggml-org#12721)

vulkan: set cmake minimum and project name in vulkan-shaders (ggml-org#12744)

vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (ggml-org#12630)

There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.

Conflicts:

ggml/src/ggml-vulkan.cpp

cmake: fix ggml-shaders-gen compiler paths containing spaces (ggml-org#12747)

fixes error for compiler paths with spaces

Vulkan: Tune Vulkan mmq int dot shader for performance (ggml-org#12767)

vulkan: Use unclamped loads for flash attention mask (ggml-org#12720)

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

vulkan: fix NaN issue in flash attention shader (ggml-org#12776)

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

vulkan: Use fp16 for the flash attention P*V multiplication (ggml-org#12783)

This is consistent with the ggml-cuda behavior and the mul_mat fallback.

vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (ggml-org#12833)

q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.

vulkan: use aligned loads for flash attention mask (ggml-org#12853)

Rewrite the stride logic for the mask tensor in the FA shader to force the stride to be aligned, to allow using more efficient loads.

vulkan: enable coopmat2 FA gqa and split_k optimizations more often (ggml-org#12931)

The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.

vulkan: support noncontiguous rms_norm (ggml-org#13031)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: matmul gcn tuning (ggml-org#13016)

Co-authored-by: 0cc4m picard12@live.de


Co-authored-by: 0cc4m picard12@live.de

vulkan: use uint array index to avoid glslang bug (ggml-org#13193)

vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (ggml-org#13191)

vulkan: Add bfloat16 support (ggml-org#12554)

This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension.

It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that.

The coopmat support also requires a glslc that supports the extension, which currently requires a custom build.

Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available.

Conflicts:

ggml/src/vulkan-shaders/test_bfloat16_support.comp

vulkan: Additional type support for unary, binary, and copy (ggml-org#13266)

Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326)

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.

vulkan: scalar flash attention implementation (ggml-org#13324)

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/flash_attn.comp

vulkan: workaround FA compile failures on macos (ggml-org#13517)

vulkan: KHR_coopmat flash attention (ggml-org#13506)

This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.

Conflicts:

ggml/src/vulkan-shaders/flash_attn_cm1.comp

cmake: simplify vulkan shader test logic (ggml-org#13263)

vulkan: use scalar FA rather than coopmat2 when N==1 (ggml-org#13554)

Add pipeline_acc_f32

vulkan: move common FA code to flash_attn_base.comp (ggml-org#13556)

Conflicts:

ggml/src/vulkan-shaders/flash_attn_base.comp

cmake: use the current build config for vulkan-shaders-gen (ggml-org#13595)

Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (ggml-org#13607)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: fix warnings (ggml-org#13626)

use LOG_WARN to replace std::cerr (ggml-org#13657)

vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (ggml-org#13696)

vulkan: support CPY from any type to itself (ggml-org#13695)

Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.

add GGML_LOG_WARN

vulkan: mark IM2COL as supporting non-contig (ggml-org#13783)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: use timestamp queries for GGML_VULKAN_PERF (ggml-org#13817)

Also change it to be controlled by an env var rather than cmake flag

vulkan : Remove unexpected ; (ggml/1253)

vulkan: fix warnings in perf logger querypool code (ggml-org#13937)

ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (ggml-org#13813)

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/conv_transpose_1d.comp

vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (ggml-org#14001)

Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (ggml-org#14099)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: force device 0 in CI (ggml-org#14106)

Add GGML_LOG_INFO

vulkan: Track descriptor pools/sets per-context (ggml-org#14109)

Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.

vulkan: Better thread-safety for command pools/buffers (ggml-org#14116)

This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: mutex around vkQueueSubmit (ggml-org#14127)

This fixes the remaining crash in test-thread-safety on my system.

cmake: clean up external project logic for vulkan-shaders-gen (ggml-org#14179)

Conflicts:

.github/workflows/build.yml

cmake: remove shader-gen step-targets from ggml-vulkan (ggml-org#14226)

Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (ggml-org#14249)

Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (ggml-org#13792)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: update windows SDK in CI (ggml-org#14334)

vulkan: update windows SDK in release.yml (ggml-org#14344)

Conflicts:

.github/workflows/release.yml

cmake: regen vulkan shaders when shaders-gen sources change (ggml-org#14398)

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (ggml-org#14427)

This setting needs to be passed through to vulkan-shaders-gen

vulkan: lock accesses of pinned_memory vector (ggml-org#14333)

vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (ggml-org#14378)

Fix cuda build error

test


Co-authored-by: 0cc4m picard12@live.de Co-authored-by: firecoperana

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request

May 6, 2026

@jeffbolznv

phibya pushed a commit to ziee-ai/llama.cpp that referenced this pull request

May 29, 2026

@jeffbolznv

fewtarius pushed a commit to fewtarius/CachyLLama that referenced this pull request

May 30, 2026

@jeffbolznv

AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request

Jun 2, 2026

@jeffbolznv

AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request

Jun 2, 2026

@jeffbolznv

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})