vulkan: lock accesses of pinned_memory vector by jeffbolznv · Pull Request #14333 · ggml-org/llama.cpp (original) (raw)

github-actions Bot added Vulkan

Issues specific to the Vulkan backend

ggml

changes relating to the ggml tensor library for machine learning

labels

Jun 22, 2025

gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request

Jun 30, 2025

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 2, 2025

Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 2, 2025

CANN: Enable labeler for Ascend NPU (#13914)
add geglu activation function (#14074)

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp

sycl: Add reorder to Q6_K mmvq implementation (#13885)
Add Reorder to Q6_K mmvq implementation
Address PR comments: clean up comments
Remove unused parameter after refactoring q4_k
Adding inline to function and removing unnecessary reference to int

Signed-off-by: nscipione nicolo.scipione@codeplay.com

server : fix LRU check (#14079)

ggml-ci

webui: fix sidebar being covered by main content (#14082)
webui: fix sidebar being covered by main content

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

webui: update index.html.gz

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

CANN: Simplify the environment variable setting(#13104)
Simplify the environment variable setting to specify the memory pool type.
Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.
update
fix CI
update
delete whitespace
fix according to review
update CANN.md
update CANN.md
graph : fix geglu (#14077)

ggml-ci

cuda : fix device sync on buffer clear (#14033)
ggml-cpu : split arch-specific implementations (#13892)
move ggml-cpu-aarch64 to repack
split quantize_row_q8_0/1
split helper functions
split ggml_vec_dot_q4_0_q8_0
split ggml_vec_dot_q4_1_q8_1
split ggml_vec_dot_q5_0_q8_0
split ggml_vec_dot_q5_1_q8_1
split ggml_vec_dot_q8_0_q8_0
split ggml_vec_dot_tq1_0_q8_K
split ggml_vec_dot_tq2_0_q8_K
split ggml_vec_dot_q2_K_q8_K
split ggml_vec_dot_q3_K_q8_K
split ggml_vec_dot_q4_K_q8_K
split ggml_vec_dot_q5_K_q8_K
split ggml_vec_dot_q6_K_q8_K
split ggml_vec_dot_iq2_xxs_q8_K
split ggml_vec_dot_iq2_xs_q8_K
split ggml_vec_dot_iq2_s_q8_K
split ggml_vec_dot_iq3_xxs_q8_K
split ggml_vec_dot_iq3_s_q8_K
split ggml_vec_dot_iq1_s_q8_K
split ggml_vec_dot_iq1_m_q8_K
split ggml_vec_dot_iq4_nl_q8_0
split ggml_vec_dot_iq4_xs_q8_K
fix typos
fix missing prototypes
rename ggml-cpu-quants.c
rename ggml-cpu-traits
rename arm folder
move cpu-feats-x86.cpp
rename ggml-cpu-hbm
update arm detection macro in quants.c
move iq quant tables
split ggml_quantize_mat_q8_0/K
split ggml_gemv_*
split ggml_gemm_*
rename namespace aarch64 to repack
use weak aliases to replace test macros
rename GGML_CPU_AARCH64 to GGML_CPU_REPACK
rename more aarch64 to repack
clean up rebase leftover
fix compilation errors
remove trailing spaces
try to fix clang compilation errors
try to fix clang compilation errors again
try to fix clang compilation errors, 3rd attempt
try to fix clang compilation errors, 4th attempt
try to fix clang compilation errors, 5th attempt
try to fix clang compilation errors, 6th attempt
try to fix clang compilation errors, 7th attempt
try to fix clang compilation errors, 8th attempt
try to fix clang compilation errors, 9th attempt
more cleanup
fix compilation errors
fix apple targets
fix a typo in arm version of ggml_vec_dot_q4_K_q8_K

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

llama : allow building all tests on windows when not using shared libs (#13980)
llama : allow building all tests on windows when not using shared libraries
add static windows build to ci
tests : enable debug logs for test-chat

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

kv-cache : fix shift and defrag logic (#14081)
kv-cache : fix shift

ggml-ci

cont : reset shift[i]

ggml-ci

cont : fix defrag erasing cells that didn't move

ggml-ci

metal : use less stack memory in FA kernel (#14088)
metal : use less stack memory in FA kernel

ggml-ci

cont : fix BF16 variant
Add in-build ggml::ggml ALIAS library (ggml/1260)

Enable uniform linking with subproject and with find_package.

sync : ggml

ggml-ci

rpc : nicer error messages for RPC server crash (#14076)
Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (#14099)
ggml : fix weak alias win32 (whisper/0)

ggml-ci

sync : ggml

ggml-ci

Fixed spec timings to: accepted/tested instead of accepted/drafted (#14104)
vulkan: force device 0 in CI (#14106)
llama : support GEGLU for jina-bert-v2 (#14090)
convert : fix duplicate key DeepSeek-R1 conversion error (#14103)
kv-cache : avoid modifying recurrent cells when setting inputs (#13834)
kv-cache : avoid modifying recurrent cells when setting inputs
kv-cache : remove inp_s_mask

It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy.

kv-cache : fix non-consecutive token pos warning for recurrent models

The problem was apparently caused by how the tail cells were swapped.

graph : simplify logic for recurrent state copies
kv-cache : use cell without src refs for rs_z in recurrent cache
llama-graph : fix recurrent state copy

The state_copy shuffle assumes everything is moved at once, which is not true when states_extra is copied back to the cache before copying the range of states between head and head + n_seqs. This is only a problem if any of the cells in [head, head + n_seqs) have an src in [head + n_seqs, head + n_kv), which does happen when n_ubatch > 1 in the llama-parallel example.

Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes.

llama-graph : rename n_state to state_size in build_recurrent_state

This naming should reduce confusion between the state size and the number of states.

opencl: add mul_mv_id_q4_0_f32_8x_flat (#14003)
vulkan: Track descriptor pools/sets per-context (#14109)

Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.

kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (#14121)
server : pass default --keep argument (#14120)
kv-cache : relax SWA masking condition (#14119)

ggml-ci

webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
webui: Wrap long numbers instead of infinite horizontal scroll
Use tailwind class
update index.html.gz
vulkan: Better thread-safety for command pools/buffers (#14116)

This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.

tests : add test-tokenizers-repo (#14017)
chore : clean up relative source dir paths (#14128)
Implement GGML_CPU_ALL_VARIANTS for ARM (#14080)
ggml-cpu: Factor out feature detection build from x86
ggml-cpu: Add ARM feature detection and scoring

This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_ which need to be set in cmake, instead of GGML_ that users would set for x86.

This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags.

ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM

Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_ as we don't have GGML_.

Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used.

ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now

The other platforms will need their own specific variants.

This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.

common: fix issue with regex_escape routine on windows (#14133)
context : round n_tokens to next multiple of n_seqs when reserving (#14140)

This fixes RWKV inference which otherwise failed when the worst case ubatch.n_seq_tokens rounded to 0.

kv-cache : fix split_equal handling in unified implementation (#14130)

ggml-ci

cmake : handle whitepsaces in path during metal build (#14126)
cmake : handle whitepsaces in path during metal build

ggml-ci

cont : proper fix

ggml-ci

Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com

batch : remove logits_all flag (#14141)

ggml-ci

context : simplify output counting logic during decode (#14142)
batch : remove logits_all flag

ggml-ci

context : simplify output counting logic during decode

ggml-ci

cont : fix comments
server : re-enable SWA speculative decoding (#14131)

ggml-ci

readme : remove project status link (#14149)
sycl: Remove not needed copy f16->f32 for dnnl mul mat (#14125)
vocab : prevent heap overflow when vocab is too small (#14145)

ggml-ci

cmake : Improve build-info.cpp generation (#14156)
cmake: Simplify build-info.cpp generation

The rebuild of build-info.cpp still gets triggered when .git/index gets changes.

cmake: generate build-info.cpp in build dir
SYCL: Bump oneMath commit (#14152)

Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669 which adds SYCL-Graph support for recording CUDA BLAS commands.

With this change the MUL_MAT tests now pass on DPC++ CUDA backends with SYCL-Graph enabled. Prior to this change, an error would be thrown.

$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2

UR CUDA ERROR:
        Value:           700
        Name:            CUDA_ERROR_ILLEGAL_ADDRESS
        Description:     an illegal memory access was encountered
        Function:        operator()
        Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154

Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
  in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

sycl: Adding additional cpy dbg print output (#14034)
server : fix SWA condition for full context reprocess (#14163)

ggml-ci

pooling : make cls_b and cls_out_b optional (#14165)

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp

cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)
cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT
cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*
readme : remove survey link (#14168)
batch : rework llama_batch_allocr (#14153)
batch : rework llama_batch_allocr

ggml-ci

cont : move validation inside class

ggml-ci

cont : move output counting to class

ggml-ci

cont : minor

ggml-ci

batch : add TODOs

ggml-ci

docs : Update multimodal.md (#14122)
Update multimodal.md
Update multimodal.md
batch : add LLAMA_BATCH_DEBUG environment variable (#14172)
batch : add LLAMA_BATCH_DEBUG environment variable

ggml-ci

cont : improve seq_id display
Merge commit from fork
vocab : prevent integer overflow during load
Add static cast and GGML_ABORT

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

sycl: fix docker image (#14144)
vocab : fix build (#14175)

ggml-ci

compare-llama-bench: add option to plot (#14169)
compare llama-bench: add option to plot
Address review comments: convert case + add type hints
Add matplotlib to requirements
fix tests
Improve comment and fix assert condition for test
Add back default test_name, add --plot_log_scale
use log_scale regardless of x_values
llama-chat : Do not throw when tool parsing fails (#14012)

Currently when a model generates output which looks like a tool call, but is invalid an exception is thrown and not handled, causing the cli or llama-server to bail. Instead, handle the chat parser exception and simply return the generated text in such cases.

Signed-off-by: Piotr Stankiewicz piotr.stankiewicz@docker.com

docs : remove WIP since PR has been merged (#13912)
batch : auto-gen positions + verify multi-sequence input (#14177)
batch : verify multi-sequence input batches

ggml-ci

cont : auto-gen positions + verify multi-seq input

ggml-ci

cont : first print debug info, then perform validation

ggml-ci

cont : fix position auto-gen + add comments

ggml-ci

cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)

ggml-ci

model : add dots.llm1 architecture support (#14044) (#14118)

Adds:

Dots1Model to convert_hf_to_gguf.py
Computation graph code to llama-model.cpp
Chat template to llama-chat.cpp to detect this model's template.

The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

The model architecture is a combination of Qwen and Deepseek parts, as seen here:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

kv-cache : fix use-after-move of defrag info (#14189)

ggml-ci

HIP: Replace usage of depricated preprocessor macro AMDGCN_WAVEFRONT_SIZE (#14183)
CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (#14196)
quantize : change int to unsigned int for KV overrides (#14197)
server : When listening on a unix domain socket don't print http:// and port (#14180)

Instead show something like this:

main: server is listening on file.sock - starting the main loop

Signed-off-by: Eric Curtin ecurtin@redhat.com

model : Add support for Arcee AI's upcoming AFM model (#14185)
Add Arcee AFM support
Add draft update code
Fix linter and update URL, may still not be final
Update src/llama-model.cpp

Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com

Remote accidental blank line

Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com

ggml-cpu : rework weak alias on apple targets (#14146)
ggml-cpu : rework weak alias on apple targets
fix powerpc detection
fix ppc detection
fix powerpc detection on darwin
vulkan: mutex around vkQueueSubmit (#14127)

This fixes the remaining crash in test-thread-safety on my system.

gguf-py : allow key override when adding value to GGUFWriter (#14194)

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp

convert : remove arcee change in convert_hf_to_gguf_update.py (#14207)
ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206)
llama : rework embeddings logic (#14208)
llama : rework embeddings logic

ggml-ci

cont : fix rerank

ggml-ci

cont : engrish [no ci]
cont : fix rerank

ggml-ci

server : support both embeddings and completions with single model

ggml-ci

cont : avoid embeddings_org

ggml-ci

HIP: disable rocwmma on gfx12 by default until rocm 7.0 (#14202)
model : add NeoBERT (#14164)
convert neobert model to gguf
add inference graph
fix flake8 lint
followed reviewer suggestions

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

follow reviewers suggestions

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

override NeoBERT feed-forward length

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Georgi Gerganov ggerganov@gmail.com

cmake: clean up external project logic for vulkan-shaders-gen (#14179)
Remove install step for vulkan-shaders-gen
Add install step to normalize msvc with make
Regenerate modified shaders at build-time
llama : add thread safety test (#14035)
llama : add thread safety test
llamafile : remove global state
llama : better LLAMA_SPLIT_MODE_NONE logic

when main_gpu < 0 GPU devices are not used

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

server : fix incorrect usage of llama_get_embeddings() (#14225)
server : fix incorrect usage of llama_get_embeddings()

ggml-ci

cont : fix the fix

ggml-ci

common : suggest --jinja when autodetection fails (#14222)
musa: fix build warning (unused variable) (#14231)

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

ggml-cpu : remove the weak alias trick (#14221)
cmake: remove shader-gen step-targets from ggml-vulkan (#14226)
Remove step-targets from vulkan-shaders-gen
Unset DESTDIR when building vulkan-shaders-gen
examples : include examples in msvc disable warn (ggml/1270)

This commit adds the examples in the "list" of targets to ignore MSVC warnings.

The motivation for this is that currently the examples generate a number of warnings that are ignore/disabled for the core ggml project. This makes for a cleaner output when building.

ggml : remove unused ggml_context_container (ggml/1272)

This commit removes the unused ggml_context_container structure from the ggml library. It looks like the usage of this struct was removed in Commit 4757fe18d56ec11bf9c07feaca6e9d5b5357e7f4 ("ggml : alloc ggml_contexts on the heap (whisper/2525)").

The motivation for this changes is to improve code clarity/readability.

ggml : disable warnings for tests when using MSVC (ggml/1273)
ggml : disable warnings for tests when using MSVC

This commit disables warnings for tests on windows when using MSVC.

The motivation for this is that this brings the build output more inline with what Linux/MacOS systems produce.

There is still one warning generated for the tests which is:

  Building Custom Rule C:/ggml/tests/CMakeLists.txt
cl : command line  warning D9025: overriding '/DNDEBUG' with '/UNDEBUG'
[C:\ggml\build\tests\test-arange.vcxproj]
  test-arange.cpp
  test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exe

ggml : fix typo in tests disable list
sync : ggml

ggml-ci

convert : fix null head_dim AutoConfig regression (#14248)
llama-chat : fix multiple system message for gemma, orion (#14246)
mtmd : refactor llava-uhd preprocessing logic (#14247)
mtmd : refactor llava-uhd preprocessing logic
fix editorconfig
ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (#14258)
ggml-cpu: fix uncaught underscore terminators (#14023)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reduce asm calls for hsum (#14037)

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add s390x build documentation (#14264)
docs: add s390x-specific build docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add s390x model conversion steps

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x build indent

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update hyperlinks for s390x docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update llama.h docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x add accelerator and perf optimizations

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x indent blocks

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: revert block indentation

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add support information for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x reword

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: remove indentation for accelerator section s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: remove redundant words s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: reword for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x reword simd

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: fix trailing whitespace for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

metal : add mean kernel (#14267)
metal : add mean kernel

ggml-ci

cont : dedup implementation

ggml-ci

memory : Hybrid recurrent cache (#13979)
feat: Add llama_model_is_hybrid API call

Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add c++ side constants for attention layer indices hparam

Branch: GraniteFour

feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: rename *_is_hybrid -> *_is_hybrid_recurrent

The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent."

Branch: HybridCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add layer filter to recurrent cache

Branch: HybridCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use per-layer sizing everywhere in kv caches

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: First pass at llama_kv_cache_hybrid_recurrent

This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches.

This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Construct hybrid recurrent cache for hybrid recurrent models

This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix wrong bool condition for split equal in hybrid cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix shift logic to defer to unified cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Support hybrid recurrent in llama-graph

NOTE: I intentionally did not add support for s_mask since it will be going away soon

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix logic for initializing inputs and attn layers for hybrid caches

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Update recurrent cache for changes to remove intermediate kv_cache interface

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix status for init_update sig for recurrent cache state

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Add missing padding to n_ctx for hybrid cache construction

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Update clear signature for data argument after rebase

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove errant virtual destructor leftover from previous impl attempt

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove n_embd_k/v_s from unified cache

No longer needed now that unified isn't also supporting recurrent

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069

Branch: HybridRecurrentCache

refactor: Remove layer index from n_embd_k/v_s

Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove n_embd_k/v_gqa from recurrent cache

This is no longer needed now that there are separate implementations

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Allow custom layer filters for hybrid recurrent

This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches.

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove logits_all after rebase

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove llama_model_is_hybrid_Recurrent public API

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Use llama_memory_state_ptr for child states in hybrid memory state

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are:

Rename class llm_graph_input_s_copy -> llm_graph_input_rs
Add a corresponding llm_graph_input_rs_hybrid_recurrent
Rename build_inp_s_copy -> build_rs_inp_recurrent
Add a corresponding build_rs_inp_hybrid_recurrent
Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified
Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input

This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix resize vs reserve and skip null tensors in size computation

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-Authored-By: @younesbelkada

fix: Fix initialization of child states

Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Use a common build_recurrent_state method that is cache-agnostic

This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

recurrent : rework graph inputs + add TODOs

ggml-ci

refactor: Make status and child states const in hybrid and iswa

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache

This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor!: Rename all k/v related values for recurrent/hybrid to r/s

Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states).

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refacor: _recurrent -> _recr for brevity

It just happens to have the same number of letters as _attn!

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

style: Fix spacing for ref

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: recurrent_layer() -> is_recurrent()

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

style: Fix spacing for size_s_bytes declaration

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249)
llamafile : support s390x SIMD instruction set (#14273)
convert : fix remote option in Windows (#14100)
llama-bench : add --no-warmup flag (#14224) (#14270)

Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking.

Add no_warmup boolean field to cmd_params struct
Add --no-warmup command-line argument parsing
Add help text documentation for the new flag
Wrap existing warmup logic in conditional check
Maintain full backward compatibility (warmup enabled by default)

Addresses #14224

sycl: Cleanup codepaths in Get Rows in sycl backend (#14215)

Addresses unused reorder path

build : suppress gcc15 compile warnings (#14261)
Change _contains_any() substrs to std::string_view and fix the find comparison logic.
server : add server parameters for draft model cache type (#13782)

Co-authored-by: aa956 27946957+aa956@users.noreply.github.com

gguf-py : make sentencepiece optional (#14200)
Make sentencepiece optional
Bump to 0.18.0
Bump patch instead of minor

Co-authored-by: compilade git@compilade.net

ggml-cpu : remove unnecesary arm feature detection (#14281)

Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.

CUDA: add conv_2d_dw (#14265)
CUDA: add conv_2d_dw
better naming
simplify using template
Review: fix operation ordering in ggml-cuda, use forceinline, use more const
ubatch : new splitting logic (#14217)

ggml-ci

model : more uniform output id handling (#14275)
model : more uniform output id handling

ggml-ci

cont : revert n_outputs < n_tokens optimization

ggml-ci

cont : fix out_ids initialization

ggml-ci

ggml: Update KleidiAI to v1.9.0 (#14277)
ggml : fix repack work size for mul_mat_id (#14292)

ggml-ci

cuda : synchronize graph capture and cublas handle destruction (#14288)

Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread

llama : improve sep token handling (#14272)
Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286)
Add PowerPC feature detection and scoring
ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC
ggml-cpu: Delay some initializations until function is called

When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.

Co-authored-by: Diego Devesa slarengh@gmail.com

sycl: add usage of enqueue_functions extension (#14244)
Add header and namespace to use enqueue_functions extension
Convert submit and parallel_for to use new extension in convert.cpp
Convert submit and parallel_for to use extension in ggml-sycl.cpp
Convert submit and parallel_for to use extension in gla.cpp
Convert submit and parallel_for in mmq.cpp
Convert submit and parallel_for in mmvq.cpp
Convert submit and parallel_for in remaining files
Convert all simple parallel_for to nd_launch from enqueue_functions extension
Wrapping extension in general function

Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels.

Signed-off-by: nscipione nicolo.scipione@codeplay.com

vocab : prevent tokenizer overflow (#14301)
vocab : prevent stack overflow in tokenize
vocab : return error instead of aborting on oversized token count
vocab : INT32_MIN from llama_tokenize on overflow
lint : remove trailing whitepace (#14304)
CUDA: add conv_2d_transpose (#14287)
CUDA: add conv_2d_transpose
remove direct include of cuda_fp16
Review: add brackets for readability, remove ggml_set_param and add asserts
docs : fix the link to llama.h (#14293)
Add ggml_roll (ggml/1274)
ggml : add ggml_roll
use set/get_op_params & std::min
sync : ggml

ggml-ci

convert : fix Llama 4 conversion (#14311)
memory : rename interface to llama_memory_context_i (#14296)
memory : rename interface to llama_memory_context_i

ggml-ci

cont : fix comments
cont : use "mctx" for referencing a memory context

ggml-ci

metal : fix thread-safety (#14300)

ggml-ci

gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.
remove #ifdef for debug utils and add queue marker.
gguf-py : fix Qwen3-Embedding eos token (#14314)
CUDA: add mean operation (#14313)
CUDA: add mean operation
add back sum_rows_f32_cuda
Review: early exit if col!=0
common : use std::string_view now that we target c++17 (#14319)
mtmd : fix Pixtral OOM with large images by capping image_size to 1024 (#14326)

Mistral Small 2506 models using Pixtral vision encoder were running out of GPU memory when processing images larger than 1024x1024 pixels due to exponential memory growth from unlimited image size.

This fix applies the same 1024x1024 limit used by Qwen2VL models to prevent OOM issues while maintaining compatibility with existing models.

HIP: enable vec fattn on RDNA4 (#14323)
examples : fix is_first logic for tokenization (#14329)

ggml-ci

run : avoid double tokenization (#14327)
run : avoid double tokenization by adopting common_tokenize heuristic
build : fix windows gcc and clang warnings
lint : fixed trailing whitepace
run : fix is_first flag
gguf-py : fix SpecialVocab parsing when post_processor is null (#14330)
quantize : handle user-defined pruning of whole layers (blocks) (#13037)
vulkan: update windows SDK in CI (#14334)
kv-cells : fix tracking of seq_pos (#14339)
kv-cells : fix tracking of seq_pos during cache reuse

ggml-ci

cont : improve error message

ggml-ci

cont : add more comments
CUDA: mul_mat_v support for batch sizes > 1 (#14262)
CUDA: mul_mat_v support for batch sizes > 1
use 64 bit math for initial offset calculation
llama : better rwkv chat template and add missing inputs.use_jinja setting (#14336)
llama-cli : add missing inputs.use_jinja setting

Signed-off-by: Molly Sophia mollysophia379@gmail.com

llama : better legacy chat template for rwkv

Signed-off-by: Molly Sophia mollysophia379@gmail.com

vulkan: update windows SDK in release.yml (#14344)
ci: add workflow for relocatable cmake package (#14346)
CUDA/HIP: optimize mmv paths taken for HIP devices (#14324)

Co-authored-by: Johannes Gäßler johannesg@5d6.de

jinja : Add Mistral-Small-3.2-24B-Instruct-2506.jinja (#14349)

This will allow the use of tools on the llama-server

main : honor --verbose-prompt on interactive prompts (#14350)
server : move no API key doc to /health (#14352)
cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION (#14362)
batch : fix check for empty sequences in memory (#14364)
batch : fix check for empty sequences in memory

ggml-ci

cont : reuse the var

ggml-ci

opencl: ref count ggml_backend_opencl_context and refactor profiling (#14254)
Move profiling info into ggml_backend_opencl_context
Add enqueue_ndrange_kernel to launch kernel
sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
ggml : do not output unprintable characters on GGUF load failure (#14381)
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

ggml-cpu: add fp16->fp32 nnpa first

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

ggml-cpu: add fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

ggml-cpu: better variable names

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

docs: update s390x docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

ggml-cpu: add debugging prints to see if dlf16 is correct

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix print vs printf

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix float placeholder

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: ensure fp16 and fp32 load and stores are called

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fp16 load ensured to hit

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa switch to vec_xst test

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to vec_xst for 4 element loops also

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rework noop

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove noop, general code cleanup

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify variable naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add breakpoint for debugging

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test fix for conversion failure

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: disable fp32->fp16 nnpa conversions for now

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to elif macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix typo

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix compiler types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: change to typedef vector types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add 4 element loops for fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarified vector naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back fp32->fp16 store nnpa

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add nnpa macro check in ggml-impl

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add missing func

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: diagnose why NNPA macro is not being defined

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: import vecintrin.h to fix compiler errors

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: update macro tests

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to importing ggml-cpu-impl instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix macro declaration

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test more macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add debug prints

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h to cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to private macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

ggml-cpu: move things around

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to quotes for import

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add compiler error macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add s390x detection in ggml-src

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: undo cmakelists work

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedefs.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedef from cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h future notes

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add todo comment for future reference

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify naming of dlf16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove unnecessary target compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update broken huggingface link for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix duplicate func names during compile

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: fix duplicate func names during compile"

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp16<->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h import in quants.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h within repack

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix amx mmq missing simd-mappings.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: attempt at fixing loongarch failing build

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa together with other fp16<->fp32 simd

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix wrong refactor of ggml-base

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: remove dependency on ggml-cpu from ggml-base

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove mistaken fallback macro

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

ggml: move ggml_table_f32_f16 to ggml-cpu.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: extern c ggml_table_f32_f16 + chore docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back ggml_table_f32_f16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: bring back ggml_table_f32_f16"

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fix ggml time initialization
fix f32_f16 table init
remove extra line

Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add s390x z17 build q&a

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup

ggml-ci

metal : handle some edge cases when threadgroup size is not a power of 2

ggml-ci

metal : add special-case mat-vec mul for ne00 == 4 (#14385)

ggml-ci

llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst

ggml-ci

ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation

ggml-ci

ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR

ggml-ci

ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

recurrent : call balloc split_reset() in init_batch() (#14414)

ggml-ci

graph : make llm_graph_context destructor virtual (#14410)

ggml-ci

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)

This setting needs to be passed through to vulkan-shaders-gen

ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL

Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
Add detection logic and basic fusion logic in ggml-vulkan.
Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.

extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups

style fixes

last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter

Co-authored-by: slaren slarengh@gmail.com

ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels

ggml-ci

add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints

ggml-ci

64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]

ggml-ci

Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation

Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
Introducing helper functions op_xxx for each unary operation to encapsulate the logic.
Replacing direct kernel calls with calls to these inlined functions.
Using __dpct_inline__ to encourage compiler inlining.
Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动

Remove redundant include path in CMakeLists.txt

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Enable scheduled Docker image builds

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels

ggml-ci

cont : disable for q4_1

ggml-ci

cont : disable for iq4_nl

ggml-ci

memory : correctly handle failure in apply() (#14438)

ggml-ci

Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-quants : rename best_mad to best_error (ggml/1283)

This commit renames the variable best_mad to best_error in the make_qkx2_quants function.

The motivation for this is that the name best_mad can be somewhat confusing if mean absolute deviation (MAD) is not in use.

ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml

ggml-ci

ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
[CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon 757486878@qq.com

Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon 757486878@qq.com

fix editorconfig

Signed-off-by: noemotiovon 757486878@qq.com

Add Vulkan images to docker.md (#14472)

Right now it's not easy to find those.

ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI

ggml-ci

cont : remove -g flag

ggml-ci

ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes

Co-authored-by: Diego Devesa slarengh@gmail.com

Signed-off-by: nscipione nicolo.scipione@codeplay.com Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Signed-off-by: Piotr Stankiewicz piotr.stankiewicz@docker.com Signed-off-by: Eric Curtin ecurtin@redhat.com Signed-off-by: Aaron Teo aaron.teo1@ibm.com Signed-off-by: Gabe Goodhart ghart@us.ibm.com Signed-off-by: Molly Sophia mollysophia379@gmail.com Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com Signed-off-by: noemotiovon 757486878@qq.com Co-authored-by: Yuanhao Ji jiyuanhao@apache.org Co-authored-by: Đinh Trọng Huy 77562200+huydt84@users.noreply.github.com Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Nicolò Scipione nicolo.scipione@codeplay.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: R0CKSTAR yeahdongcn@gmail.com Co-authored-by: Xinpeng Dou 15529241576@163.com Co-authored-by: Diego Devesa slarengh@gmail.com Co-authored-by: xctan axunlei@gmail.com Co-authored-by: Kai Pastor dg0yt@darc.de Co-authored-by: Isaac McFadyen isaac@imcf.me Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Juk Armstrong 69222624+jukofyork@users.noreply.github.com Co-authored-by: Jeff Bolz jbolz@nvidia.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net Co-authored-by: lhez quic_lih@quicinc.com Co-authored-by: Taylor quantumtraveling@gmail.com Co-authored-by: Aman amangupta052@gmail.com Co-authored-by: Christian Kastner ckk@kvr.at Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com Co-authored-by: Anton Mitkov anton.mitkov@codeplay.com Co-authored-by: Ewan Crawford ewan@codeplay.com Co-authored-by: ddpasa 112642920+ddpasa@users.noreply.github.com Co-authored-by: Guy Goldenberg guy110698@gmail.com Co-authored-by: Svetlozar Georgiev 55534064+sgeor255@users.noreply.github.com Co-authored-by: Piotr piotr.stankiewicz@docker.com Co-authored-by: Pepijn de Vos me@pepijndevos.nl Co-authored-by: Mikko Juola mikjuo@gmail.com Co-authored-by: uvos philipp@uvos.xyz Co-authored-by: Ed Addario 29247825+EAddario@users.noreply.github.com Co-authored-by: Eric Curtin ecurtin@redhat.com Co-authored-by: Bartowski 3266127+bartowski1182@users.noreply.github.com Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com Co-authored-by: xctan xc-tan@outlook.com Co-authored-by: Charles Xu charles.xu@arm.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: pqnet 119850+pqnet@users.noreply.github.com Co-authored-by: bashayer hijji bashayer.hijji@gmail.com Co-authored-by: Anton Mitkov anton_b_mitkov@abv.bg Co-authored-by: fanyang fanyang89@outlook.com Co-authored-by: aa956 aa956@users.noreply.github.com Co-authored-by: aa956 27946957+aa956@users.noreply.github.com Co-authored-…

Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 5, 2025

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 6, 2025

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 6, 2025

Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 8, 2025

add geglu activation function (#14074)

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp

sycl: Add reorder to Q6_K mmvq implementation (#13885)
Add Reorder to Q6_K mmvq implementation
Address PR comments: clean up comments
Remove unused parameter after refactoring q4_k
Adding inline to function and removing unnecessary reference to int

Signed-off-by: nscipione nicolo.scipione@codeplay.com

webui: fix sidebar being covered by main content (#14082)
webui: fix sidebar being covered by main content

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

webui: update index.html.gz

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

CANN: Simplify the environment variable setting(#13104)
Simplify the environment variable setting to specify the memory pool type.
Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.
update
fix CI
update
delete whitespace
fix according to review
update CANN.md
update CANN.md
graph : fix geglu (#14077)

ggml-ci

ggml-cpu : split arch-specific implementations (#13892)
move ggml-cpu-aarch64 to repack
split quantize_row_q8_0/1
split helper functions
split ggml_vec_dot_q4_0_q8_0
split ggml_vec_dot_q4_1_q8_1
split ggml_vec_dot_q5_0_q8_0
split ggml_vec_dot_q5_1_q8_1
split ggml_vec_dot_q8_0_q8_0
split ggml_vec_dot_tq1_0_q8_K
split ggml_vec_dot_tq2_0_q8_K
split ggml_vec_dot_q2_K_q8_K
split ggml_vec_dot_q3_K_q8_K
split ggml_vec_dot_q4_K_q8_K
split ggml_vec_dot_q5_K_q8_K
split ggml_vec_dot_q6_K_q8_K
split ggml_vec_dot_iq2_xxs_q8_K
split ggml_vec_dot_iq2_xs_q8_K
split ggml_vec_dot_iq2_s_q8_K
split ggml_vec_dot_iq3_xxs_q8_K
split ggml_vec_dot_iq3_s_q8_K
split ggml_vec_dot_iq1_s_q8_K
split ggml_vec_dot_iq1_m_q8_K
split ggml_vec_dot_iq4_nl_q8_0
split ggml_vec_dot_iq4_xs_q8_K
fix typos
fix missing prototypes
rename ggml-cpu-quants.c
rename ggml-cpu-traits
rename arm folder
move cpu-feats-x86.cpp
rename ggml-cpu-hbm
update arm detection macro in quants.c
move iq quant tables
split ggml_quantize_mat_q8_0/K
split ggml_gemv_*
split ggml_gemm_*
rename namespace aarch64 to repack
use weak aliases to replace test macros
rename GGML_CPU_AARCH64 to GGML_CPU_REPACK
rename more aarch64 to repack
clean up rebase leftover
fix compilation errors
remove trailing spaces
try to fix clang compilation errors
try to fix clang compilation errors again
try to fix clang compilation errors, 3rd attempt
try to fix clang compilation errors, 4th attempt
try to fix clang compilation errors, 5th attempt
try to fix clang compilation errors, 6th attempt
try to fix clang compilation errors, 7th attempt
try to fix clang compilation errors, 8th attempt
try to fix clang compilation errors, 9th attempt
more cleanup
fix compilation errors
fix apple targets
fix a typo in arm version of ggml_vec_dot_q4_K_q8_K

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

llama : allow building all tests on windows when not using shared libs (#13980)
llama : allow building all tests on windows when not using shared libraries
add static windows build to ci
tests : enable debug logs for test-chat

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

sync : ggml

ggml-ci

Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (#14099)
ggml : fix weak alias win32 (whisper/0)

ggml-ci

sync : ggml

ggml-ci

vulkan: force device 0 in CI (#14106)
llama : support GEGLU for jina-bert-v2 (#14090)
convert : fix duplicate key DeepSeek-R1 conversion error (#14103)
kv-cache : avoid modifying recurrent cells when setting inputs (#13834)
kv-cache : avoid modifying recurrent cells when setting inputs
kv-cache : remove inp_s_mask

It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy.

kv-cache : fix non-consecutive token pos warning for recurrent models

The problem was apparently caused by how the tail cells were swapped.

graph : simplify logic for recurrent state copies
kv-cache : use cell without src refs for rs_z in recurrent cache
llama-graph : fix recurrent state copy

Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes.

llama-graph : rename n_state to state_size in build_recurrent_state

This naming should reduce confusion between the state size and the number of states.

opencl: add mul_mv_id_q4_0_f32_8x_flat (#14003)
vulkan: Track descriptor pools/sets per-context (#14109)

kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (#14121)
kv-cache : relax SWA masking condition (#14119)

ggml-ci

webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
webui: Wrap long numbers instead of infinite horizontal scroll
Use tailwind class
update index.html.gz
vulkan: Better thread-safety for command pools/buffers (#14116)

tests : add test-tokenizers-repo (#14017)
chore : clean up relative source dir paths (#14128)
Implement GGML_CPU_ALL_VARIANTS for ARM (#14080)
ggml-cpu: Factor out feature detection build from x86
ggml-cpu: Add ARM feature detection and scoring

This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_ which need to be set in cmake, instead of GGML_ that users would set for x86.

This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags.

ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM

Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_ as we don't have GGML_.

Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used.

ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now

The other platforms will need their own specific variants.

kv-cache : fix split_equal handling in unified implementation (#14130)

ggml-ci

batch : remove logits_all flag (#14141)

ggml-ci

context : simplify output counting logic during decode (#14142)
batch : remove logits_all flag

ggml-ci

context : simplify output counting logic during decode

ggml-ci

cont : fix comments
cmake : Improve build-info.cpp generation (#14156)
cmake: Simplify build-info.cpp generation

The rebuild of build-info.cpp still gets triggered when .git/index gets changes.

cmake: generate build-info.cpp in build dir
pooling : make cls_b and cls_out_b optional (#14165)

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp

cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)
cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT
cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*
batch : rework llama_batch_allocr (#14153)
batch : rework llama_batch_allocr

ggml-ci

cont : move validation inside class

ggml-ci

cont : move output counting to class

ggml-ci

cont : minor

ggml-ci

batch : add TODOs

ggml-ci

batch : add LLAMA_BATCH_DEBUG environment variable (#14172)
batch : add LLAMA_BATCH_DEBUG environment variable

ggml-ci

cont : improve seq_id display
Merge commit from fork
vocab : prevent integer overflow during load
Add static cast and GGML_ABORT

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

vocab : fix build (#14175)

ggml-ci

batch : auto-gen positions + verify multi-sequence input (#14177)
batch : verify multi-sequence input batches

ggml-ci

cont : auto-gen positions + verify multi-seq input

ggml-ci

cont : first print debug info, then perform validation

ggml-ci

cont : fix position auto-gen + add comments

ggml-ci

cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)

ggml-ci

model : add dots.llm1 architecture support (#14044) (#14118)

Adds:

Dots1Model to convert_hf_to_gguf.py
Computation graph code to llama-model.cpp
Chat template to llama-chat.cpp to detect this model's template.

The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture.

The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here:

The model architecture is a combination of Qwen and Deepseek parts, as seen here:

https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py

kv-cache : fix use-after-move of defrag info (#14189)

ggml-ci

model : Add support for Arcee AI's upcoming AFM model (#14185)
Add Arcee AFM support
Add draft update code
Fix linter and update URL, may still not be final
Update src/llama-model.cpp

Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com

Remote accidental blank line

Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com

ggml-cpu : rework weak alias on apple targets (#14146)
ggml-cpu : rework weak alias on apple targets
fix powerpc detection
fix ppc detection
fix powerpc detection on darwin
vulkan: mutex around vkQueueSubmit (#14127)

This fixes the remaining crash in test-thread-safety on my system.

convert : remove arcee change in convert_hf_to_gguf_update.py (#14207)
ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206)
llama : rework embeddings logic (#14208)
llama : rework embeddings logic

ggml-ci

cont : fix rerank

ggml-ci

cont : engrish [no ci]
cont : fix rerank

ggml-ci

server : support both embeddings and completions with single model

ggml-ci

cont : avoid embeddings_org

ggml-ci

model : add NeoBERT (#14164)
convert neobert model to gguf
add inference graph
fix flake8 lint
followed reviewer suggestions

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

follow reviewers suggestions

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

override NeoBERT feed-forward length

Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Georgi Gerganov ggerganov@gmail.com

cmake: clean up external project logic for vulkan-shaders-gen (#14179)
Remove install step for vulkan-shaders-gen
Add install step to normalize msvc with make
Regenerate modified shaders at build-time
llama : add thread safety test (#14035)
llama : add thread safety test
llamafile : remove global state
llama : better LLAMA_SPLIT_MODE_NONE logic

when main_gpu < 0 GPU devices are not used

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

server : fix incorrect usage of llama_get_embeddings() (#14225)
server : fix incorrect usage of llama_get_embeddings()

ggml-ci

cont : fix the fix

ggml-ci

ggml-cpu : remove the weak alias trick (#14221)
cmake: remove shader-gen step-targets from ggml-vulkan (#14226)
Remove step-targets from vulkan-shaders-gen
Unset DESTDIR when building vulkan-shaders-gen
examples : include examples in msvc disable warn (ggml/1270)

This commit adds the examples in the "list" of targets to ignore MSVC warnings.

The motivation for this is that currently the examples generate a number of warnings that are ignore/disabled for the core ggml project. This makes for a cleaner output when building.

ggml : disable warnings for tests when using MSVC (ggml/1273)
ggml : disable warnings for tests when using MSVC

This commit disables warnings for tests on windows when using MSVC.

The motivation for this is that this brings the build output more inline with what Linux/MacOS systems produce.

There is still one warning generated for the tests which is:

  Building Custom Rule C:/ggml/tests/CMakeLists.txt
cl : command line  warning D9025: overriding '/DNDEBUG' with '/UNDEBUG'
[C:\ggml\build\tests\test-arange.vcxproj]
  test-arange.cpp
  test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exe

ggml : fix typo in tests disable list
sync : ggml

ggml-ci

convert : fix null head_dim AutoConfig regression (#14248)
ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (#14258)
docs: add s390x build documentation (#14264)
docs: add s390x-specific build docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add s390x model conversion steps

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x build indent

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update hyperlinks for s390x docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update llama.h docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x add accelerator and perf optimizations

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x indent blocks

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: revert block indentation

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add support information for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x reword

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: remove indentation for accelerator section s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: remove redundant words s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: reword for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: s390x reword simd

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: fix trailing whitespace for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

metal : add mean kernel (#14267)
metal : add mean kernel

ggml-ci

cont : dedup implementation

ggml-ci

memory : Hybrid recurrent cache (#13979)
feat: Add llama_model_is_hybrid API call

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add c++ side constants for attention layer indices hparam

Branch: GraniteFour

feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: rename *_is_hybrid -> *_is_hybrid_recurrent

Branch: HybridCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add layer filter to recurrent cache

Branch: HybridCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use per-layer sizing everywhere in kv caches

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: First pass at llama_kv_cache_hybrid_recurrent

This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Construct hybrid recurrent cache for hybrid recurrent models

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix wrong bool condition for split equal in hybrid cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix shift logic to defer to unified cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Support hybrid recurrent in llama-graph

NOTE: I intentionally did not add support for s_mask since it will be going away soon

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix logic for initializing inputs and attn layers for hybrid caches

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Update recurrent cache for changes to remove intermediate kv_cache interface

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix status for init_update sig for recurrent cache state

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Add missing padding to n_ctx for hybrid cache construction

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Update clear signature for data argument after rebase

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove errant virtual destructor leftover from previous impl attempt

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove n_embd_k/v_s from unified cache

No longer needed now that unified isn't also supporting recurrent

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069

Branch: HybridRecurrentCache

refactor: Remove layer index from n_embd_k/v_s

Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove n_embd_k/v_gqa from recurrent cache

This is no longer needed now that there are separate implementations

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Allow custom layer filters for hybrid recurrent

This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches.

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove logits_all after rebase

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove llama_model_is_hybrid_Recurrent public API

https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Use llama_memory_state_ptr for child states in hybrid memory state

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738

This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are:

Rename class llm_graph_input_s_copy -> llm_graph_input_rs
Add a corresponding llm_graph_input_rs_hybrid_recurrent
Rename build_inp_s_copy -> build_rs_inp_recurrent
Add a corresponding build_rs_inp_hybrid_recurrent
Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified
Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix resize vs reserve and skip null tensors in size computation

https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-Authored-By: @younesbelkada

fix: Fix initialization of child states

Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked.

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Use a common build_recurrent_state method that is cache-agnostic

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

recurrent : rework graph inputs + add TODOs

ggml-ci

refactor: Make status and child states const in hybrid and iswa

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor!: Rename all k/v related values for recurrent/hybrid to r/s

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refacor: _recurrent -> _recr for brevity

It just happens to have the same number of letters as _attn!

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

style: Fix spacing for ref

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: recurrent_layer() -> is_recurrent()

Branch: HybridRecurrentCache

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

style: Fix spacing for size_s_bytes declaration

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249)
llamafile : support s390x SIMD instruction set (#14273)
convert : fix remote option in Windows (#14100)
build : suppress gcc15 compile warnings (#14261)
Change _contains_any() substrs to std::string_view and fix the find comparison logic.
server : add server parameters for draft model cache type (#13782)

Co-authored-by: aa956 27946957+aa956@users.noreply.github.com

ggml-cpu : remove unnecesary arm feature detection (#14281)

Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.

CUDA: add conv_2d_dw (#14265)
CUDA: add conv_2d_dw
better naming
simplify using template
Review: fix operation ordering in ggml-cuda, use forceinline, use more const
ubatch : new splitting logic (#14217)

ggml-ci

model : more uniform output id handling (#14275)
model : more uniform output id handling

ggml-ci

cont : revert n_outputs < n_tokens optimization

ggml-ci

cont : fix out_ids initialization

ggml-ci

ggml: Update KleidiAI to v1.9.0 (#14277)
ggml : fix repack work size for mul_mat_id (#14292)

ggml-ci

cuda : synchronize graph capture and cublas handle destruction (#14288)

Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread

llama : improve sep token handling (#14272)
Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286)
Add PowerPC feature detection and scoring
ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC
ggml-cpu: Delay some initializations until function is called

When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.

Co-authored-by: Diego Devesa slarengh@gmail.com

sycl: add usage of enqueue_functions extension (#14244)
Add header and namespace to use enqueue_functions extension
Convert submit and parallel_for to use new extension in convert.cpp
Convert submit and parallel_for to use extension in ggml-sycl.cpp
Convert submit and parallel_for to use extension in gla.cpp
Convert submit and parallel_for in mmq.cpp
Convert submit and parallel_for in mmvq.cpp
Convert submit and parallel_for in remaining files
Convert all simple parallel_for to nd_launch from enqueue_functions extension
Wrapping extension in general function

Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels.

Signed-off-by: nscipione nicolo.scipione@codeplay.com

vocab : prevent tokenizer overflow (#14301)
vocab : prevent stack overflow in tokenize
vocab : return error instead of aborting on oversized token count
vocab : INT32_MIN from llama_tokenize on overflow
lint : remove trailing whitepace (#14304)
CUDA: add conv_2d_transpose (#14287)
CUDA: add conv_2d_transpose
remove direct include of cuda_fp16
Review: add brackets for readability, remove ggml_set_param and add asserts
Add ggml_roll (ggml/1274)
ggml : add ggml_roll
use set/get_op_params & std::min
sync : ggml

ggml-ci

convert : fix Llama 4 conversion (#14311)
memory : rename interface to llama_memory_context_i (#14296)
memory : rename interface to llama_memory_context_i

ggml-ci

cont : fix comments
cont : use "mctx" for referencing a memory context

ggml-ci

metal : fix thread-safety (#14300)

ggml-ci

gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.
remove #ifdef for debug utils and add queue marker.
gguf-py : fix Qwen3-Embedding eos token (#14314)
CUDA: add mean operation (#14313)
CUDA: add mean operation
add back sum_rows_f32_cuda
Review: early exit if col!=0
HIP: enable vec fattn on RDNA4 (#14323)
examples : fix is_first logic for tokenization (#14329)

ggml-ci

run : avoid double tokenization (#14327)
run : avoid double tokenization by adopting common_tokenize heuristic
build : fix windows gcc and clang warnings
lint : fixed trailing whitepace
run : fix is_first flag
gguf-py : fix SpecialVocab parsing when post_processor is null (#14330)
quantize : handle user-defined pruning of whole layers (blocks) (#13037)
vulkan: update windows SDK in CI (#14334)
kv-cells : fix tracking of seq_pos (#14339)
kv-cells : fix tracking of seq_pos during cache reuse

ggml-ci

cont : improve error message

ggml-ci

cont : add more comments
CUDA: mul_mat_v support for batch sizes > 1 (#14262)
CUDA: mul_mat_v support for batch sizes > 1
use 64 bit math for initial offset calculation
ci: add workflow for relocatable cmake package (#14346)
CUDA/HIP: optimize mmv paths taken for HIP devices (#14324)

Co-authored-by: Johannes Gäßler johannesg@5d6.de

cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION (#14362)
batch : fix check for empty sequences in memory (#14364)
batch : fix check for empty sequences in memory

ggml-ci

cont : reuse the var

ggml-ci

opencl: ref count ggml_backend_opencl_context and refactor profiling (#14254)
Move profiling info into ggml_backend_opencl_context
Add enqueue_ndrange_kernel to launch kernel
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

ggml-cpu: add fp16->fp32 nnpa first

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

ggml-cpu: add fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

ggml-cpu: better variable names

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

docs: update s390x docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

ggml-cpu: add debugging prints to see if dlf16 is correct

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix print vs printf

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix float placeholder

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: ensure fp16 and fp32 load and stores are called

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fp16 load ensured to hit

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa switch to vec_xst test

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to vec_xst for 4 element loops also

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rework noop

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove noop, general code cleanup

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify variable naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add breakpoint for debugging

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test fix for conversion failure

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: disable fp32->fp16 nnpa conversions for now

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to elif macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix typo

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix compiler types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: change to typedef vector types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add 4 element loops for fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarified vector naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back fp32->fp16 store nnpa

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add nnpa macro check in ggml-impl

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add missing func

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: diagnose why NNPA macro is not being defined

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: import vecintrin.h to fix compiler errors

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: update macro tests

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to importing ggml-cpu-impl instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix macro declaration

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test more macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add debug prints

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h to cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to private macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

ggml-cpu: move things around

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to quotes for import

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add compiler error macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add s390x detection in ggml-src

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: undo cmakelists work

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedefs.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedef from cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h future notes

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add todo comment for future reference

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify naming of dlf16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove unnecessary target compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update broken huggingface link for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix duplicate func names during compile

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: fix duplicate func names during compile"

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp16<->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h import in quants.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h within repack

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix amx mmq missing simd-mappings.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: attempt at fixing loongarch failing build

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa together with other fp16<->fp32 simd

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix wrong refactor of ggml-base

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: remove dependency on ggml-cpu from ggml-base

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove mistaken fallback macro

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

ggml: move ggml_table_f32_f16 to ggml-cpu.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: extern c ggml_table_f32_f16 + chore docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back ggml_table_f32_f16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: bring back ggml_table_f32_f16"

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fix ggml time initialization
fix f32_f16 table init
remove extra line

Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add s390x z17 build q&a

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup

ggml-ci

metal : handle some edge cases when threadgroup size is not a power of 2

ggml-ci

metal : add special-case mat-vec mul for ne00 == 4 (#14385)

ggml-ci

llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst

ggml-ci

ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation

ggml-ci

ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR

ggml-ci

ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

recurrent : call balloc split_reset() in init_batch() (#14414)

ggml-ci

graph : make llm_graph_context destructor virtual (#14410)

ggml-ci

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)

This setting needs to be passed through to vulkan-shaders-gen

ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL

Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
Add detection logic and basic fusion logic in ggml-vulkan.
Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.

extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups

style fixes

last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter

Co-authored-by: slaren slarengh@gmail.com

ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels

ggml-ci

add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints

ggml-ci

64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]

ggml-ci

Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation

Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
Introducing helper functions op_xxx for each unary operation to encapsulate the logic.
Replacing direct kernel calls with calls to these inlined functions.
Using __dpct_inline__ to encourage compiler inlining.
Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动

Remove redundant include path in CMakeLists.txt

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Enable scheduled Docker image builds

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels

ggml-ci

cont : disable for q4_1

ggml-ci

cont : disable for iq4_nl

ggml-ci

memory : correctly handle failure in apply() (#14438)

ggml-ci

Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml

ggml-ci

ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI

ggml-ci

cont : remove -g flag

ggml-ci

ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes

Co-authored-by: Diego Devesa slarengh@gmail.com

github : add OpenCL backend to issue templates (#14492)
ci : add OpenCL to labeler workflow (#14496)
opencl : update upscale to support align corners (#14488)
opencl : skip empty nodes on cgraph compute (#14491)
simple-chat : fix context-exceeded condition (#14494)
simple-chat : fix context-exceeded condition

ggml-ci

cont : fix n_ctx_used computation

ggml-ci

opencl : fix possible buffer overflow in dump_tensor (#14490)
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)

ggml-ci

vulkan: support softmax/FA batch and broadcast (#14449)
CUDA: broadcasting for FlashAttention mask (#14500)
CUDA: add softmax broadcast (#14475)
CUDA: add softmax broadcast
Pass by const ref
Review: Use blockDims for indexing, remove designated initializers
Add TODO for noncontigous input/output
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
ggml : add version function to get lib version (ggml/1286)
ggml : add version function to get lib version

This commit adds a function ggml_version() to the ggml library that returns the version of the library as a string.

The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.

Usage:

printf("GGML version: %s\n", ggml_version());

Output:

GGML version: 0.0.2219

ggml : add ggml_commit()

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

sync : ggml

ggml-ci

llama : initial Mamba-2 support (#9126)
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON

cuda : graceful fallback for Mamba-1 models with weird embd size
gguf-py : add support for chat template jinja files (#14508)
add support for chat template jinja files
remove gemma3n hack
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
ggml : remove kompute backend (#14501)

ggml-ci

ggml : fix FA mask dim 2 and 3 (#14505)
ggml : fix FA mask dim 2 and 3

ggml-ci

backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

vulkan : disable FA for mask->ne[2] != 1
kv-cache : use ggml_set_rows (#14285)
kv-cache : use ggml_set_rows

ggml-ci

graph : separate k and v indices

ggml-ci

cont : remove redundant ifs

ggml-ci

kv-cache : improve find_slot impl
kv-cache : bounds-check when accessing slot_info indices
kv-cache : add comments

ggml-ci

ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci

convert : correct gemma 3n conversion (#14450)
convert : correct gemma 3n conversion
rm redundant code
Fix conditional enabling following arch checks for ggml-sycl (#14504)

Signed-off-by: nscipione nicolo.scipione@codeplay.com

ggml: backward pass for split swiglu (#14483)
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
vulkan: better parameterize FA by head sizes
vulkan: support mixed/deepseekR1 FA head sizes
opencl : broadcast for soft_max (#14510)
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)

Co-authored-by: luyuhong luyuhong@kylinos.cn

batch : add n_used count (#14512)

ggml-ci

graph : prepare for 4D mask (#14515)

ggml-ci

batch : add optional for sequential equal split (#14511)

ggml-ci

metal : disable fast math in all quantize kernels (#14528)

ggml-ci

test-backend-ops: add support for specifying output format (#14368)
test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

refactor

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

remove visitor nonsense
remove visitor comment

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com

eval-callback : check for empty input (#14539)
opencl: add GELU_ERF (#14476)
server : fix assistant prefilling when content is an array (#14360)
vulkan: Handle updated FA dim2/3 definition (#14518)
vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.

handle null mask for gqa
allow gqa with dim3>1

Signed-off-by: nscipione nicolo.scipione@codeplay.com Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Signed-off-by: Aaron Teo aaron.teo1@ibm.com Signed-off-by: Gabe Goodhart ghart@us.ibm.com Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: Đinh Trọng Huy 77562200+huydt84@users.noreply.github.com Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Nicolò Scipione nicolo.scipione@codeplay.com Co-authored-by: R0CKSTAR yeahdongcn@gmail.com Co-authored-by: Xinpeng Dou 15529241576@163.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: xctan axunlei@gmail.com Co-authored-by: Diego Devesa slarengh@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Jeff Bolz jbolz@nvidia.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net Co-authored-by: lhez quic_lih@quicinc.com Co-authored-by: Aman amangupta052@gmail.com Co-authored-by: Christian Kastner ckk@kvr.at Co-authored-by: Guy Goldenberg guy110698@gmail.com Co-authored-by: Mikko Juola mikjuo@gmail.com Co-authored-by: Bartowski 3266127+bartowski1182@users.noreply.github.com Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com Co-authored-by: xctan xc-tan@outlook.com Co-authored-by: Charles Xu charles.xu@arm.com Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com Co-authored-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: pqnet 119850+pqnet@users.noreply.github.com Co-authored-by: fanyang fanyang89@outlook.com Co-authored-by: aa956 aa956@users.noreply.github.com Co-authored-by: aa956 27946957+aa956@users.noreply.github.com Co-authored-by: Ruikai Peng retr0@retr0.blog Co-authored-by: Acly aclysia@gmail.com Co-authored-by: Daniel Han danielhanchen@gmail.com Co-authored-by: Markus Tavenrath mtavenrath@users.noreply.github.com Co-authored-by: uvos philipp@uvos.xyz Co-authored-by: Ed Addario 29247825+EAddario@users.noreply.github.com Co-authored-by: Johannes Gäßler johannesg@5d6.de Co-authored-by: Mathieu Baudier mbaudier@argeo.org Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Radoslav Gerganov rgerganov@gmail.com Co-authored-by: Weizhao Ouyang weizhao.ouyang@arm.com Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Renat rntk@users.noreply.github.com Co-authored-by: matteo matteo.serva@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com Co-authored-by: Vedran Miletić vedran@miletic.net Co-authored-by: xiaobing318 71554036+xiaobing318@users.noreply.github.com Co-authored-by: Romain Biessy romain.biessy@codeplay.com Co-authored-by: Björn Ganster mail@bjoern-ganster.de Co-authored-by: Eric Zhang 34133756+EZForever@users.noreply.github.com Co-authored-by: zhouwg zhouwg2000@gmail.com Co-authored-by: Rotem Dan rotemdan@gmail.com Co-authored-by: luyhcsu 110711054+luyhcsu@users.noreply.github.com Co-authored-by: luyuhong luyuhong@kylinos.cn

qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request

Jul 10, 2025

olek-tether pushed a commit to tetherto/qvac-fabric-llm.cpp that referenced this pull request

Aug 15, 2025

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
ggml : do not output unprintable characters on GGUF load failure (#14381)
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

ggml-cpu: add fp16->fp32 nnpa first

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

ggml-cpu: add fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

ggml-cpu: better variable names

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

docs: update s390x docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

ggml-cpu: add debugging prints to see if dlf16 is correct

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix print vs printf

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix float placeholder

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: ensure fp16 and fp32 load and stores are called

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fp16 load ensured to hit

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa switch to vec_xst test

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to vec_xst for 4 element loops also

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rework noop

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove noop, general code cleanup

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify variable naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add breakpoint for debugging

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test fix for conversion failure

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: disable fp32->fp16 nnpa conversions for now

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to elif macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix typo

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix compiler types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: change to typedef vector types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add 4 element loops for fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarified vector naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back fp32->fp16 store nnpa

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add nnpa macro check in ggml-impl

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add missing func

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: diagnose why NNPA macro is not being defined

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: import vecintrin.h to fix compiler errors

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: update macro tests

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to importing ggml-cpu-impl instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix macro declaration

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test more macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add debug prints

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h to cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to private macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

ggml-cpu: move things around

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to quotes for import

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add compiler error macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add s390x detection in ggml-src

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: undo cmakelists work

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedefs.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedef from cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h future notes

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add todo comment for future reference

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify naming of dlf16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove unnecessary target compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update broken huggingface link for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix duplicate func names during compile

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: fix duplicate func names during compile"

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp16<->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h import in quants.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h within repack

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix amx mmq missing simd-mappings.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: attempt at fixing loongarch failing build

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa together with other fp16<->fp32 simd

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix wrong refactor of ggml-base

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: remove dependency on ggml-cpu from ggml-base

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove mistaken fallback macro

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

ggml: move ggml_table_f32_f16 to ggml-cpu.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: extern c ggml_table_f32_f16 + chore docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back ggml_table_f32_f16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: bring back ggml_table_f32_f16"

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fix ggml time initialization
fix f32_f16 table init
remove extra line

Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add s390x z17 build q&a

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup

ggml-ci

metal : handle some edge cases when threadgroup size is not a power of 2

ggml-ci

metal : add special-case mat-vec mul for ne00 == 4 (#14385)

ggml-ci

llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst

ggml-ci

ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation

ggml-ci

ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR

ggml-ci

ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

recurrent : call balloc split_reset() in init_batch() (#14414)

ggml-ci

graph : make llm_graph_context destructor virtual (#14410)

ggml-ci

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)

This setting needs to be passed through to vulkan-shaders-gen

ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL

Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
Add detection logic and basic fusion logic in ggml-vulkan.
Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.

extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups

style fixes

last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter

Co-authored-by: slaren slarengh@gmail.com

ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels

ggml-ci

add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints

ggml-ci

64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]

ggml-ci

Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation

Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
Introducing helper functions op_xxx for each unary operation to encapsulate the logic.
Replacing direct kernel calls with calls to these inlined functions.
Using __dpct_inline__ to encourage compiler inlining.
Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动

Remove redundant include path in CMakeLists.txt

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Enable scheduled Docker image builds

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels

ggml-ci

cont : disable for q4_1

ggml-ci

cont : disable for iq4_nl

ggml-ci

memory : correctly handle failure in apply() (#14438)

ggml-ci

Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-quants : rename best_mad to best_error (ggml/1283)

This commit renames the variable best_mad to best_error in the make_qkx2_quants function.

The motivation for this is that the name best_mad can be somewhat confusing if mean absolute deviation (MAD) is not in use.

ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml

ggml-ci

ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
[CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon 757486878@qq.com

Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon 757486878@qq.com

fix editorconfig

Signed-off-by: noemotiovon 757486878@qq.com

Add Vulkan images to docker.md (#14472)

Right now it's not easy to find those.

ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI

ggml-ci

cont : remove -g flag

ggml-ci

ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes

Co-authored-by: Diego Devesa slarengh@gmail.com

github : add OpenCL backend to issue templates (#14492)
ci : add OpenCL to labeler workflow (#14496)
opencl : update upscale to support align corners (#14488)
opencl : skip empty nodes on cgraph compute (#14491)
simple-chat : fix context-exceeded condition (#14494)
simple-chat : fix context-exceeded condition

ggml-ci

cont : fix n_ctx_used computation

ggml-ci

opencl : fix possible buffer overflow in dump_tensor (#14490)
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)

ggml-ci

vulkan: support softmax/FA batch and broadcast (#14449)
CUDA: broadcasting for FlashAttention mask (#14500)
CUDA: add softmax broadcast (#14475)
CUDA: add softmax broadcast
Pass by const ref
Review: Use blockDims for indexing, remove designated initializers
Add TODO for noncontigous input/output
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
ggml : add version function to get lib version (ggml/1286)
ggml : add version function to get lib version

This commit adds a function ggml_version() to the ggml library that returns the version of the library as a string.

The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.

Usage:

printf("GGML version: %s\n", ggml_version());

Output:

GGML version: 0.0.2219

ggml : add ggml_commit()

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

sync : ggml

ggml-ci

llama : initial Mamba-2 support (#9126)
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
mamba : fix mismatched new and delete size for llm_build_mamba

cuda : graceful fallback for Mamba-1 models with weird embd size
gguf-py : add support for chat template jinja files (#14508)
add support for chat template jinja files
remove gemma3n hack
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
ggml : remove kompute backend (#14501)

ggml-ci

ggml : fix FA mask dim 2 and 3 (#14505)
ggml : fix FA mask dim 2 and 3

ggml-ci

backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

vulkan : disable FA for mask->ne[2] != 1
kv-cache : use ggml_set_rows (#14285)
kv-cache : use ggml_set_rows

ggml-ci

graph : separate k and v indices

ggml-ci

cont : remove redundant ifs

ggml-ci

kv-cache : improve find_slot impl
kv-cache : bounds-check when accessing slot_info indices
kv-cache : add comments

ggml-ci

ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci

convert : correct gemma 3n conversion (#14450)
convert : correct gemma 3n conversion
rm redundant code
Fix conditional enabling following arch checks for ggml-sycl (#14504)

Signed-off-by: nscipione nicolo.scipione@codeplay.com

ggml: backward pass for split swiglu (#14483)
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
vulkan: better parameterize FA by head sizes
vulkan: support mixed/deepseekR1 FA head sizes
opencl : broadcast for soft_max (#14510)
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)

Co-authored-by: luyuhong luyuhong@kylinos.cn

batch : add n_used count (#14512)

ggml-ci

graph : prepare for 4D mask (#14515)

ggml-ci

batch : add optional for sequential equal split (#14511)

ggml-ci

metal : disable fast math in all quantize kernels (#14528)

ggml-ci

test-backend-ops: add support for specifying output format (#14368)
test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

refactor

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

remove visitor nonsense
remove visitor comment

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com

eval-callback : check for empty input (#14539)
opencl: add GELU_ERF (#14476)
server : fix assistant prefilling when content is an array (#14360)
vulkan: Handle updated FA dim2/3 definition (#14518)
vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.

handle null mask for gqa
allow gqa with dim3>1
vulkan: fix rms_norm+mul fusion (#14545)

The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.

vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)

Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com

CUDA: add bf16 and i32 to getrows (#14529)
llama : remove ggml_cont where possible (#14568)
llama : fix incorrect minicpm3 v_states shape (#14571)
musa: fix build warnings (unused variable) (#14561)

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

CUDA: add bilinear interpolation for upscale (#14563)
cuda : fix rope with partial rotation and non-cont src (#14580)
cuda : fix rope non-cont

ggml-ci

cont : fix multi-rope + add test

ggml-ci

sycl : try fix

ggml-ci

cont : fix sycl + clean-up cuda

ggml-ci

vulkan: increase timeout for CI (#14574)
model : add hunyuan moe (#14425)
model : add hunyuan moe
tokenizer ok
fix tensor name
cgraph init
chat template
wip
almost working
skip embed, fix bos
cleanup
yarn scaling
cleanup
correct rope type
failed token fix
ntk alpha freq_base
tokenization working
cleanup and pr changes
vocab_size sanity check
ntk alpha generic
Update convert_hf_to_gguf.py
Apply suggestions from code review
fix regression
fix style

Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com

server: Add ability to mount server at prefix (#14544)
Add server_prefix
Correct server path env
Rename cli flag to --api-prefix
Change all to api_prefix
vulkan : fix rope with partial rotation and non-cont src (#14582)
memory : fix broken batch splits for recurrent cache (#14575)

Splits producing more than one ubatch per batch for recurrent models were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.

model : add SmolLM3 (#14581)
Init - first pass.
Model -> ModelBase.
fix errors in conversion.
Update the graph.
up.
up.
wip
cgraph ok
rm redundant code

Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com

model : fix hunyuan moe chat template (#14584)

Signed-off-by: stevenkuang stevenkuang@tencent.com

vulkan: optimize flash attention split_k_reduce (#14554)
vulkan: allow FA split_k with smaller KV values
vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

convert : fix smollm3 jinja template (#14586)
model : add support for Falcon-H1 family (#14534)
v1
push more fixes
another fix
fix
more fixes
minor fix
more cleaning on python code
python fixes
changed precision for multipliers float 32->64
fixes
another fix
fix
pre-norm -> norm
fix
Revert "fix"

This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.

fix
small fix ffn_norm
try
mix instead of max
fix vocab size
conflict solve
fixed multipliers
falcon-h1 specefic vocab resolved
read arch from gguf.MODEL_ARCH
mamba_d_ssm added to d_inner find_hparam
remove unused functions from gguf_writer.py
override modify_tensors instead of get_tensors
fix conversion and d_inner
added some cb functions for debugging puposes
inp_out_ids moved outside of layers loop
mup_vec create as float64
fix rope_theta
injected mup
clean ups
rm extra space
rm unused MAMBA_CHUNK_SIZE
rm unused key
add bos False
changed ROPE_TYPE
cleaning debugging stuff
cleaning debug quant
fix comment
some cleanups
some cleanups
Update src/llama-model-loader.cpp
more cleanups
moe cleanuips
d_ssm -> d_inner;
cleaning unused hparams
cleanup
more cleanups
more cleanups on python conversion;
minor cleanups
Apply suggestions from code review

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

remove todo
added falcon-h1
tensor not required
clean
remove unneeded attributes
more cleanups and fixed conversion
remove final_norm
flake8 fixes
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

flake8 fixes
Update src/llama-hparams.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

added hashes
Update src/llama-arch.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

update the update file
Revert "update the update file"

This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.

fix: address suggestions
fix: update convert_hf_to_gguf.py
Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

d_inner fixed
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

reshaping ssm_norm for 34B
removing generate_mup
remove duplicates metadata keys
rm comment
final comment
fix unused args
fix constants
fix bad merge
Update src/llama-model.cpp

Co-authored-by: compilade git@compilade.net

falcon-h1: remove unused ssm_in_b and bad merge
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

falcon-h1: fix last comment
Update convert_hf_to_gguf.py

Co-authored-by: compilade git@compilade.net

falcon-h1: revert add_add_bos(False)
falcon-h1: fix tied weights
falcon-h1: remove whitespace
falcon-h1: fix wrong size param
falcon-h1: fix whitespace issues

Co-authored-by: younesbelkada younes.belkada@tii.ae Co-authored-by: Younes B 49240599+younesbelkada@users.noreply.github.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net

llama : remove unintended whitespace (#14592)
model : add skt/A.X-4.0 model vocabulary (#14589)
ggml : prevent integer overflow in gguf tensor size calculation (#14595)
ggml : add ggml_scale_bias (#14417)
ggml : add ggml_scale_bias
ggml_vec_mad1_f32
add more simd
add CUDA
sycl
vulkan
cann (placeholder)
opencl
will this fix cpu?
fix cuda
suggestions from coderabbit
fix cann compile error
vDSP_vsmsa
rm __ARM_FEATURE_SVE
use memcpy for op params
make code looks more consistent
use scalar for __ARM_FEATURE_SVE
add x param to ggml_vec_mad1_f32
llama : support Jamba hybrid Transformer-Mamba models (#7531)
wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it, and this makes Mamba's conv step slightly faster.

mamba : fix non-contiguous usage of ggml_silu
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

llama : begin renaming llama_past back to llama_kv_cache
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
convert : fix jamba conv1d shape squeezing
graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

model : add Jamba to Mamba-specific hparams printing
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

llama : remove llm_graph_input_one (#14603)
cuda : support Falcon-H1 state size for SSM_SCAN (#14602)
cmake : llguidance build parser library only (#14608)
cmake : bump llguidance version to v1.0.1 (#14609)
llama : minor coding style fix for smollm3 (#14605)
SYCL: Initial set_rows kernel implementation (#14562)
SYCL: Initial set_rows kernel implementation
Revert max_threads to 256
Refactor set_rows and address review comments
Deduplicate conversion function
Remove guard before kernel launch and refactor
Fix and add back SFINAE
cmake : do not search for curl libraries by ourselves (#14613)
cmake : do not search for curl libraries by ourselves
run : do not search for curl libraries by ourselves
Docs: script to auto-generate ggml operations docs (#14598)
Docs: script to auto-generate ggml operations docs
Review: formatting changes + change github action
Use built-in types instead of typing
docs : add BLAS and Metal ops

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Smoldocling support (#14597)
support for smoldocling
fixed merge conflicts
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

merge conflicts
pre tokenizer merge fix
convert : fix smollm3 jinja template (#14586)

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

support for smoldocling

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

fixed merge conflicts

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.h

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

safetensors tensor mapping

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

added back accidental removal of clean spaces for hunyuan
Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

updated hash and reordererd model list
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update include/llama.h

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

removed old tensor name
removed tensor mappings -> handled by smolvlm
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net

opencl: add set_rows for f16 and f32 (#14547)
opencl: add set_rows for f16 and f32
opencl: better choose workgroup size for set_rows
opencl: add tiled mul_mat_f16_f32 (#14535)
add tiled mul_mat_f16_f32
fix trailing whitespace
add insightful comments
model : Granite Four (#13550)
wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits
llama : use im2col and mul_mat to perform convolution for Mamba

This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed.

ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it, and this makes Mamba's conv step slightly faster.

llama : rename llama_cache to llama_past

This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway)

Still, I'm open to better suggestions.

examples : replace llama_kv_cache_seq_* with llama_past_seq_*
mamba : fix non-contiguous usage of ggml_silu
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

llama : begin renaming llama_past back to llama_kv_cache
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
feat: Add conversion for Bamba models

This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add Granite 4 conversion

This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Plumb bamba through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add bamba to llama_arch_is_hybrid_recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add optional mamba ssm_in bias tensor

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add template specialization for get_arr to load a vector for layer index arr in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Use an explicit bool to determine mamaba vs mamba2

This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture if statement inside the mamba implementation.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Isolate mamba(2) and granite attention layer building in static methods

This will allow these layer-builder methods to be used from other build structs without complex inheritance.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use per-layer sizes in granite build_attention_layer

Also no need to pass in kv cache since it's already in the inp_attn

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: First (broken) pass at end-to-end Bamba implementation

It generates (garbage) tokens! Still lots of debugging to do.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Only do Granite multipliers if set

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Pull granite ffn portion into a static function and reuse in hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat(py): Allow gguf duplicate keys if they match by value and type

This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor(py): Simplify granitemoehybrid conversion to use parents better

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add GRANITE_MOE_HYBRID through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Support GRANITE_MOE_HYBRID in llama-model

This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

style: Fix flake8 errors

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix recurrent cache get after rebase

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins

The challenge here is to give both the non-hybrid classes (llm_build_mamba and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access to the same intermediate "base class" functionality (build_mamba*_layer, build_granite_attention_layer) without running into trouble with diamond inheritance of llm_graph_context. Due to the non-trivial initialization that happens in llm_graph_context, diamond inheritance results in multiple initializations of the common base which cause problems around the unique ptrs. I wanted to get away from self-> everywhere, but this is still a bit cleaner than making those methods static I think.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Implement the full copy-paste version to duplicate the layer builders

This follows the pattern where the type of input is pinned to the type of memory and that is used to dispatch to the correct version of build_rs / build_attn. There's a lot of code duplication that can hopefully be pulled into common functions in the graph later.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid

I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba).

As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

mamba : fix mismatched new and delete size for llm_build_mamba

memory : correctly handle failure in apply()

ggml-ci

style: Remove TODO for adding first hybrid models to the switch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

docs: Fix comment about duplicate key check

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Conform to standard way of initializing inp_out_ids

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

convert : fix jamba conv1d shape squeezing
fix: Fix input initialization in granite_hybrid after removal of hybrid inputs

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use llm_graph_context_mamba in llm_build_granite_hybrid

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins

The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...).

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

model : add Jamba to Mamba-specific hparams printing
fix: Fix input setup after upstream merge

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

feat: Add support for dense FFN in GraniteMoeHybrid

This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN).

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add support for dense FFN tensor names on c++ side

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use child inputs for Falcon H1 after merge resolution

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unnecessary prefix on tensor constants

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
fix: Revert order changes for Falcon H1 to stay consistent with upstream

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid

The only key difference is the use of rope which is now set via rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove use of diamond inheritance

Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance

https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Log mamba params for Granite Hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unused ssm_in_b

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv

This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unused template expansion for get_arr

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Review cleanup in convert_hf_to_gguf

The gist is to be explicit about which base class is being used with the multiple inheritance setup

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Undo hidden warnings about duplicate identical keys in add_key_value

After further discussion, this encourages sloppy overwriting in the model converters

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: If not using ROPE, context is "infinite"

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

doc: Add a comment outlining expected duplicate key warnings

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unnecessary duplicate keys in converter

Co-authored-by: Francis Couture-Harpin git@compilade.net

(thanks for the sharp eyes and patience!)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Francis Couture-Harpin git@compilade.net Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

vocab : add midm-2.0 model pre-tokenizer (#14626)
llama : move enum llama_vocab_pre_type to implementation (#14631)

ggml-ci

readme : add hot PRs (#14636)
readme : add hot PRs
cont
readme : update title
readme : hot PRs links
cont
HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634)
model : support LiquidAI LFM2 hybrid family (#14620)

Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source

pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"

vulkan: optimizations for deepseek prompt processing (#14555)
vulkan: allow unclamped loads in coopmat2 mul_mat_id shader
vulkan: increase coopmat2 mul_mat_id tile size
vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path
vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
vulkan: support SET_ROWS (#14587)
vulkan: support SET_ROWS

Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.

vulkan: optimize set_rows

Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

server : fix pooled embedding output (#14645)
vulkan : implement ggml_roll (ggml/1290)

ggml-ci

vulkan : implement bilinear interpolation (ggml/1291)

ggml-ci

sync : ggml

ggml-ci

vulkan : remove unused vars (#0)

ggml-ci

sync : ggml
CUDA: add set rows for f32 and f16 (#14551)
CUDA: add set rows for f32 and f16
Review: change kernel params, use strides from host
Use 1-d kernel
Review: use int64_t for blockDim.x, rename nb->s for clarity
docs : add LFM2 to models section (#14650)
readme : add LFM2 to models section
fix copy paste...
tests : cover lfm2 cases in test_ssm_conv (#14651)
cmake : Add CMake presets for Linux and GCC (#14656)
metal : Add missing unary ops Metal support (#14660)
ggml : add build-time message to remind about ggml_set_rows (#14661)

ggml-ci

cuda : add ELU support (#14657)
cuda : add set rows for bf16 (#14664)
quantize : fix minor logic flaw in --tensor-type (#14572)
llama : add jinja template for rwkv-world (#14665)
llama : add jinja template for rwkv-world

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

sycl: Batched mulmat rework for oneDNN dispatch (#14617)
SY…

gianni-cor pushed a commit to gianni-cor/qvac-fabric-llm.cpp that referenced this pull request

Mar 23, 2026

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
ggml : do not output unprintable characters on GGUF load failure (#14381)
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

ggml-cpu: add fp16->fp32 nnpa first

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

ggml-cpu: add fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

ggml-cpu: better variable names

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

docs: update s390x docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

ggml-cpu: add debugging prints to see if dlf16 is correct

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix print vs printf

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix float placeholder

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: ensure fp16 and fp32 load and stores are called

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fp16 load ensured to hit

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa switch to vec_xst test

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to vec_xst for 4 element loops also

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rework noop

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove noop, general code cleanup

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify variable naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add breakpoint for debugging

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test fix for conversion failure

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: disable fp32->fp16 nnpa conversions for now

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to elif macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix typo

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix compiler types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: change to typedef vector types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add 4 element loops for fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarified vector naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back fp32->fp16 store nnpa

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add nnpa macro check in ggml-impl

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add missing func

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: diagnose why NNPA macro is not being defined

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: import vecintrin.h to fix compiler errors

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: update macro tests

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to importing ggml-cpu-impl instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix macro declaration

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test more macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add debug prints

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h to cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to private macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

ggml-cpu: move things around

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to quotes for import

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add compiler error macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add s390x detection in ggml-src

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: undo cmakelists work

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedefs.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedef from cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h future notes

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add todo comment for future reference

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify naming of dlf16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove unnecessary target compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update broken huggingface link for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix duplicate func names during compile

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: fix duplicate func names during compile"

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp16<->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h import in quants.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h within repack

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix amx mmq missing simd-mappings.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: attempt at fixing loongarch failing build

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa together with other fp16<->fp32 simd

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix wrong refactor of ggml-base

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: remove dependency on ggml-cpu from ggml-base

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove mistaken fallback macro

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

ggml: move ggml_table_f32_f16 to ggml-cpu.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: extern c ggml_table_f32_f16 + chore docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back ggml_table_f32_f16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: bring back ggml_table_f32_f16"

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fix ggml time initialization
fix f32_f16 table init
remove extra line

Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add s390x z17 build q&a

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup

ggml-ci

metal : handle some edge cases when threadgroup size is not a power of 2

ggml-ci

metal : add special-case mat-vec mul for ne00 == 4 (#14385)

ggml-ci

llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst

ggml-ci

ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation

ggml-ci

ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR

ggml-ci

ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

recurrent : call balloc split_reset() in init_batch() (#14414)

ggml-ci

graph : make llm_graph_context destructor virtual (#14410)

ggml-ci

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)

This setting needs to be passed through to vulkan-shaders-gen

ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL

Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
Add detection logic and basic fusion logic in ggml-vulkan.
Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.

extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups

style fixes

last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter

Co-authored-by: slaren slarengh@gmail.com

ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels

ggml-ci

add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints

ggml-ci

64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]

ggml-ci

Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation

Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
Introducing helper functions op_xxx for each unary operation to encapsulate the logic.
Replacing direct kernel calls with calls to these inlined functions.
Using __dpct_inline__ to encourage compiler inlining.
Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动

Remove redundant include path in CMakeLists.txt

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Enable scheduled Docker image builds

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels

ggml-ci

cont : disable for q4_1

ggml-ci

cont : disable for iq4_nl

ggml-ci

memory : correctly handle failure in apply() (#14438)

ggml-ci

Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-quants : rename best_mad to best_error (ggml/1283)

This commit renames the variable best_mad to best_error in the make_qkx2_quants function.

The motivation for this is that the name best_mad can be somewhat confusing if mean absolute deviation (MAD) is not in use.

ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml

ggml-ci

ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
[CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon 757486878@qq.com

Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon 757486878@qq.com

fix editorconfig

Signed-off-by: noemotiovon 757486878@qq.com

Add Vulkan images to docker.md (#14472)

Right now it's not easy to find those.

ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI

ggml-ci

cont : remove -g flag

ggml-ci

ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes

Co-authored-by: Diego Devesa slarengh@gmail.com

github : add OpenCL backend to issue templates (#14492)
ci : add OpenCL to labeler workflow (#14496)
opencl : update upscale to support align corners (#14488)
opencl : skip empty nodes on cgraph compute (#14491)
simple-chat : fix context-exceeded condition (#14494)
simple-chat : fix context-exceeded condition

ggml-ci

cont : fix n_ctx_used computation

ggml-ci

opencl : fix possible buffer overflow in dump_tensor (#14490)
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)

ggml-ci

vulkan: support softmax/FA batch and broadcast (#14449)
CUDA: broadcasting for FlashAttention mask (#14500)
CUDA: add softmax broadcast (#14475)
CUDA: add softmax broadcast
Pass by const ref
Review: Use blockDims for indexing, remove designated initializers
Add TODO for noncontigous input/output
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
ggml : add version function to get lib version (ggml/1286)
ggml : add version function to get lib version

This commit adds a function ggml_version() to the ggml library that returns the version of the library as a string.

The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.

Usage:

printf("GGML version: %s\n", ggml_version());

Output:

GGML version: 0.0.2219

ggml : add ggml_commit()

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

sync : ggml

ggml-ci

llama : initial Mamba-2 support (#9126)
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
mamba : fix mismatched new and delete size for llm_build_mamba

cuda : graceful fallback for Mamba-1 models with weird embd size
gguf-py : add support for chat template jinja files (#14508)
add support for chat template jinja files
remove gemma3n hack
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
ggml : remove kompute backend (#14501)

ggml-ci

ggml : fix FA mask dim 2 and 3 (#14505)
ggml : fix FA mask dim 2 and 3

ggml-ci

backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

vulkan : disable FA for mask->ne[2] != 1
kv-cache : use ggml_set_rows (#14285)
kv-cache : use ggml_set_rows

ggml-ci

graph : separate k and v indices

ggml-ci

cont : remove redundant ifs

ggml-ci

kv-cache : improve find_slot impl
kv-cache : bounds-check when accessing slot_info indices
kv-cache : add comments

ggml-ci

ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci

convert : correct gemma 3n conversion (#14450)
convert : correct gemma 3n conversion
rm redundant code
Fix conditional enabling following arch checks for ggml-sycl (#14504)

Signed-off-by: nscipione nicolo.scipione@codeplay.com

ggml: backward pass for split swiglu (#14483)
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
vulkan: better parameterize FA by head sizes
vulkan: support mixed/deepseekR1 FA head sizes
opencl : broadcast for soft_max (#14510)
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)

Co-authored-by: luyuhong luyuhong@kylinos.cn

batch : add n_used count (#14512)

ggml-ci

graph : prepare for 4D mask (#14515)

ggml-ci

batch : add optional for sequential equal split (#14511)

ggml-ci

metal : disable fast math in all quantize kernels (#14528)

ggml-ci

test-backend-ops: add support for specifying output format (#14368)
test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

refactor

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

remove visitor nonsense
remove visitor comment

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com

eval-callback : check for empty input (#14539)
opencl: add GELU_ERF (#14476)
server : fix assistant prefilling when content is an array (#14360)
vulkan: Handle updated FA dim2/3 definition (#14518)
vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.

handle null mask for gqa
allow gqa with dim3>1
vulkan: fix rms_norm+mul fusion (#14545)

The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.

vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)

Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com

CUDA: add bf16 and i32 to getrows (#14529)
llama : remove ggml_cont where possible (#14568)
llama : fix incorrect minicpm3 v_states shape (#14571)
musa: fix build warnings (unused variable) (#14561)

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

CUDA: add bilinear interpolation for upscale (#14563)
cuda : fix rope with partial rotation and non-cont src (#14580)
cuda : fix rope non-cont

ggml-ci

cont : fix multi-rope + add test

ggml-ci

sycl : try fix

ggml-ci

cont : fix sycl + clean-up cuda

ggml-ci

vulkan: increase timeout for CI (#14574)
model : add hunyuan moe (#14425)
model : add hunyuan moe
tokenizer ok
fix tensor name
cgraph init
chat template
wip
almost working
skip embed, fix bos
cleanup
yarn scaling
cleanup
correct rope type
failed token fix
ntk alpha freq_base
tokenization working
cleanup and pr changes
vocab_size sanity check
ntk alpha generic
Update convert_hf_to_gguf.py
Apply suggestions from code review
fix regression
fix style

Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com

server: Add ability to mount server at prefix (#14544)
Add server_prefix
Correct server path env
Rename cli flag to --api-prefix
Change all to api_prefix
vulkan : fix rope with partial rotation and non-cont src (#14582)
memory : fix broken batch splits for recurrent cache (#14575)

Splits producing more than one ubatch per batch for recurrent models were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.

model : add SmolLM3 (#14581)
Init - first pass.
Model -> ModelBase.
fix errors in conversion.
Update the graph.
up.
up.
wip
cgraph ok
rm redundant code

Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com

model : fix hunyuan moe chat template (#14584)

Signed-off-by: stevenkuang stevenkuang@tencent.com

vulkan: optimize flash attention split_k_reduce (#14554)
vulkan: allow FA split_k with smaller KV values
vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

convert : fix smollm3 jinja template (#14586)
model : add support for Falcon-H1 family (#14534)
v1
push more fixes
another fix
fix
more fixes
minor fix
more cleaning on python code
python fixes
changed precision for multipliers float 32->64
fixes
another fix
fix
pre-norm -> norm
fix
Revert "fix"

This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.

fix
small fix ffn_norm
try
mix instead of max
fix vocab size
conflict solve
fixed multipliers
falcon-h1 specefic vocab resolved
read arch from gguf.MODEL_ARCH
mamba_d_ssm added to d_inner find_hparam
remove unused functions from gguf_writer.py
override modify_tensors instead of get_tensors
fix conversion and d_inner
added some cb functions for debugging puposes
inp_out_ids moved outside of layers loop
mup_vec create as float64
fix rope_theta
injected mup
clean ups
rm extra space
rm unused MAMBA_CHUNK_SIZE
rm unused key
add bos False
changed ROPE_TYPE
cleaning debugging stuff
cleaning debug quant
fix comment
some cleanups
some cleanups
Update src/llama-model-loader.cpp
more cleanups
moe cleanuips
d_ssm -> d_inner;
cleaning unused hparams
cleanup
more cleanups
more cleanups on python conversion;
minor cleanups
Apply suggestions from code review

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

remove todo
added falcon-h1
tensor not required
clean
remove unneeded attributes
more cleanups and fixed conversion
remove final_norm
flake8 fixes
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

flake8 fixes
Update src/llama-hparams.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

added hashes
Update src/llama-arch.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

update the update file
Revert "update the update file"

This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.

fix: address suggestions
fix: update convert_hf_to_gguf.py
Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

d_inner fixed
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

reshaping ssm_norm for 34B
removing generate_mup
remove duplicates metadata keys
rm comment
final comment
fix unused args
fix constants
fix bad merge
Update src/llama-model.cpp

Co-authored-by: compilade git@compilade.net

falcon-h1: remove unused ssm_in_b and bad merge
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

falcon-h1: fix last comment
Update convert_hf_to_gguf.py

Co-authored-by: compilade git@compilade.net

falcon-h1: revert add_add_bos(False)
falcon-h1: fix tied weights
falcon-h1: remove whitespace
falcon-h1: fix wrong size param
falcon-h1: fix whitespace issues

llama : remove unintended whitespace (#14592)
model : add skt/A.X-4.0 model vocabulary (#14589)
ggml : prevent integer overflow in gguf tensor size calculation (#14595)
ggml : add ggml_scale_bias (#14417)
ggml : add ggml_scale_bias
ggml_vec_mad1_f32
add more simd
add CUDA
sycl
vulkan
cann (placeholder)
opencl
will this fix cpu?
fix cuda
suggestions from coderabbit
fix cann compile error
vDSP_vsmsa
rm __ARM_FEATURE_SVE
use memcpy for op params
make code looks more consistent
use scalar for __ARM_FEATURE_SVE
add x param to ggml_vec_mad1_f32
llama : support Jamba hybrid Transformer-Mamba models (#7531)
wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it, and this makes Mamba's conv step slightly faster.

mamba : fix non-contiguous usage of ggml_silu
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

llama : begin renaming llama_past back to llama_kv_cache
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
convert : fix jamba conv1d shape squeezing
graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

model : add Jamba to Mamba-specific hparams printing
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

llama : remove llm_graph_input_one (#14603)
cuda : support Falcon-H1 state size for SSM_SCAN (#14602)
cmake : llguidance build parser library only (#14608)
cmake : bump llguidance version to v1.0.1 (#14609)
llama : minor coding style fix for smollm3 (#14605)
SYCL: Initial set_rows kernel implementation (#14562)
SYCL: Initial set_rows kernel implementation
Revert max_threads to 256
Refactor set_rows and address review comments
Deduplicate conversion function
Remove guard before kernel launch and refactor
Fix and add back SFINAE
cmake : do not search for curl libraries by ourselves (#14613)
cmake : do not search for curl libraries by ourselves
run : do not search for curl libraries by ourselves
Docs: script to auto-generate ggml operations docs (#14598)
Docs: script to auto-generate ggml operations docs
Review: formatting changes + change github action
Use built-in types instead of typing
docs : add BLAS and Metal ops

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Smoldocling support (#14597)
support for smoldocling
fixed merge conflicts
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

merge conflicts
pre tokenizer merge fix
convert : fix smollm3 jinja template (#14586)

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

support for smoldocling

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

fixed merge conflicts

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.h

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

safetensors tensor mapping

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

added back accidental removal of clean spaces for hunyuan
Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

updated hash and reordererd model list
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update include/llama.h

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

removed old tensor name
removed tensor mappings -> handled by smolvlm
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

opencl: add set_rows for f16 and f32 (#14547)
opencl: add set_rows for f16 and f32
opencl: better choose workgroup size for set_rows
opencl: add tiled mul_mat_f16_f32 (#14535)
add tiled mul_mat_f16_f32
fix trailing whitespace
add insightful comments
model : Granite Four (#13550)
wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits
llama : use im2col and mul_mat to perform convolution for Mamba

ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it, and this makes Mamba's conv step slightly faster.

llama : rename llama_cache to llama_past

Still, I'm open to better suggestions.

examples : replace llama_kv_cache_seq_* with llama_past_seq_*
mamba : fix non-contiguous usage of ggml_silu
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

llama : begin renaming llama_past back to llama_kv_cache
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
feat: Add conversion for Bamba models

This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add Granite 4 conversion

This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Plumb bamba through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add bamba to llama_arch_is_hybrid_recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add optional mamba ssm_in bias tensor

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add template specialization for get_arr to load a vector for layer index arr in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Use an explicit bool to determine mamaba vs mamba2

This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture if statement inside the mamba implementation.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Isolate mamba(2) and granite attention layer building in static methods

This will allow these layer-builder methods to be used from other build structs without complex inheritance.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use per-layer sizes in granite build_attention_layer

Also no need to pass in kv cache since it's already in the inp_attn

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: First (broken) pass at end-to-end Bamba implementation

It generates (garbage) tokens! Still lots of debugging to do.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Only do Granite multipliers if set

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Pull granite ffn portion into a static function and reuse in hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat(py): Allow gguf duplicate keys if they match by value and type

This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor(py): Simplify granitemoehybrid conversion to use parents better

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add GRANITE_MOE_HYBRID through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Support GRANITE_MOE_HYBRID in llama-model

This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

style: Fix flake8 errors

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix recurrent cache get after rebase

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Implement the full copy-paste version to duplicate the layer builders

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid

As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

mamba : fix mismatched new and delete size for llm_build_mamba

memory : correctly handle failure in apply()

ggml-ci

style: Remove TODO for adding first hybrid models to the switch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

docs: Fix comment about duplicate key check

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Conform to standard way of initializing inp_out_ids

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

convert : fix jamba conv1d shape squeezing
fix: Fix input initialization in granite_hybrid after removal of hybrid inputs

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use llm_graph_context_mamba in llm_build_granite_hybrid

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

model : add Jamba to Mamba-specific hparams printing
fix: Fix input setup after upstream merge

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

feat: Add support for dense FFN in GraniteMoeHybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add support for dense FFN tensor names on c++ side

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use child inputs for Falcon H1 after merge resolution

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unnecessary prefix on tensor constants

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
fix: Revert order changes for Falcon H1 to stay consistent with upstream

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid

The only key difference is the use of rope which is now set via rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove use of diamond inheritance

Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance

https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Log mamba params for Granite Hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unused ssm_in_b

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv

This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unused template expansion for get_arr

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Review cleanup in convert_hf_to_gguf

The gist is to be explicit about which base class is being used with the multiple inheritance setup

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Undo hidden warnings about duplicate identical keys in add_key_value

After further discussion, this encourages sloppy overwriting in the model converters

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: If not using ROPE, context is "infinite"

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

doc: Add a comment outlining expected duplicate key warnings

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unnecessary duplicate keys in converter

Co-authored-by: Francis Couture-Harpin git@compilade.net

(thanks for the sharp eyes and patience!)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

vocab : add midm-2.0 model pre-tokenizer (#14626)
llama : move enum llama_vocab_pre_type to implementation (#14631)

ggml-ci

readme : add hot PRs (#14636)
readme : add hot PRs
cont
readme : update title
readme : hot PRs links
cont
HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634)
model : support LiquidAI LFM2 hybrid family (#14620)

Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source

pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"

vulkan: optimizations for deepseek prompt processing (#14555)
vulkan: allow unclamped loads in coopmat2 mul_mat_id shader
vulkan: increase coopmat2 mul_mat_id tile size
vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path
vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
vulkan: support SET_ROWS (#14587)
vulkan: support SET_ROWS

vulkan: optimize set_rows

Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

server : fix pooled embedding output (#14645)
vulkan : implement ggml_roll (ggml/1290)

ggml-ci

vulkan : implement bilinear interpolation (ggml/1291)

ggml-ci

sync : ggml

ggml-ci

vulkan : remove unused vars (#0)

ggml-ci

sync : ggml
CUDA: add set rows for f32 and f16 (#14551)
CUDA: add set rows for f32 and f16
Review: change kernel params, use strides from host
Use 1-d kernel
Review: use int64_t for blockDim.x, rename nb->s for clarity
docs : add LFM2 to models section (#14650)
readme : add LFM2 to models section
fix copy paste...
tests : cover lfm2 cases in test_ssm_conv (#14651)
cmake : Add CMake presets for Linux and GCC (#14656)
metal : Add missing unary ops Metal support (#14660)
ggml : add build-time message to remind about ggml_set_rows (#14661)

ggml-ci

cuda : add ELU support (#14657)
cuda : add set rows for bf16 (#14664)
quantize : fix minor logic flaw in --tensor-type (#14572)
llama : add jinja template for rwkv-world (#14665)
llama : add jinja template for rwkv-world

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

sycl: Batched mulmat rework for oneDNN dispatch (#14617)
SY…

gianni-cor pushed a commit to gianni-cor/qvac-fabric-llm.cpp that referenced this pull request

Mar 23, 2026

sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
ggml : do not output unprintable characters on GGUF load failure (#14381)
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)

ggml-cpu: add fp16->fp32 nnpa first

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)

ggml-cpu: add fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)

ggml-cpu: better variable names

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)

docs: update s390x docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)

ggml-cpu: add debugging prints to see if dlf16 is correct

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix print vs printf

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix float placeholder

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: ensure fp16 and fp32 load and stores are called

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fp16 load ensured to hit

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove sigint from fp16 store

for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: nnpa switch to vec_xst test

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to vec_xst for 4 element loops also

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rework noop

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove noop, general code cleanup

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify variable naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add breakpoint for debugging

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test fix for conversion failure

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: disable fp32->fp16 nnpa conversions for now

there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to elif macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix typo

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: reattempt fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix compiler types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: change to typedef vector types

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add 4 element loops for fp32->fp16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarified vector naming

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back fp32->fp16 store nnpa

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add nnpa macro check in ggml-impl

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add missing func

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: diagnose why NNPA macro is not being defined

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: import vecintrin.h to fix compiler errors

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: update macro tests

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 157f856c34589566151630e294563a420702db39.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to importing ggml-cpu-impl instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix macro declaration

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: test more macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add debug prints

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bruteforce macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move macro definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h to cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to private macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move s390x typedef to own header file

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)

ggml-cpu: move things around

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile macros

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: switch to quotes for import

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add compiler error macro

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add s390x detection in ggml-src

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: undo cmakelists work

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move s390x typedef to own header file"

This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedefs.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove typedef from cmakelists

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add ggml-impl.h future notes

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: add todo comment for future reference

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: clarify naming of dlf16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove unnecessary target compile definitions

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: update broken huggingface link for s390x

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix duplicate func names during compile

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: fix duplicate func names during compile"

This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"

This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: refactor fp16<->fp32 simd to ggml-cpu

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h import in quants.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix missing simd-mappings.h within repack

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix amx mmq missing simd-mappings.h

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: attempt at fixing loongarch failing build

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move nnpa together with other fp16<->fp32 simd

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: fix wrong refactor of ggml-base

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: remove dependency on ggml-cpu from ggml-base

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: remove mistaken fallback macro

fallback logic was already implemented but i was too sleepy to realise

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"

This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"

This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml: move ggml_table_f32_f16 to ggml-cpu

ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006

Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)

ggml: move ggml_table_f32_f16 to ggml-cpu.c

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: extern c ggml_table_f32_f16 + chore docs

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h

we rely on the variable declaration in ggml-cpu.c instead

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"

This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

ggml-cpu: bring back ggml_table_f32_f16

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

Revert "ggml-cpu: bring back ggml_table_f32_f16"

This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

fix ggml time initialization
fix f32_f16 table init
remove extra line

Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com

musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Update ggml/src/ggml-cuda/ggml-cuda.cu

Co-authored-by: Johannes Gäßler johannesg@5d6.de

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Address review comments

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de

docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

docs: add s390x z17 build q&a

Signed-off-by: Aaron Teo aaron.teo1@ibm.com

metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup

ggml-ci

metal : handle some edge cases when threadgroup size is not a power of 2

ggml-ci

metal : add special-case mat-vec mul for ne00 == 4 (#14385)

ggml-ci

llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.

ref: #8366

use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst

ggml-ci

ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation

ggml-ci

ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR

ggml-ci

ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

recurrent : call balloc split_reset() in init_batch() (#14414)

ggml-ci

graph : make llm_graph_context destructor virtual (#14410)

ggml-ci

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)

This setting needs to be passed through to vulkan-shaders-gen

ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)

Add Day-0 support for Baidu ERNIE 4.5 0.3B model.

Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com

vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL

Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
Add detection logic and basic fusion logic in ggml-vulkan.
Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.

extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups

style fixes

last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter

Co-authored-by: slaren slarengh@gmail.com

ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels

ggml-ci

add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints

ggml-ci

64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]

ggml-ci

Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation

Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai

GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
Introducing helper functions op_xxx for each unary operation to encapsulate the logic.
Replacing direct kernel calls with calls to these inlined functions.
Using __dpct_inline__ to encourage compiler inlining.
Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap

ggml-ci

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com

ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"

This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.

SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

coding standard: common/chat.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Apply suggestions from code review

coding standard: cosmetic changes

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models

Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com

scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow，如果想要运行该workflow可以手动启动

Remove redundant include path in CMakeLists.txt

The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.

Enable scheduled Docker image builds

Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.

test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels

ggml-ci

cont : disable for q4_1

ggml-ci

cont : disable for iq4_nl

ggml-ci

memory : correctly handle failure in apply() (#14438)

ggml-ci

Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-quants : rename best_mad to best_error (ggml/1283)

This commit renames the variable best_mad to best_error in the make_qkx2_quants function.

The motivation for this is that the name best_mad can be somewhat confusing if mean absolute deviation (MAD) is not in use.

ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml

ggml-ci

ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
[CANN]update to aclnnGroupedMatmulV2

Signed-off-by: noemotiovon 757486878@qq.com

Support MUL_MAT_ID on 310p

Signed-off-by: noemotiovon 757486878@qq.com

fix editorconfig

Signed-off-by: noemotiovon 757486878@qq.com

Add Vulkan images to docker.md (#14472)

Right now it's not easy to find those.

ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI

ggml-ci

cont : remove -g flag

ggml-ci

ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes

Co-authored-by: Diego Devesa slarengh@gmail.com

github : add OpenCL backend to issue templates (#14492)
ci : add OpenCL to labeler workflow (#14496)
opencl : update upscale to support align corners (#14488)
opencl : skip empty nodes on cgraph compute (#14491)
simple-chat : fix context-exceeded condition (#14494)
simple-chat : fix context-exceeded condition

ggml-ci

cont : fix n_ctx_used computation

ggml-ci

opencl : fix possible buffer overflow in dump_tensor (#14490)
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)

ggml-ci

vulkan: support softmax/FA batch and broadcast (#14449)
CUDA: broadcasting for FlashAttention mask (#14500)
CUDA: add softmax broadcast (#14475)
CUDA: add softmax broadcast
Pass by const ref
Review: Use blockDims for indexing, remove designated initializers
Add TODO for noncontigous input/output
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
ggml : add version function to get lib version (ggml/1286)
ggml : add version function to get lib version

This commit adds a function ggml_version() to the ggml library that returns the version of the library as a string.

The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.

Usage:

printf("GGML version: %s\n", ggml_version());

Output:

GGML version: 0.0.2219

ggml : add ggml_commit()

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

sync : ggml

ggml-ci

llama : initial Mamba-2 support (#9126)
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
mamba : fix mismatched new and delete size for llm_build_mamba

cuda : graceful fallback for Mamba-1 models with weird embd size
gguf-py : add support for chat template jinja files (#14508)
add support for chat template jinja files
remove gemma3n hack
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
ggml : remove kompute backend (#14501)

ggml-ci

ggml : fix FA mask dim 2 and 3 (#14505)
ggml : fix FA mask dim 2 and 3

ggml-ci

backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

vulkan : disable FA for mask->ne[2] != 1
kv-cache : use ggml_set_rows (#14285)
kv-cache : use ggml_set_rows

ggml-ci

graph : separate k and v indices

ggml-ci

cont : remove redundant ifs

ggml-ci

kv-cache : improve find_slot impl
kv-cache : bounds-check when accessing slot_info indices
kv-cache : add comments

ggml-ci

ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends

ggml-ci

convert : correct gemma 3n conversion (#14450)
convert : correct gemma 3n conversion
rm redundant code
Fix conditional enabling following arch checks for ggml-sycl (#14504)

Signed-off-by: nscipione nicolo.scipione@codeplay.com

ggml: backward pass for split swiglu (#14483)
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
vulkan: better parameterize FA by head sizes
vulkan: support mixed/deepseekR1 FA head sizes
opencl : broadcast for soft_max (#14510)
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)

Co-authored-by: luyuhong luyuhong@kylinos.cn

batch : add n_used count (#14512)

ggml-ci

graph : prepare for 4D mask (#14515)

ggml-ci

batch : add optional for sequential equal split (#14511)

ggml-ci

metal : disable fast math in all quantize kernels (#14528)

ggml-ci

test-backend-ops: add support for specifying output format (#14368)
test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

refactor

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

remove visitor nonsense
remove visitor comment

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Address review comments

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com

Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com

eval-callback : check for empty input (#14539)
opencl: add GELU_ERF (#14476)
server : fix assistant prefilling when content is an array (#14360)
vulkan: Handle updated FA dim2/3 definition (#14518)
vulkan: Handle updated FA dim2/3 definition

Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.

handle null mask for gqa
allow gqa with dim3>1
vulkan: fix rms_norm+mul fusion (#14545)

The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.

vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)

Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260

Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com

CUDA: add bf16 and i32 to getrows (#14529)
llama : remove ggml_cont where possible (#14568)
llama : fix incorrect minicpm3 v_states shape (#14571)
musa: fix build warnings (unused variable) (#14561)

Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com

CUDA: add bilinear interpolation for upscale (#14563)
cuda : fix rope with partial rotation and non-cont src (#14580)
cuda : fix rope non-cont

ggml-ci

cont : fix multi-rope + add test

ggml-ci

sycl : try fix

ggml-ci

cont : fix sycl + clean-up cuda

ggml-ci

vulkan: increase timeout for CI (#14574)
model : add hunyuan moe (#14425)
model : add hunyuan moe
tokenizer ok
fix tensor name
cgraph init
chat template
wip
almost working
skip embed, fix bos
cleanup
yarn scaling
cleanup
correct rope type
failed token fix
ntk alpha freq_base
tokenization working
cleanup and pr changes
vocab_size sanity check
ntk alpha generic
Update convert_hf_to_gguf.py
Apply suggestions from code review
fix regression
fix style

Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com

server: Add ability to mount server at prefix (#14544)
Add server_prefix
Correct server path env
Rename cli flag to --api-prefix
Change all to api_prefix
vulkan : fix rope with partial rotation and non-cont src (#14582)
memory : fix broken batch splits for recurrent cache (#14575)

Splits producing more than one ubatch per batch for recurrent models were broken with #14512.

This fixes it by moving the completeness check after the ubatch split loop.

model : add SmolLM3 (#14581)
Init - first pass.
Model -> ModelBase.
fix errors in conversion.
Update the graph.
up.
up.
wip
cgraph ok
rm redundant code

Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com

model : fix hunyuan moe chat template (#14584)

Signed-off-by: stevenkuang stevenkuang@tencent.com

vulkan: optimize flash attention split_k_reduce (#14554)
vulkan: allow FA split_k with smaller KV values
vulkan: spread split_k_reduce work across more threads

k_num can get rather large. Use the whole workgroup to reduce the M/L values.

Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).

convert : fix smollm3 jinja template (#14586)
model : add support for Falcon-H1 family (#14534)
v1
push more fixes
another fix
fix
more fixes
minor fix
more cleaning on python code
python fixes
changed precision for multipliers float 32->64
fixes
another fix
fix
pre-norm -> norm
fix
Revert "fix"

This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.

fix
small fix ffn_norm
try
mix instead of max
fix vocab size
conflict solve
fixed multipliers
falcon-h1 specefic vocab resolved
read arch from gguf.MODEL_ARCH
mamba_d_ssm added to d_inner find_hparam
remove unused functions from gguf_writer.py
override modify_tensors instead of get_tensors
fix conversion and d_inner
added some cb functions for debugging puposes
inp_out_ids moved outside of layers loop
mup_vec create as float64
fix rope_theta
injected mup
clean ups
rm extra space
rm unused MAMBA_CHUNK_SIZE
rm unused key
add bos False
changed ROPE_TYPE
cleaning debugging stuff
cleaning debug quant
fix comment
some cleanups
some cleanups
Update src/llama-model-loader.cpp
more cleanups
moe cleanuips
d_ssm -> d_inner;
cleaning unused hparams
cleanup
more cleanups
more cleanups on python conversion;
minor cleanups
Apply suggestions from code review

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

remove todo
added falcon-h1
tensor not required
clean
remove unneeded attributes
more cleanups and fixed conversion
remove final_norm
flake8 fixes
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

flake8 fixes
Update src/llama-hparams.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-arch.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

added hashes
Update src/llama-arch.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Update src/llama-vocab.cpp

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

update the update file
Revert "update the update file"

This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.

fix: address suggestions
fix: update convert_hf_to_gguf.py
Update gguf-py/gguf/constants.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model-loader.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

d_inner fixed
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

reshaping ssm_norm for 34B
removing generate_mup
remove duplicates metadata keys
rm comment
final comment
fix unused args
fix constants
fix bad merge
Update src/llama-model.cpp

Co-authored-by: compilade git@compilade.net

falcon-h1: remove unused ssm_in_b and bad merge
Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

falcon-h1: fix last comment
Update convert_hf_to_gguf.py

Co-authored-by: compilade git@compilade.net

falcon-h1: revert add_add_bos(False)
falcon-h1: fix tied weights
falcon-h1: remove whitespace
falcon-h1: fix wrong size param
falcon-h1: fix whitespace issues

llama : remove unintended whitespace (#14592)
model : add skt/A.X-4.0 model vocabulary (#14589)
ggml : prevent integer overflow in gguf tensor size calculation (#14595)
ggml : add ggml_scale_bias (#14417)
ggml : add ggml_scale_bias
ggml_vec_mad1_f32
add more simd
add CUDA
sycl
vulkan
cann (placeholder)
opencl
will this fix cpu?
fix cuda
suggestions from coderabbit
fix cann compile error
vDSP_vsmsa
rm __ARM_FEATURE_SVE
use memcpy for op params
make code looks more consistent
use scalar for __ARM_FEATURE_SVE
add x param to ggml_vec_mad1_f32
llama : support Jamba hybrid Transformer-Mamba models (#7531)
wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it, and this makes Mamba's conv step slightly faster.

mamba : fix non-contiguous usage of ggml_silu
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

llama : begin renaming llama_past back to llama_kv_cache
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
convert : fix jamba conv1d shape squeezing
graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

model : add Jamba to Mamba-specific hparams printing
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

llama : remove llm_graph_input_one (#14603)
cuda : support Falcon-H1 state size for SSM_SCAN (#14602)
cmake : llguidance build parser library only (#14608)
cmake : bump llguidance version to v1.0.1 (#14609)
llama : minor coding style fix for smollm3 (#14605)
SYCL: Initial set_rows kernel implementation (#14562)
SYCL: Initial set_rows kernel implementation
Revert max_threads to 256
Refactor set_rows and address review comments
Deduplicate conversion function
Remove guard before kernel launch and refactor
Fix and add back SFINAE
cmake : do not search for curl libraries by ourselves (#14613)
cmake : do not search for curl libraries by ourselves
run : do not search for curl libraries by ourselves
Docs: script to auto-generate ggml operations docs (#14598)
Docs: script to auto-generate ggml operations docs
Review: formatting changes + change github action
Use built-in types instead of typing
docs : add BLAS and Metal ops

Co-authored-by: Georgi Gerganov ggerganov@gmail.com

Smoldocling support (#14597)
support for smoldocling
fixed merge conflicts
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com

merge conflicts
pre tokenizer merge fix
convert : fix smollm3 jinja template (#14586)

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

support for smoldocling

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

fixed merge conflicts

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-model.h

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

safetensors tensor mapping

Signed-off-by: ryan-mangeno ryanmangeno@gmail.com

added back accidental removal of clean spaces for hunyuan
Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

updated hash and reordererd model list
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update include/llama.h

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update convert_hf_to_gguf_update.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update src/llama-vocab.cpp

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

removed old tensor name
removed tensor mappings -> handled by smolvlm
Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

opencl: add set_rows for f16 and f32 (#14547)
opencl: add set_rows for f16 and f32
opencl: better choose workgroup size for set_rows
opencl: add tiled mul_mat_f16_f32 (#14535)
add tiled mul_mat_f16_f32
fix trailing whitespace
add insightful comments
model : Granite Four (#13550)
wip: llama : separate recurrent states from the KV cache

This will be necessary to support Jamba (and other recurrent models mixed with Attention).

Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.

llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA

This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.

llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits

This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.

llama : fix edge case finding batch seq_id of split recurrent cell

This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.

llama : avoid copies for simple batch splits
llama : use im2col and mul_mat to perform convolution for Mamba

ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes

Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.

llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors

The implementation already supported it, and this makes Mamba's conv step slightly faster.

llama : rename llama_cache to llama_past

Still, I'm open to better suggestions.

examples : replace llama_kv_cache_seq_* with llama_past_seq_*
mamba : fix non-contiguous usage of ggml_silu
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.

llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift

This also slightly reduces the diff from the master branch

llama : begin renaming llama_past back to llama_kv_cache
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.

ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models to avoid some reshapes.

Not sure if it's a good idea, but it makes the graph slightly cleaner.

llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.

convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
feat: Add conversion for Bamba models

This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add Granite 4 conversion

This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Plumb bamba through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add bamba to llama_arch_is_hybrid_recurrent

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add optional mamba ssm_in bias tensor

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add template specialization for get_arr to load a vector for layer index arr in hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Use an explicit bool to determine mamaba vs mamba2

This allows other architectures like bamba and granitemoehybrid to use mamab2 without a growing architecture if statement inside the mamba implementation.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Isolate mamba(2) and granite attention layer building in static methods

This will allow these layer-builder methods to be used from other build structs without complex inheritance.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use per-layer sizes in granite build_attention_layer

Also no need to pass in kv cache since it's already in the inp_attn

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: First (broken) pass at end-to-end Bamba implementation

It generates (garbage) tokens! Still lots of debugging to do.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Only do Granite multipliers if set

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Pull granite ffn portion into a static function and reuse in hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat(py): Allow gguf duplicate keys if they match by value and type

This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor(py): Simplify granitemoehybrid conversion to use parents better

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add GRANITE_MOE_HYBRID through llama-arch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Support GRANITE_MOE_HYBRID in llama-model

This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

style: Fix flake8 errors

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix recurrent cache get after rebase

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Implement the full copy-paste version to duplicate the layer builders

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid

As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

mamba : fix mismatched new and delete size for llm_build_mamba

memory : correctly handle failure in apply()

ggml-ci

style: Remove TODO for adding first hybrid models to the switch

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

docs: Fix comment about duplicate key check

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Conform to standard way of initializing inp_out_ids

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

convert : fix jamba conv1d shape squeezing
fix: Fix input initialization in granite_hybrid after removal of hybrid inputs

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use llm_graph_context_mamba in llm_build_granite_hybrid

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins

Branch: GraniteFourWithJamba

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

graph : add back hybrid memory graph input

But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).

model : add Jamba to Mamba-specific hparams printing
fix: Fix input setup after upstream merge

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : use ggml_swiglu_split for Mamba

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

feat: Add support for dense FFN in GraniteMoeHybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Add support for dense FFN tensor names on c++ side

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Use child inputs for Falcon H1 after merge resolution

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unnecessary prefix on tensor constants

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
fix: Revert order changes for Falcon H1 to stay consistent with upstream

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

gguf-py : avoid adding duplicate tensor mappings for Jamba

Some of the tensor names are common with Llama4

refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid

The only key difference is the use of rope which is now set via rope_finetuned in the hparams

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove use of diamond inheritance

Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance

https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

feat: Log mamba params for Granite Hybrid

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unused ssm_in_b

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv

This matches how recurrent vs attention heads are identified for Jamba

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unused template expansion for get_arr

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Review cleanup in convert_hf_to_gguf

The gist is to be explicit about which base class is being used with the multiple inheritance setup

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Undo hidden warnings about duplicate identical keys in add_key_value

After further discussion, this encourages sloppy overwriting in the model converters

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: If not using ROPE, context is "infinite"

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

doc: Add a comment outlining expected duplicate key warnings

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

fix: Remove unnecessary duplicate keys in converter

Co-authored-by: Francis Couture-Harpin git@compilade.net

(thanks for the sharp eyes and patience!)

Branch: GraniteFour

Signed-off-by: Gabe Goodhart ghart@us.ibm.com

vocab : add midm-2.0 model pre-tokenizer (#14626)
llama : move enum llama_vocab_pre_type to implementation (#14631)

ggml-ci

readme : add hot PRs (#14636)
readme : add hot PRs
cont
readme : update title
readme : hot PRs links
cont
HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634)
model : support LiquidAI LFM2 hybrid family (#14620)

Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source

pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"

vulkan: optimizations for deepseek prompt processing (#14555)
vulkan: allow unclamped loads in coopmat2 mul_mat_id shader
vulkan: increase coopmat2 mul_mat_id tile size
vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path
vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
vulkan: support SET_ROWS (#14587)
vulkan: support SET_ROWS

vulkan: optimize set_rows

Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.

server : fix pooled embedding output (#14645)
vulkan : implement ggml_roll (ggml/1290)

ggml-ci

vulkan : implement bilinear interpolation (ggml/1291)

ggml-ci

sync : ggml

ggml-ci

vulkan : remove unused vars (#0)

ggml-ci

sync : ggml
CUDA: add set rows for f32 and f16 (#14551)
CUDA: add set rows for f32 and f16
Review: change kernel params, use strides from host
Use 1-d kernel
Review: use int64_t for blockDim.x, rename nb->s for clarity
docs : add LFM2 to models section (#14650)
readme : add LFM2 to models section
fix copy paste...
tests : cover lfm2 cases in test_ssm_conv (#14651)
cmake : Add CMake presets for Linux and GCC (#14656)
metal : Add missing unary ops Metal support (#14660)
ggml : add build-time message to remind about ggml_set_rows (#14661)

ggml-ci

cuda : add ELU support (#14657)
cuda : add set rows for bf16 (#14664)
quantize : fix minor logic flaw in --tensor-type (#14572)
llama : add jinja template for rwkv-world (#14665)
llama : add jinja template for rwkv-world

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com

sycl: Batched mulmat rework for oneDNN dispatch (#14617)
SY…

Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request

Apr 26, 2026

phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request

Apr 28, 2026

Merge vulkan code from mainline up to commit of 6/28/2025
Vulkan Optimizations and Fixes (ggml-org#8959)
Optimize Vulkan REPEAT performance
Use Vulkan GLSL fused multiply-add instruction where possible
Add GGML_VULKAN_PERF option to output performance data per operator
Rework and fix Vulkan descriptor set and descriptor pool handling
Fix float32 concat f16 shader validation error
Add Vulkan GROUP_NORM eps parameter
Fix validation error with transfer queue memory barrier flags
Remove trailing whitespaces

vulkan : do not use tensor->extra (ggml-org#9407)

vulkan : do not use tensor->extra

This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used.

Ref: ggml-org#8536

Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (ggml-org#2)

Co-authored-by: 0cc4m picard12@live.de

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan : fix build (#0)

ggml-ci

Improve Vulkan shader build system (ggml-org#9239)

Improve Vulkan shader builds system

Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility.
Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools

remove not required self dependency

ggml : fix build break for the vulkan-debug (ggml-org#9265)

windows build : Ok.
linux build : Ok.

Signed-off-by: Changyeon Kim cyzero.kim@samsung.com

vulkan: correctly report support for OP_CONT (ggml/946)

test-backend-ops fails because ggml_cont aborts when invoked passing an unsupported type.

This commit makes ggml_cont tests pass

Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com

vulkan: add dryrun support to sin and cos ops (ggml/947)

sin and cos failed test-backend-ops because they tried to dereference a context pointer that is null on dry runs.

This commit prevents that segfault.

Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com

Conflicts:

ggml/src/ggml-vulkan.cpp

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (ggml-org#9118)

Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.
fix compile issues
Fix issues where the last submit wasn't executed or handled properly.
remove trailing whitespace
Repair GGML_VULKAN_CHECK_RESULTS
Increase submit counter only if actual work has been submitted and increase submit count to 100.
Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.

Conflicts:

ggml/src/ggml-vulkan.cpp

Enable use to the rebar feature to upload buffers to the device. (ggml-org#9251)

vulkan : argsort barriers must be under uniform control flow (ggml/951)

a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs).

BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work.

Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com

vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)

vulkan : multithread pipeline creation (ggml/963)

vulkan : mul_mat: fix UB with small warps (ggml/952)

When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size.

The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication.

When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0.

We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8).

Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com

vulkan : retry allocation with fallback flags (whisper/2451)

Co-authored-by: Samuel Morris samuel.morris@artlist.io

vulkan : improve ggml_vk_create_buffer error handling (ggml-org#9898)

vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (ggml-org#10226)

vulkan: Throttle the number of shader compiles during the build step. (ggml-org#10222)

Fixes ggml-org#9582

Spawning too many concurrent copies of glslc leads to "Failed to create pipes" errors on Linux. This change applies the same throttling we use for multithreaded pipeline creation.

Conflicts:

ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp

vulkan: Optimize contiguous copies (ggml-org#10254)

tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.

Conflicts:

tests/test-backend-ops.cpp

vulkan: Use macros to make the mat mul pipeline creation more concise (ggml-org#10259)

Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a pipeline. This isn't really used yet.

vulkan: Optimize binary ops (ggml-org#10270)

Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/acc.comp

ggml : vulkan logs (whisper/2547)

vulkan: Optimize some mat-vec mul quant shaders (ggml-org#10296)

Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops.

Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before.

Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.

Vulkan: Fix device info output format specifiers (ggml-org#10366)

Vulkan: Fix device info output format specifiers
Vulkan: Use zu printf specifier for size_t instead of ld

vulkan: remove use of null initializer (ggml-org#10372)

Seems like this isn't working for vulkan-over-metal when the array is sized by a spec constant. Maybe a spirv-cross limitation?

vulkan: Optimize soft_max (ggml-org#10301)

vulkan: Optimize soft_max

Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H.

vulkan: Further soft_max optimizations

Restore the workgroup size of 512 case, use it for >1024.

Use unrollable loops for more iteration counts.

vulkan: further optimize mul_mat_vec using larger loads (ggml-org#10387)

vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec.

Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up.

Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions.

vulkan: Add GLSL structure aliases for quant types to allow larger loads

In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits.

vulkan: use larger loads in q5_k and q6_k shaders.

Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions.

vulkan: use larger K step per iteration in mul_mat_vec.

Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system.

The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes.

Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.

vulkan: copy iq4_nl LUT into shared memory (ggml-org#10409)

vulkan: predicate max operation in soft_max shaders/soft_max (ggml-org#10437)

Fixes ggml-org#10434

vulkan: Fix a vulkan-shaders-gen arugment parsing error (ggml-org#10484)

The vulkan-shaders-gen was not parsing the --no-clean argument correctly. Because the previous code was parsing the arguments which have a value only and the --no-clean argument does not have a value, it was not being parsed correctly. This commit can now correctly parse arguments that don't have values.

vulkan: fix group_norm (ggml-org#10496)

Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion).

Fixes leejet/stable-diffusion.cpp#439.

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: optimize Q2_K and Q3_K mul_mat_vec (ggml-org#10459)

vulkan: skip integer div/mod in get_offsets for batch_idx==0 (ggml-org#10506)

vulkan: further optimize q5_k mul_mat_vec (ggml-org#10479)

vulkan: Handle GPUs with less shared memory (ggml-org#10468)

There have been reports of failure to compile on systems with <= 32KB of shared memory (e.g. ggml-org#10037). This change makes the large tile size fall back to a smaller size if necessary, and makes mul_mat_id fall back to CPU if there's only 16KB of shared memory.

vulkan: define all quant data structures in types.comp (ggml-org#10440)

vulkan: get the first command buffer submitted sooner (ggml-org#10499)

This is an incremental improvement over ggml-org#9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space.

With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.

vulkan: Dynamic subgroup size support for Q6_K mat_vec (ggml-org#10536)

subgroup 64 version with subgroup add. 15% faster

scalable version

tested for subgroup sizes 16-128

check for subgroup multiple of 16 and greater than 16
subgroup sizes are always a power of 2 (KhronosGroup/GLSL#45)
force 16 sequential threads per block
make 16 subgroup size a constant

vulkan: optimize and reenable split_k (ggml-org#10637)

Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.

vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (ggml-org#10642)

vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (ggml-org#10206)

Conflicts:

ggml/src/vulkan-shaders/dequant_funcs_cm2.comp

ggml/src/vulkan-shaders/flash_attn_cm2.comp

ggml/src/vulkan-shaders/mul_mm_cm2.comp

Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (ggml-org#10597)

Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader
Improve performance with better q4_k and q5_k dequant and store unrolling
Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection
Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device
Vulkan: Implement accumulator switch for specific mul mat mat shaders
Vulkan: Unroll more loops for more mul mat mat performance
Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic
Disable coopmat support on AMD proprietary driver
Remove redundant checks
Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support
Fix rebase typo
Fix coopmat2 MUL_MAT_ID pipeline selection

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: compile a test shader in cmake to check for coopmat2 support (ggml-org#10713)

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/ggml-vulkan/CMakeLists.txt

ggml/src/vulkan-shaders/test_coopmat2_support.comp

Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (ggml-org#10723)

Vulkan: fix NaN in tanh.comp
Faster NaN-free tanh

vulkan: fix compile warnings (ggml-org#10731)

vulkan: disable spirv-opt for coopmat shaders (ggml-org#10763)

There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly necessary anyway.

Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it changes.

Fix coopmat support reporting when glslc doesn't support NV_coopmat2.

vulkan: dynamic subgroup size for the remaining k quants (ggml-org#10745)

q5_k

q4_k

q3_k

q2_k

q6_k multi row example

revert as multi row isnt faster for k quants

vulkan: request round-to-even for fp16 in im2col/rope_head (ggml-org#10767)

Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls feature allows rounding mode to be requested if the implementation supports it.

Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats (ggml-org#10721)

Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats
Fix subgroup size control extension support check

Add accf32 and accf16 checks for coopmats

Also disable coopmats on amdvlk

Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (ggml-org#10798)

vulkan: small mul_mat_vec optimizations (ggml-org#10665)

double the number of rows per workgroup
Update ggml-vulkan.cpp
Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats
only increase the number of rows for amd and subgroup size 64
fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested
use subgroup min and max to check for gcn (requires ggml-org#10721)
manual merge ggml-vulkan.cpp
set min and max subgroup size in any case
Also double the number of rows for Intel GPUs

Change Debug print name

add GGML_ROPE_TYPE_MROPE

rwkv6: add wkv6 support for Vulkan backend (ggml-org#10829)

rwkv_wkv6 vulkan shader
RWKV_WKV6 Vulkan op tests passed

Signed-off-by: Molly Sophia mollysophia379@gmail.com

Apply code format changes

Signed-off-by: Molly Sophia mollysophia379@gmail.com

add [[unroll]] and remove unnecessary conditions
add uma support
fix erros in EditorConfig Checker

Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Molly Sophia mollysophia379@gmail.com

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/wkv6.comp

vulkan: bugfixes for small subgroup size systems + llvmpipe test (ggml-org#10809)

ensure mul mat shaders work on systems with subgroup size less than 32

more fixes

add test

only s_warptile_mmq needs to be run with 32 threads or more

Conflicts:

.github/workflows/build.yml

vulkan : fix soft_max.comp division by zero (whisper/2633)

This change prevents a division by zero error when p.KY is 0.

vulkan: optimize coopmat2 dequant functions (ggml-org#10855)

Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.

vulkan: build fixes for 32b (ggml-org#10927)

vulkan: build fixes for 32b

Should fix ggml-org#10923

vulkan: initialize some buffer/offset variables

examples, ggml : fix GCC compiler warnings (ggml-org#10983)

Warning types fixed (observed under MSYS2 GCC 14.2.0):

format '%ld' expects argument of type 'long int', but argument has type 'size_t'
llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)

Conflicts:

examples/export-lora/export-lora.cpp

vulkan: multi-row k quants (ggml-org#10846)

multi row k quant shaders!
better row selection
more row choices
readjust row selection
rm_kq=2 by default

vulkan: Use push constant offset to handle misaligned descriptors (ggml-org#10987)

vulkan: im2col and matmul optimizations for stable diffusion (ggml-org#10942)

tests: Add im2col perf tests
vulkan: optimize im2col, more elements per thread
vulkan: increase small tile size for NV_coopmat2
vulkan: change im2col to 512 elements per workgroup

vulkan: optimize mul_mat for small values of N (ggml-org#10991)

Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better.

Share some code for reducing the result values to memory in mul_mat_vec_base.

Conflicts:

tests/test-backend-ops.cpp

fix: Vulkan shader gen binary path (ggml-org#11037)

Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (ggml-org#11074)

Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver
Add (TM) to AMD name check

fix lora print

Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (ggml-org#11117)

Disable GL_KHR_cooperative_matrix Vulkan extension if not available.
Perform Vulkan extensions checks in a more sensible order
Remove unnecessary #ifdef directive

Conflicts:

ggml/src/vulkan-shaders/test_coopmat_support.comp

llama: add support for QRWKV6 model architecture (ggml-org#11001)

Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (ggml-org#11161)

Vulkan: Remove float16 use in shaders
Fix validation error about subgroup_size_control extension

fix: ggml: fix vulkan-shaders-gen build (ggml-org#10448)

fix: ggml: fix vulkan-shaders-gen build

The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host.

refactor: ggml: Improve vulkan-shaders-gen toolchain setup

Add GGML_SHADERS_GEN_TOOLCHAIN CMake option.
Auto-detect host toolchain if not set.

refactor: ggml: Improve vulkan-shaders-gen toolchain setup

Use configure_file to generate host_toolchain.cmake from template

fix: ggml: Fix compile error

Fix compile error not finding vulkan-shaders-gen

fix: vulkan-shaders-gen build and path handling

Fix build issues with vulkan-shaders-gen:

Add target dependency for correct build order
Use CMAKE_HOST_SYSTEM_NAME for executable suffix
Fix MSVC output directory in host toolchain
Normalize path handling for cross-compilation

fix: improve host compiler detection in vulkan shader build

Improve host compiler detection for vulkan shader generation:

Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches
Consolidate compiler detection logic
Fix Windows-specific MSVC detection
Ensure correct compiler search in cross-compilation

refactor: Simplify CMake function for detecting host compiler

Simplified the CMake function to improve the process of detecting the host compiler.

fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt

Since vulkan-shader-gen.cpp only requires the glslc executable and not the Vulkan headers or libraries, CMakeLists.txt needs to be corrected. (See: ecc93d0)

refactor: Rename host_toolchain.cmake.in

Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in

refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN

Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN

Conflicts:

ggml/src/ggml-vulkan/CMakeLists.txt

vulkan: scale caching for k quants + misc fixes (ggml-org#11081)

q6_k scale caching
16 bit unpack
q4_k test (slow)
revert it
q3_k
q2_k
little stuff
try precalculating products of a and q2_k scales
Revert "try precalculating products of a and q2_k scales"

This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b.

unpack should be u16, add vim swap to gitignore (about time)
better q4_k scales
q5_k
better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations
q2_k better dequant
q3_k optimizations
q3_k use hmask simd from cpu avx version
make the caches happy
q3_k separate out calculation
q2_k separate out
little stuff
use calc_superblock everywhere
q2_k optimize scale calculation
more barriers

vulkan: optimize coopmat2 q2_k dequant function (ggml-org#11130)

vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (ggml-org#11206)

Do masking on whole dwords, fetch all scales at once.

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (ggml-org#11166)

vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl

Shaders are based on cpy.cu.

vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32
ggml: copy q->f32 assumes some contiguity in the destination

Conflicts:

ggml/src/ggml-cpu/ggml-cpu.c

ggml/src/vulkan-shaders/copy_from_quant.comp

ggml/src/vulkan-shaders/copy_to_quant.comp

vulkan: fix coopmat2 flash attention for non-contiguous inputs (ggml-org#11281)

Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression.

Add noncontiguous FA tests in test-backend-ops.

Fixes ggml-org#11268.

Conflicts:

tests/test-backend-ops.cpp

vulkan: fix coopmat2 validation failures (ggml-org#11284)

mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16.

coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.

vulkan: fix diag_mask_inf (ggml-org#11323)

With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.

vulkan: sort shaders for more deterministic binary (ggml-org#11315)

Fixes ggml-org#11306.

Vulkan-run-test: fix mmq_wg_denoms (ggml-org#11343)

There should be a copy-and-paste error here.

*mmq_wg_denoms should be used together with *warptile_mmq, instead of wg_denoms.

vulkan: compile shaders on-demand (ggml-org#11406)

Reduce first-run startup time and memory consumption.

Should fix ggml-org#11339.

vulkan: Catch pipeline creation failure and print an error message (ggml-org#11436)

vulkan: Catch pipeline creation failure and print an error message

Also, fix some warnings from my on-demand compile change.

vulkan: fix pipeline creation logging

vulkan: implement initial support for IQ2 and IQ3 quantizations (ggml-org#11360)

vulkan: initial support for IQ3_S
vulkan: initial support for IQ3_XXS
vulkan: initial support for IQ2_XXS
vulkan: initial support for IQ2_XS
vulkan: optimize Q3_K by removing branches
vulkan: implement dequantize variants for coopmat2
vulkan: initial support for IQ2_S
vulkan: vertically realign code
port failing dequant callbacks from mul_mm
Fix array length mismatches
vulkan: avoid using workgroup size before it is referenced
tests: increase timeout for Vulkan llvmpipe backend

Co-authored-by: Jeff Bolz jbolz@nvidia.com

Conflicts:

ggml/src/vulkan-shaders/dequant_iq2_s.comp

ggml/src/vulkan-shaders/dequant_iq2_xs.comp

ggml/src/vulkan-shaders/dequant_iq2_xxs.comp

ggml/src/vulkan-shaders/dequant_iq3_s.comp

ggml/src/vulkan-shaders/dequant_iq3_xxs.comp

CUDA: non-contiguous (RMS) norm support (ggml-org#11659)

vulkan: use smaller combined allocations to avoid fragmentation (ggml-org#11551)

Conflicts:

ggml/src/ggml-alloc.c

vulkan: initial support for IQ4_XS quantization (ggml-org#11501)

Conflicts:

ggml/src/vulkan-shaders/dequant_iq4_xs.comp

vulkan: optimize coopmat2 iq2/iq3 callbacks (ggml-org#11521)

vulkan: optimize coopmat2 iq2/iq3 callbacks
build: trigger CI on GLSL compute shader changes

vulkan: print shared memory size (ggml-org#11719)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: account for lookup tables when checking shared memory size (ggml-org#11502)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (ggml-org#11592)

vulkan: linux builds + small subgroup size fixes (ggml-org#11767)

mm subgroup size
upload vulkan x86 builds

vulkan: initial support for IQ1_S and IQ1_M quantizations (ggml-org#11528)

vulkan: initial support for IQ1_S and IQ1_M quantizations
vulkan: define MMV kernels for IQ1 quantizations
devops: increase timeout of Vulkan tests again
vulkan: simplify ifdef for init_iq_shmem

Conflicts:

ggml/src/vulkan-shaders/dequant_iq1_m.comp

ggml/src/vulkan-shaders/dequant_iq1_s.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq1_m.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq1_s.comp

vulkan: support multi/vision rope, and noncontiguous rope (ggml-org#11902)

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/rope_multi.comp

ggml/src/vulkan-shaders/rope_vision.comp

vulkan: implement several ops relevant for ggml_opt (ggml-org#11769)

vulkan: support memset_tensor
vulkan: support GGML_OP_SUM
vulkan: implement GGML_OP_ARGMAX
vulkan: implement GGML_OP_SUB
vulkan: implement GGML_OP_COUNT_EQUAL
vulkan: implement GGML_OP_OPT_STEP_ADAMW
vulkan: fix check_results RWKV_WKV6 crash and memory leaks
vulkan: implement GGML_OP_REPEAT_BACK
tests: remove invalid test-backend-ops REPEAT_BACK tests
vulkan: fix COUNT_EQUAL memset using a fillBuffer command

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/argmax.comp

ggml/src/vulkan-shaders/count_equal.comp

ggml/src/vulkan-shaders/opt_step_adamw.comp

ggml/src/vulkan-shaders/repeat_back.comp

ggml/src/vulkan-shaders/sub.comp

tests/test-backend-ops.cpp

vulkan: implement more backpropagation operators (ggml-org#11914)

vulkan: implement GGML_OP_ROPE_BACK
vulkan: implement GGML_OP_RMS_NORM_BACK
vulkan: implement GGML_OP_SILU_BACK
vulkan: implement GGML_OP_SOFTMAX_BACK

Conflicts:

ggml/src/vulkan-shaders/rms_norm_back.comp

ggml/src/vulkan-shaders/silu_back.comp

ggml/src/vulkan-shaders/soft_max_back.comp

Add memset tensor in all backend interface

SYCL: implement memset ggml backend buffer interface (ggml-org#12580)

SYCL: implement memset ggml backend buffer interface
use GGML_ABORT macro
Do not wait for all queues to finish for memset operation

Conflicts:

ggml/src/ggml-sycl.cpp

add OP sigmoid (ggml-org#12056)

Co-authored-by: Judd foldl@boxvest.com

Conflicts:

ggml/src/vulkan-shaders/sigmoid.comp

vulkan: fix assertion when qy_needs_dequant (ggml-org#12068)

Looks like a copy/paste bug from qx_needs_dequant.

vulkan: improve im2col (ggml-org#11826)

vulkan: improve im2col performance

vulkan: matmul dequantization improvements (ggml-org#12015)

faster dequant for old quants
dont use unpack for iq4_nl
vec2 unpack for q8

vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (ggml-org#11595)

vulkan: implement specialized MMV kernels for IQ2 quantizations
vulkan: add MMV kernels for IQ3 quants
vulkan: Increase MMV batch size and unroll IQ LUT setup
vulkan: fix init_iq_shmem for WG sizes larger than tables
vulkan: common batch size for all I-quants

Conflicts:

ggml/src/vulkan-shaders/mul_mat_vec_iq2_s.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq2_xs.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq2_xxs.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq3_s.comp

ggml/src/vulkan-shaders/mul_mat_vec_iq3_xxs.comp

cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)

ggml-ci

Conflicts:

ggml/src/ggml-cuda.cu

tests/test-backend-ops.cpp

mat vec double buffer (ggml-org#12188)

vulkan: fix bug in coopmat1 mul_mat_id (ggml-org#12316)

tests: run mul_mat_id with a larger N
vulkan: fix bug in coopmat1 mul_mat_id

Update build.yml for Windows Vulkan builder to use Vulkan 1.4.304 SDK for VK_NV_cooperative_matrix2 support (ggml-org#12301)

vulkan: Adjust coopmat2 tile sizes and selection heuristic (ggml-org#12258)

vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (ggml-org#12273)

vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking

vulkan: use fp32 in coopmat2 q4_k dequant function (ggml-org#12309)

vulkan: subgroup size tuning (ggml-org#12087)

vulkan: subgroup size test
Vulkan: Add device architecture enum and logic to recognize AMD generations
vulkan: use new architecture logic to specify subgroup size
Initial vulkan subgroup size tuning for RDNA3
vulkan: commonize RDNA subgroup tuning
vulkan: override subgroup size if required_subgroup_size = 0
vulkan: disable warp 32 for RDNA3
vulkan: fine tuned RDNA1 subgroup sizes
vulkan: adjusted subgroup size map
vulkan: fixed RDNA2 subgroup map

Co-authored-by: 0cc4m picard12@live.de

vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (ggml-org#12312)

ggml-vulkan: remove unused find_program(glslc) (ggml-org#12416)

It's already found by FindVulkan.cmake in the parent CMakeLists

Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (ggml-org#12434)

vulkan: Submit once enough matmul work has been recorded (ggml-org#12406)

I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.

vulkan: optimize iq1 coopmat2 dequant functions (ggml-org#12427)

vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (ggml-org#12472)

Vulkan: RTE rounding for cpy to quant (ggml-org#12480)

Vulkan: RTE rounding for cpy to quant

Co-Authored-By: Jeff Bolz jbolz@nvidia.com

remove trailing whitespace
avoid duplicating pipeline_cpy_f32_quant
fix copypasting issue
remove duplicated code

Co-authored-by: Jeff Bolz jbolz@nvidia.com

vulkan: Optimize mul_mat_vec p021 and nc shaders (ggml-org#12505)

tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.

Conflicts:

tests/test-backend-ops.cpp

vulkan: fix mul_mat_vec failure in backend tests (ggml-org#12529)

The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.

vulkan: fix coopmat shader generation when cross-compiling (ggml-org#12272)

vulkan: fix coopmat shader generation when cross-compiling

Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated.

Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject.

Signed-off-by: Icenowy Zheng uwu@icenowy.me

Only call coop-mat shaders once
Fix whitespace

Signed-off-by: Icenowy Zheng uwu@icenowy.me Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com

cmake: improve Vulkan cooperative matrix support checks (whisper/2966)

Co-authored-by: Sandro Hanea me@sandro.rocks

cmake : fix whitespace (#0)

Vulkan: Add DP4A MMQ and Q8_1 quantization shader (ggml-org#12135)

Vulkan: Add DP4A MMQ and Q8_1 quantization shader
Add q4_0 x q8_1 matrix matrix multiplication support
Vulkan: Add int8 coopmat MMQ support
Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code
Add GL_EXT_integer_dot_product check
Remove ggml changes, fix mmq pipeline picker
Remove ggml changes, restore Intel coopmat behaviour
Fix glsl compile attempt when integer vec dot is not supported
Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq
Remove redundant comment
Fix integer dot check
Fix compile issue with unsupported int dot glslc
Update Windows build Vulkan SDK version

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/mul_mmq.comp

ggml/src/vulkan-shaders/mul_mmq_funcs.comp

ggml/src/vulkan-shaders/quantize_q8_1.comp

ggml/src/vulkan-shaders/test_integer_dot_support.comp

vulkan: fix build when glslc doesn't support coopmat (ggml-org#12683)

Vulkan: Fix mmq int dot float cache size (ggml-org#12722)

vulkan: Implement grouped query attention in the coopmat2 FA shader (ggml-org#12559)

When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when:

dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))

previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each.

This doesn't directly translate to better performance (at least when you have

=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.

cmake: remove caching from vulkan coopmat checks (ggml-org#12719)

vulkan: Implement split_k for coopmat2 flash attention. (ggml-org#12627)

When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.

Conflicts:

ggml/src/vulkan-shaders/flash_attn_split_k_reduce.comp

vulkan: Fix missing cmake logic for dot product extension (ggml-org#12721)

vulkan: set cmake minimum and project name in vulkan-shaders (ggml-org#12744)

vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (ggml-org#12630)

There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.

Conflicts:

ggml/src/ggml-vulkan.cpp

cmake: fix ggml-shaders-gen compiler paths containing spaces (ggml-org#12747)

fixes error for compiler paths with spaces

Vulkan: Tune Vulkan mmq int dot shader for performance (ggml-org#12767)

vulkan: Use unclamped loads for flash attention mask (ggml-org#12720)

nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.

vulkan: fix NaN issue in flash attention shader (ggml-org#12776)

Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.

vulkan: Use fp16 for the flash attention P*V multiplication (ggml-org#12783)

This is consistent with the ggml-cuda behavior and the mul_mat fallback.

vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (ggml-org#12833)

q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap.

This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0.

The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.

vulkan: use aligned loads for flash attention mask (ggml-org#12853)

Rewrite the stride logic for the mask tensor in the FA shader to force the stride to be aligned, to allow using more efficient loads.

vulkan: enable coopmat2 FA gqa and split_k optimizations more often (ggml-org#12931)

The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.

vulkan: support noncontiguous rms_norm (ggml-org#13031)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: matmul gcn tuning (ggml-org#13016)

tune matmul for gcn
this one is more power efficient
Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

Co-authored-by: 0cc4m picard12@live.de

disable this tune for the proprietary driver

Co-authored-by: 0cc4m picard12@live.de

vulkan: use uint array index to avoid glslang bug (ggml-org#13193)

vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (ggml-org#13191)

vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader

vulkan: Add bfloat16 support (ggml-org#12554)

vulkan: Add bfloat16 support

This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension.

It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that.

The coopmat support also requires a glslc that supports the extension, which currently requires a custom build.

vulkan: Support bf16 tensors without the bf16 extension or coopmat support

Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available.

vulkan: bfloat16 fixes (really works without bfloat16 support now)
vulkan: fix spirv-val failure and reenable -O

Conflicts:

ggml/src/vulkan-shaders/test_bfloat16_support.comp

vulkan: Additional type support for unary, binary, and copy (ggml-org#13266)

Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326)

This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:

GGML_ASSERT(nei0 * nei1 <= 3072);

The tensor is 8 x 512. Increase this array size to accommodate.

vulkan: scalar flash attention implementation (ggml-org#13324)

vulkan: scalar flash attention implementation
vulkan: always use fp32 for scalar flash attention
vulkan: use vector loads in scalar flash attention shader
vulkan: remove PV matrix, helps with register usage
vulkan: reduce register usage in scalar FA, but perf may be slightly worse
vulkan: load each Q value once. optimize O reduction. more tuning
vulkan: support q4_0/q8_0 KV in scalar FA
CI: increase timeout to accommodate newly-supported tests
vulkan: for scalar FA, select between 1 and 8 rows
vulkan: avoid using Float16 capability in scalar FA

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/flash_attn.comp

vulkan: workaround FA compile failures on macos (ggml-org#13517)

vulkan: KHR_coopmat flash attention (ggml-org#13506)

This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.

Conflicts:

ggml/src/vulkan-shaders/flash_attn_cm1.comp

cmake: simplify vulkan shader test logic (ggml-org#13263)

vulkan: use scalar FA rather than coopmat2 when N==1 (ggml-org#13554)

Add pipeline_acc_f32

vulkan: move common FA code to flash_attn_base.comp (ggml-org#13556)

vulkan: move common FA code to flash_attn_base.comp
vulkan: move common FA index/stride setup code to flash_attn_base.comp
build fix

Conflicts:

ggml/src/vulkan-shaders/flash_attn_base.comp

cmake: use the current build config for vulkan-shaders-gen (ggml-org#13595)

fix: use the current build config for vulkan-shaders-gen
fix: only pass a valid build type to --config

Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (ggml-org#13607)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: fix warnings (ggml-org#13626)

small fixes
remove ifdef

use LOG_WARN to replace std::cerr (ggml-org#13657)

vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (ggml-org#13696)

vulkan: support CPY from any type to itself (ggml-org#13695)

Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.

add GGML_LOG_WARN

vulkan: mark IM2COL as supporting non-contig (ggml-org#13783)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: use timestamp queries for GGML_VULKAN_PERF (ggml-org#13817)

Also change it to be controlled by an env var rather than cmake flag

vulkan : Remove unexpected ; (ggml/1253)

vulkan: fix warnings in perf logger querypool code (ggml-org#13937)

ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (ggml-org#13813)

- ggml-vulkan: adds op CONV_TRANSPOSE_1D
test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
Missing barrier added to shader. Number of additional tests reduced to 108.
- Fixes typo in variable name.
Removes extra whitespaces.
Adds int64->int32 casts to prevent possible warnings.
Problem size reduced in tests to pass tests with llvmpipe.
supports_op condition moved from unintended position

Conflicts:

ggml/src/ggml-vulkan.cpp

ggml/src/vulkan-shaders/conv_transpose_1d.comp

vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (ggml-org#14001)

allowing B580 and U9-288V
experimenting code to detect Xe2
allowing coopmat only for Xe2 GPUs
fixed comment wording
fixed comment wording
removed unnecessary driver check

Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (ggml-org#14099)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: force device 0 in CI (ggml-org#14106)

Add GGML_LOG_INFO

vulkan: Track descriptor pools/sets per-context (ggml-org#14109)

vulkan: Better thread-safety for command pools/buffers (ggml-org#14116)

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: mutex around vkQueueSubmit (ggml-org#14127)

This fixes the remaining crash in test-thread-safety on my system.

cmake: clean up external project logic for vulkan-shaders-gen (ggml-org#14179)

Remove install step for vulkan-shaders-gen
Add install step to normalize msvc with make
Regenerate modified shaders at build-time

Conflicts:

.github/workflows/build.yml

cmake: remove shader-gen step-targets from ggml-vulkan (ggml-org#14226)

Remove step-targets from vulkan-shaders-gen
Unset DESTDIR when building vulkan-shaders-gen

Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (ggml-org#14249)

Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (ggml-org#13792)

Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.
remove #ifdef for debug utils and add queue marker.

Conflicts:

ggml/src/ggml-vulkan.cpp

vulkan: update windows SDK in CI (ggml-org#14334)

vulkan: update windows SDK in release.yml (ggml-org#14344)

Conflicts:

.github/workflows/release.yml

cmake: regen vulkan shaders when shaders-gen sources change (ggml-org#14398)

Add shaders-gen sources as target deps

vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (ggml-org#14427)

This setting needs to be passed through to vulkan-shaders-gen

vulkan: lock accesses of pinned_memory vector (ggml-org#14333)

vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (ggml-org#14378)

Fix cuda build error

test

remove new cpu backend and yml files
remove new op and GGML_ROPE_TYPE_NEOX
fix build error
change cmake file to add matrix operation
remove coopmat2 check in flash attention
print gpu info for vulkan
disable fuse to recover vulkan performance

Co-authored-by: 0cc4m picard12@live.de Co-authored-by: firecoperana

ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request

May 6, 2026

phibya pushed a commit to ziee-ai/llama.cpp that referenced this pull request

May 29, 2026

fewtarius pushed a commit to fewtarius/CachyLLama that referenced this pull request

May 30, 2026

AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request

Jun 2, 2026

AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request

Jun 2, 2026

This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})