vulkan: lock accesses of pinned_memory vector by jeffbolznv · Pull Request #14333 · ggml-org/llama.cpp (original) (raw)
github-actions Bot added Vulkan
Issues specific to the Vulkan backend
changes relating to the ggml tensor library for machine learning
labels
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request
qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request
Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request
CANN: Enable labeler for Ascend NPU (#13914)
add geglu activation function (#14074)
Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp
sycl: Add reorder to Q6_K mmvq implementation (#13885)
Add Reorder to Q6_K mmvq implementation
Address PR comments: clean up comments
Remove unused parameter after refactoring q4_k
Adding inline to function and removing unnecessary reference to int
Signed-off-by: nscipione nicolo.scipione@codeplay.com
- server : fix LRU check (#14079)
ggml-ci
webui: fix sidebar being covered by main content (#14082)
webui: fix sidebar being covered by main content
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- webui: update index.html.gz
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
CANN: Simplify the environment variable setting(#13104)
Simplify the environment variable setting to specify the memory pool type.
Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.
update
fix CI
update
delete whitespace
fix according to review
update CANN.md
update CANN.md
graph : fix geglu (#14077)
ggml-ci
cuda : fix device sync on buffer clear (#14033)
ggml-cpu : split arch-specific implementations (#13892)
move ggml-cpu-aarch64 to repack
split quantize_row_q8_0/1
split helper functions
split ggml_vec_dot_q4_0_q8_0
split ggml_vec_dot_q4_1_q8_1
split ggml_vec_dot_q5_0_q8_0
split ggml_vec_dot_q5_1_q8_1
split ggml_vec_dot_q8_0_q8_0
split ggml_vec_dot_tq1_0_q8_K
split ggml_vec_dot_tq2_0_q8_K
split ggml_vec_dot_q2_K_q8_K
split ggml_vec_dot_q3_K_q8_K
split ggml_vec_dot_q4_K_q8_K
split ggml_vec_dot_q5_K_q8_K
split ggml_vec_dot_q6_K_q8_K
split ggml_vec_dot_iq2_xxs_q8_K
split ggml_vec_dot_iq2_xs_q8_K
split ggml_vec_dot_iq2_s_q8_K
split ggml_vec_dot_iq3_xxs_q8_K
split ggml_vec_dot_iq3_s_q8_K
split ggml_vec_dot_iq1_s_q8_K
split ggml_vec_dot_iq1_m_q8_K
split ggml_vec_dot_iq4_nl_q8_0
split ggml_vec_dot_iq4_xs_q8_K
fix typos
fix missing prototypes
rename ggml-cpu-quants.c
rename ggml-cpu-traits
rename arm folder
move cpu-feats-x86.cpp
rename ggml-cpu-hbm
update arm detection macro in quants.c
move iq quant tables
split ggml_quantize_mat_q8_0/K
split ggml_gemv_*
split ggml_gemm_*
rename namespace aarch64 to repack
use weak aliases to replace test macros
rename GGML_CPU_AARCH64 to GGML_CPU_REPACK
rename more aarch64 to repack
clean up rebase leftover
fix compilation errors
remove trailing spaces
try to fix clang compilation errors
try to fix clang compilation errors again
try to fix clang compilation errors, 3rd attempt
try to fix clang compilation errors, 4th attempt
try to fix clang compilation errors, 5th attempt
try to fix clang compilation errors, 6th attempt
try to fix clang compilation errors, 7th attempt
try to fix clang compilation errors, 8th attempt
try to fix clang compilation errors, 9th attempt
more cleanup
fix compilation errors
fix apple targets
fix a typo in arm version of ggml_vec_dot_q4_K_q8_K
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
llama : allow building all tests on windows when not using shared libs (#13980)
llama : allow building all tests on windows when not using shared libraries
add static windows build to ci
tests : enable debug logs for test-chat
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
kv-cache : fix shift and defrag logic (#14081)
kv-cache : fix shift
ggml-ci
- cont : reset shift[i]
ggml-ci
- cont : fix defrag erasing cells that didn't move
ggml-ci
metal : use less stack memory in FA kernel (#14088)
metal : use less stack memory in FA kernel
ggml-ci
cont : fix BF16 variant
Add in-build ggml::ggml ALIAS library (ggml/1260)
Enable uniform linking with subproject and with find_package.
- sync : ggml
ggml-ci
rpc : nicer error messages for RPC server crash (#14076)
Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (#14099)
ggml : fix weak alias win32 (whisper/0)
ggml-ci
- sync : ggml
ggml-ci
Fixed spec timings to: accepted/tested instead of accepted/drafted (#14104)
vulkan: force device 0 in CI (#14106)
llama : support GEGLU for jina-bert-v2 (#14090)
convert : fix duplicate key DeepSeek-R1 conversion error (#14103)
kv-cache : avoid modifying recurrent cells when setting inputs (#13834)
kv-cache : avoid modifying recurrent cells when setting inputs
kv-cache : remove inp_s_mask
It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy.
- kv-cache : fix non-consecutive token pos warning for recurrent models
The problem was apparently caused by how the tail cells were swapped.
graph : simplify logic for recurrent state copies
kv-cache : use cell without src refs for rs_z in recurrent cache
llama-graph : fix recurrent state copy
The state_copy shuffle assumes everything is moved at once,
which is not true when states_extra is copied back to the cache
before copying the range of states between head and head + n_seqs.
This is only a problem if any of the cells in [head, head + n_seqs)
have an src in [head + n_seqs, head + n_kv),
which does happen when n_ubatch > 1 in the llama-parallel example.
Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes.
- llama-graph : rename n_state to state_size in build_recurrent_state
This naming should reduce confusion between the state size and the number of states.
opencl: add
mul_mv_id_q4_0_f32_8x_flat(#14003)vulkan: Track descriptor pools/sets per-context (#14109)
Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.
kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (#14121)
server : pass default --keep argument (#14120)
kv-cache : relax SWA masking condition (#14119)
ggml-ci
webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
webui: Wrap long numbers instead of infinite horizontal scroll
Use tailwind class
update index.html.gz
vulkan: Better thread-safety for command pools/buffers (#14116)
This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.
tests : add test-tokenizers-repo (#14017)
chore : clean up relative source dir paths (#14128)
Implement GGML_CPU_ALL_VARIANTS for ARM (#14080)
ggml-cpu: Factor out feature detection build from x86
ggml-cpu: Add ARM feature detection and scoring
This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_ which need to be set in cmake, instead of GGML_ that users would set for x86.
This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags.
- ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM
Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_ as we don't have GGML_.
Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used.
- ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now
The other platforms will need their own specific variants.
This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.
common: fix issue with regex_escape routine on windows (#14133)
context : round n_tokens to next multiple of n_seqs when reserving (#14140)
This fixes RWKV inference which otherwise failed when the worst case ubatch.n_seq_tokens rounded to 0.
- kv-cache : fix split_equal handling in unified implementation (#14130)
ggml-ci
cmake : handle whitepsaces in path during metal build (#14126)
cmake : handle whitepsaces in path during metal build
ggml-ci
- cont : proper fix
ggml-ci
Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com
- batch : remove logits_all flag (#14141)
ggml-ci
context : simplify output counting logic during decode (#14142)
batch : remove logits_all flag
ggml-ci
- context : simplify output counting logic during decode
ggml-ci
cont : fix comments
server : re-enable SWA speculative decoding (#14131)
ggml-ci
readme : remove project status link (#14149)
sycl: Remove not needed copy f16->f32 for dnnl mul mat (#14125)
vocab : prevent heap overflow when vocab is too small (#14145)
ggml-ci
cmake : Improve build-info.cpp generation (#14156)
cmake: Simplify build-info.cpp generation
The rebuild of build-info.cpp still gets triggered when .git/index gets changes.
cmake: generate build-info.cpp in build dir
SYCL: Bump oneMath commit (#14152)
Update oneMath commit to merged PR https://github.com/uxlfoundation/oneMath/pull/669 which adds SYCL-Graph support for recording CUDA BLAS commands.
With this change the MUL_MAT tests now pass on DPC++ CUDA backends with SYCL-Graph
enabled. Prior to this change, an error would be thrown.
$ GGML_SYCL_DISABLE_GRAPH=0 ./bin/test-backend-ops -b SYCL0 -o MUL_MAT -p type_a=f16,type_b=f32,m=16,n=1,k=256,bs=\\[1,1\\],nr=\\[2
UR CUDA ERROR:
Value: 700
Name: CUDA_ERROR_ILLEGAL_ADDRESS
Description: an illegal memory access was encountered
Function: operator()
Source Location: $HOME/dpcpp/unified-runtime/source/adapters/cuda/queue.cpp:154
Native API failed. Native API returns: 2147483646 (UR_RESULT_ERROR_UNKNOWN)
Exception caught at file:$HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp, line:3598, func:operator()
SYCL error: CHECK_TRY_ERROR((stream)->wait()): Meet error in this line code!
in function ggml_backend_sycl_synchronize at $HOME/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp:3598
$HOME/llama.cpp/ggml/src/ggml-sycl/../ggml-sycl/common.hpp:118: SYCL error
Could not attach to process. If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user. For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.sycl: Adding additional cpy dbg print output (#14034)
server : fix SWA condition for full context reprocess (#14163)
ggml-ci
- pooling : make cls_b and cls_out_b optional (#14165)
Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp
cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)
cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT
cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*
readme : remove survey link (#14168)
batch : rework llama_batch_allocr (#14153)
batch : rework llama_batch_allocr
ggml-ci
- cont : move validation inside class
ggml-ci
- cont : move output counting to class
ggml-ci
- cont : minor
ggml-ci
- batch : add TODOs
ggml-ci
docs : Update multimodal.md (#14122)
Update multimodal.md
Update multimodal.md
batch : add LLAMA_BATCH_DEBUG environment variable (#14172)
batch : add LLAMA_BATCH_DEBUG environment variable
ggml-ci
cont : improve seq_id display
Merge commit from fork
vocab : prevent integer overflow during load
Add static cast and GGML_ABORT
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
sycl: fix docker image (#14144)
vocab : fix build (#14175)
ggml-ci
compare-llama-bench: add option to plot (#14169)
compare llama-bench: add option to plot
Address review comments: convert case + add type hints
Add matplotlib to requirements
fix tests
Improve comment and fix assert condition for test
Add back default test_name, add --plot_log_scale
use log_scale regardless of x_values
llama-chat : Do not throw when tool parsing fails (#14012)
Currently when a model generates output which looks like a tool call, but is invalid an exception is thrown and not handled, causing the cli or llama-server to bail. Instead, handle the chat parser exception and simply return the generated text in such cases.
Signed-off-by: Piotr Stankiewicz piotr.stankiewicz@docker.com
docs : remove WIP since PR has been merged (#13912)
batch : auto-gen positions + verify multi-sequence input (#14177)
batch : verify multi-sequence input batches
ggml-ci
- cont : auto-gen positions + verify multi-seq input
ggml-ci
- cont : first print debug info, then perform validation
ggml-ci
- cont : fix position auto-gen + add comments
ggml-ci
- cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)
ggml-ci
- model : add dots.llm1 architecture support (#14044) (#14118)
Adds:
Dots1Model to convert_hf_to_gguf.py
Computation graph code to llama-model.cpp
Chat template to llama-chat.cpp to detect this model's template.
The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture.
The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here:
The model architecture is a combination of Qwen and Deepseek parts, as seen here:
- kv-cache : fix use-after-move of defrag info (#14189)
ggml-ci
HIP: Replace usage of depricated preprocessor macro AMDGCN_WAVEFRONT_SIZE (#14183)
CUDA/HIP: fix ssm_scan on devices where warp size is not 32 (#14196)
quantize : change int to unsigned int for KV overrides (#14197)
server : When listening on a unix domain socket don't print http:// and port (#14180)
Instead show something like this:
main: server is listening on file.sock - starting the main loop
Signed-off-by: Eric Curtin ecurtin@redhat.com
model : Add support for Arcee AI's upcoming AFM model (#14185)
Add Arcee AFM support
Add draft update code
Fix linter and update URL, may still not be final
Update src/llama-model.cpp
Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com
- Remote accidental blank line
Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com
ggml-cpu : rework weak alias on apple targets (#14146)
ggml-cpu : rework weak alias on apple targets
fix powerpc detection
fix ppc detection
fix powerpc detection on darwin
vulkan: mutex around vkQueueSubmit (#14127)
This fixes the remaining crash in test-thread-safety on my system.
- gguf-py : allow key override when adding value to GGUFWriter (#14194)
Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp
convert : remove arcee change in convert_hf_to_gguf_update.py (#14207)
ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206)
llama : rework embeddings logic (#14208)
llama : rework embeddings logic
ggml-ci
- cont : fix rerank
ggml-ci
cont : engrish [no ci]
cont : fix rerank
ggml-ci
- server : support both embeddings and completions with single model
ggml-ci
- cont : avoid embeddings_org
ggml-ci
HIP: disable rocwmma on gfx12 by default until rocm 7.0 (#14202)
model : add NeoBERT (#14164)
convert neobert model to gguf
add inference graph
fix flake8 lint
followed reviewer suggestions
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- follow reviewers suggestions
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- override NeoBERT feed-forward length
Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Georgi Gerganov ggerganov@gmail.com
cmake: clean up external project logic for vulkan-shaders-gen (#14179)
Remove install step for vulkan-shaders-gen
Add install step to normalize msvc with make
Regenerate modified shaders at build-time
llama : add thread safety test (#14035)
llama : add thread safety test
llamafile : remove global state
llama : better LLAMA_SPLIT_MODE_NONE logic
when main_gpu < 0 GPU devices are not used
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
server : fix incorrect usage of llama_get_embeddings() (#14225)
server : fix incorrect usage of llama_get_embeddings()
ggml-ci
- cont : fix the fix
ggml-ci
common : suggest --jinja when autodetection fails (#14222)
musa: fix build warning (unused variable) (#14231)
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
ggml-cpu : remove the weak alias trick (#14221)
cmake: remove shader-gen step-targets from ggml-vulkan (#14226)
Remove step-targets from vulkan-shaders-gen
Unset DESTDIR when building vulkan-shaders-gen
examples : include examples in msvc disable warn (ggml/1270)
This commit adds the examples in the "list" of targets to ignore MSVC warnings.
The motivation for this is that currently the examples generate a number of warnings that are ignore/disabled for the core ggml project. This makes for a cleaner output when building.
- ggml : remove unused ggml_context_container (ggml/1272)
This commit removes the unused ggml_context_container structure from
the ggml library. It looks like the usage of this struct was removed in
Commit 4757fe18d56ec11bf9c07feaca6e9d5b5357e7f4 ("ggml : alloc
ggml_contexts on the heap (whisper/2525)").
The motivation for this changes is to improve code clarity/readability.
ggml : disable warnings for tests when using MSVC (ggml/1273)
ggml : disable warnings for tests when using MSVC
This commit disables warnings for tests on windows when using MSVC.
The motivation for this is that this brings the build output more inline with what Linux/MacOS systems produce.
There is still one warning generated for the tests which is:
Building Custom Rule C:/ggml/tests/CMakeLists.txt
cl : command line warning D9025: overriding '/DNDEBUG' with '/UNDEBUG'
[C:\ggml\build\tests\test-arange.vcxproj]
test-arange.cpp
test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exeggml : fix typo in tests disable list
sync : ggml
ggml-ci
convert : fix null head_dim AutoConfig regression (#14248)
llama-chat : fix multiple system message for gemma, orion (#14246)
mtmd : refactor llava-uhd preprocessing logic (#14247)
mtmd : refactor llava-uhd preprocessing logic
fix editorconfig
ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (#14258)
ggml-cpu: fix uncaught underscore terminators (#14023)
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reduce asm calls for hsum (#14037)
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
docs: add s390x build documentation (#14264)
docs: add s390x-specific build docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add s390x model conversion steps
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x build indent
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update hyperlinks for s390x docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update llama.h docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x add accelerator and perf optimizations
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x indent blocks
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: revert block indentation
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add support information for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x reword
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: remove indentation for accelerator section s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: remove redundant words s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: reword for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x reword simd
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: fix trailing whitespace for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
metal : add mean kernel (#14267)
metal : add mean kernel
ggml-ci
- cont : dedup implementation
ggml-ci
memory : Hybrid recurrent cache (#13979)
feat: Add llama_model_is_hybrid API call
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams).
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add c++ side constants for attention layer indices hparam
Branch: GraniteFour
- feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: rename *_is_hybrid -> *_is_hybrid_recurrent
The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent."
Branch: HybridCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add layer filter to recurrent cache
Branch: HybridCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use per-layer sizing everywhere in kv caches
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: First pass at llama_kv_cache_hybrid_recurrent
This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches.
This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Construct hybrid recurrent cache for hybrid recurrent models
This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix wrong bool condition for split equal in hybrid cache
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix shift logic to defer to unified cache
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Support hybrid recurrent in llama-graph
NOTE: I intentionally did not add support for s_mask since it will be going away soon
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix logic for initializing inputs and attn layers for hybrid caches
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Update recurrent cache for changes to remove intermediate kv_cache interface
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix status for init_update sig for recurrent cache state
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Add missing padding to n_ctx for hybrid cache construction
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Update clear signature for data argument after rebase
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove errant virtual destructor leftover from previous impl attempt
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove n_embd_k/v_s from unified cache
No longer needed now that unified isn't also supporting recurrent
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069
Branch: HybridRecurrentCache
- refactor: Remove layer index from n_embd_k/v_s
Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove n_embd_k/v_gqa from recurrent cache
This is no longer needed now that there are separate implementations
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Allow custom layer filters for hybrid recurrent
This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches.
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove logits_all after rebase
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove llama_model_is_hybrid_Recurrent public API
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Use llama_memory_state_ptr for child states in hybrid memory state
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern
https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738
This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are:
- Rename class llm_graph_input_s_copy -> llm_graph_input_rs
- Add a corresponding llm_graph_input_rs_hybrid_recurrent
- Rename build_inp_s_copy -> build_rs_inp_recurrent
- Add a corresponding build_rs_inp_hybrid_recurrent
- Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified
- Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix resize vs reserve and skip null tensors in size computation
https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-Authored-By: @younesbelkada
- fix: Fix initialization of child states
Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Use a common build_recurrent_state method that is cache-agnostic
This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- recurrent : rework graph inputs + add TODOs
ggml-ci
- refactor: Make status and child states const in hybrid and iswa
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache
This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor!: Rename all k/v related values for recurrent/hybrid to r/s
Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states).
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refacor: _recurrent -> _recr for brevity
It just happens to have the same number of letters as _attn!
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- style: Fix spacing for ref
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: recurrent_layer() -> is_recurrent()
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- style: Fix spacing for size_s_bytes declaration
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249)
llamafile : support s390x SIMD instruction set (#14273)
convert : fix remote option in Windows (#14100)
llama-bench : add --no-warmup flag (#14224) (#14270)
Add no_warmup parameter to cmd_params struct and command-line parsing to allow users to skip warmup runs before benchmarking.
Add no_warmup boolean field to cmd_params struct
Add --no-warmup command-line argument parsing
Add help text documentation for the new flag
Wrap existing warmup logic in conditional check
Maintain full backward compatibility (warmup enabled by default)
Addresses #14224
- sycl: Cleanup codepaths in Get Rows in sycl backend (#14215)
Addresses unused reorder path
build : suppress gcc15 compile warnings (#14261)
Change _contains_any() substrs to std::string_view and fix the find comparison logic.
server : add server parameters for draft model cache type (#13782)
Co-authored-by: aa956 27946957+aa956@users.noreply.github.com
gguf-py : make sentencepiece optional (#14200)
Make sentencepiece optional
Bump to 0.18.0
Bump patch instead of minor
Co-authored-by: compilade git@compilade.net
Co-authored-by: compilade git@compilade.net
- ggml-cpu : remove unnecesary arm feature detection (#14281)
Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.
CUDA: add conv_2d_dw (#14265)
CUDA: add conv_2d_dw
better naming
simplify using template
Review: fix operation ordering in ggml-cuda, use forceinline, use more const
ubatch : new splitting logic (#14217)
ggml-ci
model : more uniform output id handling (#14275)
model : more uniform output id handling
ggml-ci
- cont : revert n_outputs < n_tokens optimization
ggml-ci
- cont : fix out_ids initialization
ggml-ci
ggml: Update KleidiAI to v1.9.0 (#14277)
ggml : fix repack work size for mul_mat_id (#14292)
ggml-ci
- cuda : synchronize graph capture and cublas handle destruction (#14288)
Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread
llama : improve sep token handling (#14272)
Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286)
Add PowerPC feature detection and scoring
ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC
ggml-cpu: Delay some initializations until function is called
When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.
Co-authored-by: Diego Devesa slarengh@gmail.com
sycl: add usage of enqueue_functions extension (#14244)
Add header and namespace to use enqueue_functions extension
Convert submit and parallel_for to use new extension in convert.cpp
Convert submit and parallel_for to use extension in ggml-sycl.cpp
Convert submit and parallel_for to use extension in gla.cpp
Convert submit and parallel_for in mmq.cpp
Convert submit and parallel_for in mmvq.cpp
Convert submit and parallel_for in remaining files
Convert all simple parallel_for to nd_launch from enqueue_functions extension
Wrapping extension in general function
Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels.
Signed-off-by: nscipione nicolo.scipione@codeplay.com
vocab : prevent tokenizer overflow (#14301)
vocab : prevent stack overflow in tokenize
vocab : return error instead of aborting on oversized token count
vocab : INT32_MIN from llama_tokenize on overflow
lint : remove trailing whitepace (#14304)
CUDA: add conv_2d_transpose (#14287)
CUDA: add conv_2d_transpose
remove direct include of cuda_fp16
Review: add brackets for readability, remove ggml_set_param and add asserts
docs : fix the link to llama.h (#14293)
Add
ggml_roll(ggml/1274)ggml : add ggml_roll
use set/get_op_params & std::min
sync : ggml
ggml-ci
convert : fix Llama 4 conversion (#14311)
memory : rename interface to llama_memory_context_i (#14296)
memory : rename interface to llama_memory_context_i
ggml-ci
cont : fix comments
cont : use "mctx" for referencing a memory context
ggml-ci
- metal : fix thread-safety (#14300)
ggml-ci
gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.
remove #ifdef for debug utils and add queue marker.
gguf-py : fix Qwen3-Embedding eos token (#14314)
CUDA: add mean operation (#14313)
CUDA: add mean operation
add back sum_rows_f32_cuda
Review: early exit if col!=0
common : use std::string_view now that we target c++17 (#14319)
mtmd : fix Pixtral OOM with large images by capping image_size to 1024 (#14326)
Mistral Small 2506 models using Pixtral vision encoder were running out of GPU memory when processing images larger than 1024x1024 pixels due to exponential memory growth from unlimited image size.
This fix applies the same 1024x1024 limit used by Qwen2VL models to prevent OOM issues while maintaining compatibility with existing models.
HIP: enable vec fattn on RDNA4 (#14323)
examples : fix is_first logic for tokenization (#14329)
ggml-ci
run : avoid double tokenization (#14327)
run : avoid double tokenization by adopting common_tokenize heuristic
build : fix windows gcc and clang warnings
lint : fixed trailing whitepace
run : fix is_first flag
gguf-py : fix SpecialVocab parsing when post_processor is null (#14330)
quantize : handle user-defined pruning of whole layers (blocks) (#13037)
vulkan: update windows SDK in CI (#14334)
kv-cells : fix tracking of seq_pos (#14339)
kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
- cont : improve error message
ggml-ci
cont : add more comments
CUDA: mul_mat_v support for batch sizes > 1 (#14262)
CUDA: mul_mat_v support for batch sizes > 1
use 64 bit math for initial offset calculation
llama : better rwkv chat template and add missing
inputs.use_jinjasetting (#14336)llama-cli : add missing
inputs.use_jinjasetting
Signed-off-by: Molly Sophia mollysophia379@gmail.com
- llama : better legacy chat template for rwkv
Signed-off-by: Molly Sophia mollysophia379@gmail.com
Signed-off-by: Molly Sophia mollysophia379@gmail.com
vulkan: update windows SDK in release.yml (#14344)
ci: add workflow for relocatable cmake package (#14346)
CUDA/HIP: optimize mmv paths taken for HIP devices (#14324)
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- jinja : Add Mistral-Small-3.2-24B-Instruct-2506.jinja (#14349)
This will allow the use of tools on the llama-server
main : honor --verbose-prompt on interactive prompts (#14350)
server : move no API key doc to /health (#14352)
cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION (#14362)
batch : fix check for empty sequences in memory (#14364)
batch : fix check for empty sequences in memory
ggml-ci
- cont : reuse the var
ggml-ci
opencl: ref count
ggml_backend_opencl_contextand refactor profiling (#14254)Move profiling info into
ggml_backend_opencl_contextAdd
enqueue_ndrange_kernelto launch kernelsycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
ggml : do not output unprintable characters on GGUF load failure (#14381)
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)
- ggml-cpu: add fp16->fp32 nnpa first
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)
- ggml-cpu: add fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)
- ggml-cpu: better variable names
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)
- docs: update s390x docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)
- ggml-cpu: add debugging prints to see if dlf16 is correct
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix print vs printf
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix float placeholder
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: ensure fp16 and fp32 load and stores are called
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fp16 load ensured to hit
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove sigint from fp16 store
for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa switch to vec_xst test
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to vec_xst for 4 element loops also
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rework noop
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove noop, general code cleanup
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify variable naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add breakpoint for debugging
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test fix for conversion failure
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: disable fp32->fp16 nnpa conversions for now
there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to elif macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix typo
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix compiler types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: change to typedef vector types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add 4 element loops for fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarified vector naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back fp32->fp16 store nnpa
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add nnpa macro check in ggml-impl
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add missing func
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: diagnose why NNPA macro is not being defined
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: import vecintrin.h to fix compiler errors
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: update macro tests
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 157f856c34589566151630e294563a420702db39.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to importing ggml-cpu-impl instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix macro declaration
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test more macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add debug prints
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bruteforce macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h to cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to private macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)
- ggml-cpu: move things around
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to quotes for import
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add compiler error macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add s390x detection in ggml-src
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: undo cmakelists work
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedefs.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedef from cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h future notes
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add todo comment for future reference
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify naming of dlf16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove unnecessary target compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update broken huggingface link for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix duplicate func names during compile
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: fix duplicate func names during compile"
This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"
This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp16<->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h import in quants.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h within repack
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix amx mmq missing simd-mappings.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: attempt at fixing loongarch failing build
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa together with other fp16<->fp32 simd
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix wrong refactor of ggml-base
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: remove dependency on ggml-cpu from ggml-base
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove mistaken fallback macro
fallback logic was already implemented but i was too sleepy to realise
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"
This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"
This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)
- ggml: move ggml_table_f32_f16 to ggml-cpu.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: extern c ggml_table_f32_f16 + chore docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h
we rely on the variable declaration in ggml-cpu.c instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"
This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back ggml_table_f32_f16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: bring back ggml_table_f32_f16"
This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
fix ggml time initialization
fix f32_f16 table init
remove extra line
Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com
musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de
docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add s390x z17 build q&a
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup
ggml-ci
- metal : handle some edge cases when threadgroup size is not a power of 2
ggml-ci
- metal : add special-case mat-vec mul for ne00 == 4 (#14385)
ggml-ci
llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.
ref: #8366
use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst
ggml-ci
ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation
ggml-ci
ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR
ggml-ci
ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- recurrent : call balloc split_reset() in init_batch() (#14414)
ggml-ci
- graph : make llm_graph_context destructor virtual (#14410)
ggml-ci
- vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)
This setting needs to be passed through to vulkan-shaders-gen
ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)
Add Day-0 support for Baidu ERNIE 4.5 0.3B model.
Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com
vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.
extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups
style fixes
last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter
Co-authored-by: slaren slarengh@gmail.com
ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels
ggml-ci
add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints
ggml-ci
64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]
ggml-ci
Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation
Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai
GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions
op_xxxfor each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions.
- Using
__dpct_inline__to encourage compiler inlining. - Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com
ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"
This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.
SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com
scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml
修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动
- Remove redundant include path in CMakeLists.txt
The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.
- Enable scheduled Docker image builds
Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.
test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels
ggml-ci
- cont : disable for q4_1
ggml-ci
- cont : disable for iq4_nl
ggml-ci
- memory : correctly handle failure in apply() (#14438)
ggml-ci
Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-quants : rename best_mad to best_error (ggml/1283)
This commit renames the variable best_mad to best_error in the
make_qkx2_quants function.
The motivation for this is that the name best_mad can be somewhat
confusing if mean absolute deviation (MAD) is not in use.
ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml
ggml-ci
ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
[CANN]update to aclnnGroupedMatmulV2
Signed-off-by: noemotiovon 757486878@qq.com
- Support MUL_MAT_ID on 310p
Signed-off-by: noemotiovon 757486878@qq.com
- fix editorconfig
Signed-off-by: noemotiovon 757486878@qq.com
Signed-off-by: noemotiovon 757486878@qq.com
- Add Vulkan images to docker.md (#14472)
Right now it's not easy to find those.
ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI
ggml-ci
- cont : remove -g flag
ggml-ci
ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes
Co-authored-by: Diego Devesa slarengh@gmail.com
Signed-off-by: nscipione nicolo.scipione@codeplay.com Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Signed-off-by: Piotr Stankiewicz piotr.stankiewicz@docker.com Signed-off-by: Eric Curtin ecurtin@redhat.com Signed-off-by: Aaron Teo aaron.teo1@ibm.com Signed-off-by: Gabe Goodhart ghart@us.ibm.com Signed-off-by: Molly Sophia mollysophia379@gmail.com Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com Signed-off-by: noemotiovon 757486878@qq.com Co-authored-by: Yuanhao Ji jiyuanhao@apache.org Co-authored-by: Đinh Trọng Huy 77562200+huydt84@users.noreply.github.com Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Nicolò Scipione nicolo.scipione@codeplay.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: R0CKSTAR yeahdongcn@gmail.com Co-authored-by: Xinpeng Dou 15529241576@163.com Co-authored-by: Diego Devesa slarengh@gmail.com Co-authored-by: xctan axunlei@gmail.com Co-authored-by: Kai Pastor dg0yt@darc.de Co-authored-by: Isaac McFadyen isaac@imcf.me Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Juk Armstrong 69222624+jukofyork@users.noreply.github.com Co-authored-by: Jeff Bolz jbolz@nvidia.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net Co-authored-by: lhez quic_lih@quicinc.com Co-authored-by: Taylor quantumtraveling@gmail.com Co-authored-by: Aman amangupta052@gmail.com Co-authored-by: Christian Kastner ckk@kvr.at Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com Co-authored-by: Anton Mitkov anton.mitkov@codeplay.com Co-authored-by: Ewan Crawford ewan@codeplay.com Co-authored-by: ddpasa 112642920+ddpasa@users.noreply.github.com Co-authored-by: Guy Goldenberg guy110698@gmail.com Co-authored-by: Svetlozar Georgiev 55534064+sgeor255@users.noreply.github.com Co-authored-by: Piotr piotr.stankiewicz@docker.com Co-authored-by: Pepijn de Vos me@pepijndevos.nl Co-authored-by: Mikko Juola mikjuo@gmail.com Co-authored-by: uvos philipp@uvos.xyz Co-authored-by: Ed Addario 29247825+EAddario@users.noreply.github.com Co-authored-by: Eric Curtin ecurtin@redhat.com Co-authored-by: Bartowski 3266127+bartowski1182@users.noreply.github.com Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com Co-authored-by: xctan xc-tan@outlook.com Co-authored-by: Charles Xu charles.xu@arm.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: pqnet 119850+pqnet@users.noreply.github.com Co-authored-by: bashayer hijji bashayer.hijji@gmail.com Co-authored-by: Anton Mitkov anton_b_mitkov@abv.bg Co-authored-by: fanyang fanyang89@outlook.com Co-authored-by: aa956 aa956@users.noreply.github.com Co-authored-by: aa956 27946957+aa956@users.noreply.github.com Co-authored-…
Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request
qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request
qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request
Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request
- add geglu activation function (#14074)
Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp
sycl: Add reorder to Q6_K mmvq implementation (#13885)
Add Reorder to Q6_K mmvq implementation
Address PR comments: clean up comments
Remove unused parameter after refactoring q4_k
Adding inline to function and removing unnecessary reference to int
Signed-off-by: nscipione nicolo.scipione@codeplay.com
webui: fix sidebar being covered by main content (#14082)
webui: fix sidebar being covered by main content
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- webui: update index.html.gz
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
CANN: Simplify the environment variable setting(#13104)
Simplify the environment variable setting to specify the memory pool type.
Adjust the GGML_CANN_ASYNC_MODE setting to accept yes, enable, 1, or on (case-insensitive) as valid options.
update
fix CI
update
delete whitespace
fix according to review
update CANN.md
update CANN.md
graph : fix geglu (#14077)
ggml-ci
ggml-cpu : split arch-specific implementations (#13892)
move ggml-cpu-aarch64 to repack
split quantize_row_q8_0/1
split helper functions
split ggml_vec_dot_q4_0_q8_0
split ggml_vec_dot_q4_1_q8_1
split ggml_vec_dot_q5_0_q8_0
split ggml_vec_dot_q5_1_q8_1
split ggml_vec_dot_q8_0_q8_0
split ggml_vec_dot_tq1_0_q8_K
split ggml_vec_dot_tq2_0_q8_K
split ggml_vec_dot_q2_K_q8_K
split ggml_vec_dot_q3_K_q8_K
split ggml_vec_dot_q4_K_q8_K
split ggml_vec_dot_q5_K_q8_K
split ggml_vec_dot_q6_K_q8_K
split ggml_vec_dot_iq2_xxs_q8_K
split ggml_vec_dot_iq2_xs_q8_K
split ggml_vec_dot_iq2_s_q8_K
split ggml_vec_dot_iq3_xxs_q8_K
split ggml_vec_dot_iq3_s_q8_K
split ggml_vec_dot_iq1_s_q8_K
split ggml_vec_dot_iq1_m_q8_K
split ggml_vec_dot_iq4_nl_q8_0
split ggml_vec_dot_iq4_xs_q8_K
fix typos
fix missing prototypes
rename ggml-cpu-quants.c
rename ggml-cpu-traits
rename arm folder
move cpu-feats-x86.cpp
rename ggml-cpu-hbm
update arm detection macro in quants.c
move iq quant tables
split ggml_quantize_mat_q8_0/K
split ggml_gemv_*
split ggml_gemm_*
rename namespace aarch64 to repack
use weak aliases to replace test macros
rename GGML_CPU_AARCH64 to GGML_CPU_REPACK
rename more aarch64 to repack
clean up rebase leftover
fix compilation errors
remove trailing spaces
try to fix clang compilation errors
try to fix clang compilation errors again
try to fix clang compilation errors, 3rd attempt
try to fix clang compilation errors, 4th attempt
try to fix clang compilation errors, 5th attempt
try to fix clang compilation errors, 6th attempt
try to fix clang compilation errors, 7th attempt
try to fix clang compilation errors, 8th attempt
try to fix clang compilation errors, 9th attempt
more cleanup
fix compilation errors
fix apple targets
fix a typo in arm version of ggml_vec_dot_q4_K_q8_K
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
llama : allow building all tests on windows when not using shared libs (#13980)
llama : allow building all tests on windows when not using shared libraries
add static windows build to ci
tests : enable debug logs for test-chat
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- sync : ggml
ggml-ci
Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (#14099)
ggml : fix weak alias win32 (whisper/0)
ggml-ci
- sync : ggml
ggml-ci
vulkan: force device 0 in CI (#14106)
llama : support GEGLU for jina-bert-v2 (#14090)
convert : fix duplicate key DeepSeek-R1 conversion error (#14103)
kv-cache : avoid modifying recurrent cells when setting inputs (#13834)
kv-cache : avoid modifying recurrent cells when setting inputs
kv-cache : remove inp_s_mask
It was replaced with equivalent and simpler functionality with rs_z (the first zeroed state) and the already-existing inp_s_copy.
- kv-cache : fix non-consecutive token pos warning for recurrent models
The problem was apparently caused by how the tail cells were swapped.
graph : simplify logic for recurrent state copies
kv-cache : use cell without src refs for rs_z in recurrent cache
llama-graph : fix recurrent state copy
The state_copy shuffle assumes everything is moved at once,
which is not true when states_extra is copied back to the cache
before copying the range of states between head and head + n_seqs.
This is only a problem if any of the cells in [head, head + n_seqs)
have an src in [head + n_seqs, head + n_kv),
which does happen when n_ubatch > 1 in the llama-parallel example.
Changing the order of the operations avoids the potential overwrite before use, although when copies are avoided (like with Mamba2), this will require further changes.
- llama-graph : rename n_state to state_size in build_recurrent_state
This naming should reduce confusion between the state size and the number of states.
opencl: add
mul_mv_id_q4_0_f32_8x_flat(#14003)vulkan: Track descriptor pools/sets per-context (#14109)
Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.
kv-cache : add LLAMA_KV_CACHE_DEBUG environment variable (#14121)
kv-cache : relax SWA masking condition (#14119)
ggml-ci
webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
webui: Wrap long numbers instead of infinite horizontal scroll
Use tailwind class
update index.html.gz
vulkan: Better thread-safety for command pools/buffers (#14116)
This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.
tests : add test-tokenizers-repo (#14017)
chore : clean up relative source dir paths (#14128)
Implement GGML_CPU_ALL_VARIANTS for ARM (#14080)
ggml-cpu: Factor out feature detection build from x86
ggml-cpu: Add ARM feature detection and scoring
This is analogous to cpu-feats-x86.cpp. However, to detect compile-time activation of features, we rely on GGML_USE_ which need to be set in cmake, instead of GGML_ that users would set for x86.
This is because on ARM, users specify features with GGML_CPU_ARM_ARCH, rather than with individual flags.
- ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for ARM
Like x86, however to pass around arch flags within cmake, we use GGML_INTERNAL_ as we don't have GGML_.
Some features are optional, so we may need to build multiple backends per arch version (armv8.2_1, armv8.2_2, ...), and let the scoring function sort out which one can be used.
- ggml-cpu: Limit ARM GGML_CPU_ALL_VARIANTS to Linux for now
The other platforms will need their own specific variants.
This also fixes the bug that the the variant-building branch was always being executed as the else-branch of GGML_NATIVE=OFF. The branch is moved to an elseif-branch which restores the previous behavior.
- kv-cache : fix split_equal handling in unified implementation (#14130)
ggml-ci
- batch : remove logits_all flag (#14141)
ggml-ci
context : simplify output counting logic during decode (#14142)
batch : remove logits_all flag
ggml-ci
- context : simplify output counting logic during decode
ggml-ci
cont : fix comments
cmake : Improve build-info.cpp generation (#14156)
cmake: Simplify build-info.cpp generation
The rebuild of build-info.cpp still gets triggered when .git/index gets changes.
cmake: generate build-info.cpp in build dir
pooling : make cls_b and cls_out_b optional (#14165)
Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp
cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT (#14167)
cmake: Add ability to pass in LLAMA_BUILD_NUMBER/COMMIT
cmake: Pass on LLAMA_BUILD_* to GGML_BUILD_*
batch : rework llama_batch_allocr (#14153)
batch : rework llama_batch_allocr
ggml-ci
- cont : move validation inside class
ggml-ci
- cont : move output counting to class
ggml-ci
- cont : minor
ggml-ci
- batch : add TODOs
ggml-ci
batch : add LLAMA_BATCH_DEBUG environment variable (#14172)
batch : add LLAMA_BATCH_DEBUG environment variable
ggml-ci
cont : improve seq_id display
Merge commit from fork
vocab : prevent integer overflow during load
Add static cast and GGML_ABORT
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- vocab : fix build (#14175)
ggml-ci
batch : auto-gen positions + verify multi-sequence input (#14177)
batch : verify multi-sequence input batches
ggml-ci
- cont : auto-gen positions + verify multi-seq input
ggml-ci
- cont : first print debug info, then perform validation
ggml-ci
- cont : fix position auto-gen + add comments
ggml-ci
- cparams : rename LLAMA_MAX_PARALLEL_SEQUENCES to LLAMA_MAX_SEQ (#14188)
ggml-ci
- model : add dots.llm1 architecture support (#14044) (#14118)
Adds:
Dots1Model to convert_hf_to_gguf.py
Computation graph code to llama-model.cpp
Chat template to llama-chat.cpp to detect this model's template.
The model is called "dots.llm1" (I decided to shorten it to dots1 or DOTS1 in the code generally) architecture.
The only models that exist as of writing of this commit that follow this architecture are "dots.llm1.inst" and "dots.llm1.base" from here:
The model architecture is a combination of Qwen and Deepseek parts, as seen here:
- kv-cache : fix use-after-move of defrag info (#14189)
ggml-ci
model : Add support for Arcee AI's upcoming AFM model (#14185)
Add Arcee AFM support
Add draft update code
Fix linter and update URL, may still not be final
Update src/llama-model.cpp
Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com
- Remote accidental blank line
Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com
ggml-cpu : rework weak alias on apple targets (#14146)
ggml-cpu : rework weak alias on apple targets
fix powerpc detection
fix ppc detection
fix powerpc detection on darwin
vulkan: mutex around vkQueueSubmit (#14127)
This fixes the remaining crash in test-thread-safety on my system.
convert : remove arcee change in convert_hf_to_gguf_update.py (#14207)
ggml: Add Android support for GGML_CPU_ALL_VARIANTS (#14206)
llama : rework embeddings logic (#14208)
llama : rework embeddings logic
ggml-ci
- cont : fix rerank
ggml-ci
cont : engrish [no ci]
cont : fix rerank
ggml-ci
- server : support both embeddings and completions with single model
ggml-ci
- cont : avoid embeddings_org
ggml-ci
model : add NeoBERT (#14164)
convert neobert model to gguf
add inference graph
fix flake8 lint
followed reviewer suggestions
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- follow reviewers suggestions
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- override NeoBERT feed-forward length
Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Georgi Gerganov ggerganov@gmail.com
cmake: clean up external project logic for vulkan-shaders-gen (#14179)
Remove install step for vulkan-shaders-gen
Add install step to normalize msvc with make
Regenerate modified shaders at build-time
llama : add thread safety test (#14035)
llama : add thread safety test
llamafile : remove global state
llama : better LLAMA_SPLIT_MODE_NONE logic
when main_gpu < 0 GPU devices are not used
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
server : fix incorrect usage of llama_get_embeddings() (#14225)
server : fix incorrect usage of llama_get_embeddings()
ggml-ci
- cont : fix the fix
ggml-ci
ggml-cpu : remove the weak alias trick (#14221)
cmake: remove shader-gen step-targets from ggml-vulkan (#14226)
Remove step-targets from vulkan-shaders-gen
Unset DESTDIR when building vulkan-shaders-gen
examples : include examples in msvc disable warn (ggml/1270)
This commit adds the examples in the "list" of targets to ignore MSVC warnings.
The motivation for this is that currently the examples generate a number of warnings that are ignore/disabled for the core ggml project. This makes for a cleaner output when building.
ggml : disable warnings for tests when using MSVC (ggml/1273)
ggml : disable warnings for tests when using MSVC
This commit disables warnings for tests on windows when using MSVC.
The motivation for this is that this brings the build output more inline with what Linux/MacOS systems produce.
There is still one warning generated for the tests which is:
Building Custom Rule C:/ggml/tests/CMakeLists.txt
cl : command line warning D9025: overriding '/DNDEBUG' with '/UNDEBUG'
[C:\ggml\build\tests\test-arange.vcxproj]
test-arange.cpp
test-arange.vcxproj -> C:\ggml\build\bin\Release\test-arange.exeggml : fix typo in tests disable list
sync : ggml
ggml-ci
convert : fix null head_dim AutoConfig regression (#14248)
ggml: Add Apple support for GGML_CPU_ALL_VARIANTS (#14258)
docs: add s390x build documentation (#14264)
docs: add s390x-specific build docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add s390x model conversion steps
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x build indent
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update hyperlinks for s390x docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update llama.h docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x add accelerator and perf optimizations
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x indent blocks
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: revert block indentation
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add support information for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x reword
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: remove indentation for accelerator section s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: remove redundant words s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: reword for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: s390x reword simd
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: fix trailing whitespace for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
metal : add mean kernel (#14267)
metal : add mean kernel
ggml-ci
- cont : dedup implementation
ggml-ci
memory : Hybrid recurrent cache (#13979)
feat: Add llama_model_is_hybrid API call
Also, split llama_model_is_recurrent into llm_arch_is_recurrent in llama-arch with llama_model_is_recurrent delegating to llm_arch_is_recurrent. The same split is done for hybird. This is needed because there are places where the llama_model has not yet been initialized but we need to check if the model is recurrent (specifically for the per-layer recurrent check array in hparams).
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add c++ side constants for attention layer indices hparam
Branch: GraniteFour
- feat: Add support for distinguishing recurrent vs non-recurrent layers in hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Auto-fill hparams.recurrent_layer_arr based on whether the model is recurrent
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: rename *_is_hybrid -> *_is_hybrid_recurrent
The implementation of the hybrid cache intentionally does not specify the types of the child caches, so there was a naming mismatch with these predicate functions that used "hybrid" to imply "hybrid recurrent."
Branch: HybridCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add layer filter to recurrent cache
Branch: HybridCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use per-layer sizing everywhere in kv caches
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: First pass at llama_kv_cache_hybrid_recurrent
This follows the pattern in iswa where the two child caches are held explicitly to support the case where a model requires a single attention cache and a single recurrent cache where each layer uses exactly one of the caches.
This is a rewrite of the more generic approach in the original hybrid cache PR: https://github.com/ggml-org/llama.cpp/pull/13276
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Construct hybrid recurrent cache for hybrid recurrent models
This includes a refactor of the create_memory logic to avoid needing to use the arch enum explicitly unless a model needs explicit cache instantiation logic beyond the standard logic for recurrent, hybrid, unified, and iswa.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix wrong bool condition for split equal in hybrid cache
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix shift logic to defer to unified cache
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Support hybrid recurrent in llama-graph
NOTE: I intentionally did not add support for s_mask since it will be going away soon
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix logic for initializing inputs and attn layers for hybrid caches
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Update recurrent cache for changes to remove intermediate kv_cache interface
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix status for init_update sig for recurrent cache state
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Add missing padding to n_ctx for hybrid cache construction
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Update clear signature for data argument after rebase
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove errant virtual destructor leftover from previous impl attempt
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use per-layer n_embd_k/v_s calls for mamba (1) layers
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove n_embd_k/v_s from unified cache
No longer needed now that unified isn't also supporting recurrent
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140761069
Branch: HybridRecurrentCache
- refactor: Remove layer index from n_embd_k/v_s
Now that it's not used at all in the unified cache, we don't need to use the layer index to zero it out for attention layers.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove n_embd_k/v_gqa from recurrent cache
This is no longer needed now that there are separate implementations
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140825128
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Allow custom layer filters for hybrid recurrent
This should help support architectures like Falcon H1 where there is overlap between layers that need attention and recurrent caches.
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2140748922
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove logits_all after rebase
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove llama_model_is_hybrid_Recurrent public API
https://github.com/ggml-org/llama.cpp/pull/13979#discussion_r2141728423
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Use llama_memory_state_ptr for child states in hybrid memory state
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Overhaul build_recurrent_state / build_inp_s_copy to match attention pattern
https://github.com/ggml-org/llama.cpp/pull/13979/files#r2141701738
This is a big overhaul to bring consistency between how inputs and per- layer components are created for attention layers and recurrent layers. The main changes are:
- Rename class llm_graph_input_s_copy -> llm_graph_input_rs
- Add a corresponding llm_graph_input_rs_hybrid_recurrent
- Rename build_inp_s_copy -> build_rs_inp_recurrent
- Add a corresponding build_rs_inp_hybrid_recurrent
- Rename build_recurrent_state -> build_rs to match build_attn w/ llm_graph_input_rs android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a corresponding overload of build_rs w/ llm_graph_input_rs_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
- Add a llm_graph_input_attn_kv_hybrid_recurrent analogous to llm_graph_input_attn_kv_unified
- Add a build_attn override that takes llm_graph_input_attn_kv_hybrid_recurrent android-build AUTHORS bamba-9b-2.2T.gguf bamba-9b-2.2T.q4_k_m.gguf broken.log build build-rel build-xcframework.sh build.android build.android.bak ci cmake CMakeLists.txt CMakePresets.json CODEOWNERS common common.o CONTRIBUTING.md convert_hf_to_gguf_update.py convert_hf_to_gguf.py convert_llama_ggml_to_gguf.py convert_lora_to_gguf.py debug.log docs examples flake.lock flake.nix ggml ggml-alloc.o ggml-backend.o ggml-metal.o ggml-model-BF16.gguf ggml-model-Q4_K_M.gguf ggml-quants.o ggml.o gguf-py grammar-parser.o grammars include LICENSE licenses llama.log llama.o llamacpp_trace.log main.log Makefile media models mypy.ini pocs poetry.lock prompts pyproject.toml pyrightconfig.json q4_k_m_boot.log q8_0_boot.log quant.log quant2.log README.md requirements requirements.txt sampling.o scripts SECURITY.md src test-grammar-output.tmp test-json-schema-input.tmp tests tools vendor working.log as the first input
This makes the two paradigms fully consistent. The main drawback is the code duplication in the build_attn and build_rs implementations where the only difference between implementations is how they cast the memory state.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix resize vs reserve and skip null tensors in size computation
https://github.com/ggml-org/llama.cpp/pull/13979/files#r2149469788
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-Authored-By: @younesbelkada
- fix: Fix initialization of child states
Since initially writing this PR, the logic in the child state types changed such that using the "init full" signature and keeping the ubatches on the parent struct no longer worked.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Use a common build_recurrent_state method that is cache-agnostic
This reduces the code duplication between the different build_rs impls and also retains a similar signature to the previous build_recurrent_state method while standardizing on the input-dispatched build_rs implementation.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- recurrent : rework graph inputs + add TODOs
ggml-ci
- refactor: Make status and child states const in hybrid and iswa
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Rename llama_kv_cache_[recurrent|hybrid_recurrent] to remove kv cache
This removes the notion of "kv" from the interface names for these memory types. There are still many references to kv in the implementation of the recurrent memory which will need further adjustment.
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor!: Rename all k/v related values for recurrent/hybrid to r/s
Anywhere that "kv_<state|cell|size|etc>" is used, I've used the more generic "mem_" prefix. The specifics of "k" (key) translate to "r" (recurrent state) and "v" (value) translate to "s" (state-space embedding states).
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refacor: _recurrent -> _recr for brevity
It just happens to have the same number of letters as _attn!
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- style: Fix spacing for ref
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: recurrent_layer() -> is_recurrent()
Branch: HybridRecurrentCache
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- style: Fix spacing for size_s_bytes declaration
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249)
llamafile : support s390x SIMD instruction set (#14273)
convert : fix remote option in Windows (#14100)
build : suppress gcc15 compile warnings (#14261)
Change _contains_any() substrs to std::string_view and fix the find comparison logic.
server : add server parameters for draft model cache type (#13782)
Co-authored-by: aa956 27946957+aa956@users.noreply.github.com
- ggml-cpu : remove unnecesary arm feature detection (#14281)
Support for Arm runtime feature detection has now been added to GGML_CPU_ALL_VARIANTS. This removes the old and not very functional code.
CUDA: add conv_2d_dw (#14265)
CUDA: add conv_2d_dw
better naming
simplify using template
Review: fix operation ordering in ggml-cuda, use forceinline, use more const
ubatch : new splitting logic (#14217)
ggml-ci
model : more uniform output id handling (#14275)
model : more uniform output id handling
ggml-ci
- cont : revert n_outputs < n_tokens optimization
ggml-ci
- cont : fix out_ids initialization
ggml-ci
ggml: Update KleidiAI to v1.9.0 (#14277)
ggml : fix repack work size for mul_mat_id (#14292)
ggml-ci
- cuda : synchronize graph capture and cublas handle destruction (#14288)
Workarounds an issue that may cause CUDA graph capture to fail when a cuBLAS handle is destroyed in a different thread
llama : improve sep token handling (#14272)
Implement GGML_CPU_ALL_VARIANTS for PowerPC (#14286)
Add PowerPC feature detection and scoring
ggml-cpu: Implement GGML_CPU_ALL_VARIANTS for PowerPC
ggml-cpu: Delay some initializations until function is called
When using GGML_BACKEND_DL=ON, these initializations might use instructions that are not supported by the current CPU.
Co-authored-by: Diego Devesa slarengh@gmail.com
sycl: add usage of enqueue_functions extension (#14244)
Add header and namespace to use enqueue_functions extension
Convert submit and parallel_for to use new extension in convert.cpp
Convert submit and parallel_for to use extension in ggml-sycl.cpp
Convert submit and parallel_for to use extension in gla.cpp
Convert submit and parallel_for in mmq.cpp
Convert submit and parallel_for in mmvq.cpp
Convert submit and parallel_for in remaining files
Convert all simple parallel_for to nd_launch from enqueue_functions extension
Wrapping extension in general function
Create a general function that enable the enqueue_functions extension if it is enable in the compiler, otherwise call the general SYCL function to launch kernels.
Signed-off-by: nscipione nicolo.scipione@codeplay.com
vocab : prevent tokenizer overflow (#14301)
vocab : prevent stack overflow in tokenize
vocab : return error instead of aborting on oversized token count
vocab : INT32_MIN from llama_tokenize on overflow
lint : remove trailing whitepace (#14304)
CUDA: add conv_2d_transpose (#14287)
CUDA: add conv_2d_transpose
remove direct include of cuda_fp16
Review: add brackets for readability, remove ggml_set_param and add asserts
Add
ggml_roll(ggml/1274)ggml : add ggml_roll
use set/get_op_params & std::min
sync : ggml
ggml-ci
convert : fix Llama 4 conversion (#14311)
memory : rename interface to llama_memory_context_i (#14296)
memory : rename interface to llama_memory_context_i
ggml-ci
cont : fix comments
cont : use "mctx" for referencing a memory context
ggml-ci
- metal : fix thread-safety (#14300)
ggml-ci
gguf-py : fix TemplateProcessing pair when bos/eos is missing (#14312)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.
remove #ifdef for debug utils and add queue marker.
gguf-py : fix Qwen3-Embedding eos token (#14314)
CUDA: add mean operation (#14313)
CUDA: add mean operation
add back sum_rows_f32_cuda
Review: early exit if col!=0
HIP: enable vec fattn on RDNA4 (#14323)
examples : fix is_first logic for tokenization (#14329)
ggml-ci
run : avoid double tokenization (#14327)
run : avoid double tokenization by adopting common_tokenize heuristic
build : fix windows gcc and clang warnings
lint : fixed trailing whitepace
run : fix is_first flag
gguf-py : fix SpecialVocab parsing when post_processor is null (#14330)
quantize : handle user-defined pruning of whole layers (blocks) (#13037)
vulkan: update windows SDK in CI (#14334)
kv-cells : fix tracking of seq_pos (#14339)
kv-cells : fix tracking of seq_pos during cache reuse
ggml-ci
- cont : improve error message
ggml-ci
cont : add more comments
CUDA: mul_mat_v support for batch sizes > 1 (#14262)
CUDA: mul_mat_v support for batch sizes > 1
use 64 bit math for initial offset calculation
ci: add workflow for relocatable cmake package (#14346)
CUDA/HIP: optimize mmv paths taken for HIP devices (#14324)
Co-authored-by: Johannes Gäßler johannesg@5d6.de
cmake : use LLAMA_BUILD_NUMBER when defining LLAMA_INSTALL_VERSION (#14362)
batch : fix check for empty sequences in memory (#14364)
batch : fix check for empty sequences in memory
ggml-ci
- cont : reuse the var
ggml-ci
opencl: ref count
ggml_backend_opencl_contextand refactor profiling (#14254)Move profiling info into
ggml_backend_opencl_contextAdd
enqueue_ndrange_kernelto launch kernelggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)
- ggml-cpu: add fp16->fp32 nnpa first
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)
- ggml-cpu: add fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)
- ggml-cpu: better variable names
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)
- docs: update s390x docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)
- ggml-cpu: add debugging prints to see if dlf16 is correct
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix print vs printf
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix float placeholder
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: ensure fp16 and fp32 load and stores are called
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fp16 load ensured to hit
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove sigint from fp16 store
for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa switch to vec_xst test
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to vec_xst for 4 element loops also
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rework noop
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove noop, general code cleanup
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify variable naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add breakpoint for debugging
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test fix for conversion failure
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: disable fp32->fp16 nnpa conversions for now
there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to elif macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix typo
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix compiler types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: change to typedef vector types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add 4 element loops for fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarified vector naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back fp32->fp16 store nnpa
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add nnpa macro check in ggml-impl
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add missing func
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: diagnose why NNPA macro is not being defined
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: import vecintrin.h to fix compiler errors
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: update macro tests
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 157f856c34589566151630e294563a420702db39.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to importing ggml-cpu-impl instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix macro declaration
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test more macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add debug prints
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bruteforce macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h to cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to private macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)
- ggml-cpu: move things around
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to quotes for import
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add compiler error macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add s390x detection in ggml-src
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: undo cmakelists work
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedefs.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedef from cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h future notes
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add todo comment for future reference
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify naming of dlf16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove unnecessary target compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update broken huggingface link for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix duplicate func names during compile
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: fix duplicate func names during compile"
This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"
This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp16<->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h import in quants.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h within repack
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix amx mmq missing simd-mappings.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: attempt at fixing loongarch failing build
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa together with other fp16<->fp32 simd
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix wrong refactor of ggml-base
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: remove dependency on ggml-cpu from ggml-base
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove mistaken fallback macro
fallback logic was already implemented but i was too sleepy to realise
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"
This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"
This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)
- ggml: move ggml_table_f32_f16 to ggml-cpu.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: extern c ggml_table_f32_f16 + chore docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h
we rely on the variable declaration in ggml-cpu.c instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"
This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back ggml_table_f32_f16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: bring back ggml_table_f32_f16"
This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
fix ggml time initialization
fix f32_f16 table init
remove extra line
Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com
musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de
docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add s390x z17 build q&a
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup
ggml-ci
- metal : handle some edge cases when threadgroup size is not a power of 2
ggml-ci
- metal : add special-case mat-vec mul for ne00 == 4 (#14385)
ggml-ci
llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.
ref: #8366
use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst
ggml-ci
ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation
ggml-ci
ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR
ggml-ci
ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- recurrent : call balloc split_reset() in init_batch() (#14414)
ggml-ci
- graph : make llm_graph_context destructor virtual (#14410)
ggml-ci
- vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)
This setting needs to be passed through to vulkan-shaders-gen
ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)
Add Day-0 support for Baidu ERNIE 4.5 0.3B model.
Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com
vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.
extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups
style fixes
last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter
Co-authored-by: slaren slarengh@gmail.com
ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels
ggml-ci
add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints
ggml-ci
64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]
ggml-ci
Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation
Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai
GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions
op_xxxfor each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions.
- Using
__dpct_inline__to encourage compiler inlining. - Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com
ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"
This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.
SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com
scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml
修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动
- Remove redundant include path in CMakeLists.txt
The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.
- Enable scheduled Docker image builds
Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.
test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels
ggml-ci
- cont : disable for q4_1
ggml-ci
- cont : disable for iq4_nl
ggml-ci
- memory : correctly handle failure in apply() (#14438)
ggml-ci
Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml
ggml-ci
ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI
ggml-ci
- cont : remove -g flag
ggml-ci
ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes
Co-authored-by: Diego Devesa slarengh@gmail.com
github : add OpenCL backend to issue templates (#14492)
ci : add OpenCL to labeler workflow (#14496)
opencl : update upscale to support align corners (#14488)
opencl : skip empty nodes on cgraph compute (#14491)
simple-chat : fix context-exceeded condition (#14494)
simple-chat : fix context-exceeded condition
ggml-ci
- cont : fix n_ctx_used computation
ggml-ci
opencl : fix possible buffer overflow in dump_tensor (#14490)
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)
ggml-ci
vulkan: support softmax/FA batch and broadcast (#14449)
CUDA: broadcasting for FlashAttention mask (#14500)
CUDA: add softmax broadcast (#14475)
CUDA: add softmax broadcast
Pass by const ref
Review: Use blockDims for indexing, remove designated initializers
Add TODO for noncontigous input/output
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
ggml : add version function to get lib version (ggml/1286)
ggml : add version function to get lib version
This commit adds a function ggml_version() to the ggml library that
returns the version of the library as a string.
The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.
Usage:
printf("GGML version: %s\n", ggml_version());Output:
GGML version: 0.0.2219- ggml : add ggml_commit()
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- sync : ggml
ggml-ci
llama : initial Mamba-2 support (#9126)
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
- metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
- ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
- convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models to avoid some reshapes.
Not sure if it's a good idea, but it makes the graph slightly cleaner.
llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON
cuda : graceful fallback for Mamba-1 models with weird embd size
gguf-py : add support for chat template jinja files (#14508)
add support for chat template jinja files
remove gemma3n hack
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
ggml : remove kompute backend (#14501)
ggml-ci
ggml : fix FA mask dim 2 and 3 (#14505)
ggml : fix FA mask dim 2 and 3
ggml-ci
- backends : unsupport batched FA in CUDA and Vulkan
ggml-ci
vulkan : disable FA for mask->ne[2] != 1
kv-cache : use ggml_set_rows (#14285)
kv-cache : use ggml_set_rows
ggml-ci
- graph : separate k and v indices
ggml-ci
- cont : remove redundant ifs
ggml-ci
kv-cache : improve find_slot impl
kv-cache : bounds-check when accessing slot_info indices
kv-cache : add comments
ggml-ci
- ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
convert : correct gemma 3n conversion (#14450)
convert : correct gemma 3n conversion
rm redundant code
Fix conditional enabling following arch checks for ggml-sycl (#14504)
Signed-off-by: nscipione nicolo.scipione@codeplay.com
ggml: backward pass for split swiglu (#14483)
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
vulkan: better parameterize FA by head sizes
vulkan: support mixed/deepseekR1 FA head sizes
opencl : broadcast for soft_max (#14510)
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)
Co-authored-by: luyuhong luyuhong@kylinos.cn
- batch : add n_used count (#14512)
ggml-ci
- graph : prepare for 4D mask (#14515)
ggml-ci
- batch : add optional for sequential equal split (#14511)
ggml-ci
- metal : disable fast math in all quantize kernels (#14528)
ggml-ci
test-backend-ops: add support for specifying output format (#14368)
test-backend-ops: add support for specifying output format
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Add build_commit and build_number in test_result
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- refactor
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Get build commit from ggml_commit()
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Merge errors into test_operation_info && address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
remove visitor nonsense
remove visitor comment
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com
eval-callback : check for empty input (#14539)
opencl: add GELU_ERF (#14476)
server : fix assistant prefilling when content is an array (#14360)
vulkan: Handle updated FA dim2/3 definition (#14518)
vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.
handle null mask for gqa
allow gqa with dim3>1
Signed-off-by: nscipione nicolo.scipione@codeplay.com Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Signed-off-by: Aaron Teo aaron.teo1@ibm.com Signed-off-by: Gabe Goodhart ghart@us.ibm.com Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: Đinh Trọng Huy 77562200+huydt84@users.noreply.github.com Co-authored-by: dinhhuy huy.dinh@brains-tech.co.jp Co-authored-by: Nicolò Scipione nicolo.scipione@codeplay.com Co-authored-by: R0CKSTAR yeahdongcn@gmail.com Co-authored-by: Xinpeng Dou 15529241576@163.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: xctan axunlei@gmail.com Co-authored-by: Diego Devesa slarengh@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Jeff Bolz jbolz@nvidia.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net Co-authored-by: lhez quic_lih@quicinc.com Co-authored-by: Aman amangupta052@gmail.com Co-authored-by: Christian Kastner ckk@kvr.at Co-authored-by: Guy Goldenberg guy110698@gmail.com Co-authored-by: Mikko Juola mikjuo@gmail.com Co-authored-by: Bartowski 3266127+bartowski1182@users.noreply.github.com Co-authored-by: Xuan-Son Nguyen thichthat@gmail.com Co-authored-by: xctan xc-tan@outlook.com Co-authored-by: Charles Xu charles.xu@arm.com Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com Co-authored-by: Daniel Bevenius daniel.bevenius@gmail.com Co-authored-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: pqnet 119850+pqnet@users.noreply.github.com Co-authored-by: fanyang fanyang89@outlook.com Co-authored-by: aa956 aa956@users.noreply.github.com Co-authored-by: aa956 27946957+aa956@users.noreply.github.com Co-authored-by: Ruikai Peng retr0@retr0.blog Co-authored-by: Acly aclysia@gmail.com Co-authored-by: Daniel Han danielhanchen@gmail.com Co-authored-by: Markus Tavenrath mtavenrath@users.noreply.github.com Co-authored-by: uvos philipp@uvos.xyz Co-authored-by: Ed Addario 29247825+EAddario@users.noreply.github.com Co-authored-by: Johannes Gäßler johannesg@5d6.de Co-authored-by: Mathieu Baudier mbaudier@argeo.org Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Radoslav Gerganov rgerganov@gmail.com Co-authored-by: Weizhao Ouyang weizhao.ouyang@arm.com Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Renat rntk@users.noreply.github.com Co-authored-by: matteo matteo.serva@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com Co-authored-by: Vedran Miletić vedran@miletic.net Co-authored-by: xiaobing318 71554036+xiaobing318@users.noreply.github.com Co-authored-by: Romain Biessy romain.biessy@codeplay.com Co-authored-by: Björn Ganster mail@bjoern-ganster.de Co-authored-by: Eric Zhang 34133756+EZForever@users.noreply.github.com Co-authored-by: zhouwg zhouwg2000@gmail.com Co-authored-by: Rotem Dan rotemdan@gmail.com Co-authored-by: luyhcsu 110711054+luyhcsu@users.noreply.github.com Co-authored-by: luyuhong luyuhong@kylinos.cn
qnixsynapse pushed a commit to janhq/llama.cpp that referenced this pull request
olek-tether pushed a commit to tetherto/qvac-fabric-llm.cpp that referenced this pull request
sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
ggml : do not output unprintable characters on GGUF load failure (#14381)
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)
- ggml-cpu: add fp16->fp32 nnpa first
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)
- ggml-cpu: add fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)
- ggml-cpu: better variable names
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)
- docs: update s390x docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)
- ggml-cpu: add debugging prints to see if dlf16 is correct
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix print vs printf
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix float placeholder
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: ensure fp16 and fp32 load and stores are called
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fp16 load ensured to hit
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove sigint from fp16 store
for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa switch to vec_xst test
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to vec_xst for 4 element loops also
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rework noop
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove noop, general code cleanup
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify variable naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add breakpoint for debugging
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test fix for conversion failure
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: disable fp32->fp16 nnpa conversions for now
there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to elif macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix typo
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix compiler types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: change to typedef vector types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add 4 element loops for fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarified vector naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back fp32->fp16 store nnpa
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add nnpa macro check in ggml-impl
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add missing func
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: diagnose why NNPA macro is not being defined
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: import vecintrin.h to fix compiler errors
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: update macro tests
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 157f856c34589566151630e294563a420702db39.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to importing ggml-cpu-impl instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix macro declaration
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test more macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add debug prints
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bruteforce macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h to cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to private macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)
- ggml-cpu: move things around
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to quotes for import
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add compiler error macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add s390x detection in ggml-src
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: undo cmakelists work
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedefs.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedef from cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h future notes
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add todo comment for future reference
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify naming of dlf16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove unnecessary target compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update broken huggingface link for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix duplicate func names during compile
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: fix duplicate func names during compile"
This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"
This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp16<->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h import in quants.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h within repack
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix amx mmq missing simd-mappings.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: attempt at fixing loongarch failing build
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa together with other fp16<->fp32 simd
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix wrong refactor of ggml-base
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: remove dependency on ggml-cpu from ggml-base
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove mistaken fallback macro
fallback logic was already implemented but i was too sleepy to realise
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"
This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"
This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)
- ggml: move ggml_table_f32_f16 to ggml-cpu.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: extern c ggml_table_f32_f16 + chore docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h
we rely on the variable declaration in ggml-cpu.c instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"
This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back ggml_table_f32_f16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: bring back ggml_table_f32_f16"
This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
fix ggml time initialization
fix f32_f16 table init
remove extra line
Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com
musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de
docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add s390x z17 build q&a
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup
ggml-ci
- metal : handle some edge cases when threadgroup size is not a power of 2
ggml-ci
- metal : add special-case mat-vec mul for ne00 == 4 (#14385)
ggml-ci
llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.
ref: #8366
use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst
ggml-ci
ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation
ggml-ci
ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR
ggml-ci
ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- recurrent : call balloc split_reset() in init_batch() (#14414)
ggml-ci
- graph : make llm_graph_context destructor virtual (#14410)
ggml-ci
- vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)
This setting needs to be passed through to vulkan-shaders-gen
ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)
Add Day-0 support for Baidu ERNIE 4.5 0.3B model.
Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com
vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.
extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups
style fixes
last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter
Co-authored-by: slaren slarengh@gmail.com
ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels
ggml-ci
add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints
ggml-ci
64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]
ggml-ci
Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation
Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai
GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions
op_xxxfor each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions.
- Using
__dpct_inline__to encourage compiler inlining. - Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com
ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"
This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.
SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com
scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml
修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动
- Remove redundant include path in CMakeLists.txt
The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.
- Enable scheduled Docker image builds
Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.
test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels
ggml-ci
- cont : disable for q4_1
ggml-ci
- cont : disable for iq4_nl
ggml-ci
- memory : correctly handle failure in apply() (#14438)
ggml-ci
Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-quants : rename best_mad to best_error (ggml/1283)
This commit renames the variable best_mad to best_error in the
make_qkx2_quants function.
The motivation for this is that the name best_mad can be somewhat
confusing if mean absolute deviation (MAD) is not in use.
ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml
ggml-ci
ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
[CANN]update to aclnnGroupedMatmulV2
Signed-off-by: noemotiovon 757486878@qq.com
- Support MUL_MAT_ID on 310p
Signed-off-by: noemotiovon 757486878@qq.com
- fix editorconfig
Signed-off-by: noemotiovon 757486878@qq.com
Signed-off-by: noemotiovon 757486878@qq.com
- Add Vulkan images to docker.md (#14472)
Right now it's not easy to find those.
ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI
ggml-ci
- cont : remove -g flag
ggml-ci
ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes
Co-authored-by: Diego Devesa slarengh@gmail.com
github : add OpenCL backend to issue templates (#14492)
ci : add OpenCL to labeler workflow (#14496)
opencl : update upscale to support align corners (#14488)
opencl : skip empty nodes on cgraph compute (#14491)
simple-chat : fix context-exceeded condition (#14494)
simple-chat : fix context-exceeded condition
ggml-ci
- cont : fix n_ctx_used computation
ggml-ci
opencl : fix possible buffer overflow in dump_tensor (#14490)
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)
ggml-ci
vulkan: support softmax/FA batch and broadcast (#14449)
CUDA: broadcasting for FlashAttention mask (#14500)
CUDA: add softmax broadcast (#14475)
CUDA: add softmax broadcast
Pass by const ref
Review: Use blockDims for indexing, remove designated initializers
Add TODO for noncontigous input/output
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
ggml : add version function to get lib version (ggml/1286)
ggml : add version function to get lib version
This commit adds a function ggml_version() to the ggml library that
returns the version of the library as a string.
The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.
Usage:
printf("GGML version: %s\n", ggml_version());Output:
GGML version: 0.0.2219- ggml : add ggml_commit()
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- sync : ggml
ggml-ci
llama : initial Mamba-2 support (#9126)
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
- metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
- ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
- convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models to avoid some reshapes.
Not sure if it's a good idea, but it makes the graph slightly cleaner.
llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON
cuda : graceful fallback for Mamba-1 models with weird embd size
gguf-py : add support for chat template jinja files (#14508)
add support for chat template jinja files
remove gemma3n hack
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
ggml : remove kompute backend (#14501)
ggml-ci
ggml : fix FA mask dim 2 and 3 (#14505)
ggml : fix FA mask dim 2 and 3
ggml-ci
- backends : unsupport batched FA in CUDA and Vulkan
ggml-ci
vulkan : disable FA for mask->ne[2] != 1
kv-cache : use ggml_set_rows (#14285)
kv-cache : use ggml_set_rows
ggml-ci
- graph : separate k and v indices
ggml-ci
- cont : remove redundant ifs
ggml-ci
kv-cache : improve find_slot impl
kv-cache : bounds-check when accessing slot_info indices
kv-cache : add comments
ggml-ci
- ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
convert : correct gemma 3n conversion (#14450)
convert : correct gemma 3n conversion
rm redundant code
Fix conditional enabling following arch checks for ggml-sycl (#14504)
Signed-off-by: nscipione nicolo.scipione@codeplay.com
ggml: backward pass for split swiglu (#14483)
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
vulkan: better parameterize FA by head sizes
vulkan: support mixed/deepseekR1 FA head sizes
opencl : broadcast for soft_max (#14510)
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)
Co-authored-by: luyuhong luyuhong@kylinos.cn
- batch : add n_used count (#14512)
ggml-ci
- graph : prepare for 4D mask (#14515)
ggml-ci
- batch : add optional for sequential equal split (#14511)
ggml-ci
- metal : disable fast math in all quantize kernels (#14528)
ggml-ci
test-backend-ops: add support for specifying output format (#14368)
test-backend-ops: add support for specifying output format
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Add build_commit and build_number in test_result
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- refactor
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Get build commit from ggml_commit()
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Merge errors into test_operation_info && address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
remove visitor nonsense
remove visitor comment
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com
eval-callback : check for empty input (#14539)
opencl: add GELU_ERF (#14476)
server : fix assistant prefilling when content is an array (#14360)
vulkan: Handle updated FA dim2/3 definition (#14518)
vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.
handle null mask for gqa
allow gqa with dim3>1
vulkan: fix rms_norm+mul fusion (#14545)
The fused operation was grabbing the epsilon value from the wrong place.
Add an env var to disable fusion.
Add some missing checks for supported shapes/types.
Handle fused rms_norm+mul in check_results.
- vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260
Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com
CUDA: add bf16 and i32 to getrows (#14529)
llama : remove ggml_cont where possible (#14568)
llama : fix incorrect minicpm3 v_states shape (#14571)
musa: fix build warnings (unused variable) (#14561)
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
CUDA: add bilinear interpolation for upscale (#14563)
cuda : fix rope with partial rotation and non-cont src (#14580)
cuda : fix rope non-cont
ggml-ci
- cont : fix multi-rope + add test
ggml-ci
- sycl : try fix
ggml-ci
- cont : fix sycl + clean-up cuda
ggml-ci
vulkan: increase timeout for CI (#14574)
model : add hunyuan moe (#14425)
model : add hunyuan moe
tokenizer ok
fix tensor name
cgraph init
chat template
wip
almost working
skip embed, fix bos
cleanup
yarn scaling
cleanup
correct rope type
failed token fix
ntk alpha freq_base
tokenization working
cleanup and pr changes
vocab_size sanity check
ntk alpha generic
Update convert_hf_to_gguf.py
Apply suggestions from code review
fix regression
fix style
Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com
server: Add ability to mount server at prefix (#14544)
Add server_prefix
Correct server path env
Rename cli flag to --api-prefix
Change all to api_prefix
vulkan : fix rope with partial rotation and non-cont src (#14582)
memory : fix broken batch splits for recurrent cache (#14575)
Splits producing more than one ubatch per batch for recurrent models were broken with #14512.
This fixes it by moving the completeness check after the ubatch split loop.
model : add SmolLM3 (#14581)
Init - first pass.
Model -> ModelBase.
fix errors in conversion.
Update the graph.
up.
up.
wip
cgraph ok
rm redundant code
Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com
- model : fix hunyuan moe chat template (#14584)
Signed-off-by: stevenkuang stevenkuang@tencent.com
vulkan: optimize flash attention split_k_reduce (#14554)
vulkan: allow FA split_k with smaller KV values
vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).
convert : fix smollm3 jinja template (#14586)
model : add support for Falcon-H1 family (#14534)
v1
push more fixes
another fix
fix
more fixes
minor fix
more cleaning on python code
python fixes
changed precision for multipliers float 32->64
fixes
another fix
fix
pre-norm -> norm
fix
Revert "fix"
This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.
fix
small fix ffn_norm
try
mix instead of max
fix vocab size
conflict solve
fixed multipliers
falcon-h1 specefic vocab resolved
read arch from gguf.MODEL_ARCH
mamba_d_ssm added to d_inner find_hparam
remove unused functions from gguf_writer.py
override modify_tensors instead of get_tensors
fix conversion and d_inner
added some cb functions for debugging puposes
inp_out_ids moved outside of layers loop
mup_vec create as float64
fix rope_theta
injected mup
clean ups
rm extra space
rm unused MAMBA_CHUNK_SIZE
rm unused key
add bos False
changed ROPE_TYPE
cleaning debugging stuff
cleaning debug quant
fix comment
some cleanups
some cleanups
Update src/llama-model-loader.cpp
more cleanups
moe cleanuips
d_ssm -> d_inner;
cleaning unused hparams
cleanup
more cleanups
more cleanups on python conversion;
minor cleanups
Apply suggestions from code review
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
remove todo
added falcon-h1
tensor not required
clean
remove unneeded attributes
more cleanups and fixed conversion
remove final_norm
flake8 fixes
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
flake8 fixes
Update src/llama-hparams.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
added hashes
Update src/llama-arch.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
update the update file
Revert "update the update file"
This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.
fix: address suggestions
fix: update convert_hf_to_gguf.py
Update gguf-py/gguf/constants.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
d_inner fixed
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
reshaping ssm_norm for 34B
removing generate_mup
remove duplicates metadata keys
rm comment
final comment
fix unused args
fix constants
fix bad merge
Update src/llama-model.cpp
Co-authored-by: compilade git@compilade.net
falcon-h1: remove unused ssm_in_b and bad merge
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
falcon-h1: fix last comment
Update convert_hf_to_gguf.py
Co-authored-by: compilade git@compilade.net
falcon-h1: revert add_add_bos(False)
falcon-h1: fix tied weights
falcon-h1: remove whitespace
falcon-h1: fix wrong size param
falcon-h1: fix whitespace issues
Co-authored-by: younesbelkada younes.belkada@tii.ae Co-authored-by: Younes B 49240599+younesbelkada@users.noreply.github.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net
llama : remove unintended whitespace (#14592)
model : add skt/A.X-4.0 model vocabulary (#14589)
ggml : prevent integer overflow in gguf tensor size calculation (#14595)
ggml : add ggml_scale_bias (#14417)
ggml : add ggml_scale_bias
ggml_vec_mad1_f32
add more simd
add CUDA
sycl
vulkan
cann (placeholder)
opencl
will this fix cpu?
fix cuda
suggestions from coderabbit
fix cann compile error
vDSP_vsmsa
rm __ARM_FEATURE_SVE
use memcpy for op params
make code looks more consistent
use scalar for __ARM_FEATURE_SVE
add x param to ggml_vec_mad1_f32
llama : support Jamba hybrid Transformer-Mamba models (#7531)
wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba (and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.
llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
- llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
llama : avoid copies for simple batch splits
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it, and this makes Mamba's conv step slightly faster.
mamba : fix non-contiguous usage of ggml_silu
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
llama : begin renaming llama_past back to llama_kv_cache
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
convert : fix jamba conv1d shape squeezing
graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).
model : add Jamba to Mamba-specific hparams printing
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
llama : remove llm_graph_input_one (#14603)
cuda : support Falcon-H1 state size for SSM_SCAN (#14602)
cmake : llguidance build parser library only (#14608)
cmake : bump llguidance version to v1.0.1 (#14609)
llama : minor coding style fix for smollm3 (#14605)
SYCL: Initial set_rows kernel implementation (#14562)
SYCL: Initial set_rows kernel implementation
Revert max_threads to 256
Refactor set_rows and address review comments
Deduplicate conversion function
Remove guard before kernel launch and refactor
Fix and add back SFINAE
cmake : do not search for curl libraries by ourselves (#14613)
cmake : do not search for curl libraries by ourselves
run : do not search for curl libraries by ourselves
Docs: script to auto-generate ggml operations docs (#14598)
Docs: script to auto-generate ggml operations docs
Review: formatting changes + change github action
Use built-in types instead of typing
docs : add BLAS and Metal ops
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Smoldocling support (#14597)
support for smoldocling
fixed merge conflicts
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com
merge conflicts
pre tokenizer merge fix
convert : fix smollm3 jinja template (#14586)
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- support for smoldocling
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- fixed merge conflicts
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.h
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- safetensors tensor mapping
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
added back accidental removal of clean spaces for hunyuan
Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
updated hash and reordererd model list
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update include/llama.h
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf_update.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
removed old tensor name
removed tensor mappings -> handled by smolvlm
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net
opencl: add
set_rowsforf16andf32(#14547)opencl: add
set_rowsforf16andf32opencl: better choose workgroup size for
set_rowsopencl: add tiled mul_mat_f16_f32 (#14535)
add tiled mul_mat_f16_f32
fix trailing whitespace
add insightful comments
model : Granite Four (#13550)
wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba (and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.
llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
- llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
llama : avoid copies for simple batch splits
llama : use im2col and mul_mat to perform convolution for Mamba
This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed.
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it, and this makes Mamba's conv step slightly faster.
- llama : rename llama_cache to llama_past
This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway)
Still, I'm open to better suggestions.
examples : replace llama_kv_cache_seq_* with llama_past_seq_*
mamba : fix non-contiguous usage of ggml_silu
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
llama : begin renaming llama_past back to llama_kv_cache
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
- metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
- ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
- convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models to avoid some reshapes.
Not sure if it's a good idea, but it makes the graph slightly cleaner.
llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
feat: Add conversion for Bamba models
This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add Granite 4 conversion
This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Plumb bamba through llama-arch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add bamba to llama_arch_is_hybrid_recurrent
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add optional mamba ssm_in bias tensor
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add template specialization for get_arr to load a vector for layer index arr in hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Use an explicit bool to determine mamaba vs mamba2
This allows other architectures like bamba and granitemoehybrid to use
mamab2 without a growing architecture if statement inside the mamba
implementation.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Isolate mamba(2) and granite attention layer building in static methods
This will allow these layer-builder methods to be used from other build structs without complex inheritance.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use per-layer sizes in granite build_attention_layer
Also no need to pass in kv cache since it's already in the inp_attn
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: First (broken) pass at end-to-end Bamba implementation
It generates (garbage) tokens! Still lots of debugging to do.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Only do Granite multipliers if set
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Pull granite ffn portion into a static function and reuse in hybrid
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat(py): Allow gguf duplicate keys if they match by value and type
This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor(py): Simplify granitemoehybrid conversion to use parents better
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add GRANITE_MOE_HYBRID through llama-arch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Support GRANITE_MOE_HYBRID in llama-model
This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- style: Fix flake8 errors
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix recurrent cache get after rebase
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins
The challenge here is to give both the non-hybrid classes (llm_build_mamba
and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access
to the same intermediate "base class" functionality (build_mamba*_layer,
build_granite_attention_layer) without running into trouble with diamond
inheritance of llm_graph_context. Due to the non-trivial initialization
that happens in llm_graph_context, diamond inheritance results in multiple
initializations of the common base which cause problems around the unique
ptrs. I wanted to get away from self-> everywhere, but this is still a
bit cleaner than making those methods static I think.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Implement the full copy-paste version to duplicate the layer builders
This follows the pattern where the type of input is pinned to the type of
memory and that is used to dispatch to the correct version of build_rs /
build_attn. There's a lot of code duplication that can hopefully be
pulled into common functions in the graph later.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid
I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba).
As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON
- memory : correctly handle failure in apply()
ggml-ci
- style: Remove TODO for adding first hybrid models to the switch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- docs: Fix comment about duplicate key check
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Conform to standard way of initializing inp_out_ids
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
convert : fix jamba conv1d shape squeezing
fix: Fix input initialization in granite_hybrid after removal of hybrid inputs
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use llm_graph_context_mamba in llm_build_granite_hybrid
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins
The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...).
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).
model : add Jamba to Mamba-specific hparams printing
fix: Fix input setup after upstream merge
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- feat: Add support for dense FFN in GraniteMoeHybrid
This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN).
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add support for dense FFN tensor names on c++ side
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use child inputs for Falcon H1 after merge resolution
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unnecessary prefix on tensor constants
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
fix: Revert order changes for Falcon H1 to stay consistent with upstream
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
- refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid
The only key difference is the use of rope which is now set via rope_finetuned in the hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove use of diamond inheritance
Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance
https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Log mamba params for Granite Hybrid
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unused ssm_in_b
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv
This matches how recurrent vs attention heads are identified for Jamba
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unused template expansion for get_arr
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Review cleanup in convert_hf_to_gguf
The gist is to be explicit about which base class is being used with the multiple inheritance setup
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Undo hidden warnings about duplicate identical keys in add_key_value
After further discussion, this encourages sloppy overwriting in the model converters
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: If not using ROPE, context is "infinite"
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- doc: Add a comment outlining expected duplicate key warnings
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unnecessary duplicate keys in converter
Co-authored-by: Francis Couture-Harpin git@compilade.net
(thanks for the sharp eyes and patience!)
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Francis Couture-Harpin git@compilade.net Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
vocab : add midm-2.0 model pre-tokenizer (#14626)
llama : move enum llama_vocab_pre_type to implementation (#14631)
ggml-ci
readme : add hot PRs (#14636)
readme : add hot PRs
cont
readme : update title
readme : hot PRs links
cont
HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634)
model : support LiquidAI LFM2 hybrid family (#14620)
Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"vulkan: optimizations for deepseek prompt processing (#14555)
vulkan: allow unclamped loads in coopmat2 mul_mat_id shader
vulkan: increase coopmat2 mul_mat_id tile size
vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path
vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
vulkan: support SET_ROWS (#14587)
vulkan: support SET_ROWS
Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.
- vulkan: optimize set_rows
Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.
server : fix pooled embedding output (#14645)
vulkan : implement ggml_roll (ggml/1290)
ggml-ci
- vulkan : implement bilinear interpolation (ggml/1291)
ggml-ci
- sync : ggml
ggml-ci
- vulkan : remove unused vars (#0)
ggml-ci
sync : ggml
CUDA: add set rows for f32 and f16 (#14551)
CUDA: add set rows for f32 and f16
Review: change kernel params, use strides from host
Use 1-d kernel
Review: use int64_t for blockDim.x, rename nb->s for clarity
docs : add LFM2 to models section (#14650)
readme : add LFM2 to models section
fix copy paste...
tests : cover lfm2 cases in test_ssm_conv (#14651)
cmake : Add CMake presets for Linux and GCC (#14656)
metal : Add missing unary ops Metal support (#14660)
ggml : add build-time message to remind about ggml_set_rows (#14661)
ggml-ci
cuda : add ELU support (#14657)
cuda : add set rows for bf16 (#14664)
quantize : fix minor logic flaw in --tensor-type (#14572)
llama : add jinja template for rwkv-world (#14665)
llama : add jinja template for rwkv-world
Signed-off-by: Molly Sophia mollysophia379@gmail.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
sycl: Batched mulmat rework for oneDNN dispatch (#14617)
SY…
gianni-cor pushed a commit to gianni-cor/qvac-fabric-llm.cpp that referenced this pull request
sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
ggml : do not output unprintable characters on GGUF load failure (#14381)
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)
- ggml-cpu: add fp16->fp32 nnpa first
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)
- ggml-cpu: add fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)
- ggml-cpu: better variable names
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)
- docs: update s390x docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)
- ggml-cpu: add debugging prints to see if dlf16 is correct
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix print vs printf
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix float placeholder
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: ensure fp16 and fp32 load and stores are called
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fp16 load ensured to hit
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove sigint from fp16 store
for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa switch to vec_xst test
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to vec_xst for 4 element loops also
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rework noop
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove noop, general code cleanup
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify variable naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add breakpoint for debugging
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test fix for conversion failure
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: disable fp32->fp16 nnpa conversions for now
there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to elif macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix typo
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix compiler types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: change to typedef vector types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add 4 element loops for fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarified vector naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back fp32->fp16 store nnpa
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add nnpa macro check in ggml-impl
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add missing func
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: diagnose why NNPA macro is not being defined
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: import vecintrin.h to fix compiler errors
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: update macro tests
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 157f856c34589566151630e294563a420702db39.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to importing ggml-cpu-impl instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix macro declaration
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test more macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add debug prints
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bruteforce macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h to cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to private macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)
- ggml-cpu: move things around
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to quotes for import
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add compiler error macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add s390x detection in ggml-src
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: undo cmakelists work
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedefs.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedef from cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h future notes
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add todo comment for future reference
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify naming of dlf16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove unnecessary target compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update broken huggingface link for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix duplicate func names during compile
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: fix duplicate func names during compile"
This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"
This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp16<->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h import in quants.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h within repack
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix amx mmq missing simd-mappings.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: attempt at fixing loongarch failing build
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa together with other fp16<->fp32 simd
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix wrong refactor of ggml-base
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: remove dependency on ggml-cpu from ggml-base
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove mistaken fallback macro
fallback logic was already implemented but i was too sleepy to realise
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"
This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"
This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)
- ggml: move ggml_table_f32_f16 to ggml-cpu.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: extern c ggml_table_f32_f16 + chore docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h
we rely on the variable declaration in ggml-cpu.c instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"
This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back ggml_table_f32_f16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: bring back ggml_table_f32_f16"
This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
fix ggml time initialization
fix f32_f16 table init
remove extra line
Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com
musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de
docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add s390x z17 build q&a
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup
ggml-ci
- metal : handle some edge cases when threadgroup size is not a power of 2
ggml-ci
- metal : add special-case mat-vec mul for ne00 == 4 (#14385)
ggml-ci
llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.
ref: #8366
use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst
ggml-ci
ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation
ggml-ci
ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR
ggml-ci
ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- recurrent : call balloc split_reset() in init_batch() (#14414)
ggml-ci
- graph : make llm_graph_context destructor virtual (#14410)
ggml-ci
- vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)
This setting needs to be passed through to vulkan-shaders-gen
ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)
Add Day-0 support for Baidu ERNIE 4.5 0.3B model.
Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com
vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.
extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups
style fixes
last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter
Co-authored-by: slaren slarengh@gmail.com
ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels
ggml-ci
add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints
ggml-ci
64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]
ggml-ci
Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation
Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai
GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions
op_xxxfor each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions.
- Using
__dpct_inline__to encourage compiler inlining. - Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com
ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"
This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.
SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com
scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml
修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动
- Remove redundant include path in CMakeLists.txt
The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.
- Enable scheduled Docker image builds
Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.
test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels
ggml-ci
- cont : disable for q4_1
ggml-ci
- cont : disable for iq4_nl
ggml-ci
- memory : correctly handle failure in apply() (#14438)
ggml-ci
Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-quants : rename best_mad to best_error (ggml/1283)
This commit renames the variable best_mad to best_error in the
make_qkx2_quants function.
The motivation for this is that the name best_mad can be somewhat
confusing if mean absolute deviation (MAD) is not in use.
ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml
ggml-ci
ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
[CANN]update to aclnnGroupedMatmulV2
Signed-off-by: noemotiovon 757486878@qq.com
- Support MUL_MAT_ID on 310p
Signed-off-by: noemotiovon 757486878@qq.com
- fix editorconfig
Signed-off-by: noemotiovon 757486878@qq.com
Signed-off-by: noemotiovon 757486878@qq.com
- Add Vulkan images to docker.md (#14472)
Right now it's not easy to find those.
ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI
ggml-ci
- cont : remove -g flag
ggml-ci
ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes
Co-authored-by: Diego Devesa slarengh@gmail.com
github : add OpenCL backend to issue templates (#14492)
ci : add OpenCL to labeler workflow (#14496)
opencl : update upscale to support align corners (#14488)
opencl : skip empty nodes on cgraph compute (#14491)
simple-chat : fix context-exceeded condition (#14494)
simple-chat : fix context-exceeded condition
ggml-ci
- cont : fix n_ctx_used computation
ggml-ci
opencl : fix possible buffer overflow in dump_tensor (#14490)
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)
ggml-ci
vulkan: support softmax/FA batch and broadcast (#14449)
CUDA: broadcasting for FlashAttention mask (#14500)
CUDA: add softmax broadcast (#14475)
CUDA: add softmax broadcast
Pass by const ref
Review: Use blockDims for indexing, remove designated initializers
Add TODO for noncontigous input/output
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
ggml : add version function to get lib version (ggml/1286)
ggml : add version function to get lib version
This commit adds a function ggml_version() to the ggml library that
returns the version of the library as a string.
The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.
Usage:
printf("GGML version: %s\n", ggml_version());Output:
GGML version: 0.0.2219- ggml : add ggml_commit()
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- sync : ggml
ggml-ci
llama : initial Mamba-2 support (#9126)
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
- metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
- ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
- convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models to avoid some reshapes.
Not sure if it's a good idea, but it makes the graph slightly cleaner.
llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON
cuda : graceful fallback for Mamba-1 models with weird embd size
gguf-py : add support for chat template jinja files (#14508)
add support for chat template jinja files
remove gemma3n hack
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
ggml : remove kompute backend (#14501)
ggml-ci
ggml : fix FA mask dim 2 and 3 (#14505)
ggml : fix FA mask dim 2 and 3
ggml-ci
- backends : unsupport batched FA in CUDA and Vulkan
ggml-ci
vulkan : disable FA for mask->ne[2] != 1
kv-cache : use ggml_set_rows (#14285)
kv-cache : use ggml_set_rows
ggml-ci
- graph : separate k and v indices
ggml-ci
- cont : remove redundant ifs
ggml-ci
kv-cache : improve find_slot impl
kv-cache : bounds-check when accessing slot_info indices
kv-cache : add comments
ggml-ci
- ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
convert : correct gemma 3n conversion (#14450)
convert : correct gemma 3n conversion
rm redundant code
Fix conditional enabling following arch checks for ggml-sycl (#14504)
Signed-off-by: nscipione nicolo.scipione@codeplay.com
ggml: backward pass for split swiglu (#14483)
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
vulkan: better parameterize FA by head sizes
vulkan: support mixed/deepseekR1 FA head sizes
opencl : broadcast for soft_max (#14510)
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)
Co-authored-by: luyuhong luyuhong@kylinos.cn
- batch : add n_used count (#14512)
ggml-ci
- graph : prepare for 4D mask (#14515)
ggml-ci
- batch : add optional for sequential equal split (#14511)
ggml-ci
- metal : disable fast math in all quantize kernels (#14528)
ggml-ci
test-backend-ops: add support for specifying output format (#14368)
test-backend-ops: add support for specifying output format
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Add build_commit and build_number in test_result
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- refactor
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Get build commit from ggml_commit()
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Merge errors into test_operation_info && address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
remove visitor nonsense
remove visitor comment
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com
eval-callback : check for empty input (#14539)
opencl: add GELU_ERF (#14476)
server : fix assistant prefilling when content is an array (#14360)
vulkan: Handle updated FA dim2/3 definition (#14518)
vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.
handle null mask for gqa
allow gqa with dim3>1
vulkan: fix rms_norm+mul fusion (#14545)
The fused operation was grabbing the epsilon value from the wrong place.
Add an env var to disable fusion.
Add some missing checks for supported shapes/types.
Handle fused rms_norm+mul in check_results.
- vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260
Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com
CUDA: add bf16 and i32 to getrows (#14529)
llama : remove ggml_cont where possible (#14568)
llama : fix incorrect minicpm3 v_states shape (#14571)
musa: fix build warnings (unused variable) (#14561)
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
CUDA: add bilinear interpolation for upscale (#14563)
cuda : fix rope with partial rotation and non-cont src (#14580)
cuda : fix rope non-cont
ggml-ci
- cont : fix multi-rope + add test
ggml-ci
- sycl : try fix
ggml-ci
- cont : fix sycl + clean-up cuda
ggml-ci
vulkan: increase timeout for CI (#14574)
model : add hunyuan moe (#14425)
model : add hunyuan moe
tokenizer ok
fix tensor name
cgraph init
chat template
wip
almost working
skip embed, fix bos
cleanup
yarn scaling
cleanup
correct rope type
failed token fix
ntk alpha freq_base
tokenization working
cleanup and pr changes
vocab_size sanity check
ntk alpha generic
Update convert_hf_to_gguf.py
Apply suggestions from code review
fix regression
fix style
Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com
server: Add ability to mount server at prefix (#14544)
Add server_prefix
Correct server path env
Rename cli flag to --api-prefix
Change all to api_prefix
vulkan : fix rope with partial rotation and non-cont src (#14582)
memory : fix broken batch splits for recurrent cache (#14575)
Splits producing more than one ubatch per batch for recurrent models were broken with #14512.
This fixes it by moving the completeness check after the ubatch split loop.
model : add SmolLM3 (#14581)
Init - first pass.
Model -> ModelBase.
fix errors in conversion.
Update the graph.
up.
up.
wip
cgraph ok
rm redundant code
Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com
- model : fix hunyuan moe chat template (#14584)
Signed-off-by: stevenkuang stevenkuang@tencent.com
vulkan: optimize flash attention split_k_reduce (#14554)
vulkan: allow FA split_k with smaller KV values
vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).
convert : fix smollm3 jinja template (#14586)
model : add support for Falcon-H1 family (#14534)
v1
push more fixes
another fix
fix
more fixes
minor fix
more cleaning on python code
python fixes
changed precision for multipliers float 32->64
fixes
another fix
fix
pre-norm -> norm
fix
Revert "fix"
This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.
fix
small fix ffn_norm
try
mix instead of max
fix vocab size
conflict solve
fixed multipliers
falcon-h1 specefic vocab resolved
read arch from gguf.MODEL_ARCH
mamba_d_ssm added to d_inner find_hparam
remove unused functions from gguf_writer.py
override modify_tensors instead of get_tensors
fix conversion and d_inner
added some cb functions for debugging puposes
inp_out_ids moved outside of layers loop
mup_vec create as float64
fix rope_theta
injected mup
clean ups
rm extra space
rm unused MAMBA_CHUNK_SIZE
rm unused key
add bos False
changed ROPE_TYPE
cleaning debugging stuff
cleaning debug quant
fix comment
some cleanups
some cleanups
Update src/llama-model-loader.cpp
more cleanups
moe cleanuips
d_ssm -> d_inner;
cleaning unused hparams
cleanup
more cleanups
more cleanups on python conversion;
minor cleanups
Apply suggestions from code review
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
remove todo
added falcon-h1
tensor not required
clean
remove unneeded attributes
more cleanups and fixed conversion
remove final_norm
flake8 fixes
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
flake8 fixes
Update src/llama-hparams.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
added hashes
Update src/llama-arch.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
update the update file
Revert "update the update file"
This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.
fix: address suggestions
fix: update convert_hf_to_gguf.py
Update gguf-py/gguf/constants.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
d_inner fixed
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
reshaping ssm_norm for 34B
removing generate_mup
remove duplicates metadata keys
rm comment
final comment
fix unused args
fix constants
fix bad merge
Update src/llama-model.cpp
Co-authored-by: compilade git@compilade.net
falcon-h1: remove unused ssm_in_b and bad merge
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
falcon-h1: fix last comment
Update convert_hf_to_gguf.py
Co-authored-by: compilade git@compilade.net
falcon-h1: revert add_add_bos(False)
falcon-h1: fix tied weights
falcon-h1: remove whitespace
falcon-h1: fix wrong size param
falcon-h1: fix whitespace issues
Co-authored-by: younesbelkada younes.belkada@tii.ae Co-authored-by: Younes B 49240599+younesbelkada@users.noreply.github.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net
llama : remove unintended whitespace (#14592)
model : add skt/A.X-4.0 model vocabulary (#14589)
ggml : prevent integer overflow in gguf tensor size calculation (#14595)
ggml : add ggml_scale_bias (#14417)
ggml : add ggml_scale_bias
ggml_vec_mad1_f32
add more simd
add CUDA
sycl
vulkan
cann (placeholder)
opencl
will this fix cpu?
fix cuda
suggestions from coderabbit
fix cann compile error
vDSP_vsmsa
rm __ARM_FEATURE_SVE
use memcpy for op params
make code looks more consistent
use scalar for __ARM_FEATURE_SVE
add x param to ggml_vec_mad1_f32
llama : support Jamba hybrid Transformer-Mamba models (#7531)
wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba (and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.
llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
- llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
llama : avoid copies for simple batch splits
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it, and this makes Mamba's conv step slightly faster.
mamba : fix non-contiguous usage of ggml_silu
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
llama : begin renaming llama_past back to llama_kv_cache
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
convert : fix jamba conv1d shape squeezing
graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).
model : add Jamba to Mamba-specific hparams printing
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
llama : remove llm_graph_input_one (#14603)
cuda : support Falcon-H1 state size for SSM_SCAN (#14602)
cmake : llguidance build parser library only (#14608)
cmake : bump llguidance version to v1.0.1 (#14609)
llama : minor coding style fix for smollm3 (#14605)
SYCL: Initial set_rows kernel implementation (#14562)
SYCL: Initial set_rows kernel implementation
Revert max_threads to 256
Refactor set_rows and address review comments
Deduplicate conversion function
Remove guard before kernel launch and refactor
Fix and add back SFINAE
cmake : do not search for curl libraries by ourselves (#14613)
cmake : do not search for curl libraries by ourselves
run : do not search for curl libraries by ourselves
Docs: script to auto-generate ggml operations docs (#14598)
Docs: script to auto-generate ggml operations docs
Review: formatting changes + change github action
Use built-in types instead of typing
docs : add BLAS and Metal ops
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Smoldocling support (#14597)
support for smoldocling
fixed merge conflicts
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com
merge conflicts
pre tokenizer merge fix
convert : fix smollm3 jinja template (#14586)
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- support for smoldocling
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- fixed merge conflicts
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.h
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- safetensors tensor mapping
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
added back accidental removal of clean spaces for hunyuan
Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
updated hash and reordererd model list
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update include/llama.h
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf_update.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
removed old tensor name
removed tensor mappings -> handled by smolvlm
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net
opencl: add
set_rowsforf16andf32(#14547)opencl: add
set_rowsforf16andf32opencl: better choose workgroup size for
set_rowsopencl: add tiled mul_mat_f16_f32 (#14535)
add tiled mul_mat_f16_f32
fix trailing whitespace
add insightful comments
model : Granite Four (#13550)
wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba (and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.
llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
- llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
llama : avoid copies for simple batch splits
llama : use im2col and mul_mat to perform convolution for Mamba
This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed.
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it, and this makes Mamba's conv step slightly faster.
- llama : rename llama_cache to llama_past
This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway)
Still, I'm open to better suggestions.
examples : replace llama_kv_cache_seq_* with llama_past_seq_*
mamba : fix non-contiguous usage of ggml_silu
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
llama : begin renaming llama_past back to llama_kv_cache
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
- metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
- ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
- convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models to avoid some reshapes.
Not sure if it's a good idea, but it makes the graph slightly cleaner.
llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
feat: Add conversion for Bamba models
This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add Granite 4 conversion
This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Plumb bamba through llama-arch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add bamba to llama_arch_is_hybrid_recurrent
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add optional mamba ssm_in bias tensor
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add template specialization for get_arr to load a vector for layer index arr in hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Use an explicit bool to determine mamaba vs mamba2
This allows other architectures like bamba and granitemoehybrid to use
mamab2 without a growing architecture if statement inside the mamba
implementation.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Isolate mamba(2) and granite attention layer building in static methods
This will allow these layer-builder methods to be used from other build structs without complex inheritance.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use per-layer sizes in granite build_attention_layer
Also no need to pass in kv cache since it's already in the inp_attn
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: First (broken) pass at end-to-end Bamba implementation
It generates (garbage) tokens! Still lots of debugging to do.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Only do Granite multipliers if set
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Pull granite ffn portion into a static function and reuse in hybrid
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat(py): Allow gguf duplicate keys if they match by value and type
This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor(py): Simplify granitemoehybrid conversion to use parents better
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add GRANITE_MOE_HYBRID through llama-arch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Support GRANITE_MOE_HYBRID in llama-model
This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- style: Fix flake8 errors
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix recurrent cache get after rebase
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins
The challenge here is to give both the non-hybrid classes (llm_build_mamba
and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access
to the same intermediate "base class" functionality (build_mamba*_layer,
build_granite_attention_layer) without running into trouble with diamond
inheritance of llm_graph_context. Due to the non-trivial initialization
that happens in llm_graph_context, diamond inheritance results in multiple
initializations of the common base which cause problems around the unique
ptrs. I wanted to get away from self-> everywhere, but this is still a
bit cleaner than making those methods static I think.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Implement the full copy-paste version to duplicate the layer builders
This follows the pattern where the type of input is pinned to the type of
memory and that is used to dispatch to the correct version of build_rs /
build_attn. There's a lot of code duplication that can hopefully be
pulled into common functions in the graph later.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid
I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba).
As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON
- memory : correctly handle failure in apply()
ggml-ci
- style: Remove TODO for adding first hybrid models to the switch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- docs: Fix comment about duplicate key check
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Conform to standard way of initializing inp_out_ids
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
convert : fix jamba conv1d shape squeezing
fix: Fix input initialization in granite_hybrid after removal of hybrid inputs
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use llm_graph_context_mamba in llm_build_granite_hybrid
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins
The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...).
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).
model : add Jamba to Mamba-specific hparams printing
fix: Fix input setup after upstream merge
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- feat: Add support for dense FFN in GraniteMoeHybrid
This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN).
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add support for dense FFN tensor names on c++ side
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use child inputs for Falcon H1 after merge resolution
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unnecessary prefix on tensor constants
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
fix: Revert order changes for Falcon H1 to stay consistent with upstream
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
- refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid
The only key difference is the use of rope which is now set via rope_finetuned in the hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove use of diamond inheritance
Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance
https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Log mamba params for Granite Hybrid
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unused ssm_in_b
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv
This matches how recurrent vs attention heads are identified for Jamba
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unused template expansion for get_arr
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Review cleanup in convert_hf_to_gguf
The gist is to be explicit about which base class is being used with the multiple inheritance setup
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Undo hidden warnings about duplicate identical keys in add_key_value
After further discussion, this encourages sloppy overwriting in the model converters
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: If not using ROPE, context is "infinite"
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- doc: Add a comment outlining expected duplicate key warnings
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unnecessary duplicate keys in converter
Co-authored-by: Francis Couture-Harpin git@compilade.net
(thanks for the sharp eyes and patience!)
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Francis Couture-Harpin git@compilade.net Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
vocab : add midm-2.0 model pre-tokenizer (#14626)
llama : move enum llama_vocab_pre_type to implementation (#14631)
ggml-ci
readme : add hot PRs (#14636)
readme : add hot PRs
cont
readme : update title
readme : hot PRs links
cont
HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634)
model : support LiquidAI LFM2 hybrid family (#14620)
Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"vulkan: optimizations for deepseek prompt processing (#14555)
vulkan: allow unclamped loads in coopmat2 mul_mat_id shader
vulkan: increase coopmat2 mul_mat_id tile size
vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path
vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
vulkan: support SET_ROWS (#14587)
vulkan: support SET_ROWS
Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.
- vulkan: optimize set_rows
Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.
server : fix pooled embedding output (#14645)
vulkan : implement ggml_roll (ggml/1290)
ggml-ci
- vulkan : implement bilinear interpolation (ggml/1291)
ggml-ci
- sync : ggml
ggml-ci
- vulkan : remove unused vars (#0)
ggml-ci
sync : ggml
CUDA: add set rows for f32 and f16 (#14551)
CUDA: add set rows for f32 and f16
Review: change kernel params, use strides from host
Use 1-d kernel
Review: use int64_t for blockDim.x, rename nb->s for clarity
docs : add LFM2 to models section (#14650)
readme : add LFM2 to models section
fix copy paste...
tests : cover lfm2 cases in test_ssm_conv (#14651)
cmake : Add CMake presets for Linux and GCC (#14656)
metal : Add missing unary ops Metal support (#14660)
ggml : add build-time message to remind about ggml_set_rows (#14661)
ggml-ci
cuda : add ELU support (#14657)
cuda : add set rows for bf16 (#14664)
quantize : fix minor logic flaw in --tensor-type (#14572)
llama : add jinja template for rwkv-world (#14665)
llama : add jinja template for rwkv-world
Signed-off-by: Molly Sophia mollysophia379@gmail.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
sycl: Batched mulmat rework for oneDNN dispatch (#14617)
SY…
gianni-cor pushed a commit to gianni-cor/qvac-fabric-llm.cpp that referenced this pull request
sycl: GGML_SYCL_DISABLE_OPT on by default for all Intel Devices (#13973)
ggml : do not output unprintable characters on GGUF load failure (#14381)
ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317)
ggml-cpu: add nnpa compile flag
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 4a9f60c201573128f73a65999b3e5cc497fae5c1)
- ggml-cpu: add fp16->fp32 nnpa first
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 8d4a7987f9c1887f716be96250f2caeee0253929)
- ggml-cpu: add fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 0ff0d6516247a41d2ade42b42cf0d676a4dd1627)
- ggml-cpu: better variable names
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 2f58bbcbb89c183340e252362b2a40651f573f1f)
- docs: update s390x docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 01b929491b50071a5d0572235dcf5a449da70aa7)
- ggml-cpu: add debugging prints to see if dlf16 is correct
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix print vs printf
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix float placeholder
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: ensure fp16 and fp32 load and stores are called
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fp16 load ensured to hit
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove sigint from fp16 store
for some reason, the function is not getting a hit when debugged with gdb. we will need to investigate further
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp16_to_fp32
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa activate ggml_cpu_fp16_to_fp32 for 8 elements
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: nnpa switch to vec_xst test
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to vec_xst for 4 element loops also
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rework noop
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove noop, general code cleanup
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify variable naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa for ggml_cpu_fp32_to_fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add breakpoint for debugging
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test fix for conversion failure
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: disable fp32->fp16 nnpa conversions for now
there are some conversion failures in nnpa that requires the eyes of an ibm stsm. will create a separate pr to introduce the fp32->fp16 change.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to elif macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix typo
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: reattempt fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix compiler types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: change to typedef vector types
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add 4 element loops for fp32->fp16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarified vector naming
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back fp32->fp16 store nnpa
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: activate nnpa fp32->fp16 or fp16->fp32 compute
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add nnpa macro check in ggml-impl
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add missing func
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: diagnose why NNPA macro is not being defined
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: import vecintrin.h to fix compiler errors
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: update macro tests
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 157f856c34589566151630e294563a420702db39.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to importing ggml-cpu-impl instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix macro declaration
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: test more macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add debug prints
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bruteforce macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move macro definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h to cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to private macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move s390x typedef to own header file
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 157f856c34589566151630e294563a420702db39)
- ggml-cpu: move things around
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile macros
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: switch to quotes for import
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add compiler error macro
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add s390x detection in ggml-src
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: undo cmakelists work
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move s390x typedef to own header file"
This reverts commit 18d79e1a30b39d9aaa0bd58400c5cf2c32135c9a.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedefs.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove typedef from cmakelists
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add ggml-impl.h future notes
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: add todo comment for future reference
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: clarify naming of dlf16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove unnecessary target compile definitions
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa fp16->fp32 and fp32->fp16 to simd-mappings
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: update broken huggingface link for s390x
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix duplicate func names during compile
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: fix duplicate func names during compile"
This reverts commit fbb733451f27677063b914d4f6c9a9841d45b38d.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: refactor fp32->fp16 and fp16->fp32 simd to ggml-cpu"
This reverts commit bd288e8fa52b5244f65cee21cb61062f1a9e0ca5.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: refactor fp16<->fp32 simd to ggml-cpu
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h import in quants.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix missing simd-mappings.h within repack
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix amx mmq missing simd-mappings.h
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: attempt at fixing loongarch failing build
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move nnpa together with other fp16<->fp32 simd
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: fix wrong refactor of ggml-base
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164176555
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: remove dependency on ggml-cpu from ggml-base
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: rename all fp16<->fp32 macros to prefix with ggml_cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164449406
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: remove mistaken fallback macro
fallback logic was already implemented but i was too sleepy to realise
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: move ggml_table_f32_f16 back to ggml-base due to ci failures"
This reverts commit 32a3533564bdb7902cefb9c89b1c9e956a81ce29.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml: move ggml_table_f32_f16 to ggml-cpu"
This reverts commit 9e40d984ad27d7b60392fb2b7548885201864fe4.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml: move ggml_table_f32_f16 to ggml-cpu
ref: https://github.com/ggml-org/llama.cpp/pull/14317#discussion_r2164775006
Signed-off-by: Aaron Teo aaron.teo1@ibm.com (cherry picked from commit 9e40d984ad27d7b60392fb2b7548885201864fe4)
- ggml: move ggml_table_f32_f16 to ggml-cpu.c
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: extern c ggml_table_f32_f16 + chore docs
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h
we rely on the variable declaration in ggml-cpu.c instead
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: dedup ggml_table_f32_f16 from simd-mappings.h"
This reverts commit f71b21d2f74f5e03ec0c2b4fefd3cbf395aecf16.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- ggml-cpu: bring back ggml_table_f32_f16
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- Revert "ggml-cpu: bring back ggml_table_f32_f16"
This reverts commit 2dce119178bed5ef5c8398c4230ddd14fef80e49.
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
fix ggml time initialization
fix f32_f16 table init
remove extra line
Signed-off-by: Aaron Teo aaron.teo1@ibm.com Co-authored-by: slaren slarengh@gmail.com
musa: enable fp16 mma (all) and cublas on qy2 (#13842)
musa: enable fp16 mma (all) and cublas on qy2
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Update ggml/src/ggml-cuda/ggml-cuda.cu
Co-authored-by: Johannes Gäßler johannesg@5d6.de
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- Address review comments
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
- musa: disable MUL_MAT_ID (q2_k × f32) due to precision issues
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com Co-authored-by: Johannes Gäßler johannesg@5d6.de
docs: update s390x documentation + add faq (#14389)
docs: update s390x documentation + add faq
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
- docs: add s390x z17 build q&a
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
Signed-off-by: Aaron Teo aaron.teo1@ibm.com
metal : batch rows copy in a single threadgroup (#14384)
metal : batch rows copy in a single threadgroup
ggml-ci
- metal : handle some edge cases when threadgroup size is not a power of 2
ggml-ci
- metal : add special-case mat-vec mul for ne00 == 4 (#14385)
ggml-ci
llama : return mistral-v7-tekken as default template only (#14390)
cmake: regen vulkan shaders when shaders-gen sources change (#14398)
Add shaders-gen sources as target deps
model : gemma3n text-only (#14400)
gemma3n
add llm_graph_input_one
convert : fix broken sentencepiece vocab (#14416)
ggml : add ggml_set_rows (#14274)
ggml : add ggml_set_rows
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using indices from 'c'.
ref: #8366
use I64 for indices
ggml : add repeat impl for i64
ggml : add ggml_is_contiguous_rows
ggml : ggml_set_rows support broadcast
ggml : ggml_set_rows support quantized dst
ggml-ci
ggml : support GGML_TYPE_F32 ".from_float" trait
ggml : ggml_set_rows update comment + better index name
tests : add ggml_set_rows
metal : add ggml_set_rows implementation
ggml-ci
ggml : simplify forward_dup_f32
ggml : fix supports_op
tests : add comment to set_rows
ggml : leave the repeat_i64 for a separate PR
ggml-ci
ggml : set_rows use std::min instead of MIN
ggml : better error message for set_rows unsupported type
metal : perform op->type check only once
tests : more consistent implementation + more tests
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- recurrent : call balloc split_reset() in init_batch() (#14414)
ggml-ci
- graph : make llm_graph_context destructor virtual (#14410)
ggml-ci
- vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (#14427)
This setting needs to be passed through to vulkan-shaders-gen
ci : fix windows build and release (#14431)
fix async_mode bug (#14432)
model : add support for ERNIE 4.5 0.3B model (#14408)
Add Day-0 support for Baidu ERNIE 4.5 0.3B model.
Signed-off-by: Weizhao Ouyang weizhao.ouyang@arm.com
vulkan: lock accesses of pinned_memory vector (#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
CUDA: add bf16 and f32 support to cublas_mul_mat_batched
Review: add type traits and make function more generic
Review: make check more explicit, add back comments, and fix formatting
Review: fix formatting, remove useless type conversion, fix naming for bools
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test.
extract some common fusion logic
fix -Winconsistent-missing-override
move ggml_can_fuse to a common function
build fix
C and C++ versions of can_fuse
move use count to the graph to avoid data races and double increments when used in multiple threads
use hash table lookup to find node index
change use_counts to be indexed by hash table slot
minimize hash lookups
style fixes
last node doesn't need single use. fix type. handle mul operands being swapped.
remove redundant parameter
Co-authored-by: slaren slarengh@gmail.com
ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
implement unary REGLU/GEGLU/SWIGLU cpu ops
relax constraints
duplicate shape of source
fix ggml_vec_geglu_f16
special case gated ops
implement unary REGLU/GEGLU/SWIGLU cuda ops
tighten constraints again
refactor into GGML_GLU_OP
metal : add glu kernels
ggml-ci
add CUDA_GLU_BLOCK_SIZE [no ci]
more constraints and use 64bit ints
ggml-ci
64bit multiplication [no ci]
implement swapped variants (cpu/cuda)
update comment [no ci]
ggml-ci
Vulkan: Add GLU ops and shaders
SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
ggml : implement GLU for split up/gate (#14181)
implement GLU for split up/gate
add tests for ggml_glu_split
Vulkan: Implement glu_split logic and shader support
add split to logging [no ci]
SYCL: refactor element_size ops and add split up and gate support to gated kernels
SYCL: switch GEGLU to use tanh approximation
Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai
GGML: increase OP count in assertion
Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions
op_xxxfor each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions.
- Using
__dpct_inline__to encourage compiler inlining. - Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
vulkan: Increase workgroup size for GLU, for performance (#14345)
vulkan: Increase workgroup size for GLU, for performance
vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
merge fix
metal : add support for split and swap
ggml-ci
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: 0cc4m picard12@live.de Co-authored-by: Akarshan akarshan@menlo.ai Co-authored-by: Jeff Bolz jbolz@nvidia.com
ggml : fix unmerged GGML_FPxx_TO_FPxx refactoring (#14443)
SYCL: disable faulty fp16 exp kernel (#14395)
SYCL: disable faulty fp16 CPU exponent for now
Revert "SYCL: disable faulty fp16 CPU exponent for now"
This reverts commit ed0aab1ec31b4eb4b0f275dd7acd41d96a375202.
SYCL: disable faulty fp16 CPU exponent for now
Fix logic of disabling exponent kernel
server : fix appearance of the chats list context menu for Safari (#14322)
server : support jinja extra template kwargs (Qwen3 enable_thinking feature), from command line and from client (#13196)
initial commit for handling extra template kwargs
enable_thinking and assistant prefill cannot be enabled at the same time
can set chat_template_kwargs in command line
added doc
fixed formatting
add support for extra context in generic template init
coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
fix merge conflict
chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
normalize environment variable name
simplify code
prefill cannot be used with thinking models
compatibility with the new reasoning-budget parameter
fix prefill for non thinking models
Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Olivier Chafik olivier.chafik@gmail.com
scripts : make the shell scripts cross-platform (#14341)
cmake : Remove redundant include path in CMakeLists.txt (#14452)
Update docker.yml
修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动
- Remove redundant include path in CMakeLists.txt
The parent directory '..' was removed from the include directories for the ggml-cpu-feats target, to avoid unnecessary include paths.
- Enable scheduled Docker image builds
Uncomments the workflow schedule to trigger daily Docker image rebuilds at 04:12 UTC, improving automation and keeping images up to date.
test-backend-ops : disable llama test (#14461)
ggml-cpu: sycl: Re-enable exp f16 (#14462)
metal : disable fast-math for some cpy kernels (#14460)
metal : disable fast-math for some cpy kernels
ggml-ci
- cont : disable for q4_1
ggml-ci
- cont : disable for iq4_nl
ggml-ci
- memory : correctly handle failure in apply() (#14438)
ggml-ci
Add Conv2d for CPU (#14388)
Conv2D: Add CPU version
Half decent
Tiled approach for F32
remove file
Fix tests
Support F16 operations
add assert about size
Review: further formatting fixes, add assert and use CPU version of fp32->fp16
opencl : add GEGLU, REGLU, SWIGLU (#14456)
ggml-quants : rename best_mad to best_error (ggml/1283)
This commit renames the variable best_mad to best_error in the
make_qkx2_quants function.
The motivation for this is that the name best_mad can be somewhat
confusing if mean absolute deviation (MAD) is not in use.
ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
add "align corners" mode for bilinear upscale, and allow downscaling
add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
sync : ggml
ggml-ci
ggml : remove trailing whitespace (#0)
add GELU_ERF (#14455)
vulkan: Split large mul_mat_id to fit in shared memory (#14451)
CANN: update aclnnGroupedMatmulV2 to aclnnGroupedMatmulV3 (#14411)
[CANN]update to aclnnGroupedMatmulV2
Signed-off-by: noemotiovon 757486878@qq.com
- Support MUL_MAT_ID on 310p
Signed-off-by: noemotiovon 757486878@qq.com
- fix editorconfig
Signed-off-by: noemotiovon 757486878@qq.com
Signed-off-by: noemotiovon 757486878@qq.com
- Add Vulkan images to docker.md (#14472)
Right now it's not easy to find those.
ci : disable fast-math for Metal GHA CI (#14478)
ci : disable fast-math for Metal GHA CI
ggml-ci
- cont : remove -g flag
ggml-ci
ggml : Callback before abort (#14481)
Add a callback that will be called just before abort. This allows apps without a console to display a message to the user and save data if needed.
Return previous callback to allow callback chaining
style fixes
Co-authored-by: Diego Devesa slarengh@gmail.com
github : add OpenCL backend to issue templates (#14492)
ci : add OpenCL to labeler workflow (#14496)
opencl : update upscale to support align corners (#14488)
opencl : skip empty nodes on cgraph compute (#14491)
simple-chat : fix context-exceeded condition (#14494)
simple-chat : fix context-exceeded condition
ggml-ci
- cont : fix n_ctx_used computation
ggml-ci
opencl : fix possible buffer overflow in dump_tensor (#14490)
ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)
ggml-ci
vulkan: support softmax/FA batch and broadcast (#14449)
CUDA: broadcasting for FlashAttention mask (#14500)
CUDA: add softmax broadcast (#14475)
CUDA: add softmax broadcast
Pass by const ref
Review: Use blockDims for indexing, remove designated initializers
Add TODO for noncontigous input/output
Set RPATH to "@loader_path" / "$ORIGIN" to ensure executables and dynamic libraries search for dependencies in their origin directory. (#14309)
ggml : add version function to get lib version (ggml/1286)
ggml : add version function to get lib version
This commit adds a function ggml_version() to the ggml library that
returns the version of the library as a string.
The motivation for this is that it can be useful to be able to programmatically check the version of the ggml library being used.
Usage:
printf("GGML version: %s\n", ggml_version());Output:
GGML version: 0.0.2219- ggml : add ggml_commit()
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- sync : ggml
ggml-ci
llama : initial Mamba-2 support (#9126)
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
- metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
- ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
- convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models to avoid some reshapes.
Not sure if it's a good idea, but it makes the graph slightly cleaner.
llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON
cuda : graceful fallback for Mamba-1 models with weird embd size
gguf-py : add support for chat template jinja files (#14508)
add support for chat template jinja files
remove gemma3n hack
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497)
ggml : remove kompute backend (#14501)
ggml-ci
ggml : fix FA mask dim 2 and 3 (#14505)
ggml : fix FA mask dim 2 and 3
ggml-ci
- backends : unsupport batched FA in CUDA and Vulkan
ggml-ci
vulkan : disable FA for mask->ne[2] != 1
kv-cache : use ggml_set_rows (#14285)
kv-cache : use ggml_set_rows
ggml-ci
- graph : separate k and v indices
ggml-ci
- cont : remove redundant ifs
ggml-ci
kv-cache : improve find_slot impl
kv-cache : bounds-check when accessing slot_info indices
kv-cache : add comments
ggml-ci
- ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends
ggml-ci
convert : correct gemma 3n conversion (#14450)
convert : correct gemma 3n conversion
rm redundant code
Fix conditional enabling following arch checks for ggml-sycl (#14504)
Signed-off-by: nscipione nicolo.scipione@codeplay.com
ggml: backward pass for split swiglu (#14483)
vulkan: support mixed/deepseekR1 FA head sizes (#14509)
vulkan: better parameterize FA by head sizes
vulkan: support mixed/deepseekR1 FA head sizes
opencl : broadcast for soft_max (#14510)
ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445)
CANN: Replace aclrtMemsetSync with aclnnInplaceZero operator (#14002)
Co-authored-by: luyuhong luyuhong@kylinos.cn
- batch : add n_used count (#14512)
ggml-ci
- graph : prepare for 4D mask (#14515)
ggml-ci
- batch : add optional for sequential equal split (#14511)
ggml-ci
- metal : disable fast math in all quantize kernels (#14528)
ggml-ci
test-backend-ops: add support for specifying output format (#14368)
test-backend-ops: add support for specifying output format
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Add build_commit and build_number in test_result
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- refactor
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Get build commit from ggml_commit()
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Merge errors into test_operation_info && address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
remove visitor nonsense
remove visitor comment
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
- Address review comments
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com
Signed-off-by: Xiaodong Ye yeahdongcn@gmail.com Co-authored-by: slaren slarengh@gmail.com
eval-callback : check for empty input (#14539)
opencl: add GELU_ERF (#14476)
server : fix assistant prefilling when content is an array (#14360)
vulkan: Handle updated FA dim2/3 definition (#14518)
vulkan: Handle updated FA dim2/3 definition
Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit.
handle null mask for gqa
allow gqa with dim3>1
vulkan: fix rms_norm+mul fusion (#14545)
The fused operation was grabbing the epsilon value from the wrong place.
Add an env var to disable fusion.
Add some missing checks for supported shapes/types.
Handle fused rms_norm+mul in check_results.
- vulkan: increase LOAD_VEC_A to 8 (IQ1/IQ2) or 4 (IQ3) (#14485)
Commit taken from remyoudompheng's PR https://github.com/ggml-org/llama.cpp/pull/12260
Co-authored-by: Rémy Oudompheng remyoudompheng@gmail.com
CUDA: add bf16 and i32 to getrows (#14529)
llama : remove ggml_cont where possible (#14568)
llama : fix incorrect minicpm3 v_states shape (#14571)
musa: fix build warnings (unused variable) (#14561)
Signed-off-by: Xiaodong Ye xiaodong.ye@mthreads.com
CUDA: add bilinear interpolation for upscale (#14563)
cuda : fix rope with partial rotation and non-cont src (#14580)
cuda : fix rope non-cont
ggml-ci
- cont : fix multi-rope + add test
ggml-ci
- sycl : try fix
ggml-ci
- cont : fix sycl + clean-up cuda
ggml-ci
vulkan: increase timeout for CI (#14574)
model : add hunyuan moe (#14425)
model : add hunyuan moe
tokenizer ok
fix tensor name
cgraph init
chat template
wip
almost working
skip embed, fix bos
cleanup
yarn scaling
cleanup
correct rope type
failed token fix
ntk alpha freq_base
tokenization working
cleanup and pr changes
vocab_size sanity check
ntk alpha generic
Update convert_hf_to_gguf.py
Apply suggestions from code review
fix regression
fix style
Co-authored-by: kooshi 1934337+kooshi@users.noreply.github.com
server: Add ability to mount server at prefix (#14544)
Add server_prefix
Correct server path env
Rename cli flag to --api-prefix
Change all to api_prefix
vulkan : fix rope with partial rotation and non-cont src (#14582)
memory : fix broken batch splits for recurrent cache (#14575)
Splits producing more than one ubatch per batch for recurrent models were broken with #14512.
This fixes it by moving the completeness check after the ubatch split loop.
model : add SmolLM3 (#14581)
Init - first pass.
Model -> ModelBase.
fix errors in conversion.
Update the graph.
up.
up.
wip
cgraph ok
rm redundant code
Co-authored-by: Vaibhavs10 vaibhavs10@gmail.com
- model : fix hunyuan moe chat template (#14584)
Signed-off-by: stevenkuang stevenkuang@tencent.com
vulkan: optimize flash attention split_k_reduce (#14554)
vulkan: allow FA split_k with smaller KV values
vulkan: spread split_k_reduce work across more threads
k_num can get rather large. Use the whole workgroup to reduce the M/L values.
Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).
convert : fix smollm3 jinja template (#14586)
model : add support for Falcon-H1 family (#14534)
v1
push more fixes
another fix
fix
more fixes
minor fix
more cleaning on python code
python fixes
changed precision for multipliers float 32->64
fixes
another fix
fix
pre-norm -> norm
fix
Revert "fix"
This reverts commit 243e4d1a50bd73467d99f6b289b9a1826f83b94b.
fix
small fix ffn_norm
try
mix instead of max
fix vocab size
conflict solve
fixed multipliers
falcon-h1 specefic vocab resolved
read arch from gguf.MODEL_ARCH
mamba_d_ssm added to d_inner find_hparam
remove unused functions from gguf_writer.py
override modify_tensors instead of get_tensors
fix conversion and d_inner
added some cb functions for debugging puposes
inp_out_ids moved outside of layers loop
mup_vec create as float64
fix rope_theta
injected mup
clean ups
rm extra space
rm unused MAMBA_CHUNK_SIZE
rm unused key
add bos False
changed ROPE_TYPE
cleaning debugging stuff
cleaning debug quant
fix comment
some cleanups
some cleanups
Update src/llama-model-loader.cpp
more cleanups
moe cleanuips
d_ssm -> d_inner;
cleaning unused hparams
cleanup
more cleanups
more cleanups on python conversion;
minor cleanups
Apply suggestions from code review
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
remove todo
added falcon-h1
tensor not required
clean
remove unneeded attributes
more cleanups and fixed conversion
remove final_norm
flake8 fixes
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
flake8 fixes
Update src/llama-hparams.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-arch.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
added hashes
Update src/llama-arch.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
- Update src/llama-vocab.cpp
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
update the update file
Revert "update the update file"
This reverts commit 082ab4ad2a3927384d878666a5f8cae4eb15f577.
fix: address suggestions
fix: update convert_hf_to_gguf.py
Update gguf-py/gguf/constants.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model-loader.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
d_inner fixed
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
reshaping ssm_norm for 34B
removing generate_mup
remove duplicates metadata keys
rm comment
final comment
fix unused args
fix constants
fix bad merge
Update src/llama-model.cpp
Co-authored-by: compilade git@compilade.net
falcon-h1: remove unused ssm_in_b and bad merge
Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
falcon-h1: fix last comment
Update convert_hf_to_gguf.py
Co-authored-by: compilade git@compilade.net
falcon-h1: revert add_add_bos(False)
falcon-h1: fix tied weights
falcon-h1: remove whitespace
falcon-h1: fix wrong size param
falcon-h1: fix whitespace issues
Co-authored-by: younesbelkada younes.belkada@tii.ae Co-authored-by: Younes B 49240599+younesbelkada@users.noreply.github.com Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net
llama : remove unintended whitespace (#14592)
model : add skt/A.X-4.0 model vocabulary (#14589)
ggml : prevent integer overflow in gguf tensor size calculation (#14595)
ggml : add ggml_scale_bias (#14417)
ggml : add ggml_scale_bias
ggml_vec_mad1_f32
add more simd
add CUDA
sycl
vulkan
cann (placeholder)
opencl
will this fix cpu?
fix cuda
suggestions from coderabbit
fix cann compile error
vDSP_vsmsa
rm __ARM_FEATURE_SVE
use memcpy for op params
make code looks more consistent
use scalar for __ARM_FEATURE_SVE
add x param to ggml_vec_mad1_f32
llama : support Jamba hybrid Transformer-Mamba models (#7531)
wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba (and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.
llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
- llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
llama : avoid copies for simple batch splits
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it, and this makes Mamba's conv step slightly faster.
mamba : fix non-contiguous usage of ggml_silu
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
llama : begin renaming llama_past back to llama_kv_cache
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
convert : fix jamba conv1d shape squeezing
graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).
model : add Jamba to Mamba-specific hparams printing
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
llama : remove llm_graph_input_one (#14603)
cuda : support Falcon-H1 state size for SSM_SCAN (#14602)
cmake : llguidance build parser library only (#14608)
cmake : bump llguidance version to v1.0.1 (#14609)
llama : minor coding style fix for smollm3 (#14605)
SYCL: Initial set_rows kernel implementation (#14562)
SYCL: Initial set_rows kernel implementation
Revert max_threads to 256
Refactor set_rows and address review comments
Deduplicate conversion function
Remove guard before kernel launch and refactor
Fix and add back SFINAE
cmake : do not search for curl libraries by ourselves (#14613)
cmake : do not search for curl libraries by ourselves
run : do not search for curl libraries by ourselves
Docs: script to auto-generate ggml operations docs (#14598)
Docs: script to auto-generate ggml operations docs
Review: formatting changes + change github action
Use built-in types instead of typing
docs : add BLAS and Metal ops
Co-authored-by: Georgi Gerganov ggerganov@gmail.com
Smoldocling support (#14597)
support for smoldocling
fixed merge conflicts
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com
merge conflicts
pre tokenizer merge fix
convert : fix smollm3 jinja template (#14586)
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- support for smoldocling
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- fixed merge conflicts
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-model.h
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- safetensors tensor mapping
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com
added back accidental removal of clean spaces for hunyuan
Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
updated hash and reordererd model list
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update include/llama.h
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update convert_hf_to_gguf_update.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update src/llama-vocab.cpp
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
removed old tensor name
removed tensor mappings -> handled by smolvlm
Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Signed-off-by: ryan-mangeno ryanmangeno@gmail.com Co-authored-by: Gabe Goodhart gabe.l.hart@gmail.com Co-authored-by: Xuan-Son Nguyen son@huggingface.co Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com Co-authored-by: compilade git@compilade.net
opencl: add
set_rowsforf16andf32(#14547)opencl: add
set_rowsforf16andf32opencl: better choose workgroup size for
set_rowsopencl: add tiled mul_mat_f16_f32 (#14535)
add tiled mul_mat_f16_f32
fix trailing whitespace
add insightful comments
model : Granite Four (#13550)
wip: llama : separate recurrent states from the KV cache
This will be necessary to support Jamba (and other recurrent models mixed with Attention).
Doesn't compile yet, and finding a slot isn't yet done correctly for recurrent states.
llama : use std::find for seq_nodes in llama_rs_cache
llama : state checkpoints for recurrent models
llama : correctly handle more edge cases for the rs cache
llama : rename many llama_kv_cache_* functions
llama : remove useless return value for some llama_cache_* functions
llama : rethink recurrent state cell counts
llama : begin work on support for variable GQA
This will also be useful for Jamba if we consider the Mamba layers to have 0 KV heads.
llama : gracefully fail when not finding hybrid slot
llama : support Jamba
llama : fix BERT inference without KV cache
convert-hf : check for unprocessed Jamba experts
convert-hf : support Mini-Jamba conversion
llama : fix Jamba quantization sanity checks
llama : sequence-length-aware batch splitting
llama : use equal-sequence-length sub-batches for recurrent models
ggml : simplify SSM-related operators
llama : make recurrent state slot allocation contiguous
llama : adapt internal uses of batches to llama_ubatch
llama : fix batch split output count for embeddings
llama : minimize swaps when reordering logits
This reduces overhead when running hellaswag on thousands of sequences with very small 100k params Mamba models.
- llama : fix edge case finding batch seq_id of split recurrent cell
This otherwise was a problem when running the HellaSwag benchmark with small batch sizes, making it crash.
llama : avoid copies for simple batch splits
llama : use im2col and mul_mat to perform convolution for Mamba
This removes the need for ggml_ssm_conv!!! But performance seems slighly worse on my system, especially for prompt processing. Maybe ggml_mul_mat isn't optimized for small row sizes? More performance testing is necessary until GGML_OP_SSM_CONV is removed.
ggml : make ggml_ssm_scan not modify its source tensors
llama : fix shared recurrent tail cell count for small ubatch sizes
Otherwise it was impossible to run the 'parallel' example with '-ub 1' with a Mamba or Jamba model.
llama : fix .base() compilation error on Windows
llama : allow doing the equivalent of SSM_CONV with SUM_ROWS and MUL
ggml : allow GGML_OP_CONCAT to work on non-contiguous tensors
The implementation already supported it, and this makes Mamba's conv step slightly faster.
- llama : rename llama_cache to llama_past
This can be changed back later if the name change is wrong. I was renaming the functions anyway to generalize kv-cache-related functions to hybrid and recurrent model architectures. I think llama_past is a better name than llama_cache for a combined kv cache and recurrent state cache, because the states it contains pretty much always come before the newly-added ones for any particular sequence. Also 'llama_past_clear' sounds more obvious in what it does than 'llama_kv_cache_clear'. The future is what the models generate. (For embeddings, the kv cache isn't really used anyway)
Still, I'm open to better suggestions.
examples : replace llama_kv_cache_seq_* with llama_past_seq_*
mamba : fix non-contiguous usage of ggml_silu
llama : initial Mamba-2 support
ggml : SIMD ggml_ssm_scan for Mamba-2
ggml : improve ggml_mul speed when masking recurrent states
llama : support running Mamba-Codestral-7B-v0.1
llama : fix Mamba-2 conv state saving
ggml : make the ggml_mul fast broadcast path more consistently formatted
llama : remove unused variable
llama : add missing break
convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present
The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires workarounds to work correctly.
llama : session saving and reloading for hybrid models
convert_hf : fix Jamba conversion
llama : fix mixed signedness comparison
llama : use unused n_embd_k_gqa in k_shift
This also slightly reduces the diff from the master branch
llama : begin renaming llama_past back to llama_kv_cache
llama : avoid redundant state copy for Mamba 1 and 2
metal : attempt to adapt SSM_SCAN for Mamba-2
metal : fix SSM_SCAN pipeline scope
metal : use log and exp instead of log1pf and expf in SSM_SCAN
metal : remove unused arguments for SSM_SCAN
The max index is 31, so trimming the arguments is necessary.
- metal : add back n_seqs to SSM_SCAN args
Whoops, this is needed for the offset in the concatenated output.
metal : fix SSM_SCAN state head offset
metal : fix wrong number of tokens per sequence in SSM_SCAN
ggml : remove unused fast broadcast path in GGML_MUL
This was initially added because states were masked with ggml_mul, but this is no longer done and so this "optimisation" is no longer necessary, or at least not worth the additional code complexity.
- ggml : avoid multiply by D in GGML_OP_SSM_SCAN
This makes the weight buft detection in src/llama.cpp simpler.
- convert : transpose Mamba-2 A, D and reshape SSM_NORM
This breaks existing conversions of Mamba-2 models to avoid some reshapes.
Not sure if it's a good idea, but it makes the graph slightly cleaner.
llama : more appropriate SSM_SCAN and SSM_CONV buft support checks
convert : fix flake8 lint
llama : remove implicit recurrent state rollbacks
llama : partially apply clang-format style
metal : fix confusion between ; and ,
metal : add missing args for nb references in ssm_scan_f32_group
metal : single-user mamba2 inference works
kv-cache : remove const_cast when setting inputs for s_copy
And also fix multi-user inference for recurrent models by using cell_id instead of i as the kv cell index when populating s_copy.
convert : avoid AutoConfig for Mamba and Mamba2 hparams
kv-cache : allow context shift for recurrent models
graph : fix recurrent state copies when avoiding copies
Works, but using lambda functions might not be that clean.
ggml : fix mamba2 ssm scan when compiled with SVE
ggml-cpu : reorder SVE FMA for consistency with other SIMD arches
cuda : implement ssm scan for Mamba2
There is still room for improvement, but it works!
cuda : adapt Mamba1 ssm scan to shape changes from Mamba2
feat: Add conversion for Bamba models
This is borrowed and adapted from the original implementation https://github.com/ggml-org/llama.cpp/pull/10810
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add Granite 4 conversion
This is a manual copy from my draft branch https://github.com/gabe-l-hart/llama.cpp/blob/GraniteFourDraft/convert_hf_to_gguf.py#L5076
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Plumb bamba through llama-arch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add bamba to llama_arch_is_hybrid_recurrent
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add optional mamba ssm_in bias tensor
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add template specialization for get_arr to load a vector for layer index arr in hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Use an explicit bool to determine mamaba vs mamba2
This allows other architectures like bamba and granitemoehybrid to use
mamab2 without a growing architecture if statement inside the mamba
implementation.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Isolate mamba(2) and granite attention layer building in static methods
This will allow these layer-builder methods to be used from other build structs without complex inheritance.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use per-layer sizes in granite build_attention_layer
Also no need to pass in kv cache since it's already in the inp_attn
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: First (broken) pass at end-to-end Bamba implementation
It generates (garbage) tokens! Still lots of debugging to do.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Only do Granite multipliers if set
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Pull granite ffn portion into a static function and reuse in hybrid
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat(py): Allow gguf duplicate keys if they match by value and type
This is helpful for hybrid models that want to do gguf param setting by calling multiple parent classes without needing to make those parent classes try/except on every attempt to set a gguf value.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor(py): Simplify granitemoehybrid conversion to use parents better
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add GRANITE_MOE_HYBRID through llama-arch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Support GRANITE_MOE_HYBRID in llama-model
This re-uses the Bamba code paths heavily and simply adds the missing parts for loading MoE and the shared expert.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- style: Fix flake8 errors
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix recurrent cache get after rebase
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix hybrid granite implementation for signature changes in build_mamba*_layer
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Refactor relationship between non-hybrid classes and hybrid impl to use mixins
The challenge here is to give both the non-hybrid classes (llm_build_mamba
and llm_build_granite) AND the hybrid class (llm_build_hybrid_mamba) access
to the same intermediate "base class" functionality (build_mamba*_layer,
build_granite_attention_layer) without running into trouble with diamond
inheritance of llm_graph_context. Due to the non-trivial initialization
that happens in llm_graph_context, diamond inheritance results in multiple
initializations of the common base which cause problems around the unique
ptrs. I wanted to get away from self-> everywhere, but this is still a
bit cleaner than making those methods static I think.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Implement the full copy-paste version to duplicate the layer builders
This follows the pattern where the type of input is pinned to the type of
memory and that is used to dispatch to the correct version of build_rs /
build_attn. There's a lot of code duplication that can hopefully be
pulled into common functions in the graph later.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Rename llm_build_hybrid_mamba -> llm_build_granite_hybrid
I've got back-and-forth a lot about how/if to try to implement reuse of the "child model" layer types for hybrid models. At the end of the day, I think hybrid models are their own beast and even if their layers are inspired by other models, they should maintain control of their own layer building (in other words, the copy-paste method). Given that, the name should reflect that this is not a generic hybrid model builder, but rather a granite- specific hybrid model builder that can do MoE (granite 4) or dense (bamba).
As part if this, I also cleaned up dangling comments from previous attempts at using static methods for reusability.
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- mamba : fix mismatched new and delete size for llm_build_mamba
Subclasses of llm_graph_context cannot have extra fields, because the called destructor is not the one from the subclass. This otherwise would cause problems when runnning Mamba-(1|2) inference when compiled -DGGML_SANITIZE_ADDRESS=ON
- memory : correctly handle failure in apply()
ggml-ci
- style: Remove TODO for adding first hybrid models to the switch
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix bad merge in tensor_mapping.py w/ SSM_NORM
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Fix bad merge resolution with variable renames/moves in llm_build_mamba
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- docs: Fix comment about duplicate key check
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Conform to standard way of initializing inp_out_ids
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
convert : fix jamba conv1d shape squeezing
fix: Fix input initialization in granite_hybrid after removal of hybrid inputs
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use llm_graph_context_mamba in llm_build_granite_hybrid
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Refactor mamba2/granite/jamba/granite_hybrid relationships as mixins
The key is for the mixin classes (llm_graph_context_mamba, llm_graph_context_granite) to use virtual inheritance from llm_graph_context. This allows the common members to exist only once in the class hierarchy. The downside is that llm_graph_context will be re-initialized once for each parent (ie 2x for single mixin, 3x for two mixins, etc...).
Branch: GraniteFourWithJamba
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- graph : add back hybrid memory graph input
But this time it contains the sub-cache graph inputs. This should make it easier to handle updating the inputs when caching the graph (eventually).
model : add Jamba to Mamba-specific hparams printing
fix: Fix input setup after upstream merge
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
jamba : remove redundant nullptr initializations
model : remove unnecessary prefix for tensor loading constants
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- model : use ggml_swiglu_split for Mamba
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
- feat: Add support for dense FFN in GraniteMoeHybrid
This was already partially supported via reusing the granite ffn builder, and there may be models that leverage this architecture going forward. The naming is a bit odd, but in the transformers version, it reuses the same model class and simply has zero regular experts and a single shared expert (which is the same as a single dense FFN).
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Add support for dense FFN tensor names on c++ side
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Use child inputs for Falcon H1 after merge resolution
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unnecessary prefix on tensor constants
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
model : make falcon-h1 use shared mamba2 layer builder
memory : avoid referring to KV in recurrent cache logs
fix: Revert order changes for Falcon H1 to stay consistent with upstream
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- gguf-py : avoid adding duplicate tensor mappings for Jamba
Some of the tensor names are common with Llama4
- refactor: Collapse Bamba and GraniteMoeHybrid into GraniteHybrid
The only key difference is the use of rope which is now set via rope_finetuned in the hparams
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove use of diamond inheritance
Per PR discussion, it's simpler to keep this with basic inheritance and not introduce the complexity of virtual inheritance and multiple inheritance
https://github.com/ggml-org/llama.cpp/pull/13550#issuecomment-3053787556
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- feat: Log mamba params for Granite Hybrid
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unused ssm_in_b
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- refactor: Remove ATTENTION_LAYER_INDICES hparam in favor of n_head_kv
This matches how recurrent vs attention heads are identified for Jamba
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unused template expansion for get_arr
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Review cleanup in convert_hf_to_gguf
The gist is to be explicit about which base class is being used with the multiple inheritance setup
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Undo hidden warnings about duplicate identical keys in add_key_value
After further discussion, this encourages sloppy overwriting in the model converters
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: If not using ROPE, context is "infinite"
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- doc: Add a comment outlining expected duplicate key warnings
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
- fix: Remove unnecessary duplicate keys in converter
Co-authored-by: Francis Couture-Harpin git@compilade.net
(thanks for the sharp eyes and patience!)
Branch: GraniteFour
Signed-off-by: Gabe Goodhart ghart@us.ibm.com
Signed-off-by: Gabe Goodhart ghart@us.ibm.com Co-authored-by: Francis Couture-Harpin git@compilade.net Co-authored-by: Georgi Gerganov ggerganov@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
vocab : add midm-2.0 model pre-tokenizer (#14626)
llama : move enum llama_vocab_pre_type to implementation (#14631)
ggml-ci
readme : add hot PRs (#14636)
readme : add hot PRs
cont
readme : update title
readme : hot PRs links
cont
HIP : Add HIP 7.0+ compatibility for hipBLAS compute types (#14634)
model : support LiquidAI LFM2 hybrid family (#14620)
Important LFM2 was merged into transformers, but has not yet been released. To convert into gguf, install transformers from source
pip install "transformers @ git+https://github.com/huggingface/transformers.git@main"vulkan: optimizations for deepseek prompt processing (#14555)
vulkan: allow unclamped loads in coopmat2 mul_mat_id shader
vulkan: increase coopmat2 mul_mat_id tile size
vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path
vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)
vulkan: support SET_ROWS (#14587)
vulkan: support SET_ROWS
Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now.
- vulkan: optimize set_rows
Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.
server : fix pooled embedding output (#14645)
vulkan : implement ggml_roll (ggml/1290)
ggml-ci
- vulkan : implement bilinear interpolation (ggml/1291)
ggml-ci
- sync : ggml
ggml-ci
- vulkan : remove unused vars (#0)
ggml-ci
sync : ggml
CUDA: add set rows for f32 and f16 (#14551)
CUDA: add set rows for f32 and f16
Review: change kernel params, use strides from host
Use 1-d kernel
Review: use int64_t for blockDim.x, rename nb->s for clarity
docs : add LFM2 to models section (#14650)
readme : add LFM2 to models section
fix copy paste...
tests : cover lfm2 cases in test_ssm_conv (#14651)
cmake : Add CMake presets for Linux and GCC (#14656)
metal : Add missing unary ops Metal support (#14660)
ggml : add build-time message to remind about ggml_set_rows (#14661)
ggml-ci
cuda : add ELU support (#14657)
cuda : add set rows for bf16 (#14664)
quantize : fix minor logic flaw in --tensor-type (#14572)
llama : add jinja template for rwkv-world (#14665)
llama : add jinja template for rwkv-world
Signed-off-by: Molly Sophia mollysophia379@gmail.com
- Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Sigbjørn Skjæret sigbjorn.skjaeret@scala.com
sycl: Batched mulmat rework for oneDNN dispatch (#14617)
SY…
Seunghhon pushed a commit to Seunghhon/llama.cpp that referenced this pull request
phuongncn pushed a commit to phuongncn/llama.cpp-gx10-dgx-sparks-deepseekv4 that referenced this pull request
Merge vulkan code from mainline up to commit of 6/28/2025
Vulkan Optimizations and Fixes (ggml-org#8959)
Optimize Vulkan REPEAT performance
Use Vulkan GLSL fused multiply-add instruction where possible
Add GGML_VULKAN_PERF option to output performance data per operator
Rework and fix Vulkan descriptor set and descriptor pool handling
Fix float32 concat f16 shader validation error
Add Vulkan GROUP_NORM eps parameter
Fix validation error with transfer queue memory barrier flags
Remove trailing whitespaces
vulkan : do not use tensor->extra (ggml-org#9407)
- vulkan : do not use tensor->extra
This patch allows using the Vulkan backend with the RPC backend as tensor->extra is no longer used.
Ref: ggml-org#8536
- Adapt GGML_VULKAN_CHECK_RESULTS to extra removal (ggml-org#2)
Co-authored-by: 0cc4m picard12@live.de
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan : fix build (#0)
ggml-ci
Improve Vulkan shader build system (ggml-org#9239)
- Improve Vulkan shader builds system
- Add dependency to vulkan-shaders-gen to rebuild shaders when changing the shader compilation utility.
- Add option to generate debug info for Vulkan shaders to provide shader source to Vulkan shader profiling tools
- remove not required self dependency
ggml : fix build break for the vulkan-debug (ggml-org#9265)
- windows build : Ok.
- linux build : Ok.
Signed-off-by: Changyeon Kim cyzero.kim@samsung.com
vulkan: correctly report support for OP_CONT (ggml/946)
test-backend-ops fails because ggml_cont aborts when invoked passing an unsupported type.
This commit makes ggml_cont tests pass
Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com
vulkan: add dryrun support to sin and cos ops (ggml/947)
sin and cos failed test-backend-ops because they tried to dereference a context pointer that is null on dry runs.
This commit prevents that segfault.
Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com
Conflicts:
ggml/src/ggml-vulkan.cpp
Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early. (ggml-org#9118)
Overlap cmdbuffer creation and cmdbuffer execution in Vulkan backend by submitting smaller cmdbuffers early.
fix compile issues
Fix issues where the last submit wasn't executed or handled properly.
remove trailing whitespace
Repair GGML_VULKAN_CHECK_RESULTS
Increase submit counter only if actual work has been submitted and increase submit count to 100.
Fix some nodes are not checked with GGML_VULKAN_CHECK_RESULTS enabled.
Conflicts:
ggml/src/ggml-vulkan.cpp
Enable use to the rebar feature to upload buffers to the device. (ggml-org#9251)
vulkan : argsort barriers must be under uniform control flow (ggml/951)
a return before a barrier (that happens only in some threads in a workgroup) leads to UB. While the old code actually works on some devices, it fails on some others (i.e. "smaller" GPUs).
BTW, I think it would be better to set specialization constants when the graph is built, in that way the local workgroup could be sized appropriately. But it would take a lot of work.
Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com
vulkan : fix build for GGML_VULKAN_RUN_TESTS, add TFLOPS to log (ggml/961)
vulkan : multithread pipeline creation (ggml/963)
vulkan : mul_mat: fix UB with small warps (ggml/952)
When the device's warp size is less than 16, it is possible for loadstride_a (mul_mm.comp:114) and loadstride_b (mul_mm.comp:115) to be set to 0. Because they are calculated as: the workgroup size, multiplied by LOAD_VEC_* (which can be 1) and divided by 16. And the workgroup size is set to be the same as the warp/subgroup size.
The loadstride_* variables are used as increments in the loops that populate the buffers used for the multiplication.
When they are 0 they cause an infinite loop. But infinite loops without side-effects are UB and the values of loadstride_* are known at compile time. So, the compiler quietly optimizes all the loops away. As a consequence, the buffers are not populated and the multiplication result is just a matrix with all elements set to 0.
We prevent the UB by making sure that the workgroup size will never be less than 16, even if our device has a smaller warp size (e.g. 8).
Signed-off-by: Salvatore Mesoraca s.mesoraca16@gmail.com
vulkan : retry allocation with fallback flags (whisper/2451)
Co-authored-by: Samuel Morris samuel.morris@artlist.io
vulkan : improve ggml_vk_create_buffer error handling (ggml-org#9898)
vulkan: Fix newly added tests for permuted mul_mat and 1D im2col (ggml-org#10226)
vulkan: Throttle the number of shader compiles during the build step. (ggml-org#10222)
Fixes ggml-org#9582
Spawning too many concurrent copies of glslc leads to "Failed to create pipes" errors on Linux. This change applies the same throttling we use for multithreaded pipeline creation.
Conflicts:
ggml/src/vulkan-shaders/vulkan-shaders-gen.cpp
vulkan: Optimize contiguous copies (ggml-org#10254)
- tests: Fix memory bandwidth calculation for perf tests
Add a flops calculation for flash attention.
Add one GGML_OP_CPY perf test.
- vulkan: Optimize contiguous copies
Add a variant of the copy shader for when the tensors are contiguous. Avoid the complex addressing calculations, and do four elements per invocation to hide some other overhead.
Apply similar changes to the scale shader, since scale is always contiguous.
Add a "progress bar" for shader compiles.
Conflicts:
tests/test-backend-ops.cpp
vulkan: Use macros to make the mat mul pipeline creation more concise (ggml-org#10259)
Also add vk_matmul_pipeline2 to hold f16/f32 accumulator versions of a pipeline. This isn't really used yet.
vulkan: Optimize binary ops (ggml-org#10270)
Reuse the index calculations across all of src0/src1/dst. Add a shader variant for when src0/src1 are the same dimensions and additional modulus for src1 aren't needed. Div/mod are slow, so add "fast" div/mod that have a fast path when the calculation isn't needed or can be done more cheaply.
Conflicts:
ggml/src/ggml-vulkan.cpp
ggml/src/vulkan-shaders/acc.comp
ggml : vulkan logs (whisper/2547)
vulkan: Optimize some mat-vec mul quant shaders (ggml-org#10296)
Compute two result elements per workgroup (for Q{4,5}_{0,1}). This reuses the B loads across the rows and also reuses some addressing calculations. This required manually partially unrolling the loop, since the compiler is less willing to unroll outer loops.
Add bounds-checking on the last iteration of the loop. I think this was at least partly broken before.
Optimize the Q4_K shader to vectorize most loads and reduce the number of bit twiddling instructions.
Vulkan: Fix device info output format specifiers (ggml-org#10366)
Vulkan: Fix device info output format specifiers
Vulkan: Use zu printf specifier for size_t instead of ld
vulkan: remove use of null initializer (ggml-org#10372)
Seems like this isn't working for vulkan-over-metal when the array is sized by a spec constant. Maybe a spirv-cross limitation?
vulkan: Optimize soft_max (ggml-org#10301)
- vulkan: Optimize soft_max
Large soft_max could already saturate memory, but small/medium sizes were pretty slow. The bulk of the gains for them comes from using a smaller workgroup size, and making the workgroup size match the subgroup size also makes the barriers much cheaper.
Cache some values in locals to avoid refetching/recomputing. And stamp out a few "template instantiations" so smaller cases will fully unroll.
Add a missing early return for OOB rows. This happens when there are more than 512 rows and the dispatch is 512 x H.
- vulkan: Further soft_max optimizations
Restore the workgroup size of 512 case, use it for >1024.
Use unrollable loops for more iteration counts.
vulkan: further optimize mul_mat_vec using larger loads (ggml-org#10387)
- vulkan: Use pipeline_robustness to disable robustness in mul_mat_vec.
Add some early returns for nonexistent rows in mul_mat_vec shaders. These can only be hit when dispatching a 2D grid of workgroups. Fix the logic for the 2D grid of workgroups to round up.
Enable the pipeline robustness extension if it's available, and use it to disable robustness for these pipelines. The instructions to do the bounds checking contend for the same ALU resources as the bit twiddling dequant instructions.
- vulkan: Add GLSL structure aliases for quant types to allow larger loads
In Vulkan it's not possible to cast pointer types, so instead you have to declare an aliased binding for the memory with a different type. This commit adds aliases for the quant formats using 16b ints, and in a few places where the struct size is a multiple of 4 also using 32b ints. Currently only q4_k's aliases are used, but others will be used in subsequent commits.
- vulkan: use larger loads in q5_k and q6_k shaders.
Similar to the optimization I did in q4_k recently, this vectorizes some loads and reduces the number of bit twiddling instructions.
- vulkan: use larger K step per iteration in mul_mat_vec.
Add vec4 dequantization functions, and use them to do K=8 per iteration in mul_mat_vec. This uses 16b loads for the quant values and 128b loads for B which helps reduce the load on the memory system.
The K_PER_ITER==2 logic is still there, just for F16/F32, and really only because they support unaligned sizes.
Tweak the num_iters/unrolling logic to be simpler and catch a couple missed unrolling opportunities.
vulkan: copy iq4_nl LUT into shared memory (ggml-org#10409)
vulkan: predicate max operation in soft_max shaders/soft_max (ggml-org#10437)
Fixes ggml-org#10434
vulkan: Fix a vulkan-shaders-gen arugment parsing error (ggml-org#10484)
The vulkan-shaders-gen was not parsing the --no-clean argument correctly. Because the previous code was parsing the arguments which have a value only and the --no-clean argument does not have a value, it was not being parsed correctly. This commit can now correctly parse arguments that don't have values.
vulkan: fix group_norm (ggml-org#10496)
Fix bad calculation of the end of the range. Add a backend test that covers the bad case (taken from stable diffusion).
Fixes leejet/stable-diffusion.cpp#439.
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: optimize Q2_K and Q3_K mul_mat_vec (ggml-org#10459)
vulkan: skip integer div/mod in get_offsets for batch_idx==0 (ggml-org#10506)
vulkan: further optimize q5_k mul_mat_vec (ggml-org#10479)
vulkan: Handle GPUs with less shared memory (ggml-org#10468)
There have been reports of failure to compile on systems with <= 32KB of shared memory (e.g. ggml-org#10037). This change makes the large tile size fall back to a smaller size if necessary, and makes mul_mat_id fall back to CPU if there's only 16KB of shared memory.
vulkan: define all quant data structures in types.comp (ggml-org#10440)
vulkan: get the first command buffer submitted sooner (ggml-org#10499)
This is an incremental improvement over ggml-org#9118 to get work to the GPU a bit sooner. The first part is to start with a smaller number of nodes before the first submit, and ramp it up to the current 100 nodes/submit. The second part is to reduce the dryrun overhead for all the nodes that just need to request descriptor space.
With these changes I get around 1-2% speedup on RTX 4070 combined with my old Haswell-era CPU.
vulkan: Dynamic subgroup size support for Q6_K mat_vec (ggml-org#10536)
- subgroup 64 version with subgroup add. 15% faster
scalable version
tested for subgroup sizes 16-128
check for subgroup multiple of 16 and greater than 16
subgroup sizes are always a power of 2 (KhronosGroup/GLSL#45)
force 16 sequential threads per block
make 16 subgroup size a constant
vulkan: optimize and reenable split_k (ggml-org#10637)
Use vector loads when possible in mul_mat_split_k_reduce. Use split_k when there aren't enough workgroups to fill the shaders.
vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (ggml-org#10642)
vulkan: Add VK_NV_cooperative_matrix2 support for mul_mat and flash attention (ggml-org#10206)
Conflicts:
ggml/src/vulkan-shaders/dequant_funcs_cm2.comp
ggml/src/vulkan-shaders/flash_attn_cm2.comp
ggml/src/vulkan-shaders/mul_mm_cm2.comp
Vulkan: VK_KHR_cooperative_matrix support to speed up prompt processing (ggml-org#10597)
Vulkan: Implement VK_KHR_cooperative_matrix support in the matrix matrix multiplication shader
Improve performance with better q4_k and q5_k dequant and store unrolling
Add Vulkan MUL_MAT and MUL_MAT_ID accumulator precision selection
Rework mulmat shader selection and compilation logic, avoid compiling shaders that won't get used by device
Vulkan: Implement accumulator switch for specific mul mat mat shaders
Vulkan: Unroll more loops for more mul mat mat performance
Vulkan: Add VK_AMD_shader_core_properties2 support to read Compute Unit count for split_k logic
Disable coopmat support on AMD proprietary driver
Remove redundant checks
Add environment variable GGML_VK_DISABLE_COOPMAT to disable VK_KHR_cooperative_matrix support
Fix rebase typo
Fix coopmat2 MUL_MAT_ID pipeline selection
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: compile a test shader in cmake to check for coopmat2 support (ggml-org#10713)
Conflicts:
ggml/src/ggml-vulkan.cpp
ggml/src/ggml-vulkan/CMakeLists.txt
ggml/src/vulkan-shaders/test_coopmat2_support.comp
Vulkan: fix NaN in tanh.comp with AMD proprietary driver on Windows (ggml-org#10723)
Vulkan: fix NaN in tanh.comp
Faster NaN-free tanh
vulkan: fix compile warnings (ggml-org#10731)
vulkan: disable spirv-opt for coopmat shaders (ggml-org#10763)
There are some bugs in the 1.3.296 SDK, so disable this. It isn't strictly necessary anyway.
Add missing dependency on vulkan-shaders-gen, so shaders get recompiled when it changes.
Fix coopmat support reporting when glslc doesn't support NV_coopmat2.
vulkan: dynamic subgroup size for the remaining k quants (ggml-org#10745)
- q5_k
q4_k
q3_k
q2_k
q6_k multi row example
- revert as multi row isnt faster for k quants
vulkan: request round-to-even for fp16 in im2col/rope_head (ggml-org#10767)
Vulkan doesn't mandate a specific rounding mode, but the shader_float_controls feature allows rounding mode to be requested if the implementation supports it.
Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats (ggml-org#10721)
Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats
Fix subgroup size control extension support check
Add accf32 and accf16 checks for coopmats
- Also disable coopmats on amdvlk
Vulkan: Use improved q4_k and q5_k dequant code in dequant shaders (ggml-org#10798)
vulkan: small mul_mat_vec optimizations (ggml-org#10665)
double the number of rows per workgroup
Update ggml-vulkan.cpp
Vulkan: Add VK_EXT_subgroup_size_control support to ensure full subgroups for coopmats
only increase the number of rows for amd and subgroup size 64
fix missing NUM_ROWS for mul_mat_vec_iq4_nl_f16_f32, untested
use subgroup min and max to check for gcn (requires ggml-org#10721)
manual merge ggml-vulkan.cpp
set min and max subgroup size in any case
Also double the number of rows for Intel GPUs
Change Debug print name
add GGML_ROPE_TYPE_MROPE
rwkv6: add wkv6 support for Vulkan backend (ggml-org#10829)
rwkv_wkv6 vulkan shader
RWKV_WKV6 Vulkan op tests passed
Signed-off-by: Molly Sophia mollysophia379@gmail.com
- Apply code format changes
Signed-off-by: Molly Sophia mollysophia379@gmail.com
add [[unroll]] and remove unnecessary conditions
add uma support
fix erros in EditorConfig Checker
Signed-off-by: Molly Sophia mollysophia379@gmail.com Co-authored-by: Molly Sophia mollysophia379@gmail.com
Conflicts:
ggml/src/ggml-vulkan.cpp
ggml/src/vulkan-shaders/wkv6.comp
vulkan: bugfixes for small subgroup size systems + llvmpipe test (ggml-org#10809)
- ensure mul mat shaders work on systems with subgroup size less than 32
more fixes
add test
- only s_warptile_mmq needs to be run with 32 threads or more
Conflicts:
.github/workflows/build.yml
vulkan : fix soft_max.comp division by zero (whisper/2633)
This change prevents a division by zero error when p.KY is 0.
vulkan: optimize coopmat2 dequant functions (ggml-org#10855)
Change the code to do 16b loads when possible and extract the appropriate component late, so the code is effectively decoding a pair of elements and then selecting one. This can allow more commoning to happen in the compiler when neighboring elements are loaded.
vulkan: build fixes for 32b (ggml-org#10927)
- vulkan: build fixes for 32b
Should fix ggml-org#10923
- vulkan: initialize some buffer/offset variables
examples, ggml : fix GCC compiler warnings (ggml-org#10983)
Warning types fixed (observed under MSYS2 GCC 14.2.0):
- format '%ld' expects argument of type 'long int', but argument has type 'size_t'
- llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp:81:46: warning: missing initializer for member '_STARTUPINFOA::lpDesktop' [-Wmissing-field-initializers] (emitted for all struct field except first)
Conflicts:
examples/export-lora/export-lora.cpp
vulkan: multi-row k quants (ggml-org#10846)
multi row k quant shaders!
better row selection
more row choices
readjust row selection
rm_kq=2 by default
vulkan: Use push constant offset to handle misaligned descriptors (ggml-org#10987)
vulkan: im2col and matmul optimizations for stable diffusion (ggml-org#10942)
tests: Add im2col perf tests
vulkan: optimize im2col, more elements per thread
vulkan: increase small tile size for NV_coopmat2
vulkan: change im2col to 512 elements per workgroup
vulkan: optimize mul_mat for small values of N (ggml-org#10991)
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where the batch_strides are overloaded to hold the row strides. Put the loads from the B matrix in the innermost loop because it should cache better.
Share some code for reducing the result values to memory in mul_mat_vec_base.
Conflicts:
tests/test-backend-ops.cpp
fix: Vulkan shader gen binary path (ggml-org#11037)
Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver (ggml-org#11074)
Vulkan: Add device-specific blacklist for coopmat for the AMD proprietary driver
Add (TM) to AMD name check
fix lora print
Disable GL_KHR_cooperative_matrix Vulkan extension if not available. (ggml-org#11117)
Disable GL_KHR_cooperative_matrix Vulkan extension if not available.
Perform Vulkan extensions checks in a more sensible order
Remove unnecessary #ifdef directive
Conflicts:
ggml/src/vulkan-shaders/test_coopmat_support.comp
llama: add support for QRWKV6 model architecture (ggml-org#11001)
Vulkan: Fix float16 use on devices without float16 support + fix subgroup_size_control validation error (ggml-org#11161)
Vulkan: Remove float16 use in shaders
Fix validation error about subgroup_size_control extension
fix: ggml: fix vulkan-shaders-gen build (ggml-org#10448)
- fix: ggml: fix vulkan-shaders-gen build
The vulkan-shaders-gen target was not being built correctly in case of cross-compilation. Other outputs need to be built for the cross compile target, but vulkan-shaders-gen needs to be built for the host.
- refactor: ggml: Improve vulkan-shaders-gen toolchain setup
- Add GGML_SHADERS_GEN_TOOLCHAIN CMake option.
- Auto-detect host toolchain if not set.
- refactor: ggml: Improve vulkan-shaders-gen toolchain setup
Use configure_file to generate host_toolchain.cmake from template
- fix: ggml: Fix compile error
Fix compile error not finding vulkan-shaders-gen
- fix: vulkan-shaders-gen build and path handling
Fix build issues with vulkan-shaders-gen:
- Add target dependency for correct build order
- Use CMAKE_HOST_SYSTEM_NAME for executable suffix
- Fix MSVC output directory in host toolchain
- Normalize path handling for cross-compilation
- fix: improve host compiler detection in vulkan shader build
Improve host compiler detection for vulkan shader generation:
- Add NO_CMAKE_FIND_ROOT_PATH to all compiler searches
- Consolidate compiler detection logic
- Fix Windows-specific MSVC detection
- Ensure correct compiler search in cross-compilation
- refactor: Simplify CMake function for detecting host compiler
Simplified the CMake function to improve the process of detecting the host compiler.
- fix: Remove unnecessary Vulkan library linkage in CMakeLists.txt
Since vulkan-shader-gen.cpp only requires the glslc executable
and not the Vulkan headers or libraries, CMakeLists.txt needs to
be corrected.
(See: ecc93d0)
- refactor: Rename host_toolchain.cmake.in
- Rename host_toolchain.cmake.in to cmake/host-toolchain.cmake.in
- refactor: GGML_VULKAN_SHADERS_GEN_TOOLCHAIN
Rename the macro GGML_SHADERS_GEN_TOOLCHAIN to GGML_VULKAN_SHADERS_GEN_TOOLCHAIN
Conflicts:
ggml/src/ggml-vulkan/CMakeLists.txt
vulkan: scale caching for k quants + misc fixes (ggml-org#11081)
q6_k scale caching
16 bit unpack
q4_k test (slow)
revert it
q3_k
q2_k
little stuff
try precalculating products of a and q2_k scales
Revert "try precalculating products of a and q2_k scales"
This reverts commit 65110b81f23f66331a50c6e889a7c1ab9470a86b.
unpack should be u16, add vim swap to gitignore (about time)
better q4_k scales
q5_k
better q6_k with separate paths for all threads and partial threads in use, plus some more optimizations
q2_k better dequant
q3_k optimizations
q3_k use hmask simd from cpu avx version
make the caches happy
q3_k separate out calculation
q2_k separate out
little stuff
use calc_superblock everywhere
q2_k optimize scale calculation
more barriers
vulkan: optimize coopmat2 q2_k dequant function (ggml-org#11130)
vulkan: optimize coopmat2 q4_k/q5_k dequant functions. (ggml-org#11206)
Do masking on whole dwords, fetch all scales at once.
vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (ggml-org#11166)
- vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl
Shaders are based on cpy.cu.
vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32
ggml: copy q->f32 assumes some contiguity in the destination
Conflicts:
ggml/src/ggml-cpu/ggml-cpu.c
ggml/src/vulkan-shaders/copy_from_quant.comp
ggml/src/vulkan-shaders/copy_to_quant.comp
vulkan: fix coopmat2 flash attention for non-contiguous inputs (ggml-org#11281)
Add code similar to mul_mm_cm2 to force alignment of strides, to avoid a performance regression.
Add noncontiguous FA tests in test-backend-ops.
Fixes ggml-org#11268.
Conflicts:
tests/test-backend-ops.cpp
vulkan: fix coopmat2 validation failures (ggml-org#11284)
mul mat and flash attention shaders were loading f32 types directly into A/B matrices, which happens to work but is technically invalid usage. For FA, we can load it as an Accumulator matrix and convert and this is not in the inner loop and is cheap enough. For mul mat, it's more efficient to do this conversion in a separate pass and have the input(s) be f16.
coopmat2 requires SPIR-V 1.6 (related using to LocalSizeId). LocalSizeId requires maintenance4 be enabled, and SPIR-V 1.6 requires Vulkan 1.3.
vulkan: fix diag_mask_inf (ggml-org#11323)
With robustbufferaccess disabled, this shader was showing OOB stores. There is a bounds check in the code, but the workgrouop dimensions were reversed vs CUDA and it was running the wrong number of threads. So fix the workgroup dimensions and disable robustness for this pipeline.
vulkan: sort shaders for more deterministic binary (ggml-org#11315)
Fixes ggml-org#11306.
Vulkan-run-test: fix mmq_wg_denoms (ggml-org#11343)
There should be a copy-and-paste error here.
*mmq_wg_denoms should be used together with *warptile_mmq, instead of wg_denoms.
vulkan: compile shaders on-demand (ggml-org#11406)
Reduce first-run startup time and memory consumption.
Should fix ggml-org#11339.
vulkan: Catch pipeline creation failure and print an error message (ggml-org#11436)
- vulkan: Catch pipeline creation failure and print an error message
Also, fix some warnings from my on-demand compile change.
- vulkan: fix pipeline creation logging
vulkan: implement initial support for IQ2 and IQ3 quantizations (ggml-org#11360)
vulkan: initial support for IQ3_S
vulkan: initial support for IQ3_XXS
vulkan: initial support for IQ2_XXS
vulkan: initial support for IQ2_XS
vulkan: optimize Q3_K by removing branches
vulkan: implement dequantize variants for coopmat2
vulkan: initial support for IQ2_S
vulkan: vertically realign code
port failing dequant callbacks from mul_mm
Fix array length mismatches
vulkan: avoid using workgroup size before it is referenced
tests: increase timeout for Vulkan llvmpipe backend
Co-authored-by: Jeff Bolz jbolz@nvidia.com
Conflicts:
ggml/src/vulkan-shaders/dequant_iq2_s.comp
ggml/src/vulkan-shaders/dequant_iq2_xs.comp
ggml/src/vulkan-shaders/dequant_iq2_xxs.comp
ggml/src/vulkan-shaders/dequant_iq3_s.comp
ggml/src/vulkan-shaders/dequant_iq3_xxs.comp
CUDA: non-contiguous (RMS) norm support (ggml-org#11659)
vulkan: use smaller combined allocations to avoid fragmentation (ggml-org#11551)
Conflicts:
ggml/src/ggml-alloc.c
vulkan: initial support for IQ4_XS quantization (ggml-org#11501)
Conflicts:
ggml/src/vulkan-shaders/dequant_iq4_xs.comp
vulkan: optimize coopmat2 iq2/iq3 callbacks (ggml-org#11521)
vulkan: optimize coopmat2 iq2/iq3 callbacks
build: trigger CI on GLSL compute shader changes
vulkan: print shared memory size (ggml-org#11719)
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: account for lookup tables when checking shared memory size (ggml-org#11502)
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (ggml-org#11592)
vulkan: linux builds + small subgroup size fixes (ggml-org#11767)
mm subgroup size
upload vulkan x86 builds
vulkan: initial support for IQ1_S and IQ1_M quantizations (ggml-org#11528)
vulkan: initial support for IQ1_S and IQ1_M quantizations
vulkan: define MMV kernels for IQ1 quantizations
devops: increase timeout of Vulkan tests again
vulkan: simplify ifdef for init_iq_shmem
Conflicts:
ggml/src/vulkan-shaders/dequant_iq1_m.comp
ggml/src/vulkan-shaders/dequant_iq1_s.comp
ggml/src/vulkan-shaders/mul_mat_vec_iq1_m.comp
ggml/src/vulkan-shaders/mul_mat_vec_iq1_s.comp
vulkan: support multi/vision rope, and noncontiguous rope (ggml-org#11902)
Conflicts:
ggml/src/ggml-vulkan.cpp
ggml/src/vulkan-shaders/rope_multi.comp
ggml/src/vulkan-shaders/rope_vision.comp
vulkan: implement several ops relevant for ggml_opt (ggml-org#11769)
vulkan: support memset_tensor
vulkan: support GGML_OP_SUM
vulkan: implement GGML_OP_ARGMAX
vulkan: implement GGML_OP_SUB
vulkan: implement GGML_OP_COUNT_EQUAL
vulkan: implement GGML_OP_OPT_STEP_ADAMW
vulkan: fix check_results RWKV_WKV6 crash and memory leaks
vulkan: implement GGML_OP_REPEAT_BACK
tests: remove invalid test-backend-ops REPEAT_BACK tests
vulkan: fix COUNT_EQUAL memset using a fillBuffer command
Conflicts:
ggml/src/ggml-vulkan.cpp
ggml/src/vulkan-shaders/argmax.comp
ggml/src/vulkan-shaders/count_equal.comp
ggml/src/vulkan-shaders/opt_step_adamw.comp
ggml/src/vulkan-shaders/repeat_back.comp
ggml/src/vulkan-shaders/sub.comp
tests/test-backend-ops.cpp
vulkan: implement more backpropagation operators (ggml-org#11914)
vulkan: implement GGML_OP_ROPE_BACK
vulkan: implement GGML_OP_RMS_NORM_BACK
vulkan: implement GGML_OP_SILU_BACK
vulkan: implement GGML_OP_SOFTMAX_BACK
Conflicts:
ggml/src/vulkan-shaders/rms_norm_back.comp
ggml/src/vulkan-shaders/silu_back.comp
ggml/src/vulkan-shaders/soft_max_back.comp
Add memset tensor in all backend interface
SYCL: implement memset ggml backend buffer interface (ggml-org#12580)
SYCL: implement memset ggml backend buffer interface
use GGML_ABORT macro
Do not wait for all queues to finish for memset operation
Conflicts:
ggml/src/ggml-sycl.cpp
add OP sigmoid (ggml-org#12056)
Co-authored-by: Judd foldl@boxvest.com
Conflicts:
ggml/src/vulkan-shaders/sigmoid.comp
vulkan: fix assertion when qy_needs_dequant (ggml-org#12068)
Looks like a copy/paste bug from qx_needs_dequant.
vulkan: improve im2col (ggml-org#11826)
- vulkan: improve im2col performance
vulkan: matmul dequantization improvements (ggml-org#12015)
faster dequant for old quants
dont use unpack for iq4_nl
vec2 unpack for q8
vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (ggml-org#11595)
vulkan: implement specialized MMV kernels for IQ2 quantizations
vulkan: add MMV kernels for IQ3 quants
vulkan: Increase MMV batch size and unroll IQ LUT setup
vulkan: fix init_iq_shmem for WG sizes larger than tables
vulkan: common batch size for all I-quants
Conflicts:
ggml/src/vulkan-shaders/mul_mat_vec_iq2_s.comp
ggml/src/vulkan-shaders/mul_mat_vec_iq2_xs.comp
ggml/src/vulkan-shaders/mul_mat_vec_iq2_xxs.comp
ggml/src/vulkan-shaders/mul_mat_vec_iq3_s.comp
ggml/src/vulkan-shaders/mul_mat_vec_iq3_xxs.comp
cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)
ggml-ci
Conflicts:
ggml/src/ggml-cuda.cu
tests/test-backend-ops.cpp
mat vec double buffer (ggml-org#12188)
vulkan: fix bug in coopmat1 mul_mat_id (ggml-org#12316)
tests: run mul_mat_id with a larger N
vulkan: fix bug in coopmat1 mul_mat_id
Update build.yml for Windows Vulkan builder to use Vulkan 1.4.304 SDK for VK_NV_cooperative_matrix2 support (ggml-org#12301)
vulkan: Adjust coopmat2 tile sizes and selection heuristic (ggml-org#12258)
vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (ggml-org#12273)
- vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking
vulkan: use fp32 in coopmat2 q4_k dequant function (ggml-org#12309)
vulkan: subgroup size tuning (ggml-org#12087)
vulkan: subgroup size test
Vulkan: Add device architecture enum and logic to recognize AMD generations
vulkan: use new architecture logic to specify subgroup size
Initial vulkan subgroup size tuning for RDNA3
vulkan: commonize RDNA subgroup tuning
vulkan: override subgroup size if required_subgroup_size = 0
vulkan: disable warp 32 for RDNA3
vulkan: fine tuned RDNA1 subgroup sizes
vulkan: adjusted subgroup size map
vulkan: fixed RDNA2 subgroup map
Co-authored-by: 0cc4m picard12@live.de
vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (ggml-org#12312)
ggml-vulkan: remove unused find_program(glslc) (ggml-org#12416)
It's already found by FindVulkan.cmake in the parent CMakeLists
Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (ggml-org#12434)
vulkan: Submit once enough matmul work has been recorded (ggml-org#12406)
I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.
vulkan: optimize iq1 coopmat2 dequant functions (ggml-org#12427)
vulkan: workaround for AMD Windows driver 16 bit unpack8 bug (ggml-org#12472)
Vulkan: RTE rounding for cpy to quant (ggml-org#12480)
- Vulkan: RTE rounding for cpy to quant
Co-Authored-By: Jeff Bolz jbolz@nvidia.com
remove trailing whitespace
avoid duplicating pipeline_cpy_f32_quant
fix copypasting issue
remove duplicated code
Co-authored-by: Jeff Bolz jbolz@nvidia.com
vulkan: Optimize mul_mat_vec p021 and nc shaders (ggml-org#12505)
tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
Conflicts:
tests/test-backend-ops.cpp
vulkan: fix mul_mat_vec failure in backend tests (ggml-org#12529)
The OOB calculation could be wrong if the last iteration was during one of the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple new backend tests that hit this failure on NVIDIA GPUs.
vulkan: fix coopmat shader generation when cross-compiling (ggml-org#12272)
- vulkan: fix coopmat shader generation when cross-compiling
Previously the status of coopmat{,2} support isn't passed to the vulkan-shaders-gen project building on the host, which leads to build failure because of the cross-compiling code expecting coopmat{,2} shaders that didn't get generated.
Fix this by passing the coopmat{,2} support status to vulkan-shaders subproject.
Signed-off-by: Icenowy Zheng uwu@icenowy.me
Only call coop-mat shaders once
Fix whitespace
Signed-off-by: Icenowy Zheng uwu@icenowy.me Co-authored-by: bandoti 141645996+bandoti@users.noreply.github.com
cmake: improve Vulkan cooperative matrix support checks (whisper/2966)
Co-authored-by: Sandro Hanea me@sandro.rocks
cmake : fix whitespace (#0)
Vulkan: Add DP4A MMQ and Q8_1 quantization shader (ggml-org#12135)
Vulkan: Add DP4A MMQ and Q8_1 quantization shader
Add q4_0 x q8_1 matrix matrix multiplication support
Vulkan: Add int8 coopmat MMQ support
Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code
Add GL_EXT_integer_dot_product check
Remove ggml changes, fix mmq pipeline picker
Remove ggml changes, restore Intel coopmat behaviour
Fix glsl compile attempt when integer vec dot is not supported
Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq
Remove redundant comment
Fix integer dot check
Fix compile issue with unsupported int dot glslc
Update Windows build Vulkan SDK version
Conflicts:
ggml/src/ggml-vulkan.cpp
ggml/src/vulkan-shaders/mul_mmq.comp
ggml/src/vulkan-shaders/mul_mmq_funcs.comp
ggml/src/vulkan-shaders/quantize_q8_1.comp
ggml/src/vulkan-shaders/test_integer_dot_support.comp
vulkan: fix build when glslc doesn't support coopmat (ggml-org#12683)
Vulkan: Fix mmq int dot float cache size (ggml-org#12722)
vulkan: Implement grouped query attention in the coopmat2 FA shader (ggml-org#12559)
When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when:
dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))
previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each.
This doesn't directly translate to better performance (at least when you have
=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.
cmake: remove caching from vulkan coopmat checks (ggml-org#12719)
vulkan: Implement split_k for coopmat2 flash attention. (ggml-org#12627)
When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.
Conflicts:
ggml/src/vulkan-shaders/flash_attn_split_k_reduce.comp
vulkan: Fix missing cmake logic for dot product extension (ggml-org#12721)
vulkan: set cmake minimum and project name in vulkan-shaders (ggml-org#12744)
vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (ggml-org#12630)
There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.
Conflicts:
ggml/src/ggml-vulkan.cpp
cmake: fix ggml-shaders-gen compiler paths containing spaces (ggml-org#12747)
fixes error for compiler paths with spaces
Vulkan: Tune Vulkan mmq int dot shader for performance (ggml-org#12767)
vulkan: Use unclamped loads for flash attention mask (ggml-org#12720)
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.
vulkan: fix NaN issue in flash attention shader (ggml-org#12776)
Use -FLT_MAX/2 rather than -inf as the initial value for computing the maximum.
vulkan: Use fp16 for the flash attention P*V multiplication (ggml-org#12783)
This is consistent with the ggml-cuda behavior and the mul_mat fallback.
vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (ggml-org#12833)
q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap.
This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0.
The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.
vulkan: use aligned loads for flash attention mask (ggml-org#12853)
Rewrite the stride logic for the mask tensor in the FA shader to force the stride to be aligned, to allow using more efficient loads.
vulkan: enable coopmat2 FA gqa and split_k optimizations more often (ggml-org#12931)
The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.
vulkan: support noncontiguous rms_norm (ggml-org#13031)
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: matmul gcn tuning (ggml-org#13016)
tune matmul for gcn
this one is more power efficient
Update ggml/src/ggml-vulkan/ggml-vulkan.cpp
Co-authored-by: 0cc4m picard12@live.de
- disable this tune for the proprietary driver
Co-authored-by: 0cc4m picard12@live.de
vulkan: use uint array index to avoid glslang bug (ggml-org#13193)
vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (ggml-org#13191)
- vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader
vulkan: Add bfloat16 support (ggml-org#12554)
- vulkan: Add bfloat16 support
This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension.
It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that.
The coopmat support also requires a glslc that supports the extension, which currently requires a custom build.
- vulkan: Support bf16 tensors without the bf16 extension or coopmat support
Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available.
vulkan: bfloat16 fixes (really works without bfloat16 support now)
vulkan: fix spirv-val failure and reenable -O
Conflicts:
ggml/src/vulkan-shaders/test_bfloat16_support.comp
vulkan: Additional type support for unary, binary, and copy (ggml-org#13266)
Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: Allow up to 4096 elements for mul_mat_id row_ids (ggml-org#13326)
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:
GGML_ASSERT(nei0 * nei1 <= 3072);
The tensor is 8 x 512. Increase this array size to accommodate.
vulkan: scalar flash attention implementation (ggml-org#13324)
vulkan: scalar flash attention implementation
vulkan: always use fp32 for scalar flash attention
vulkan: use vector loads in scalar flash attention shader
vulkan: remove PV matrix, helps with register usage
vulkan: reduce register usage in scalar FA, but perf may be slightly worse
vulkan: load each Q value once. optimize O reduction. more tuning
vulkan: support q4_0/q8_0 KV in scalar FA
CI: increase timeout to accommodate newly-supported tests
vulkan: for scalar FA, select between 1 and 8 rows
vulkan: avoid using Float16 capability in scalar FA
Conflicts:
ggml/src/ggml-vulkan.cpp
ggml/src/vulkan-shaders/flash_attn.comp
vulkan: workaround FA compile failures on macos (ggml-org#13517)
vulkan: KHR_coopmat flash attention (ggml-org#13506)
This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.
Conflicts:
ggml/src/vulkan-shaders/flash_attn_cm1.comp
cmake: simplify vulkan shader test logic (ggml-org#13263)
vulkan: use scalar FA rather than coopmat2 when N==1 (ggml-org#13554)
Add pipeline_acc_f32
vulkan: move common FA code to flash_attn_base.comp (ggml-org#13556)
vulkan: move common FA code to flash_attn_base.comp
vulkan: move common FA index/stride setup code to flash_attn_base.comp
build fix
Conflicts:
ggml/src/vulkan-shaders/flash_attn_base.comp
cmake: use the current build config for vulkan-shaders-gen (ggml-org#13595)
fix: use the current build config for
vulkan-shaders-genfix: only pass a valid build type to
--config
Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (ggml-org#13607)
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: fix warnings (ggml-org#13626)
small fixes
remove ifdef
use LOG_WARN to replace std::cerr (ggml-org#13657)
vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (ggml-org#13696)
vulkan: support CPY from any type to itself (ggml-org#13695)
Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.
add GGML_LOG_WARN
vulkan: mark IM2COL as supporting non-contig (ggml-org#13783)
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: use timestamp queries for GGML_VULKAN_PERF (ggml-org#13817)
Also change it to be controlled by an env var rather than cmake flag
vulkan : Remove unexpected ; (ggml/1253)
vulkan: fix warnings in perf logger querypool code (ggml-org#13937)
ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (ggml-org#13813)
- ggml-vulkan: adds op CONV_TRANSPOSE_1D
test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
Missing barrier added to shader. Number of additional tests reduced to 108.
- Fixes typo in variable name.
Removes extra whitespaces.
Adds int64->int32 casts to prevent possible warnings.
Problem size reduced in tests to pass tests with llvmpipe.
supports_op condition moved from unintended position
Conflicts:
ggml/src/ggml-vulkan.cpp
ggml/src/vulkan-shaders/conv_transpose_1d.comp
vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (ggml-org#14001)
allowing B580 and U9-288V
experimenting code to detect Xe2
allowing coopmat only for Xe2 GPUs
fixed comment wording
fixed comment wording
removed unnecessary driver check
Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (ggml-org#14099)
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: force device 0 in CI (ggml-org#14106)
Add GGML_LOG_INFO
vulkan: Track descriptor pools/sets per-context (ggml-org#14109)
Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.
vulkan: Better thread-safety for command pools/buffers (ggml-org#14116)
This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: mutex around vkQueueSubmit (ggml-org#14127)
This fixes the remaining crash in test-thread-safety on my system.
cmake: clean up external project logic for vulkan-shaders-gen (ggml-org#14179)
Remove install step for vulkan-shaders-gen
Add install step to normalize msvc with make
Regenerate modified shaders at build-time
Conflicts:
.github/workflows/build.yml
cmake: remove shader-gen step-targets from ggml-vulkan (ggml-org#14226)
Remove step-targets from vulkan-shaders-gen
Unset DESTDIR when building vulkan-shaders-gen
Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (ggml-org#14249)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (ggml-org#13792)
Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled.
remove #ifdef for debug utils and add queue marker.
Conflicts:
ggml/src/ggml-vulkan.cpp
vulkan: update windows SDK in CI (ggml-org#14334)
vulkan: update windows SDK in release.yml (ggml-org#14344)
Conflicts:
.github/workflows/release.yml
cmake: regen vulkan shaders when shaders-gen sources change (ggml-org#14398)
- Add shaders-gen sources as target deps
vulkan: Fix GGML_VULKAN_SHADER_DEBUG_INFO (ggml-org#14427)
This setting needs to be passed through to vulkan-shaders-gen
vulkan: lock accesses of pinned_memory vector (ggml-org#14333)
vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (ggml-org#14378)
Fix cuda build error
test
remove new cpu backend and yml files
remove new op and GGML_ROPE_TYPE_NEOX
fix build error
change cmake file to add matrix operation
remove coopmat2 check in flash attention
print gpu info for vulkan
disable fuse to recover vulkan performance
Co-authored-by: 0cc4m picard12@live.de Co-authored-by: firecoperana
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request
phibya pushed a commit to ziee-ai/llama.cpp that referenced this pull request
fewtarius pushed a commit to fewtarius/CachyLLama that referenced this pull request
AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request
AlexiAlp pushed a commit to minghaop/llama.cpp that referenced this pull request
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
[ Show hidden characters]({{ revealButtonHref }})