Extending context size via RoPE scaling · ggml-org/llama.cpp · Discussion #1965 (original) (raw)

Intro

This is a discussion about a recently proposed strategy of extending the context size of LLaMA models.

Make sure to first get familiar with the info in the links above as there has already been ongoing discussions and results.

So far the discussion seems to focus on the coherency of the generated texts when using large context. I think what we can do here in llama.cpp in order to support these investigations is to provide a more objective way of evaluating the proposed method by computing the perplexity at different context sizes with and without fine-tuning. Very initial results already suggest that this idea might be viable, but we should carefully check that we are doing the computations correctly

Preliminary tests with LLaMA 7B

Applied the following simple patch as proposed by Reddit user pseudonerv in this comment:

Details

diff --git a/examples/main/main.cpp b/examples/main/main.cpp index 941312f..7fa3ae2 100644 --- a/examples/main/main.cpp +++ b/examples/main/main.cpp @@ -84,8 +84,8 @@ int main(int argc, char ** argv) { return 0; }

diff --git a/ggml.c b/ggml.c index 4319683..0aa4bd1 100644 --- a/ggml.c +++ b/ggml.c @@ -12172,7 +12172,7 @@ static void ggml_compute_forward_rope_f32( if (ir++ < ir0) continue; if (ir > ir1) break;

@@ -12285,7 +12285,7 @@ static void ggml_compute_forward_rope_f16( if (ir++ < ir0) continue; if (ir > ir1) break;

@@ -12423,7 +12423,7 @@ static void ggml_compute_forward_rope_back_f32( if (ir++ < ir0) continue; if (ir > ir1) break;

@@ -12536,7 +12536,7 @@ static void ggml_compute_forward_rope_back_f16( if (ir++ < ir0) continue; if (ir > ir1) break;

This patch "scales" the RoPE position by a factor of 0.5 which should correspond to extending the max context size from 2048 to 4096.

Running the following perplexity calculation for 7B LLaMA Q4_0 with context of 4096 yields:

Details

$ ▶ make -j && time ./perplexity -m ./models/7B/ggml-model-q4_0.bin -f ./build/wiki.test.raw --no-mmap -t 24 -c 4096 I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS I LDFLAGS:
I CC: cc (Ubuntu 11.3.0-1ubuntu122.04.1) 11.3.0 I CXX: g++ (Ubuntu 11.3.0-1ubuntu122.04.1) 11.3.0

cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c ggml.c -o ggml.o ggml.c: In function ‘ggml_compute_forward_rope_f32’: ggml.c:12175:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion] 12175 | float theta = (float)p0.5; | ^ ggml.c: In function ‘ggml_compute_forward_rope_f16’: ggml.c:12288:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion] 12288 | float theta = (float)p0.5; | ^ ggml.c: In function ‘ggml_compute_forward_rope_back_f32’: ggml.c:12426:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion] 12426 | float theta = (float)p0.5; | ^ ggml.c: In function ‘ggml_compute_forward_rope_back_f16’: ggml.c:12539:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion] 12539 | float theta = (float)p0.5; | ^ g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/main/main.cpp ggml.o llama.o common.o k_quants.o -o main g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize/quantize.cpp ggml.o llama.o k_quants.o -o quantize g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o -o quantize-stats g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o -o perplexity g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o -o embedding g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS pocs/vdot/vdot.cpp ggml.o k_quants.o -o vdot g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o -o train-text-from-scratch g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o -o simple examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void write_tensor(llama_file*, ggml_tensor*)’: examples/train-text-from-scratch/train-text-from-scratch.cpp:2371:21: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses] 2371 | file->seek(0-file->tell() & 31, SEEK_CUR); | ~^~~~~~~~~~~~~ examples/train-text-from-scratch/train-text-from-scratch.cpp:2386:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses] 2386 | file->seek(0-file->tell() & 31, SEEK_CUR); | ~^~~~~~~~~~~~~ examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void read_tensor(llama_file*, ggml_tensor*)’: examples/train-text-from-scratch/train-text-from-scratch.cpp:2407:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses] 2407 | file->seek(0-file->tell() & 31, SEEK_CUR); | ~^~~~~~~~~~~~~

==== Run ./main -h for help. ====

In file included from /usr/include/string.h:535, from /usr/include/c++/11/cstring:42, from examples/train-text-from-scratch/train-text-from-scratch.cpp:7: In function ‘char* strncpy(char*, const char*, size_t)’, inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:305:16: /usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation] 95 | return __builtin___strncpy_chk (__dest, __src, __len, | ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ 96 | __glibc_objsize (__dest)); | ~~~~~~~~~~~~~~~~~~~~~~~~~ In function ‘char* strncpy(char*, const char*, size_t)’, inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:306:16: /usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation] 95 | return __builtin___strncpy_chk (__dest, __src, __len, | ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ 96 | __glibc_objsize (__dest)); | ~~~~~~~~~~~~~~~~~~~~~~~~~ In function ‘char* strncpy(char*, const char*, size_t)’, inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:307:16: /usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation] 95 | return __builtin___strncpy_chk (__dest, __src, __len, | ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ 96 | __glibc_objsize (__dest)); | ~~~~~~~~~~~~~~~~~~~~~~~~~ main: warning: model does not support context sizes greater than 2048 tokens (4096 specified);expect poor results main: build = 721 (2322ec2) main: seed = 1687419189 llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 4096 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 3615.71 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 2048.00 MB

system_info: n_threads = 24 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | perplexity: calculating perplexity over 81 chunks, batch_size=512 perplexity: 138.49 seconds per pass - ETA 3 hours 6 minutes [1]6.0187,[2]7.0714,[3]6.3656,[4]5.5239,[5]5.5817,[6]5.6883,[7]5.7353,[8]5.8797,[9]6.0222,[10]6.0995,[11]5.9346,[12]6.0113,[13]6.0741,[14]6.1430,[15]6.2255,[16]6.3302,[17]6.2853,[18]6.1894,[19]6.1435,[20]6.1202,[21]5.9613,[22]5.8335,[23]5.7270,[24]5.7918,[25]5.9147,[26]5.9784,[27]5.9983,[28]5.9945,[29]6.0096,[30]5.9743,[31]5.9087,[32]5.8371,[33]5.7800,[34]5.7782,[35]5.8298,[36]5.8891,[37]5.8342,[38]5.8005,[39]5.7749,[40]5.7405,[41]5.7550,[42]5.7732,[43]5.7759,[44]5.7750,[45]5.7827,[46]5.8045,[47]5.7804,[48]5.7699,[49]5.7503,[50]5.7619,[51]5.7707,[52]5.8372,[53]5.8738,[54]5.9270,[55]5.9470,[56]5.9599,[57]5.9449,[58]5.9547,[59]5.9601,[60]5.9670,[61]5.9655,[62]5.9458,[63]5.9253,[64]5.9198,[65]5.9334,[66]5.9447,[67]5.9319,[68]5.9465,[69]5.9305,[70]5.9212,[71]5.9203,[72]5.9134,[73]5.9029,[74]5.9025,[75]5.9006,[76]5.9015,[77]5.8926,[78]5.8630,[79]5.8878,[80]5.8870,[81]5.8945,

llama_print_timings: load time = 16255.05 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 11425948.96 ms / 331776 tokens ( 34.44 ms per token, 29.04 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 11483720.44 ms

real 191m23.868s user 4568m7.821s sys 1m11.117s

Final result: 5.8945:

perplexity: 138.49 seconds per pass - ETA 3 hours 6 minutes [1]6.0187,[2]7.0714,[3]6.3656,[4]5.5239,[5]5.5817,[6]5.6883,[7]5.7353,[8]5.8797,[9]6.0222,[10]6.0995,[11]5.9346,[12]6.0113,[13]6.0741,[14]6.1430,[15]6.2255,[16]6.3302,[17]6.2853,[18]6.1894,[19]6.1435,[20]6.1202,[21]5.9613,[22]5.8335,[23]5.7270,[24]5.7918,[25]5.9147,[26]5.9784,[27]5.9983,[28]5.9945,[29]6.0096,[30]5.9743,[31]5.9087,[32]5.8371,[33]5.7800,[34]5.7782,[35]5.8298,[36]5.8891,[37]5.8342,[38]5.8005,[39]5.7749,[40]5.7405,[41]5.7550,[42]5.7732,[43]5.7759,[44]5.7750,[45]5.7827,[46]5.8045,[47]5.7804,[48]5.7699,[49]5.7503,[50]5.7619,[51]5.7707,[52]5.8372,[53]5.8738,[54]5.9270,[55]5.9470,[56]5.9599,[57]5.9449,[58]5.9547,[59]5.9601,[60]5.9670,[61]5.9655,[62]5.9458,[63]5.9253,[64]5.9198,[65]5.9334,[66]5.9447,[67]5.9319,[68]5.9465,[69]5.9305,[70]5.9212,[71]5.9203,[72]5.9134,[73]5.9029,[74]5.9025,[75]5.9006,[76]5.9015,[77]5.8926,[78]5.8630,[79]5.8878,[80]5.8870,[81]5.8945,

This is already looking very promising since without applying the "RoPE scaling" patch, the perplexity is extremely bad - starts off above 110.0, which can be expected since the vanilla computation does not support context size beyond 2048.

Additional tests with context size of 2048:

$  make -j && time ./perplexity -m ./models/7B/ggml-model-q4_0.bin -f ./build/wiki.test.raw --no-mmap -c 2048 -t 24
I llama.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS
I LDFLAGS:  
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0

cc  -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c ggml.c -o ggml.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c llama.cpp -o llama.o
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c examples/common.cpp -o common.o
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS   -c -o k_quants.o k_quants.c
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/main/main.cpp ggml.o llama.o common.o k_quants.o -o main 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize/quantize.cpp ggml.o llama.o k_quants.o -o quantize 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o -o quantize-stats 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o -o perplexity 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o -o embedding 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS pocs/vdot/vdot.cpp ggml.o k_quants.o -o vdot 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o -o train-text-from-scratch 
g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o -o simple 
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void write_tensor(llama_file*, ggml_tensor*)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:2371:21: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2371 |         file->seek(0-file->tell() & 31, SEEK_CUR);
      |                    ~^~~~~~~~~~~~~
examples/train-text-from-scratch/train-text-from-scratch.cpp:2386:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2386 |     file->seek(0-file->tell() & 31, SEEK_CUR);
      |                ~^~~~~~~~~~~~~
examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void read_tensor(llama_file*, ggml_tensor*)’:
examples/train-text-from-scratch/train-text-from-scratch.cpp:2407:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses]
 2407 |     file->seek(0-file->tell() & 31, SEEK_CUR);
      |                ~^~~~~~~~~~~~~

====  Run ./main -h for help.  ====

In file included from /usr/include/string.h:535,
                 from /usr/include/c++/11/cstring:42,
                 from examples/train-text-from-scratch/train-text-from-scratch.cpp:7:
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:305:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:306:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘char* strncpy(char*, const char*, size_t)’,
    inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:307:16:
/usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation]
   95 |   return __builtin___strncpy_chk (__dest, __src, __len,
      |          ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~
   96 |                                   __glibc_objsize (__dest));
      |                                   ~~~~~~~~~~~~~~~~~~~~~~~~~
main: build = 724 (bbca06e)
main: seed  = 1687421414
llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 3615.71 MB
llama_model_load_internal: mem required  = 5407.71 MB (+ 1026.00 MB per state)
...................................................................................................
llama_init_from_file: kv self size  = 1024.00 MB

system_info: n_threads = 24 / 64 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 163 chunks, batch_size=512
perplexity: 29.91 seconds per pass - ETA 1 hours 21 minutes
[1]3.9923,[2]5.3389,[3]6.3162,[4]6.4622,[5]5.9443,[6]5.6493,[7]5.1580,[8]4.9461,[9]4.9658,[10]5.0431,[11]5.1182,[12]5.1215,[13]5.0616,[14]5.1292,[15]5.2202,[16]5.3321,[17]5.4008,[18]5.5149,[19]5.5788,[20]5.6355,[21]5.5755,[22]5.4813,[23]5.4999,[24]5.5534,[25]5.5471,[26]5.5828,[27]5.5970,[28]5.6491,[29]5.6481,[30]5.7217,[31]5.7891,[32]5.8522,[33]5.8383,[34]5.8318,[35]5.7966,[36]5.7386,[37]5.7094,[38]5.7035,[39]5.6921,[40]5.7040,[41]5.6324,[42]5.5440,[43]5.4858,[44]5.4032,[45]5.3468,[46]5.3093,[47]5.3060,[48]5.3659,[49]5.4259,[50]5.4784,[51]5.5197,[52]5.5335,[53]5.5438,[54]5.5528,[55]5.5698,[56]5.5450,[57]5.5778,[58]5.5574,[59]5.5457,[60]5.5209,[61]5.4906,[62]5.4601,[63]5.4336,[64]5.3905,[65]5.3565,[66]5.3304,[67]5.3202,[68]5.3309,[69]5.3545,[70]5.3764,[71]5.4046,[72]5.4295,[73]5.3870,[74]5.3834,[75]5.3716,[76]5.3487,[77]5.3396,[78]5.3293,[79]5.3092,[80]5.2992,[81]5.2962,[82]5.3115,[83]5.3299,[84]5.3223,[85]5.3134,[86]5.3201,[87]5.3259,[88]5.3210,[89]5.3312,[90]5.3359,[91]5.3543,[92]5.3527,[93]5.3429,[94]5.3369,[95]5.3398,[96]5.3276,[97]5.3238,[98]5.3080,[99]5.3016,[100]5.3145,[101]5.3226,[102]5.3251,[103]5.3662,[104]5.3941,[105]5.4088,[106]5.4254,[107]5.4511,[108]5.4792,[109]5.4785,[110]5.4911,[111]5.4929,[112]5.5035,[113]5.4953,[114]5.4925,[115]5.4961,[116]5.5013,[117]5.4994,[118]5.5026,[119]5.5040,[120]5.5132,[121]5.5147,[122]5.5055,[123]5.4988,[124]5.4928,[125]5.4807,[126]5.4702,[127]5.4650,[128]5.4691,[129]5.4757,[130]5.4839,[131]5.4918,[132]5.4952,[133]5.4868,[134]5.4739,[135]5.4851,[136]5.4884,[137]5.4806,[138]5.4728,[139]5.4646,[140]5.4613,[141]5.4608,[142]5.4597,[143]5.4605,[144]5.4533,[145]5.4477,[146]5.4465,[147]5.4427,[148]5.4420,[149]5.4448,[150]5.4466,[151]5.4505,[152]5.4492,[153]5.4546,[154]5.4416,[155]5.4180,[156]5.4227,[157]5.4273,[158]5.4457,[159]5.4525,[160]5.4508,[161]5.4582,[162]5.4602,[163]5.4708,

llama_print_timings:        load time =  8579.66 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time = 4890728.66 ms / 333824 tokens (   14.65 ms per token,    68.26 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time = 4940994.89 ms

real	82m21,155s
user	1955m22,704s
sys	0m27,572s

$ ▶ make -j && time ./perplexity -m ./models/7B/ggml-model-q4_0.bin -f ./build/wiki.test.raw --no-mmap -t 12 -c 2048 I llama.cpp build info: I UNAME_S: Linux I UNAME_P: x86_64 I UNAME_M: x86_64 I CFLAGS: -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS I CXXFLAGS: -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS I LDFLAGS:
I CC: cc (Ubuntu 11.3.0-1ubuntu122.04.1) 11.3.0 I CXX: g++ (Ubuntu 11.3.0-1ubuntu122.04.1) 11.3.0

cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c ggml.c -o ggml.o g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c llama.cpp -o llama.o g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c examples/common.cpp -o common.o cc -I. -O3 -std=c11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS -c -o k_quants.o k_quants.c ggml.c: In function ‘ggml_compute_forward_rope_f32’: ggml.c:12175:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion] 12175 | float theta = (float)p0.5; | ^ ggml.c: In function ‘ggml_compute_forward_rope_f16’: ggml.c:12288:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion] 12288 | float theta = (float)p0.5; | ^ ggml.c: In function ‘ggml_compute_forward_rope_back_f32’: ggml.c:12426:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion] 12426 | float theta = (float)p0.5; | ^ ggml.c: In function ‘ggml_compute_forward_rope_back_f16’: ggml.c:12539:39: warning: implicit conversion from ‘float’ to ‘double’ to match other operand of binary expression [-Wdouble-promotion] 12539 | float theta = (float)p0.5; | ^ g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/main/main.cpp ggml.o llama.o common.o k_quants.o -o main g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize/quantize.cpp ggml.o llama.o k_quants.o -o quantize g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/quantize-stats/quantize-stats.cpp ggml.o llama.o k_quants.o -o quantize-stats g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/perplexity/perplexity.cpp ggml.o llama.o common.o k_quants.o -o perplexity g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/embedding/embedding.cpp ggml.o llama.o common.o k_quants.o -o embedding g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS pocs/vdot/vdot.cpp ggml.o k_quants.o -o vdot g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/train-text-from-scratch/train-text-from-scratch.cpp ggml.o llama.o k_quants.o -o train-text-from-scratch g++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -DGGML_USE_K_QUANTS examples/simple/simple.cpp ggml.o llama.o common.o k_quants.o -o simple examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void write_tensor(llama_file*, ggml_tensor*)’: examples/train-text-from-scratch/train-text-from-scratch.cpp:2371:21: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses] 2371 | file->seek(0-file->tell() & 31, SEEK_CUR); | ~^~~~~~~~~~~~~ examples/train-text-from-scratch/train-text-from-scratch.cpp:2386:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses] 2386 | file->seek(0-file->tell() & 31, SEEK_CUR); | ~^~~~~~~~~~~~~ examples/train-text-from-scratch/train-text-from-scratch.cpp: In function ‘void read_tensor(llama_file*, ggml_tensor*)’: examples/train-text-from-scratch/train-text-from-scratch.cpp:2407:17: warning: suggest parentheses around ‘-’ in operand of ‘&’ [-Wparentheses] 2407 | file->seek(0-file->tell() & 31, SEEK_CUR); | ~^~~~~~~~~~~~~

==== Run ./main -h for help. ====

In file included from /usr/include/string.h:535, from /usr/include/c++/11/cstring:42, from examples/train-text-from-scratch/train-text-from-scratch.cpp:7: In function ‘char* strncpy(char*, const char*, size_t)’, inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:305:16: /usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation] 95 | return __builtin___strncpy_chk (__dest, __src, __len, | ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ 96 | __glibc_objsize (__dest)); | ~~~~~~~~~~~~~~~~~~~~~~~~~ In function ‘char* strncpy(char*, const char*, size_t)’, inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:306:16: /usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation] 95 | return __builtin___strncpy_chk (__dest, __src, __len, | ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ 96 | __glibc_objsize (__dest)); | ~~~~~~~~~~~~~~~~~~~~~~~~~ In function ‘char* strncpy(char*, const char*, size_t)’, inlined from ‘void init_model(my_llama_model*)’ at examples/train-text-from-scratch/train-text-from-scratch.cpp:307:16: /usr/include/x86_64-linux-gnu/bits/string_fortified.h:95:34: warning: ‘char* __builtin_strncpy(char*, const char*, long unsigned int)’ specified bound 32 equals destination size [-Wstringop-truncation] 95 | return __builtin___strncpy_chk (__dest, __src, __len, | ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~ 96 | __glibc_objsize (__dest)); | ~~~~~~~~~~~~~~~~~~~~~~~~~ main: build = 721 (2322ec2) main: seed = 1687435610 llama.cpp: loading model from ./models/7B/ggml-model-q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 32000 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4096 llama_model_load_internal: n_mult = 256 llama_model_load_internal: n_head = 32 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 128 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 11008 llama_model_load_internal: n_parts = 1 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 3615.71 MB llama_model_load_internal: mem required = 5407.71 MB (+ 1026.00 MB per state) ................................................................................................... llama_init_from_file: kv self size = 1024.00 MB

system_info: n_threads = 12 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | perplexity: calculating perplexity over 163 chunks, batch_size=512 perplexity: 63.83 seconds per pass - ETA 2 hours 53 minutes [1]4.5277,[2]5.9310,[3]7.0203,[4]7.1517,[5]6.6052,[6]6.3426,[7]5.7903,[8]5.5576,[9]5.5795,[10]5.6517,[11]5.7561,[12]5.7770,[13]5.7174,[14]5.7934,[15]5.8778,[16]5.9828,[17]6.0381,[18]6.1546,[19]6.2098,[20]6.2615,[21]6.2092,[22]6.1235,[23]6.1354,[24]6.1845,[25]6.1778,[26]6.2131,[27]6.2235,[28]6.2771,[29]6.2809,[30]6.3569,[31]6.4299,[32]6.4946,[33]6.4728,[34]6.4554,[35]6.4232,[36]6.3592,[37]6.3237,[38]6.3070,[39]6.3018,[40]6.3152,[41]6.2502,[42]6.1575,[43]6.1011,[44]6.0151,[45]5.9520,[46]5.9117,[47]5.9119,[48]5.9750,[49]6.0362,[50]6.0938,[51]6.1433,[52]6.1591,[53]6.1681,[54]6.1780,[55]6.1945,[56]6.1673,[57]6.2046,[58]6.1846,[59]6.1761,[60]6.1512,[61]6.1174,[62]6.0860,[63]6.0553,[64]6.0074,[65]5.9698,[66]5.9433,[67]5.9329,[68]5.9401,[69]5.9661,[70]5.9888,[71]6.0173,[72]6.0475,[73]6.0011,[74]5.9921,[75]5.9784,[76]5.9486,[77]5.9385,[78]5.9258,[79]5.9065,[80]5.8969,[81]5.8917,[82]5.9041,[83]5.9231,[84]5.9185,[85]5.9089,[86]5.9155,[87]5.9189,[88]5.9129,[89]5.9237,[90]5.9295,[91]5.9490,[92]5.9448,[93]5.9291,[94]5.9186,[95]5.9191,[96]5.9055,[97]5.9024,[98]5.8866,[99]5.8808,[100]5.8984,[101]5.9089,[102]5.9097,[103]5.9529,[104]5.9819,[105]5.9978,[106]6.0180,[107]6.0446,[108]6.0737,[109]6.0737,[110]6.0894,[111]6.0915,[112]6.1039,[113]6.0958,[114]6.0914,[115]6.0934,[116]6.0991,[117]6.0972,[118]6.1008,[119]6.1036,[120]6.1124,[121]6.1155,[122]6.1061,[123]6.0980,[124]6.0895,[125]6.0779,[126]6.0657,[127]6.0606,[128]6.0645,[129]6.0721,[130]6.0807,[131]6.0886,[132]6.0931,[133]6.0858,[134]6.0754,[135]6.0878,[136]6.0915,[137]6.0837,[138]6.0755,[139]6.0668,[140]6.0645,[141]6.0642,[142]6.0636,[143]6.0642,[144]6.0566,[145]6.0486,[146]6.0479,[147]6.0447,[148]6.0462,[149]6.0469,[150]6.0471,[151]6.0505,[152]6.0466,[153]6.0514,[154]6.0373,[155]6.0094,[156]6.0135,[157]6.0191,[158]6.0390,[159]6.0461,[160]6.0427,[161]6.0495,[162]6.0525,[163]6.0642,

llama_print_timings: load time = 16124.85 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 10484054.26 ms / 333824 tokens ( 31.41 ms per token, 31.84 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 10535367.30 ms

real 175m35.512s user 2096m10.839s sys 0m42.180s

I'm currently running the computations on the CPU as I have more confidence in the changes being correct, but we should look into updating the GPU code to support the RoPE scaling and doing more calculations to determine how the perplexity behaves for different context sizes.

The author of this idea @kaiokendev suggests that this approach should work even better with fine-tuned models (https://www.reddit.com/r/LocalLLaMA/comments/14fgjqj/comment/jp2dchb/?utm_source=share&utm_medium=web2x&context=3), so we should also do some tests with those models.

Result summary (live updates)

wiki.test.raw

Model Format Scale Ctx Chunks Perplexity
LLaMA 7B Q4_0 1.0 2048 163 5.4708
LLaMA 7B Q4_0 0.5 2048 163 6.0642
LLaMA 7B Q4_0 1.0 4096 81 inf
LLaMA 7B Q4_0 0.5 4096 81 5.8945