Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YAML logging and presets #2657

Merged
merged 1 commit into from
Aug 28, 2023

Conversation

JohannesGaessler
Copy link
Collaborator

This PR adds the ability to log all input and output parameters to YAML files as well as a Python script run_with_preset.py that takes a YAML file as input and uses it to run the llama.cpp binaries with the specified CLI arguments. The usage is python run_with_preset.py path/to/preset/file.yml. Currently only one preset file can be specified. I plan to make it so that you will be able to specify additional CLI arguments to the Python script that override the presets. This PR is intended as a replacement for #2557 and it has vaguely similar uses as #2626 . By specifying the CLI argument --logdir, a YAML file such as this can be created:

binary: main
build_commit: 97a99ba
build_number: 1008
debug: true
optimize: false
time: 2023-08-18T11:26:53.514823854

###############
# User Inputs #
###############

batch_size: 512 # default: 512
cfg_negative_prompt:
cfg_scale: 1.000000 # default: 1.0
chunks: -1 # default: -1 (unlimited)
color: false # default: false
ctx_size: 2048 # default: 512
escape: false # default: false
export: false # default: false
file: # never logged, see prompt instead. Can still be specified for input.
frequency_penalty: 0.000000 # default: 0.0 
gqa: 1 # default: 1.0
grammar:
grammar-file: # never logged, see grammar instead. Can still be specified for input.
hellaswag: false # default: false
hellaswag_tasks: 400 # default: 400
ignore_eos: true # default: false
instruct: false # default: false
interactive: false # default: false
interactive_first: false # default: false
in_prefix:
in_prefix_bos: false # default: false
in_suffix:
keep: 0 # default: 0
logdir: log # default: unset (no logging)
logit_bias:
lora: 
lora_base: 
low_vram: false # default: false
main_gpu: 0 # default: 0
mirostat: 0 # default: 0 (disabled)
mirostat_ent: 5.000000 # default: 5.0
mirostat_lr: 0.100000 # default: 0.1
mtest: false # default: false
mul_mat_q: true # default: false
memory_f32: false # default: false
mlock: false # default: false
model: models/nvme/llama-7b-ggml-q4_0.bin # default: models/7B/ggml-model.bin
model_alias: unknown # default: unknown
multiline_input: false # default: false
n_gpu_layers: 99 # default: 0
n_predict: 128 # default: -1 (unlimited)
no_mmap: false # default: false
no_penalize_nl: false # default: false
numa: false # default: false
presence_penalty: 0.000000 # default: 0.0
prompt: "Llamas are animals that "
prompt_cache: 
prompt_cache_all: false # default: false
prompt_cache_ro: false # default: false
prompt_tokens: [1, 365, 5288, 294, 526, 15006, 393, 29871]
random_prompt: false # default: false
repeat_penalty: 1.100000 # default: 1.1
reverse_prompt:
rms_norm_eps: 0.000005 # default: 5e-6
rope_freq_base: 10000.000000 # default: 10000.0
rope_freq_scale: 1.000000 # default: 1.0
seed: 1337 # default: -1 (random seed)
simple_io: false # default: false
tensor_split: [0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000, 0.000000]
temp: 0.800000 # default: 0.8
threads: 1 # default: 16
tfs: 1.000000 # default: 1.0
top_k: 40 # default: 40
top_p: 0.950000 # default: 0.95
typical_p: 1.000000 # default: 1.0
verbose_prompt: false # default: false

###########
# Results #
###########

cpu_has_arm_fma: false
cpu_has_avx: true
cpu_has_avx2: true
cpu_has_avx512: false
cpu_has_avx512_vbmi: false
cpu_has_avx512_vnni: false
cpu_has_blas: true
cpu_has_cublas: true
cpu_has_clblast: false
cpu_has_fma: true
cpu_has_gpublas: true
cpu_has_neon: false
cpu_has_f16c: true
cpu_has_fp16_va: false
cpu_has_wasm_simd: false
cpu_has_blas: true
cpu_has_sse3: true
cpu_has_vsx: false
ftype: 2
ftype_str: mostly Q4_0
model_type: 7B
n_eval: 127
n_vocab: 32000
n_p_eval: 8
n_sample: 128
output: |
  90% of the world has never seen. That’s because llamas live in Peru, Bolivia and Argentina. But for the last few years there has been a steady migration north to New York City where they have been discovered by Manhattanites walking their dogs in Central Park.
  But while they are quite common around here, they aren’t really a part of our culture, as llamas are pretty much unknown here, and we don’t really know what to do with them. But even though they are strange creatures that we have trouble identifying, there is no doubt that they are beautiful.
output_tokens: [29929, 29900, 29995, 310, 278, 3186, 756, 2360, 3595, 29889, 2193, 30010, 29879, 1363, 11829, 294, 5735, 297, 25493, 29892, 25765, 423, 322, 13798, 29889, 1205, 363, 278, 1833, 2846, 2440, 727, 756, 1063, 263, 27357, 20332, 6641, 304, 1570, 3088, 4412, 988, 896, 505, 1063, 10943, 491, 29093, 23586, 3246, 22049, 1009, 26361, 297, 8068, 4815, 29889, 13, 6246, 1550, 896, 526, 3755, 3619, 2820, 1244, 29892, 896, 9455, 30010, 29873, 2289, 263, 760, 310, 1749, 9257, 29892, 408, 11829, 294, 526, 5051, 1568, 9815, 1244, 29892, 322, 591, 1016, 30010, 29873, 2289, 1073, 825, 304, 437, 411, 963, 29889, 1205, 1584, 2466, 896, 526, 8515, 907, 3698, 393, 591, 505, 7458, 2893, 9215, 29892, 727, 338, 694, 7404, 393, 896, 526, 9560, 29889, 13, 15597, 526]
t_eval_us: 987228
t_load_us: 1266402
t_p_eval_us: 63403
t_sample_us: 521452

This YAML file could then be used as input for run_with_preset.py to reproduce the generation or a custom YAML file could be written that specifies only a subset of all CLI arguments, for example:

ctx_size: 2048 # default: 512
mul_mat_q: true # default: false
mlock: true # default: false
model: models/nvme/llama-7b-ggml-q4_0.bin # default: models/7B/ggml-model.bin
n_gpu_layers: 99 # default: 0
n_predict: 128 # default: -1 (unlimited)
prompt: Llamas are animals that
seed: 1337 # default: -1 (random seed)
threads: 1 # default: 16

This PR is still very much WIP, for now only the main binary is supported and I still need to do more testing. I made the following design decisions for the implementation:

  • I chose YAML for the data format. I chose it over CSV because CSV would have very bad readability for a large number of properties with varying lengths. I chose it over JSON because JSON does not have support for comments or multiline strings. Since YAML is a superset of JSON it should still possible to use JSON as input though.
  • Logging is disabled by default and only enabled when the --logdir CLI argument is set.
  • I separated the log file into three segments: a header that contains static information, a "User Input" section and a "Results" section that contains the information that will only be known at runtime. Maybe some of the result properties like cpu_has_blas would make sense to move into the header.
  • I am using timestamps of the format <YEAR>-<MONTH>-<DAY>T<HOURS>:<MINUTES>:<SECONDS>.<NANOSECONDS>. The timestamps are also used as filenames. I chose the ordering of month before day in order for alphabetical sorting to align with the temporal order. I added the nanoseconds to make it very unlikely for two processes to write to the same file.
  • I am using std::experimental::filesystem to create the log directory if it does not exist and to construct the path to the output file in an OS-agnostic manner. In later C++ versions this functionality has become part of std.
  • I added a long option and a gpt_params property for escaping special characters in the prompt.
  • I am adding lists of tokens (as int) for the prompt and the output to the log file.

examples/main/main.cpp Outdated Show resolved Hide resolved
@JohannesGaessler
Copy link
Collaborator Author

I implemented support for the perplexity binary. I am logging the individual token probabilities because I suspect that just looking at the average negative log-likelihood of the token probabilities is a suboptimal metric for judging quantization quality loss. This is a simple plot where I just histogrammed the probabilities:

prob_hist

The corresponding code is:

#!/usr/bin/env python3

import yaml
import numpy as np
import matplotlib.pyplot as plt

with open("log/2023-08-19T12:48:37.164907565.yml", "r") as f:
    props = yaml.safe_load(f)

plt.hist(props["probs"], bins=np.linspace(0, 1, 201))
plt.title("Probability distribution for perplexity on wikitext, 7b q4_0")
plt.xlim(0, 1)
plt.xlabel("Token probability")
plt.ylabel("Count")
plt.savefig("prob_hist.png", dpi=240)
plt.show()

Even in this very simple analysis it becomes apparent that there are two clusters in terms of token probabilities: one close to 0 and one close to 1. I would argue that the cluster close to 0 is largely irrelevant because the model is essentially guessing and it doesn't matter whether it's correct 1% of the time or 0.1% of the time. However, due to the way the math works out this is equivalent to a change in probability from 50% to 5% or 100% to 10% which would mean a much larger loss of quality. I'm not yet 100% sure what a better metric would be but I think part of it should be to ignore the perplexity of those tokens where the unquantized model was already performing extremely poorly.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some implementation related comments that I would like to get addressed before merging

llama.h Outdated Show resolved Hide resolved
llama-util.h Outdated Show resolved Hide resolved
llama-util.h Outdated Show resolved Hide resolved
@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Aug 19, 2023

Not directly related to the PR but making another post because I find the results to be interesting. I made another histogram where I weighted the frequency of the probabilities with their negative log-likelihoods:

prob_hist_weighted

This is the corresponding code:

#!/usr/bin/env python3

import yaml
import numpy as np
import matplotlib.pyplot as plt

with open("log/2023-08-19T12:48:37.164907565.yml", "r") as f:
    props = yaml.safe_load(f)

probs = np.array(props["probs"])

plt.hist(probs, bins=np.linspace(0, 1, 201))
plt.title("Probability distribution for perplexity on wikitext, 7b q4_0")
plt.xlim(0, 1)
plt.xlabel("Token probability")
plt.ylabel("Count")
plt.savefig("prob_hist.png", dpi=240)

probs = probs[probs != 0]  # due to rounding error when writing YAML file, negligible
weights = -np.log(probs)

plt.figure()
plt.hist(probs, bins=np.linspace(0, 1, 101), density=True, weights=weights)
plt.title("Perplexity contributions on wikitext, 7b q4_0")
plt.xlim(0, 1)
plt.xlabel("Token probability")
plt.ylabel("Rel. contributions to total perplexity")
plt.savefig("prob_hist_weighted.png", dpi=240)

plt.show()

As it turns out the absolute perplexity value is overwhelmingly dominated by low-probability tokens and the metric is thus most sensitive to small absolute probability changes for those tokens.

Edit: the binning in this histogram is chosen in such a way that the bin height is equivalent to the percentage of the contribution.

@JohannesGaessler
Copy link
Collaborator Author

Using the code in this PR I think I've found a good metric for estimating the precision loss from quantization: the root mean square of the token probabilities relative to the unquantized model. This metric is sensitive to those probabilities not close to 0 or 1 (which seem to be largely unaffected by quantization):

ppl_test

There is still a lot that would need to be done for this PR but I could instead focus on adding this metric to the perplexity binary. But although I would try my best to make a good case for it in the corresponding PR it would require some degree of just trusting that I did everything correctly if my data is based on unmerged code.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Aug 23, 2023

This PR should now be feature complete except for directory creation on Windows. I'll probably get to it on Friday.

Edit: Logging of the hellaswag score is not implemented and I'll still need to ensure that I didn't break anything. Running server and llama-bench with presets is supported but not the logging of the results as YAML files.

@ghost
Copy link

ghost commented Aug 23, 2023

By specifying the CLI argument --logdir

It appears specifying a path is also required & no logging occurs if unset. I want a yaml file produced, but I'm unclear on how to do so.

I figured -ld ~/ should be enough during main, but nothing is saved to that path. 😔

@cebtenzzre

This comment was marked as outdated.

@ghost
Copy link

ghost commented Aug 23, 2023

I figured -ld ~/ should be enough during main, but nothing is saved to that path. 😔

Maybe the trailing slash is a problem for create_directory_with_parents - try without it. Is there a warning logged to stderr?

I'm trying to understand, but I don't understand sdterr or what you mean by it. there's no warnings that I see, main loads lile usual. Is it because I CTRL + C after it inferences a bit? This works with --interactive, yeah?

Here'a my attempt: ./main -m ~/WizardMath.gguf --color -c 2048 --keep -1 -n -1 -t 3 -b 7 -i -r "USER:" --in-prefix " " -ld /data/data/com.termux/files/home

Other than adding -ld path I don't see any indication it's on.

@cebtenzzre
Copy link
Collaborator

I'm trying to understand, but I don't understand sdterr or what you mean by it. there's no warnings that I see, main loads lile usual. Is it because I CTRL + C after it inferences a bit? This works with --interactive, yeah?

The log only saves if main completes normally, not if you kill it with CTRL+C.

@ghost
Copy link

ghost commented Aug 23, 2023

I'm trying to understand, but I don't understand sdterr or what you mean by it. there's no warnings that I see, main loads lile usual. Is it because I CTRL + C after it inferences a bit? This works with --interactive, yeah?

The log only saves if main completes normally, not if you kill it with CTRL+C.

😅 whoops. I guess I'll lower the --context, because I don't really want it to run on so long. Thank you.

@cebtenzzre
Copy link
Collaborator

I get a segfault if I don't pass a prompt, because prompt_tokens is empty and dump_vector_int_yaml assumes all vectors have at least one element.

@ghost
Copy link

ghost commented Aug 23, 2023

I get a segfault if I don't pass a prompt, because prompt_tokens is empty and dump_vector_int_yaml assumes all vectors have at least one element.

I think newest commit helps with that by adding a bos token, but yeah, I'll add a prompt. I'm a bit confused how to end main without killing it.

Even if it stopped when it reaches max context, I have to kill it to end the program.. but also, it's not respecting n_ctx = 50..

main: interactive mode on.                                 
Reverse prompt: 'USER:'
Input prefix: ' '                                          
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 50, n_batch = 7, n_predict = -1, n_keep = 0
                                                           
== Running in interactive mode. ==                          
- Press Ctrl+C to interject at any time.
 - Press Return to return control to LLaMa.                 
 - To return control without starting a new line, end your input with '/'.                                             

 - If you want to submit another line, end your input with '\'.                                                       
 USRR: Hello. everybody, and welcome to the latest episode of the Unsponsored Running Report, I’m your host Will Barling and in this week’s show, we have a bumper crop of interviews…

➖ Three with women who are all on a journey to a more ethical lifestyle.

➖ One woman has already changed her career to do something more sustainable, one is trying to make changes in her current job and another wants to start a business but needs the time and money to do so. Unterscheidung zwischen Urlaub und Ferien.                                                 
The company was started in 1984 by Tim Berners-Lee, who remains the director of the In- ternet consortium. His original idea was to allow scientists from different laboratories throughout the world to exchange data and specimens without having to ship them through the mail or fly them across the globe.ϊ                                                 
 The International Space Station (ISS) is a permanently manned space station in orbit around Earth. It was first assembled in 1998 and completed in 2011, with Russian
                                                           
llama_print_timings:        load time =   585.46 ms
llama_print_timings:      sample time =   434.87 ms /   236 runs   (    1.84 ms per token,   542.69 tokens per second)
llama_print_timings: prompt eval time = 72145.63 ms /   229 tokens (  315.05 ms per token,     3.17 tokens per second)
llama_print_timings:        eval time = 83716.91 ms /   227 runs   (  368.80 ms per token,     2.71 tokens per second)
llama_print_timings:       total time = 156588.79 ms

examples/main/main.cpp Outdated Show resolved Hide resolved
@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Aug 23, 2023

@JackJollimore it seems I misinterpreted what the code does. I saw the timings when you interrupt with CTRL+C and was assuming it would jump to the end of the program where I also added the logging code. I'll fix the generation of YAML files on interrupt on Friday.

Even if it stopped when it reaches max context, I have to kill it to end the program.. but also, it's not respecting n_ctx = 50..

You need to set --n-predict, not the context size.

@ghost
Copy link

ghost commented Aug 23, 2023

I'll fix the generation of YAML files on interrupt on Friday.

I wanted to ask if interrupt generations could be done, so that's great. Thanks for a PR like this.

You need to set --n-predict, not the context size.

Bug with -n -2 fixed: #2767

Please merge master into this PR when you get a chance.

@JohannesGaessler JohannesGaessler marked this pull request as ready for review August 27, 2023 19:55
@JohannesGaessler
Copy link
Collaborator Author

Alright, this should now be mostly good to merge from my end (I still need to test it some more).

@ghost
Copy link

ghost commented Aug 27, 2023

Working:

./main -m ~/WizardLM.gguf --color -c 2048 --keep -1 -n -2 -t 3 -b 7 -i -r "User:" --in-prefix " " --in-suffix "Assistant:" -f ~/storage/shared/PT/M.txt -ld ~/

binary: main
build_commit: f18cada
build_number: 1095
cpu_has_arm_fma: true
cpu_has_avx: false
cpu_has_avx2: false
cpu_has_avx512: false
cpu_has_avx512_vbmi: false
cpu_has_avx512_vnni: false
cpu_has_blas: false
cpu_has_cublas: false
cpu_has_clblast: false
cpu_has_fma: false
cpu_has_gpublas: false
cpu_has_neon: true
cpu_has_f16c: false
cpu_has_fp16_va: true
cpu_has_wasm_simd: false
cpu_has_blas: false
cpu_has_sse3: false
cpu_has_vsx: false
debug: false
model: wizardlm-7b-v1.0-uncensored.ggmlv3.q4_0.bin 7B mostly Q4_0
optimize: true
time: 2023_08_27-18_32_00.9205951720

###############
# User Inputs #
###############

alias: unknown # default: unknown
batch_size: 7 # default: 512
cfg_negative_prompt:
cfg_scale: 1.000000 # default: 1.0
chunks: -1 # default: -1 (unlimited)
color: true # default: false
ctx_size: 2048 # default: 512
escape: false # default: false
export: false # default: false
file: # never logged, see prompt instead. Can still be specified for input.
frequency_penalty: 0.000000 # default: 0.0 
grammar:
grammar-file: # never logged, see grammar instead. Can still be specified for input.
hellaswag: false # default: false
hellaswag_tasks: 400 # default: 400
ignore_eos: false # default: false
instruct: false # default: false
interactive: true # default: false
interactive_first: false # default: false
in_prefix: " "
in_prefix_bos: false # default: false
in_suffix: " "
keep: 42 # default: 0
logdir: /data/data/com.termux/files/home/ # default: unset (no logging)
logit_bias:
lora: 
lora_base: 
low_vram: false # default: false
main_gpu: 0 # default: 0
mirostat: 0 # default: 0 (disabled)
mirostat_ent: 5.000000 # default: 5.0
mirostat_lr: 0.100000 # default: 0.1
memory_f32: false # default: false
mlock: false # default: false
model: /data/data/com.termux/files/home/WizardLM.gguf # default: models/7B/ggml-model.bin
mtest: false # default: false
n_probs: 0 # only used by server binary, default: 0
multiline_input: false # default: false
n_gpu_layers: 0 # default: 0
n_predict: -2 # default: -1 (unlimited)
no_mmap: false # default: false
no_mul_mat_q: false # default: false
no_penalize_nl: false # default: false
numa: false # default: false
ppl_output_type: 0 # default: 0
ppl_stride: 0 # default: 0
presence_penalty: 0.000000 # default: 0.0
prompt: |
  Below is an instruction that describes a task. Write a response that appropriately completes the request.
  
  ### Instruction:
  Please list 3 movie titles.
  
prompt_cache: 
prompt_cache_all: false # default: false
prompt_cache_ro: false # default: false
prompt_tokens: [1, 13866, 338, 385, 15278, 393, 16612, 263, 3414, 29889, 14350, 263, 2933, 393, 7128, 2486, 1614, 2167, 278, 2009, 29889, 13, 13, 2277, 29937, 2799, 4080, 29901, 13, 12148, 1051, 29871, 29941, 14064, 17735, 29889, 13, 13, 2277, 29937, 13291, 29901]
random_prompt: false # default: false
repeat_penalty: 1.100000 # default: 1.1
reverse_prompt:
  - User:
rope_freq_base: 10000.000000 # default: 10000.0
rope_freq_scale: 1.000000 # default: 1.0
seed: 1693171775 # default: -1 (random seed)
simple_io: false # default: false
tensor_split: [0.000000e+00]
temp: 0.800000 # default: 0.8
threads: 3 # default: 8
tfs: 1.000000 # default: 1.0
top_k: 40 # default: 40
top_p: 0.950000 # default: 0.95
typical_p: 1.000000 # default: 1.0
verbose_prompt: false # default: false

######################
# Generation Results #
######################

output: "\n 1. \"Jurassic Park\" (1993)\n 2. \"The Godfather\" (1972)\n 3. \"The Shawshank Redemption\" (1994)  thanks. whats pi - 2?\nAssistant: The value of Pi is a mathematical constant, defined as the ratio of the circumference of any circle to its diameter. It is approximately equal to 3.14159. To find Pi, you can use the formula:\n\nPi = C / D\n\nWhere C represents the circumference and D represents the diameter.  Sure.. list 2 or 3 morw movie titles.\nAssistant: Here are three additional movie titles for you to consider:\n\n1. \"The Silence of the Lambs\" (1991)\n2. \"The Dark Knight\" (2008)\n3. \"Schindler's List\" (1993)  Thanks.\nAssistant: You're welcome! Is there anything else I can assist you with?"
output_tokens: [13, 29871, 29896, 29889, 376, 29967, 332, 465, 293, 4815, 29908, 313, 29896, 29929, 29929, 29941, 29897, 13, 29871, 29906, 29889, 376, 1576, 4177, 22212, 29908, 313, 29896, 29929, 29955, 29906, 29897, 13, 29871, 29941, 29889, 376, 1576, 28548, 845, 804, 4367, 331, 683, 29908, 313, 29896, 29929, 29929, 29946, 29897, 2, 29871, 3969, 29889, 825, 29879, 2930, 448, 29871, 29906, 29973, 13, 7900, 22137, 29901, 450, 995, 310, 7362, 338, 263, 19475, 4868, 29892, 3342, 408, 278, 11959, 310, 278, 9942, 1659, 310, 738, 8607, 304, 967, 24235, 29889, 739, 338, 14235, 5186, 304, 29871, 29941, 29889, 29896, 29946, 29896, 29945, 29929, 29889, 1763, 1284, 7362, 29892, 366, 508, 671, 278, 7063, 29901, 13, 13, 12197, 353, 315, 847, 360, 13, 13, 11921, 315, 11524, 278, 9942, 1659, 322, 360, 11524, 278, 24235, 29889, 2, 29871, 18585, 636, 1051, 29871, 29906, 470, 29871, 29941, 3036, 29893, 14064, 17735, 29889, 13, 7900, 22137, 29901, 2266, 526, 2211, 5684, 14064, 17735, 363, 366, 304, 2050, 29901, 13, 13, 29896, 29889, 376, 1576, 5664, 663, 310, 278, 26832, 29879, 29908, 313, 29896, 29929, 29929, 29896, 29897, 13, 29906, 29889, 376, 1576, 15317, 22980, 29908, 313, 29906, 29900, 29900, 29947, 29897, 13, 29941, 29889, 376, 4504, 513, 1358, 29915, 29879, 2391, 29908, 313, 29896, 29929, 29929, 29941, 29897, 2, 29871, 1834, 29889, 13, 7900, 22137, 29901, 887, 29915, 276, 12853, 29991, 1317, 727, 3099, 1683, 306, 508, 6985, 366, 411, 29973, 2]

###########
# Timings #
###########

mst_eval: 378.01  # ms / token during generation
mst_p_eval: 393.34  # ms / token during prompt processing
mst_sample: 2.28  # ms / token during sampling
n_eval: 199  # number of tokens generated (excluding the first one)
n_vocab: 32000  # output size of the final layer, 32001 for some models
n_p_eval: 87  # number of tokens processed in batches at the beginning
n_sample: 200  # number of sampled tokens
t_eval_us: 75223089  # total microseconds spent generating tokens
t_load_us: 872038  # total microseconds spent loading the model
t_p_eval_us: 34220995  # total microseconds spent prompt processing
t_sample_us: 456705  # total microseconds spent sampling
ts_eval: 2.65  # tokens / second during generation
ts_p_eval: 2.54  # tokens / second during prompt processing
ts_sample: 437.92  # tokens / second during sampling


Full generation.

}

void perplexity(llama_context * ctx, const gpt_params & params) {
std::tuple<std::vector<llama_token>, std::vector<float>, std::vector<float>, float>
Copy link
Owner

@ggerganov ggerganov Aug 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find std::tuple syntax to be quite cumbersome and unreadable, so I normally avoid it by using a simple struct.

Can we change these to struct results_ppl and earlier to struct results_log_softmax?

@JohannesGaessler JohannesGaessler merged commit 6b73ef1 into ggerganov:master Aug 28, 2023
25 checks passed
akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants