In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs) on various problem-solving tasks.
We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used nine popular LLMs with five prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature from 0.0 to 1.6.
Despite anecdotal reports to the contrary, our empirical results indicate that changes in temperature from 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks. In addition, these results appear to generalize across LLMs, prompt-engineering techniques, and problem domains.
- Research paper
- Pre-print paper
- Research poster
- Presentation video
- Presentation slides
- [Research homepage] (https://matthewrenze.com/research/the-effect-of-sampling-temperature-on-llms/)
- Source - contains all source code
- Models - contains the model-specific code
- Prompts - contains LLM agent prompt code
- Exams - contains the code to load exams
- Exams - contains the test dataset
- Results - contains the high-level test results
- Details - contains the low-level test results
- Responses - contains the LLM response text
- Logs - contains the experiment event logs
- Plots - contains all data visualizations
- Source contains all scripts for experiments, processing, and analysis
- See Requirements.txt for a list of packages used in this experiment.
- GitHub Copilot was used in the creation of this experiment.