This repository provides the code for the paper "Language Models Represent Beliefs of Self and Others". It shows that LLMs internally represent beliefs of themselves and other agents, and manipulating these representations can significantly impact their Theory of Mind reasoning capabilities.
conda create -n lm python=3.8 anaconda conda activate lm # Please install PyTorch (<2.4) according to your CUDA version. conda install pytorch==2.3.1 torchvision==0.18.1 torchaudio==2.3.1 pytorch-cuda=12.1 -c pytorch -c nvidia pip install -r requirements.txt Then download the language models (e.g. Mistral-7B-Instruct-v0.2, deepseek-llm-7b-chat) to models/. You can also specify the file paths in lm_paths.json.
sh scripts/save_reps.sh 0_forward belief sh scripts/save_reps.sh 0_forward action sh scripts/save_reps.sh 0_backward beliefBinary:
python probe.py --belief=protagonist --dynamic=0_forward --variable belief python probe.py --belief=oracle --dynamic=0_forward --variable belief python probe.py --belief=protagonist --dynamic=0_forward --variable action python probe.py --belief=oracle --dynamic=0_forward --variable action python probe.py --belief=protagonist --dynamic=0_backward --variable belief python probe.py --belief=oracle --dynamic=0_backward --variable beliefMultinomial:
python probe_multinomial.py --dynamic=0_forward --variable belief python probe_multinomial.py --dynamic=0_forward --variable action python probe_multinomial.py --dynamic=0_backward --variable beliefsh scripts/0_forward_belief.sh sh scripts/0_forward_action.sh sh scripts/0_backward_belief.shIntervention for the Forward Belief task:
sh scripts/0_forward_belief_interv_oracle.sh sh scripts/0_forward_belief_interv_protagonist.sh sh scripts/0_forward_belief_interv_o0p1.shCross-task intervention:
sh scripts/cross_0_forward_belief_to_forward_action_interv_o0p1.sh sh scripts/cross_0_forward_belief_to_backward_belief_interv_o0p1.sh@inproceedings{zhu2024language, title={Language Models Represent Beliefs of Self and Others}, author={Zhu, Wentao and Zhang, Zhining and Wang, Yizhou}, booktitle={Forty-first International Conference on Machine Learning}, year={2024} }

