Skip to content

Commit 478522d

Browse files
committed
docs(metrics): update README with comprehensive quality assessment documentation
1 parent e22a1b7 commit 478522d

File tree

6 files changed

+73
-11
lines changed

6 files changed

+73
-11
lines changed

README.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ LieGraph is a multi-agent implementation of the popular social deduction game "W
1414
- **Natural Language Interaction:** Agents communicate and reason in natural language throughout the game
1515
- **Probabilistic Belief System:** Sophisticated belief tracking with self-belief confidence and suspicions matrix
1616
- **Strategic Reasoning:** Advanced bluff detection, alliance formation, and long-term planning
17+
- **Built-in Metrics:** Automatic quality tracking for win balance, identification accuracy, and speech diversity with JSON reports for prompt evaluation workflows
1718

1819
## 🚀 Quick Start
1920

@@ -158,6 +159,31 @@ game:
158159
# ...
159160
```
160161

162+
## 📊 Metrics & Evaluation
163+
164+
LieGraph ships with a lightweight metrics collector (`src/game/metrics.py`) that records quality indicators as games unfold:
165+
166+
- **Win balance:** Civilian vs. spy win rates and a fairness score targeting 50/50 outcomes.
167+
- **Identification accuracy:** Tracks how confidently players identify their own roles and others over time.
168+
- **Speech diversity:** Measures lexical variety per speech turn to surface repetitive phrasing.
169+
170+
Metrics are streamed to memory during play and automatically persisted when a game ends:
171+
172+
- Per-game summaries: `logs/metrics/{game_id}.json`
173+
- Rolling aggregate + functional quality score: `logs/metrics/overall.json`
174+
175+
You can also access the live collector from code:
176+
177+
```python
178+
from src.game.metrics import metrics_collector
179+
180+
audit = metrics_collector.get_overall_metrics()
181+
score = metrics_collector.compute_quality_score() # deterministic
182+
# metrics_collector.compute_quality_score(method="llm", llm=client) for LLM-based review
183+
```
184+
185+
These outputs are ready to feed into downstream prompt-evaluation or offline analysis pipelines.
186+
161187
## 🛠️ Development
162188

163189
### Project Structure

README_zh.md

Lines changed: 27 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ LieGraph 是基于 LangGraph 构建的流行社交推理游戏 "谁是卧底"
1515
- **自然语言交互:** 智能体在整个游戏中以自然语言进行交流和推理
1616
- **概率信念系统:** 具有自我信念置信度和怀疑矩阵的复杂信念追踪系统
1717
- **战略推理:** 高级的虚张声势检测、联盟形成和长期规划
18+
- **内建指标:** 自动追踪胜率平衡、身份识别准确率与发言多样性,并生成 JSON 报告以供后续提示词评估使用
1819

1920
## 🚀 快速开始
2021

@@ -32,6 +33,31 @@ LieGraph 是基于 LangGraph 构建的流行社交推理游戏 "谁是卧底"
3233
touch .env
3334
```
3435

36+
## 📊 指标与评估
37+
38+
项目内置一个轻量的指标收集器(`src/game/metrics.py`),在游戏过程中实时记录以下指标:
39+
40+
- **胜率平衡:** 跟踪平民与卧底胜率以及目标为 50/50 的公平度得分。
41+
- **身份识别准确率:** 监测玩家对自身及他人身份判断的准确趋势。
42+
- **发言多样性:** 按回合统计词汇多样性,帮助发现重复或单调的发言。
43+
44+
当游戏结束时,指标会自动写入:
45+
46+
- 单局摘要:`logs/metrics/{game_id}.json`
47+
- 全局聚合与函数版总分:`logs/metrics/overall.json`
48+
49+
如需在代码中获取实时数据,可直接调用:
50+
51+
```python
52+
from src.game.metrics import metrics_collector
53+
54+
report = metrics_collector.get_overall_metrics()
55+
score = metrics_collector.compute_quality_score() # 函数评分
56+
# metrics_collector.compute_quality_score(method="llm", llm=client) 可获取 LLM 评价
57+
```
58+
59+
这些成果可以直接用于后续的提示词评估或离线分析流程。
60+
3561
**OpenAI 配置示例:**
3662
```
3763
LLM_PROVIDER=openai
@@ -203,4 +229,4 @@ python -m pytest tests/ -v
203229

204230
## 📄 许可证
205231

206-
本项目采用 MIT 许可证 - 详见 [LICENSE](LICENSE) 文件。
232+
本项目采用 MIT 许可证 - 详见 [LICENSE](LICENSE) 文件。

src/game/config.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,10 +22,10 @@
2222
- Player name pool management
2323
"""
2424

25-
import random
25+
import os
2626
from typing import List, Tuple
27+
2728
import yaml
28-
import os
2929

3030

3131
class GameConfig:
@@ -50,7 +50,8 @@ def _load_config(self) -> dict:
5050
# Return default configuration
5151
return self._get_default_config()
5252

53-
def _get_default_config(self) -> dict:
53+
@staticmethod
54+
def _get_default_config() -> dict:
5455
"""Get default configuration."""
5556
return {
5657
"game": {

src/game/llm_strategy.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727

2828
from trustcall import create_extractor
2929

30-
from src.game.state import GameState, Speech, PlayerMindset, Suspicion, Vote, SelfBelief
30+
from src.game.state import GameState, Speech, PlayerMindset, Suspicion, SelfBelief
3131
from src.tools.llm import create_llm
3232

3333
# Game rules are now managed by the configuration system

src/game/metrics.py

Lines changed: 14 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -333,7 +333,8 @@ def _summarize_game(self, game: Dict[str, Any]) -> Dict[str, Any]:
333333
"speech_diversity": speech_summary,
334334
}
335335

336-
def _trend(self, round_metrics: Dict[int, Dict[str, Optional[float]]], *, key: str):
336+
@staticmethod
337+
def _trend(round_metrics: Dict[int, Dict[str, Optional[float]]], *, key: str):
337338
if not round_metrics:
338339
return None
339340

@@ -347,7 +348,8 @@ def _trend(self, round_metrics: Dict[int, Dict[str, Optional[float]]], *, key: s
347348
return None
348349
return last - first
349350

350-
def _summarize_speeches(self, speeches: List[SpeechRecord]) -> Dict[str, Any]:
351+
@staticmethod
352+
def _summarize_speeches(speeches: List[SpeechRecord]) -> Dict[str, Any]:
351353
if not speeches:
352354
return {
353355
"average_diversity": 0.0,
@@ -471,7 +473,8 @@ def _aggregate_speech_metrics(self) -> Dict[str, Any]:
471473
"by_player": per_player,
472474
}
473475

474-
def _compute_functional_score(self, summary: Dict[str, Any]) -> Dict[str, float]:
476+
@staticmethod
477+
def _compute_functional_score(summary: Dict[str, Any]) -> Dict[str, float]:
475478
win_balance = summary.get("win_balance_score", 0.0)
476479
identification = summary.get("identification", {})
477480
speech = summary.get("speech_diversity", {})
@@ -501,7 +504,8 @@ def _compute_functional_score(self, summary: Dict[str, Any]) -> Dict[str, float]
501504
"speech_diversity": round(speech_component, 4),
502505
}
503506

504-
def _format_summary_for_llm(self, summary: Dict[str, Any]) -> Dict[str, Any]:
507+
@staticmethod
508+
def _format_summary_for_llm(summary: Dict[str, Any]) -> Dict[str, Any]:
505509
"""Format metrics into an instruction for an LLM reviewer."""
506510
instructions = (
507511
"You are evaluating the quality of repeated 'Who Is Spy' games. "
@@ -524,7 +528,12 @@ def _persist_game_summary(self, summary: Dict[str, Any]) -> None:
524528

525529
def _persist_overall_metrics(self) -> None:
526530
path = self._output_dir / "overall.json"
527-
payload = self.get_overall_metrics()
531+
summary = self.get_overall_metrics()
532+
score = self._compute_functional_score(summary)
533+
payload = {
534+
"metrics": summary,
535+
"quality_score": score,
536+
}
528537
with path.open("w", encoding="utf-8") as fp:
529538
json.dump(payload, fp, ensure_ascii=False, indent=2)
530539

src/tools/graph_viz.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from __future__ import annotations
22

33
from pathlib import Path
4-
from typing import Any, Protocol
4+
from typing import Protocol
55

66
try: # Imported lazily so pure-Python usage works without IPython.
77
from IPython import get_ipython

0 commit comments

Comments
 (0)