datawhalechina/happy-llm

5.6

Adjusted Score

5.6

Raw Score

100%

Time Factor

2026-05-06

Last Push

32.1K

Stars

Jupyter Notebook

Language

33.0K

Lines of Code

Files

139

Pattern Hits

2026-07-14

Scan Date

0.10

HC Hit Rate

What These Metrics Mean

Adjusted Score: Primary synthetic code indicator. Raw score normalised per 1,000 lines of code and multiplied by the temporal discount factor. This is the definitive comparative metric — use it to rank repositories by AI authorship density.
Raw Score: The unmodified sum of all severity-weighted, context-multiplied pattern match scores before temporal discounting. Reflects the absolute signal strength independent of when the repository was last active.
Time Factor: The temporal discount multiplier (0–100%) applied to the raw score. Repositories last updated before ChatGPT's launch (Nov 2022) receive a 5% factor. Full signal is only assigned to repositories active in the post-adoption era (Jan 2024+).
Pattern Hits: Total count of individual pattern matches across all files and categories. A high hit count with a low score may indicate a very large codebase with isolated AI snippets; a low count with a high score indicates dense, concentrated AI signatures.
HC Hit Rate: High+Critical pattern hits per file, averaged across the repository. This orthogonal signal catches repositories where a few files are densely packed with high-severity AI tells — a strong indicator even when the normalised score appears moderate due to codebase size.
Lines of Code / Files: Total lines and files analysed. The scanner examines 94 file extensions. These denominators are used to normalise the score, enabling fair comparison between repositories of vastly different sizes.

Score History

This chart maps the temporal evolution of the adjusted synthetic code score across successive scan runs. An upward trajectory indicates ongoing incorporation of AI-generated code or expanding LLM-assisted scaffolding; a stable or declining trajectory may reflect active human refactoring, code removal, or the adoption of stricter authorship policies. The dashed secondary line (right axis) independently tracks total raw pattern hit count, which can diverge from the normalised score when codebase size changes significantly between scans.

Severity Breakdown

Classifies detected patterns by their diagnostic confidence and structural impact. CRITICAL patterns (coefficient 10) represent definitive synthetic signatures — hallucinated imports, explicit LLM attribution metadata — virtually never produced by human authors. HIGH (5) indicates strong structural tells such as cross-file repetition or cross-linguistic idioms. MEDIUM (2) covers recognisable conversational padding and AI-specific vocabulary. LOW (1) captures subtle indicators like tautological comments and generic boilerplate that require density to carry independent signal.

CRITICAL 0HIGH 7MEDIUM 11LOW 121

Directory Score Breakdown

This horizontal bar chart decomposes the repository's raw synthetic code score by top-level directory, allowing you to pinpoint precisely which modules or components carry the highest AI authorship density. Directories with disproportionately high scores relative to their size warrant targeted manual review: concentrated AI signatures often trace back to mass-generated configuration layers, auto-ported test suites, LLM-scaffolded boilerplate classes, or entire subsystems authored under heavy copilot assistance. Use this view to prioritise your human code-review effort.

Pattern Findings

The scanner identified 139 distinct pattern matches across 11 syntactic categories. Each entry below represents a discrete location in the source code where the engine recorded a statistically significant AI authorship indicator. Expand any category row to inspect the individual file paths, line numbers, code snippets, and the lexical context (CODE, COMMENT, or STRING) in which each match was detected.

Reading the findings table: The Severity column indicates the diagnostic confidence level (CRITICAL / HIGH / MEDIUM / LOW). The Context column identifies whether the match occurred inside executable code, an inline comment, or a string literal — comment-context matches receive a ×1.5 weight because LLMs systematically over-annotate. The ⚡ bolt icon marks clustered matches: three or more patterns within a 10-line window, each receiving an additional ×1.5 density multiplier as dense clusters constitute far stronger evidence of synthetic authorship than isolated hits.

Unused Imports71 hits · 71 pts

Severity	File	Line	Context
LOW	docs/chapter6/code/download_dataset.py	2	CODE
LOW	docs/chapter6/code/download_dataset.py	3	CODE
LOW	docs/chapter6/code/pretrain.py	6	CODE
LOW	docs/chapter6/code/pretrain.py	12	CODE
LOW	docs/chapter6/code/pretrain.py	16	CODE
LOW	docs/chapter6/code/pretrain.py	17	CODE
LOW	docs/chapter6/code/pretrain.py	30	CODE
LOW	docs/chapter6/code/pretrain.py	31	CODE
LOW	docs/chapter6/code/finetune.py	6	CODE
LOW	docs/chapter6/code/finetune.py	11	CODE
LOW	docs/chapter6/code/finetune.py	12	CODE
LOW	docs/chapter6/code/finetune.py	13	CODE
LOW	docs/chapter6/code/finetune.py	19	CODE
LOW	docs/chapter6/code/finetune.py	21	CODE
LOW	docs/chapter6/code/finetune.py	23	CODE
LOW	docs/chapter6/code/finetune.py	23	CODE
LOW	docs/chapter6/code/finetune.py	33	CODE
LOW	docs/chapter6/code/finetune.py	34	CODE
LOW	docs/chapter7/Agent/demo.py	2	CODE
LOW	docs/chapter7/Agent/demo.py	2	CODE
LOW	docs/chapter7/Agent/demo.py	2	CODE
LOW	docs/chapter7/Agent/web_demo.py	3	CODE
LOW	docs/chapter7/Agent/web_demo.py	3	CODE
LOW	docs/chapter7/Agent/web_demo.py	3	CODE
LOW	docs/chapter7/Agent/src/core.py	5	CODE
LOW	docs/chapter7/Agent/src/core.py	5	CODE
LOW	docs/chapter7/Agent/src/core.py	5	CODE
LOW	docs/chapter7/Agent/src/core.py	5	CODE
LOW	docs/chapter7/Agent/src/core.py	5	CODE
LOW	docs/chapter7/Agent/src/core.py	5	CODE
LOW	docs/chapter7/Agent/src/core.py	7	CODE
LOW	docs/chapter7/Agent/src/utils.py	2	CODE
LOW	docs/chapter7/Agent/src/utils.py	3	CODE
LOW	docs/chapter7/RAG/VectorBase.py	12	CODE
LOW	docs/chapter7/RAG/VectorBase.py	12	CODE
LOW	docs/chapter7/RAG/VectorBase.py	12	CODE
LOW	docs/chapter7/RAG/VectorBase.py	12	CODE
LOW	docs/chapter7/RAG/VectorBase.py	14	CODE
LOW	docs/chapter7/RAG/LLM.py	11	CODE
LOW	docs/chapter7/RAG/LLM.py	11	CODE
LOW	docs/chapter7/RAG/LLM.py	11	CODE
LOW	docs/chapter7/RAG/LLM.py	11	CODE
LOW	docs/chapter7/RAG/Embeddings.py	12	CODE
LOW	docs/chapter7/RAG/Embeddings.py	13	CODE
LOW	docs/chapter7/RAG/Embeddings.py	13	CODE
LOW	docs/chapter7/RAG/Embeddings.py	13	CODE
LOW	docs/chapter7/RAG/Embeddings.py	13	CODE
LOW	docs/chapter7/RAG/utils.py	12	CODE
LOW	docs/chapter7/RAG/utils.py	12	CODE
LOW	docs/chapter7/RAG/utils.py	12	CODE
LOW	docs/chapter7/RAG/utils.py	12	CODE
LOW	docs/chapter7/RAG/utils.py	12	CODE
LOW	docs/chapter7/RAG/utils.py	17	CODE
LOW	docs/chapter5/code/ddp_pretrain.py	3	CODE
LOW	docs/chapter5/code/ddp_pretrain.py	8	CODE
LOW	docs/chapter5/code/deal_dataset.py	1	CODE
LOW	docs/chapter5/code/k_model.py	2	CODE
LOW	docs/chapter5/code/k_model.py	3	CODE
LOW	docs/chapter5/code/k_model.py	4	CODE
LOW	docs/chapter5/code/dataset.py	2	CODE
11 more matches not shown…

Structural Annotation Overuse19 hits · 26 pts

Severity	File	Line	Snippet	Context
LOW⚡	docs/chapter7/第七章大模型应用.md	107	#### Step 1: RAG流程介绍	COMMENT
LOW	docs/chapter7/第七章大模型应用.md	139	#### Step 2: 文档加载和切分	COMMENT
LOW	docs/chapter7/第七章大模型应用.md	237	#### Step 3: 向量化	COMMENT
LOW	docs/chapter7/第七章大模型应用.md	337	#### Step 4: 数据库与向量检索	COMMENT
LOW	docs/chapter7/第七章大模型应用.md	381	#### Step 5: 大模型模块	COMMENT
LOW	docs/chapter7/第七章大模型应用.md	440	#### Step 6: Tiny-RAG Demo	STRING
LOW⚡	docs/chapter7/第七章大模型应用.md	546	#### Step 1 : 初始化客户端和模型	STRING
LOW⚡	docs/chapter7/第七章大模型应用.md	565	#### Step 2: 定义工具函数	STRING
LOW	docs/chapter7/第七章大模型应用.md	640	#### Step 3: 构造 Agent 类	STRING
LOW	docs/chapter7/第七章大模型应用.md	738	#### Step 4: 运行 Agent	STRING
LOW	docs/chapter5/第五章动手搭建大模型.md	743	#### Step 1: 安装和导入依赖库	COMMENT
LOW	docs/chapter5/第五章动手搭建大模型.md	769	#### Step 2: 加载训练数据	COMMENT
LOW	docs/chapter5/第五章动手搭建大模型.md	793	#### Step 3: 创建配置文件	COMMENT
LOW	docs/chapter5/第五章动手搭建大模型.md	843	#### Step 4: 训练 BPE Tokenizer	COMMENT
LOW⚡	docs/chapter5/第五章动手搭建大模型.md	903	#### Step 5: 使用训练好的 Tokenizer	COMMENT
LOW	…r/s1-vllm-thinking-budget/output/output_1754208752.txt	1969	### Step 1: Understanding the Critical Points	COMMENT
LOW	…r/s1-vllm-thinking-budget/output/output_1754208752.txt	1995	### Step 2: Behavior of the Polynomial	COMMENT
LOW	…r/s1-vllm-thinking-budget/output/output_1754208752.txt	2009	### Step 3: Symmetry and Roots	COMMENT
LOW	…r/s1-vllm-thinking-budget/output/output_1754208752.txt	2029	### Step 4: Final Calculation	COMMENT

Excessive Try-Catch Wrapping12 hits · 20 pts

Severity	File	Line	Snippet	Context
MEDIUM	docs/chapter5/第五章动手搭建大模型.md	786	print(f"Error decoding JSON in line {line_num}")	CODE
LOW⚡	docs/chapter5/第五章动手搭建大模型.md	912	except Exception as e:	CODE
MEDIUM⚡	docs/chapter5/第五章动手搭建大模型.md	913	print(f"Error loading tokenizer: {e}")	CODE
MEDIUM	docs/chapter5/第五章动手搭建大模型.md	1094	print(f"Error decoding JSON in line {line_num}")	CODE
LOW	docs/chapter5/第五章动手搭建大模型.md	1199	except Exception as e:	CODE
MEDIUM	docs/chapter5/第五章动手搭建大模型.md	1200	print(f"Error loading tokenizer: {e}")	CODE
MEDIUM	docs/chapter5/code/train_tokenizer.py	27	print(f"Error decoding JSON in line {line_num}")	CODE
LOW	docs/chapter5/code/train_tokenizer.py	132	except Exception as e:	CODE
MEDIUM	docs/chapter5/code/train_tokenizer.py	133	print(f"Error loading tokenizer: {e}")	CODE
LOW	Extra-Chapter/CDDRS/readme.md	160	except Exception as e:	CODE
MEDIUM	Extra-Chapter/CDDRS/readme.md	161	print(f"Error reading {file_path}: {e}")	CODE
LOW	Extra-Chapter/generation-method/llm_generation.py	149	except Exception as e:	CODE

Magic Placeholder Names5 hits · 18 pts

Severity	File	Line	Snippet	Context
HIGH⚡	docs/chapter7/第七章大模型应用.md	555	api_key="YOUR_API_KEY", # 替换为你的 API Key	STRING
HIGH⚡	docs/chapter7/第七章大模型应用.md	563	> 注意: 你需要将 `YOUR_API_KEY` 替换为你从 [SiliconFlow](https://cloud.siliconflow.cn/i/ybUFvmqK) 或其他服务商获取的有效 API Key。	STRING
HIGH	docs/chapter7/第七章大模型应用.md	746	api_key="YOUR_API_KEY", # 替换为你的 API Key	STRING
HIGH	Extra-Chapter/CDDRS/readme.md	80	api_key="your-api-key-here",	CODE
HIGH	Extra-Chapter/CDDRS/readme.md	786	api_key='your-api-key',	STRING

Modern AI Meta-Vocabulary4 hits · 14 pts

Severity	File	Line	Snippet	Context
MEDIUM⚡	docs/chapter7/第七章大模型应用.md	93	## 7.2 RAG	COMMENT
MEDIUM⚡	docs/chapter7/第七章大模型应用.md	95	### 7.2.1 RAG 的基本原理	COMMENT
MEDIUM⚡	docs/chapter7/第七章大模型应用.md	103	### 7.2.2 搭建一个 RAG 框架	COMMENT
MEDIUM	docs/chapter7/第七章大模型应用.md	440	#### Step 6: Tiny-RAG Demo	STRING

Docstring Block Structure2 hits · 10 pts

Severity	File	Line	Snippet	Context
HIGH	docs/chapter7/第七章大模型应用.md	259	获取文本的嵌入向量表示 Args: text (str): 输入文本 model (str): 使用的模型名称 Returns:	STRING
HIGH	docs/chapter7/RAG/Embeddings.py	36	获取文本的嵌入向量表示 Args: text (str): 输入文本 model (str): 使用的模型名称 Returns:	STRING

AI Structural Patterns8 hits · 8 pts

Severity	File	Line	Context
LOW	docs/chapter6/code/pretrain.py	180	CODE
LOW	docs/chapter5/code/k_model.py	16	CODE
LOW	docs/chapter5/code/k_model.py	248	CODE
LOW	docs/chapter5/code/k_model.py	307	CODE
LOW	docs/chapter5/code/k_model.py	677	CODE
LOW	docs/chapter2/code/transformer.py	98	CODE
LOW	docs/chapter2/code/transformer.py	151	CODE
LOW	docs/chapter2/code/transformer.py	192	CODE

Hyper-Verbose Identifiers9 hits · 8 pts

Severity	File	Line	Snippet	Context
LOW	docs/chapter5/code/k_model.py	378	def _left_pad_by_attention_mask(	CODE
LOW	Extra-Chapter/CDDRS/readme.md	246	def _compute_semantic_discrepancy(self, embeddings: np.ndarray) -> List[float]:	CODE
LOW	Extra-Chapter/CDDRS/readme.md	283	def _enforce_length_constraints(self, chunks: List[str]) -> List[str]:	CODE
LOW	Extra-Chapter/CDDRS/readme.md	423	def compute_document_length_factor(chunk_length: int, avg_length: int = 100) -> float:	STRING
LOW	Extra-Chapter/CDDRS/readme.md	437	def compute_term_significance(term_freq: int, doc_length_factor: float) -> float:	STRING
LOW	Extra-Chapter/CDDRS/readme.md	559	def _compute_knowledge_scores(self, key_info: Dict[str, Tuple[str, float]]) -> List[float]:	STRING
LOW	Extra-Chapter/text-data-processing/readme.md	932	def test_simple_bpe_tokenizer():	CODE
LOW	Extra-Chapter/s1-vllm-thinking-budget/s1.py	28	def run_thinking_budget_sample(llm_model, tokenizer, user_input, thinking_budget):	CODE
LOW	Extra-Chapter/s1-vllm-thinking-budget/readme.md	41	def run_thinking_budget_sample(llm_model, tokenizer, user_input, thinking_budget):	CODE

Deep Nesting5 hits · 5 pts

Severity	File	Line	Context
LOW	docs/chapter6/code/finetune.py	87	CODE
LOW	docs/chapter7/RAG/utils.py	34	CODE
LOW	docs/chapter7/RAG/utils.py	61	CODE
LOW	docs/chapter5/code/dataset.py	65	CODE
LOW	docs/chapter5/code/train_tokenizer.py	17	CODE

Modern Structural Boilerplate2 hits · 2 pts

Severity	File	Line	Snippet	Context
LOW	docs/chapter6/code/pretrain.py	36	logger = logging.getLogger(__name__)	CODE
LOW	docs/chapter6/code/finetune.py	40	logger = logging.getLogger(__name__)	CODE

Over-Commented Block2 hits · 2 pts

Severity	File	Line	Snippet	Context
LOW	docs/chapter5/code/k_model.py	61	def forward(self, x):	COMMENT
LOW	docs/chapter2/第二章 Transformer架构.md	301	# 注意力计算	COMMENT

Analysis Overview

What These Metrics Mean

Score History

Severity Breakdown

Directory Score Breakdown

Pattern Findings