EleutherAI/lm-evaluation-harness

12.5

Adjusted Score

12.5

Raw Score

100%

Time Factor

2026-07-13

Last Push

13.3K

Stars

Python

Language

292.3K

Lines of Code

15.3K

Files

1.9K

Pattern Hits

2026-07-14

Scan Date

0.02

HC Hit Rate

What These Metrics Mean

Adjusted Score: Primary synthetic code indicator. Raw score normalised per 1,000 lines of code and multiplied by the temporal discount factor. This is the definitive comparative metric — use it to rank repositories by AI authorship density.
Raw Score: The unmodified sum of all severity-weighted, context-multiplied pattern match scores before temporal discounting. Reflects the absolute signal strength independent of when the repository was last active.
Time Factor: The temporal discount multiplier (0–100%) applied to the raw score. Repositories last updated before ChatGPT's launch (Nov 2022) receive a 5% factor. Full signal is only assigned to repositories active in the post-adoption era (Jan 2024+).
Pattern Hits: Total count of individual pattern matches across all files and categories. A high hit count with a low score may indicate a very large codebase with isolated AI snippets; a low count with a high score indicates dense, concentrated AI signatures.
HC Hit Rate: High+Critical pattern hits per file, averaged across the repository. This orthogonal signal catches repositories where a few files are densely packed with high-severity AI tells — a strong indicator even when the normalised score appears moderate due to codebase size.
Lines of Code / Files: Total lines and files analysed. The scanner examines 94 file extensions. These denominators are used to normalise the score, enabling fair comparison between repositories of vastly different sizes.

Score History

This chart maps the temporal evolution of the adjusted synthetic code score across successive scan runs. An upward trajectory indicates ongoing incorporation of AI-generated code or expanding LLM-assisted scaffolding; a stable or declining trajectory may reflect active human refactoring, code removal, or the adoption of stricter authorship policies. The dashed secondary line (right axis) independently tracks total raw pattern hit count, which can diverge from the normalised score when codebase size changes significantly between scans.

Severity Breakdown

Classifies detected patterns by their diagnostic confidence and structural impact. CRITICAL patterns (coefficient 10) represent definitive synthetic signatures — hallucinated imports, explicit LLM attribution metadata — virtually never produced by human authors. HIGH (5) indicates strong structural tells such as cross-file repetition or cross-linguistic idioms. MEDIUM (2) covers recognisable conversational padding and AI-specific vocabulary. LOW (1) captures subtle indicators like tautological comments and generic boilerplate that require density to carry independent signal.

CRITICAL 0HIGH 350MEDIUM 167LOW 1379

Directory Score Breakdown

This horizontal bar chart decomposes the repository's raw synthetic code score by top-level directory, allowing you to pinpoint precisely which modules or components carry the highest AI authorship density. Directories with disproportionately high scores relative to their size warrant targeted manual review: concentrated AI signatures often trace back to mass-generated configuration layers, auto-ported test suites, LLM-scaffolded boilerplate classes, or entire subsystems authored under heavy copilot assistance. Use this view to prioritise your human code-review effort.

Pattern Findings

The scanner identified 1896 distinct pattern matches across 21 syntactic categories. Each entry below represents a discrete location in the source code where the engine recorded a statistically significant AI authorship indicator. Expand any category row to inspect the individual file paths, line numbers, code snippets, and the lexical context (CODE, COMMENT, or STRING) in which each match was detected.

Reading the findings table: The Severity column indicates the diagnostic confidence level (CRITICAL / HIGH / MEDIUM / LOW). The Context column identifies whether the match occurred inside executable code, an inline comment, or a string literal — comment-context matches receive a ×1.5 weight because LLMs systematically over-annotate. The ⚡ bolt icon marks clustered matches: three or more patterns within a 10-line window, each receiving an additional ×1.5 density multiplier as dense clusters constitute far stronger evidence of synthetic authorship than isolated hits.

Cross-File Repetition337 hits · 1685 pts

Severity	File	Line	Snippet	Context
HIGH	lm_eval/filters/selection.py	0	assuming each entry of `resps` is a list of model responses, we discard all but the first response.	STRING
HIGH	…val/tasks/darija_bench/darija_transliteration/utils.py	0	assuming each entry of `resps` is a list of model responses, we discard all but the first response.	STRING
HIGH	lm_eval/tasks/darija_bench/darija_translation/utils.py	0	assuming each entry of `resps` is a list of model responses, we discard all but the first response.	STRING
HIGH	…_eval/tasks/darija_bench/darija_summarization/utils.py	0	assuming each entry of `resps` is a list of model responses, we discard all but the first response.	STRING
HIGH	lm_eval/tasks/super_glue/record/t5_utils.py	0	lower text and remove punctuation, extra whitespace.	STRING
HIGH	lm_eval/tasks/french_bench/utils.py	0	lower text and remove punctuation, extra whitespace.	STRING
HIGH	lm_eval/tasks/longbench/metrics.py	0	lower text and remove punctuation, extra whitespace.	STRING
HIGH	lm_eval/tasks/mlqa/utils.py	0	lower text and remove punctuation, extra whitespace.	STRING
HIGH	lm_eval/tasks/tinyBenchmarks/utils_truthfulqa.py	0	returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer	STRING
HIGH	lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py	0	returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer	STRING
HIGH	lm_eval/tasks/truthfulqa-multi/utils.py	0	returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer	STRING
HIGH	lm_eval/tasks/truthfulqa/utils.py	0	returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer	STRING
HIGH	lm_eval/tasks/noreval/nortruthfulqa/generation/utils.py	0	returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer	STRING
HIGH	lm_eval/tasks/noreval/norsumm/utils.py	0	returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer	STRING
HIGH	lm_eval/tasks/galician_bench/utils.py	0	returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer	STRING
HIGH	lm_eval/tasks/tinyBenchmarks/utils_truthfulqa.py	0	returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe	STRING
HIGH	lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py	0	returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe	STRING
HIGH	lm_eval/tasks/truthfulqa-multi/utils.py	0	returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe	STRING
HIGH	lm_eval/tasks/truthfulqa/utils.py	0	returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe	STRING
HIGH	lm_eval/tasks/noreval/nortruthfulqa/generation/utils.py	0	returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe	STRING
HIGH	lm_eval/tasks/noreval/norsumm/utils.py	0	returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe	STRING
HIGH	lm_eval/tasks/galician_bench/utils.py	0	returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe	STRING
HIGH	lm_eval/tasks/darijammlu/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/mmlusr/config.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/e2lmc/noor/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/egymmlu/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/arab_culture/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	…val/tasks/arab_culture_completion/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/mmlu/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/tmmluplus/default/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/arabicmmlu/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/tmlu/default/_generate_configs.py	0	take in a yaml, and output all "other" splits with this yaml	STRING
HIGH	lm_eval/tasks/mgsm/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrimmlu/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/paws-x/_generate_config.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/translation/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/xnli/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/openai_mmlu/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/adr/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/mafand/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/naijarc/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/belebele/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/injongointent/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/xlsum/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/afrisenti/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/masakhapos/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/ntrex/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/flores/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/salt/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/masakhanews/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/uhura-arc-easy/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/sib/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/afriqa/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrobench/masakhaner/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrixnli/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrixnli/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/xwinograd/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrimgsm/gen_utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/afrimgsm/utils.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
HIGH	lm_eval/tasks/eus_exams/configs.py	0	generate a yaml file for each configuage. :param output_dir: the directory to output the files to. :param overwrite: whe	STRING
277 more matches not shown…

Hyper-Verbose Identifiers660 hits · 582 pts

Severity	File	Line	Snippet	Context
LOW	lm_eval/evaluator_utils.py	173	def _compute_task_aggregations(	CODE
LOW	lm_eval/evaluator_utils.py	319	def _collect_groups_bottom_up(groups: dict[str, Group]) -> list[Group]:	CODE
LOW	lm_eval/evaluator_utils.py	404	def _propagate_higher_is_better(	CODE
LOW	lm_eval/utils.py	47	def is_transformers_available() -> bool:	CODE
LOW	lm_eval/utils.py	324	def get_sample_results_filenames(filenames: list[str]) -> list[str]:	CODE
LOW	lm_eval/utils.py	331	def get_rolling_token_windows(	CODE
LOW	lm_eval/utils.py	844	def check_remote_tokenizer_support(	CODE
LOW	lm_eval/tasks/__init__.py	36	def get_task_name_from_config(task_config: dict[str, str]) -> str:	CODE
LOW	lm_eval/tasks/__init__.py	50	def get_task_name_from_object(task_object):	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	207	def generate_optimal_plans_for_problem_state(P, state, num_plans, timeout):	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	330	def create_tmp_dom_prob_replace_init(P, state, result_domain_file, result_problem_file):	CODE
LOW⚡	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	671	def str_remove_before_first_parentheses(s):	CODE
LOW⚡	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	680	def str_remove_after_last_parentheses(s):	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	207	def generate_optimal_plans_for_problem_state(P, state, num_plans, timeout):	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	330	def create_tmp_dom_prob_replace_init(P, state, result_domain_file, result_problem_file):	CODE
LOW⚡	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	671	def str_remove_before_first_parentheses(s):	CODE
LOW⚡	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	680	def str_remove_after_last_parentheses(s):	CODE
LOW	lm_eval/tasks/jfinqa/test_jfinqa_utils.py	35	def test_normalize_comma_only_between_digits(self):	CODE
LOW	lm_eval/tasks/jfinqa/test_jfinqa_utils.py	58	def test_extract_answer_multiline_with_answer(self):	CODE
LOW	lm_eval/tasks/jfinqa/test_jfinqa_utils.py	100	def test_exact_numerical_match(self):	CODE
LOW	lm_eval/tasks/jfinqa/test_jfinqa_utils.py	115	def test_non_numeric_fallback(self):	CODE
LOW	lm_eval/tasks/jfinqa/test_jfinqa_utils.py	123	def test_same_unit_different_values(self):	CODE
LOW	lm_eval/tasks/jfinqa/test_jfinqa_utils.py	145	def test_missing_optional_fields(self):	CODE
LOW	lm_eval/tasks/jfinqa/test_jfinqa_utils.py	163	def test_exact_and_numerical_match(self):	CODE
LOW	lm_eval/tasks/jfinqa/test_jfinqa_utils.py	169	def test_numerical_match_only(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	133	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	170	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	243	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	293	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	340	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	379	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	427	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	476	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	550	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	600	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	660	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	721	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	784	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	853	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	911	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	939	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1017	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1108	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1153	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1208	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1241	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1295	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1331	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1356	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1432	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1460	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1492	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1523	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1580	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	1612	def get_instruction_args_keys(self):	CODE
LOW	lm_eval/tasks/ifeval/utils.py	24	def test_instruction_following_strict(	CODE
LOW	lm_eval/tasks/ifeval/utils.py	57	def test_instruction_following_loose(	CODE
LOW	lm_eval/tasks/ifeval/multilingual/utils.py	23	def test_instruction_following_strict(	CODE
LOW	lm_eval/tasks/ifeval/multilingual/utils.py	56	def test_instruction_following_loose(	CODE
LOW	…ks/ifeval/multilingual/instructions/es_instructions.py	104	def get_instruction_args_keys(self):	CODE
600 more matches not shown…

Decorative Section Separators98 hits · 358 pts

Severity	File	Line	Snippet	Context
MEDIUM	lm_eval/tasks/cruxeval/utils.py	223	# ============================================================================	COMMENT
MEDIUM	lm_eval/tasks/cruxeval/utils.py	225	# ============================================================================	COMMENT
MEDIUM⚡	lm_eval/tasks/cruxeval/utils.py	241	# ============================================================================	COMMENT
MEDIUM⚡	lm_eval/tasks/cruxeval/utils.py	243	# ============================================================================	COMMENT
MEDIUM⚡	lm_eval/tasks/cruxeval/utils.py	425	# ============================================================================	COMMENT
MEDIUM⚡	lm_eval/tasks/cruxeval/utils.py	427	# ============================================================================	COMMENT
MEDIUM	lm_eval/api/registry.py	414	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/registry.py	416	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/registry.py	460	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/registry.py	462	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/registry.py	520	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/registry.py	522	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/registry.py	570	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/registry.py	572	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/group.py	327	# =============================================================================	COMMENT
MEDIUM	lm_eval/api/group.py	329	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	13	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	15	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	24	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	26	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	58	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	60	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	112	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	114	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	717	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_fewshot_context.py	719	# =============================================================================	COMMENT
MEDIUM	tests/test_fewshot_context.py	180	# =============================================================================	COMMENT
MEDIUM	tests/test_fewshot_context.py	182	# =============================================================================	COMMENT
MEDIUM	tests/test_fewshot_context.py	455	# =============================================================================	COMMENT
MEDIUM	tests/test_fewshot_context.py	457	# =============================================================================	COMMENT
MEDIUM	tests/test_task_manager.py	12	# =============================================================================	COMMENT
MEDIUM	tests/test_task_manager.py	14	# =============================================================================	COMMENT
MEDIUM	tests/test_task_manager.py	359	# =============================================================================	COMMENT
MEDIUM	tests/test_task_manager.py	361	# =============================================================================	COMMENT
MEDIUM	tests/test_task_manager.py	921	# =============================================================================	COMMENT
MEDIUM	tests/test_task_manager.py	923	# =============================================================================	COMMENT
MEDIUM	tests/test_task_manager.py	84	# =============================================================================	STRING
MEDIUM	tests/test_task_manager.py	86	# =============================================================================	STRING
MEDIUM	tests/test_task_manager.py	205	# =============================================================================	STRING
MEDIUM	tests/test_task_manager.py	207	# =============================================================================	STRING
MEDIUM	tests/test_task_manager.py	777	# =============================================================================	STRING
MEDIUM	tests/test_task_manager.py	779	# =============================================================================	STRING
MEDIUM⚡	tests/test_samplers.py	38	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_samplers.py	40	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_samplers.py	213	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_samplers.py	215	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_samplers.py	268	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_samplers.py	270	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_samplers.py	304	# =============================================================================	COMMENT
MEDIUM⚡	tests/test_samplers.py	306	# =============================================================================	COMMENT
MEDIUM	tests/test_samplers.py	15	# =============================================================================	COMMENT
MEDIUM	tests/test_samplers.py	17	# =============================================================================	COMMENT
MEDIUM	tests/test_aggregation_pipeline.py	26	# ---------------------------------------------------------------------------	COMMENT
MEDIUM	tests/test_aggregation_pipeline.py	28	# ---------------------------------------------------------------------------	COMMENT
MEDIUM⚡	tests/test_aggregation_pipeline.py	107	# ---------------------------------------------------------------------------	COMMENT
MEDIUM⚡	tests/test_aggregation_pipeline.py	109	# ---------------------------------------------------------------------------	COMMENT
MEDIUM⚡	tests/test_evaluator_utils.py	115	# ---------------------------------------------------------------------------	COMMENT
MEDIUM⚡	tests/test_evaluator_utils.py	117	# ---------------------------------------------------------------------------	COMMENT
MEDIUM⚡	tests/test_evaluator_utils.py	139	# ---------------------------------------------------------------------------	COMMENT
MEDIUM⚡	tests/test_evaluator_utils.py	141	# ---------------------------------------------------------------------------	COMMENT
38 more matches not shown…

Excessive Try-Catch Wrapping164 hits · 202 pts

Severity	File	Line	Snippet	Context
LOW	lm_eval/tasks/_index.py	63	except Exception as err:	CODE
LOW	lm_eval/tasks/realtoxicityprompts/metric.py	36	except Exception:	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	123	except Exception as e:	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	148	except Exception:	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	325	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	676	except Exception:	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	1058	except Exception as e:	CODE
MEDIUM	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	1053	def parse_prediction(prediction):	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	123	except Exception as e:	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	148	except Exception:	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	325	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	676	except Exception:	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	1058	except Exception as e:	CODE
MEDIUM	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	1053	def parse_prediction(prediction):	CODE
LOW	lm_eval/tasks/slr_bench/lm_eval_slr_bench.py	14	except Exception as e:	CODE
LOW	lm_eval/tasks/slr_bench/lm_eval_slr_bench.py	59	except Exception as e:	CODE
MEDIUM	lm_eval/tasks/slr_bench/lm_eval_slr_bench.py	60	print(f"Error in process_results: {e}")	CODE
LOW	lm_eval/tasks/humaneval/utils.py	9	except Exception as e:	CODE
LOW	lm_eval/tasks/aime/utils.py	49	except Exception:	CODE
LOW	lm_eval/tasks/hendrycks_math/utils.py	49	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/default/utils.py	32	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/default/utils.py	70	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/default/utils.py	84	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/default/utils.py	158	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/default/utils.py	189	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/en/utils.py	32	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/en/utils.py	70	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/en/utils.py	84	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/en/utils.py	158	except Exception:	CODE
LOW	lm_eval/tasks/hrm8k/en/utils.py	189	except Exception:	CODE
LOW	lm_eval/tasks/humaneval_infilling/utils.py	9	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/medtext/utils.py	18	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/medtext/utils.py	27	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/medtext/utils.py	33	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/medtext/utils.py	39	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/medtext/utils.py	47	except Exception as e:	CODE
MEDIUM	lm_eval/tasks/medtext/utils.py	24	def doc_eval(pred, refs):	CODE
LOW⚡	lm_eval/tasks/olaph/utils.py	19	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/olaph/utils.py	28	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/olaph/utils.py	34	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/olaph/utils.py	40	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/olaph/utils.py	48	except Exception as e:	CODE
MEDIUM	lm_eval/tasks/olaph/utils.py	25	def doc_eval(pred, refs):	CODE
LOW	lm_eval/tasks/minerva_math/utils.py	197	except Exception as e:	CODE
LOW	lm_eval/tasks/leaderboard/math/utils.py	209	except Exception as e:	CODE
LOW	lm_eval/tasks/toksuite/utils.py	474	except Exception:	CODE
LOW	lm_eval/tasks/toksuite/utils.py	494	except Exception:	CODE
LOW⚡	lm_eval/tasks/toksuite/utils.py	518	except Exception:	CODE
LOW⚡	lm_eval/tasks/toksuite/utils.py	522	except Exception:	CODE
LOW	lm_eval/tasks/toksuite/utils.py	555	except Exception:	CODE
LOW	lm_eval/tasks/toksuite/utils.py	578	except Exception:	CODE
LOW	lm_eval/tasks/toksuite/utils.py	601	except Exception:	CODE
LOW	lm_eval/tasks/toksuite/utils.py	605	except Exception:	CODE
LOW	lm_eval/tasks/meqsum/utils.py	18	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/meqsum/utils.py	52	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/meqsum/utils.py	58	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/meqsum/utils.py	64	except Exception as e:	CODE
LOW⚡	lm_eval/tasks/meqsum/utils.py	72	except Exception as e:	CODE
LOW	lm_eval/tasks/med_prescriptions/utils.py	2060	except Exception:	CODE
LOW	lm_eval/tasks/med_prescriptions/utils.py	2066	except Exception:	CODE
104 more matches not shown…

Unused Imports181 hits · 178 pts

Severity	File	Line	Context
LOW	lm_eval/evaluator_utils.py	1	CODE
LOW	lm_eval/__init__.py	2	CODE
LOW	lm_eval/evaluator.py	1	CODE
LOW	lm_eval/filters/__init__.py	1	CODE
LOW	lm_eval/filters/__init__.py	6	CODE
LOW	lm_eval/filters/__init__.py	8	CODE
LOW	lm_eval/filters/__init__.py	8	CODE
LOW	lm_eval/filters/__init__.py	8	CODE
LOW	lm_eval/filters/__init__.py	8	CODE
LOW	lm_eval/filters/extraction.py	1	CODE
LOW	lm_eval/tasks/_index.py	1	CODE
LOW	lm_eval/tasks/_factory.py	1	CODE
LOW	lm_eval/tasks/_yaml_loader.py	1	CODE
LOW	lm_eval/tasks/manager.py	1	CODE
LOW	lm_eval/tasks/manager.py	20	CODE
LOW	lm_eval/tasks/babilong/common_utils.py	11	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	3	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	4	CODE
LOW	lm_eval/tasks/jfinqa/utils.py	12	CODE
LOW	lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py	2	CODE
LOW	lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py	8	CODE
LOW	lm_eval/tasks/aime/utils.py	1	CODE
LOW	lm_eval/tasks/noreval/norsumm/utils.py	1	CODE
LOW	lm_eval/tasks/noreval/norsumm/utils.py	7	CODE
LOW	lm_eval/tasks/spanish_bench/utils.py	2	CODE
LOW	lm_eval/tasks/spanish_bench/utils.py	5	CODE
LOW	lm_eval/tasks/minerva_math/utils.py	5	CODE
LOW	lm_eval/tasks/minerva_math/utils.py	5	CODE
LOW	lm_eval/tasks/minerva_math/utils.py	14	CODE
LOW	lm_eval/tasks/darija_bench/darija_sentiment/utils.py	1	CODE
LOW	lm_eval/tasks/darija_bench/darija_sentiment/utils.py	2	CODE
LOW	…_eval/tasks/darija_bench/darija_summarization/utils.py	1	CODE
LOW	lm_eval/tasks/longbench/utils.py	1	CODE
LOW	lm_eval/tasks/longbench/utils.py	2	CODE
LOW	lm_eval/tasks/longbench/utils.py	3	CODE
LOW	lm_eval/tasks/leaderboard/gpqa/utils.py	1	CODE
LOW	lm_eval/tasks/xquad/utils.py	1	CODE
LOW	lm_eval/tasks/xquad/utils.py	2	CODE
LOW	lm_eval/tasks/xquad/utils.py	4	CODE
LOW	lm_eval/tasks/xquad/utils.py	7	CODE
LOW	lm_eval/tasks/cnn_dailymail/utils.py	6	CODE
LOW	lm_eval/tasks/score/utils.py	21	CODE
LOW	lm_eval/tasks/score/utils.py	25	CODE
LOW	lm_eval/tasks/ruler/vt_utils.py	30	CODE
LOW	lm_eval/tasks/ruler/vt_utils.py	30	CODE
LOW	lm_eval/tasks/ruler/fwe_utils.py	20	CODE
LOW	lm_eval/tasks/ruler/common_utils.py	10	CODE
LOW	lm_eval/tasks/afrobench/nollysenti/prompt_5/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/nollysenti/prompt_2/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/nollysenti/prompt_3/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/nollysenti/prompt_4/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/nollysenti/prompt_1/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/injongointent/prompt_5/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/injongointent/prompt_2/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/injongointent/prompt_3/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/injongointent/prompt_4/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/injongointent/prompt_1/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/afrisenti/prompt_5/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/afrisenti/prompt_2/utils.py	1	CODE
LOW	lm_eval/tasks/afrobench/afrisenti/prompt_3/utils.py	1	CODE
121 more matches not shown…

Deep Nesting166 hits · 164 pts

Severity	File	Line	Context
LOW	lm_eval/evaluator_utils.py	404	CODE
LOW	lm_eval/evaluator_utils.py	483	CODE
LOW	lm_eval/evaluator.py	424	CODE
LOW	lm_eval/filters/extraction.py	39	CODE
LOW	lm_eval/filters/extraction.py	157	CODE
LOW	lm_eval/filters/extraction.py	42	CODE
LOW	lm_eval/tasks/_factory.py	127	CODE
LOW	lm_eval/tasks/realtoxicityprompts/metric.py	12	CODE
LOW	lm_eval/tasks/evalita_llm/metrics.py	49	CODE
LOW	lm_eval/tasks/evalita_llm/metrics.py	63	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	11	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	30	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	91	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	193	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	246	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	439	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	526	CODE
LOW	lm_eval/tasks/simple_cooccurrence_bias/utils.py	29	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	642	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	730	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	91	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	917	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	993	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	642	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	730	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	91	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	917	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	993	CODE
LOW	lm_eval/tasks/mgsm/utils.py	131	CODE
LOW	lm_eval/tasks/chartqa/utils.py	192	CODE
LOW	…asks/catalan_bench/flores_ca/create_yamls_flores_ca.py	273	CODE
LOW	lm_eval/tasks/truthfulqa-multi/utils.py	38	CODE
LOW	lm_eval/tasks/mmmu/utils.py	105	CODE
LOW	lm_eval/tasks/mmmu/utils.py	223	CODE
LOW	lm_eval/tasks/mmmu/utils.py	316	CODE
LOW	lm_eval/tasks/mmmu/utils.py	230	CODE
LOW	lm_eval/tasks/translation/utils.py	41	CODE
LOW	lm_eval/tasks/aime/utils.py	97	CODE
LOW	lm_eval/tasks/hendrycks_math/utils.py	97	CODE
LOW	lm_eval/tasks/qasper/utils.py	6	CODE
LOW	lm_eval/tasks/qasper/utils.py	9	CODE
LOW	lm_eval/tasks/qasper/utils.py	31	CODE
LOW	…s/portuguese_bench/flores_pt/create_yamls_flores_pt.py	272	CODE
LOW	lm_eval/tasks/bbq/utils.py	212	CODE
LOW	lm_eval/tasks/bbq/utils.py	300	CODE
LOW	lm_eval/tasks/bbq/utils.py	303	CODE
LOW	lm_eval/tasks/hrm8k/default/utils.py	146	CODE
LOW	lm_eval/tasks/hrm8k/en/utils.py	146	CODE
LOW	…asks/spanish_bench/flores_es/create_yamls_flores_es.py	272	CODE
LOW	lm_eval/tasks/minerva_math/utils.py	159	CODE
LOW	lm_eval/tasks/leaderboard/math/utils.py	170	CODE
LOW	lm_eval/tasks/toksuite/utils.py	423	CODE
LOW	lm_eval/tasks/toksuite/utils.py	532	CODE
LOW	lm_eval/tasks/med_prescriptions/utils.py	2178	CODE
LOW	lm_eval/tasks/med_prescriptions/utils.py	2271	CODE
LOW	lm_eval/tasks/score/utils.py	74	CODE
LOW	lm_eval/tasks/score/utils.py	199	CODE
LOW	lm_eval/tasks/score/utils.py	93	CODE
LOW	lm_eval/tasks/score/non_greedy_summarizer.py	33	CODE
LOW	lm_eval/tasks/score/non_greedy_summarizer.py	117	CODE
106 more matches not shown…

Over-Commented Block88 hits · 86 pts

Severity	File	Line	Snippet	Context
LOW	lm_eval/result_schema.py	21	{	COMMENT
LOW	lm_eval/result_schema.py	41	# Per-task list of per-document sample results.	COMMENT
LOW	lm_eval/result_schema.py	61	"upper_git_hash": str \| None,	COMMENT
LOW	lm_eval/result_schema.py	81	# Model source identifier (e.g. "hf").	COMMENT
LOW	lm_eval/tasks/tinyBenchmarks/utils_truthfulqa.py	61	# bleurt_scores_true = self.bleurt.compute(	COMMENT
LOW	lm_eval/tasks/ifeval/instructions.py	1	# Copyright 2023 The Google Research Authors.	COMMENT
LOW	lm_eval/tasks/ifeval/instructions_util.py	1	# Copyright 2023 The Google Research Authors.	COMMENT
LOW	lm_eval/tasks/ifeval/instructions_registry.py	1	# Copyright 2023 The Google Research Authors.	COMMENT
LOW	…val/tasks/ifeval/multilingual/instructions_registry.py	1	# Copyright 2024 The Google Research Authors.	COMMENT
LOW	…multilingual/instruction_utils/ca_instructions_util.py	1	# coding=utf-8	COMMENT
LOW	…multilingual/instruction_utils/es_instructions_util.py	1	# coding=utf-8	COMMENT
LOW	…ks/ifeval/multilingual/instructions/es_instructions.py	1	# coding=utf-8	COMMENT
LOW	…ks/ifeval/multilingual/instructions/ca_instructions.py	1	# Copyright 2024 The Google Research Authors.	COMMENT
LOW	lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py	181	# bleurt_scores_false = self.bleurt.compute(	COMMENT
LOW	lm_eval/tasks/truthfulqa-multi/utils.py	81	completion = results[0]	COMMENT
LOW	lm_eval/tasks/truthfulqa-multi/utils.py	101	bleu_scores = [bleu([[ref]], [completion]) for ref in all_refs]	COMMENT
LOW	lm_eval/tasks/truthfulqa-multi/utils.py	121	# rouge2_max = rouge2_correct	COMMENT
LOW	lm_eval/tasks/truthfulqa/utils.py	61		COMMENT
LOW	lm_eval/tasks/longbench/metrics.py	1	# MIT License	COMMENT
LOW	lm_eval/tasks/longbench/_generate_config.py	1	# MIT License	COMMENT
LOW	lm_eval/tasks/leaderboard/ifeval/instructions.py	1	# Copyright 2023 The Google Research Authors.	COMMENT
LOW	lm_eval/tasks/leaderboard/ifeval/instructions_util.py	1	# Copyright 2023 The Google Research Authors.	COMMENT
LOW	…eval/tasks/leaderboard/ifeval/instructions_registry.py	1	# Copyright 2023 The Google Research Authors.	COMMENT
LOW	lm_eval/tasks/logiqa2/utils_logiqa2.py	21	# # https://github.com/csitfun/LogiQA2.0/blob/main/logiqa2nli/nli-prompt.py	COMMENT
LOW	lm_eval/tasks/score/utils.py	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	lm_eval/tasks/score/non_greedy_summarizer.py	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	lm_eval/tasks/score/mmlu_pro/utils_mmlu_pro.py	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…ore/math/prompt_robustness_math_counting_and_prob.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…/tasks/score/math/prompt_robustness_math_geometry.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…ks/score/math/non_greedy_robustness_math_geometry.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…/score/math/non_greedy_robustness_math_num_theory.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…l/tasks/score/math/prompt_robustness_math_precalc.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…/score/math/non_greedy_robustness_math_prealgebra.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…asks/score/math/prompt_robustness_math_num_theory.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…math/non_greedy_robustness_math_counting_and_prob.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…asks/score/math/prompt_robustness_math_prealgebra.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…/math/prompt_robustness_math_intermediate_algebra.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	lm_eval/tasks/score/math/math_grader.py	1	# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.	COMMENT
LOW	lm_eval/tasks/score/math/math_grader.py	21	# copies of the Software, and to permit persons to whom the Software is	COMMENT
LOW	lm_eval/tasks/score/math/math_grader.py	41	# copies of the Software, and to permit persons to whom the Software is	COMMENT
LOW	lm_eval/tasks/score/math/math_grader.py	61	# copies of the Software, and to permit persons to whom the Software is	COMMENT
LOW	…sks/score/math/non_greedy_robustness_math_precalc.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…h/non_greedy_robustness_math_intermediate_algebra.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…/agi_eval/option_order_robustness_agieval_lsat_rc.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…score/agi_eval/prompt_robustness_agieval_lstat_lr.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…e/agi_eval/non_greedy_robustness_agieval_sat_math.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…agi_eval/option_order_robustness_agieval_sat_math.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…e/agi_eval/non_greedy_robustness_agieval_lstat_ar.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	lm_eval/tasks/score/agi_eval/utils_agieval.py	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…/agi_eval/option_order_robustness_agieval_lsat_ar.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…s/score/agi_eval/prompt_robustness_agieval_sat_en.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…ore/agi_eval/non_greedy_robustness_agieval_sat_en.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…/agi_eval/non_greedy_robustness_agieval_logiqa_en.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…e/agi_eval/non_greedy_robustness_agieval_lstat_lr.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…gi_eval/option_order_robustness_agieval_logiqa_en.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…score/agi_eval/prompt_robustness_agieval_lstat_ar.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…score/agi_eval/prompt_robustness_agieval_sat_math.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…re/agi_eval/non_greedy_robustness_agieval_lsat_rc.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…core/agi_eval/prompt_robustness_agieval_logiqa_en.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
LOW	…/score/agi_eval/prompt_robustness_agieval_lsat_rc.yaml	1	# Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.	COMMENT
28 more matches not shown…

Redundant / Tautological Comments44 hits · 68 pts

Severity	File	Line	Snippet	Context
LOW	lm_eval/tasks/_yaml_loader.py	50	# Check if this is a built-in task module	COMMENT
LOW	lm_eval/tasks/_yaml_loader.py	69	# Check if we need to reload the module	COMMENT
LOW	lm_eval/tasks/_yaml_loader.py	72	# Check if it was modified	COMMENT
LOW	lm_eval/tasks/evalita_llm/utils.py	126	if results: # Check if results is not empty	CODE
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	183	# Check if new_plan is a plan	COMMENT
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	807	# Check if the answer is equal (as a set) to the real stored answer	COMMENT
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	860	# Check if the plan candidate from the answer (a) is a proper subsequence of the plan in the question and (b	COMMENT
LOW	lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py	978	# Check if the answer is equal as sets to the correct answers.	COMMENT
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	183	# Check if new_plan is a plan	COMMENT
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	807	# Check if the answer is equal (as a set) to the real stored answer	COMMENT
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	860	# Check if the plan candidate from the answer (a) is a proper subsequence of the plan in the question and (b	COMMENT
LOW	lm_eval/tasks/acpbench/gen_2shot/acp_utils.py	978	# Check if the answer is equal as sets to the correct answers.	COMMENT
LOW	…ks/ifeval/multilingual/instructions/es_instructions.py	1366	# Check if the last character in value is a dot (.)	COMMENT
LOW	…ks/ifeval/multilingual/instructions/es_instructions.py	1506	# Check if all normalized, alphabetic characters are uppercase, ignoring non-alphabetic characters	COMMENT
LOW	…ks/ifeval/multilingual/instructions/ca_instructions.py	1364	# Check if the last character in value is a dot (.)	COMMENT
LOW	…ks/ifeval/multilingual/instructions/ca_instructions.py	1504	# Check if all normalized, alphabetic characters are uppercase, ignoring non-alphabetic characters	COMMENT
LOW	lm_eval/tasks/chartqa/utils.py	211	# Check if the number is in the model answer with commas (e.g. 1,000)	COMMENT
LOW	lm_eval/tasks/chartqa/utils.py	214	# Check if the number is in the model answer without commas (e.g. 1000)	COMMENT
LOW	lm_eval/tasks/graphwalks/utils.py	42	# Check if formatted correctly	COMMENT
LOW	lm_eval/tasks/aime/utils.py	26	# Check if answer matches target	COMMENT
LOW	lm_eval/tasks/bbq/utils.py	224	# Check if answer is "Not known"	COMMENT
LOW⚡	lm_eval/tasks/med_prescriptions/utils.py	2106	# Check if the text contains any Indian script characters	COMMENT
LOW⚡	lm_eval/tasks/arab_culture/utils_mcq.py	17	### Set this to one to add the country and region information to the prompt	COMMENT
LOW⚡	lm_eval/tasks/arab_culture/utils_mcq.py	19	### Set this to one to add the region information to the prompt	COMMENT
LOW⚡	lm_eval/tasks/arab_culture/utils_mcq.py	21	### Set this to change between Arabic and English for the answer keys and the choices keys	COMMENT
LOW	lm_eval/tasks/jsonschema_bench/metrics.py	28	# Check if the schema is valid	COMMENT
LOW	lm_eval/tasks/afrobench/masakhaner/prompt_5/utils.py	17	if pair: # Check if the line is not empty	CODE
LOW	lm_eval/tasks/afrobench/masakhaner/prompt_2/utils.py	17	if pair: # Check if the line is not empty	CODE
LOW	lm_eval/tasks/afrobench/masakhaner/prompt_3/utils.py	17	if pair: # Check if the line is not empty	CODE
LOW	lm_eval/tasks/afrobench/masakhaner/prompt_4/utils.py	17	if pair: # Check if the line is not empty	CODE
LOW	lm_eval/tasks/afrobench/masakhaner/prompt_1/utils.py	17	if pair: # Check if the line is not empty	CODE
LOW⚡	…eval/tasks/arab_culture_completion/utils_completion.py	18	### Set this to one to add the country and region information to the prompt	COMMENT
LOW⚡	…eval/tasks/arab_culture_completion/utils_completion.py	20	### Set this to one to add the region information to the prompt	COMMENT
LOW⚡	…eval/tasks/arab_culture_completion/utils_completion.py	22	### Set this to change between Arabic and English for the answer keys and the choices keys	COMMENT
LOW	lm_eval/decontamination/decontaminate.py	61	# Check if we've decontaminated this combination before	COMMENT
LOW⚡	lm_eval/models/winml.py	326	# Check if encoding empty string gives BOS token	COMMENT
LOW	lm_eval/models/winml.py	556	# Check if greedy (argmax matches actual token)	COMMENT
LOW	lm_eval/models/hf_vlms.py	586	# Check if per-token argmax is exactly equal to continuation	COMMENT
LOW	lm_eval/models/neuron_optimum.py	542	# Check if per-token argmax is exactly equal to continuation	COMMENT
LOW	lm_eval/models/huggingface.py	1529	# Check if per-token argmax is exactly equal to continuation	COMMENT
LOW	lm_eval/models/megatron_lm.py	987	# Check if greedy	COMMENT
LOW	lm_eval/_cli/run.py	478	# Print results	COMMENT
LOW	lm_eval/api/task.py	1078	# Check if answer is provided (handle a=0 as valid answer index)	STRING
LOW	tests/test_tasks.py	28	# Check if task_classes is empty	COMMENT

Docstring Block Structure11 hits · 55 pts

Severity	File	Line	Snippet	Context
HIGH	lm_eval/models/winml.py	388	Run inference using ONNX Runtime GenAI to get full logits sequence. Args: input_text: Inpu	STRING
HIGH	lm_eval/models/ibm_watsonx_ai.py	229	Determines whether a stop token has been generated in the `response_tokens` compared to the `context_tokens`.	STRING
HIGH	lm_eval/models/utils.py	280	Generates and yields batches from the reordered array. The method of grouping and batching depends on the param	STRING
HIGH	lm_eval/models/utils.py	504	This function checks if the (Hugging Face) tokenizer has a padding token and sets it if not present. Some tokenizers req	STRING
HIGH	lm_eval/models/utils.py	611	Normalize generation kwargs for consistent handling across model backends. Model implementations may have different	STRING
HIGH	lm_eval/models/utils.py	829	Truncates input tokens and/or reduces max_gen_toks to fit within max_model_len. Strategy: 1. No truncation	STRING
HIGH	lm_eval/api/registry.py	102	Materialize a lazy placeholder into the actual object. This is at module level to avoid memory leaks from lru_cache	STRING
HIGH	lm_eval/api/registry.py	188	Register an object under one or more aliases. Can be used as a decorator or called directly for direct registra	STRING
HIGH	lm_eval/api/registry.py	279	Retrieve an object by alias, materializing if needed. Thread-safe lazy loading: if the alias points to a placeh	STRING
HIGH	lm_eval/api/registry.py	492	Get a model class by name. Args: model_name: The registered name of the model Returns: The mod	STRING
HIGH	lm_eval/api/registry.py	546	Get a filter by name. Args: filter_name: The registered name of the filter, or a callable Returns:	STRING

AI Slop Vocabulary20 hits · 54 pts

Severity	File	Line	Snippet	Context
MEDIUM	lm_eval/evaluator.py	198	# See https://github.com/EleutherAI/lm-evaluation-harness/pull/1412	COMMENT
MEDIUM	lm_eval/tasks/tinyBenchmarks/utils_truthfulqa.py	160	# init RougeScorer once (https://github.com/EleutherAI/lm-evaluation-harness/issues/1692)--rouge_types are const	COMMENT
MEDIUM	lm_eval/tasks/ifeval/instructions_util.py	29	# see https://github.com/EleutherAI/lm-evaluation-harness/issues/2210	COMMENT
MEDIUM	lm_eval/tasks/aime/utils.py	35	# string normalization from https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_math	COMMENT
MEDIUM	lm_eval/tasks/hendrycks_math/utils.py	35	# string normalization from https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_math	COMMENT
MEDIUM	lm_eval/tasks/truthfulqa/utils.py	164	# init RougeScorer once (https://github.com/EleutherAI/lm-evaluation-harness/issues/1692)--rouge_types are const	COMMENT
LOW	lm_eval/tasks/bbq/utils.py	65	# If all elements are NaN, then we simply return NaN	COMMENT
MEDIUM	lm_eval/tasks/noreval/nortruthfulqa/generation/utils.py	137	# init RougeScorer once (https://github.com/EleutherAI/lm-evaluation-harness/issues/1692)--rouge_types are const	COMMENT
MEDIUM	lm_eval/tasks/noreval/norsumm/utils.py	87	# init RougeScorer once (https://github.com/EleutherAI/lm-evaluation-harness/issues/1692)--rouge_types are const	COMMENT
MEDIUM	lm_eval/tasks/minerva_math/utils.py	28	# https://github.com/wellecks/lm-evaluation-harness/blob/master/lm_eval/tasks/minerva_math.py	COMMENT
LOW	lm_eval/tasks/longbench/_generate_config.py	177	# Now we just set a boolean flag to indicate whether we need a newline	STRING
MEDIUM	lm_eval/tasks/leaderboard/ifeval/instructions_util.py	28	# see https://github.com/EleutherAI/lm-evaluation-harness/issues/2210	COMMENT
MEDIUM	lm_eval/tasks/leaderboard/math/utils.py	25	# https://github.com/wellecks/lm-evaluation-harness/blob/master/lm_eval/tasks/minerva_math.py	COMMENT
MEDIUM⚡	lm_eval/tasks/cruxeval/utils.py	242	# lm-evaluation-harness Integration Functions	COMMENT
MEDIUM	lm_eval/models/openai_completions.py	314	"Loglikelihood (and therefore `multiple_choice`-type tasks) is not supported for chat completions as OpenAI	CODE
MEDIUM	lm_eval/models/huggingface.py	1393	# See: https://github.com/EleutherAI/lm-evaluation-harness/issues/1678	COMMENT
MEDIUM	lm_eval/models/sglang_causallms.py	40	# batch args from lm-eval interface: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interfa	COMMENT
LOW	lm_eval/models/megatron_lm.py	837	# We just pass through the requests without additional splitting	COMMENT
LOW	lm_eval/models/megatron_lm.py	857	# We just return results without additional gathering	COMMENT
MEDIUM	lm_eval/api/metrics.py	614	# See https://github.com/EleutherAI/lm-evaluation-harness/pull/1390 for more documentation.	COMMENT

Self-Referential Comments17 hits · 50 pts

Severity	File	Line	Snippet	Context
MEDIUM	lm_eval/tasks/slr_bench/lm_eval_slr_bench.py	33	# Create the reference in the required format	COMMENT
MEDIUM	lm_eval/tasks/toksuite/utils.py	500	# Create the summary row with column averages	COMMENT
MEDIUM⚡	lm_eval/tasks/med_prescriptions/utils.py	2101	# Create a regular expression pattern for Indian scripts	COMMENT
MEDIUM	lm_eval/tasks/ruler/vt_utils.py	75	# Create a list of the repeated noise	COMMENT
MEDIUM	lm_eval/loggers/utils.py	25	# Define the pattern to match ',none' at the end of the string	COMMENT
MEDIUM	lm_eval/config/evaluate_config.py	223	# Create an instance and validate	COMMENT
MEDIUM	lm_eval/models/api_models.py	269	"""This method is responsible for creating the json payload that will be sent to the API."""	STRING
MEDIUM	tests/test_registry.py	157	# Create a class to test with	COMMENT
MEDIUM	tests/test_metrics.py	12	# Create a minimal config	COMMENT
MEDIUM	tests/test_cli_subcommands.py	444	# Create a minimal valid task yaml	COMMENT
MEDIUM	tests/test_cli_subcommands.py	855	# Create a YAML config file	COMMENT
MEDIUM	tests/test_task_manager.py	529	# Create a custom arc_easy.yaml that has a different metric	COMMENT
MEDIUM	tests/test_task_manager.py	588	# Create a custom task using a real dataset	COMMENT
MEDIUM	tests/test_task_manager.py	640	# Create a completely new task (not overriding any default)	COMMENT
MEDIUM	tests/models/test_vllm_context_length.py	24	# Create a mock VLLM instance with required attributes	COMMENT
MEDIUM	tests/models/test_vllm_context_length.py	205	# Create a mock request	COMMENT
MEDIUM	tests/scripts/test_zeno_visualize.py	17	# Define the process_model_args function that replicates the fixed logic in zeno_visualize.py	COMMENT

Modern AI Meta-Vocabulary18 hits · 47 pts

Severity	File	Line	Snippet	Context
MEDIUM	lm_eval/result_schema.py	35	# Number of few-shot examples used per task.	COMMENT
MEDIUM	lm_eval/result_schema.py	93	# Whether few-shot examples were formatted as multi-turn.	COMMENT
MEDIUM	lm_eval/evaluator.py	388	# add info about the model and few shot config	COMMENT
MEDIUM	lm_eval/tasks/metabench/process_docs.py	24	long_prompt = f"{long_prompt}Question: {question}\nAnswer: {answer}\n\n" # no choices are provided in the f	CODE
MEDIUM	lm_eval/tasks/metabench/process_docs.py	43	long_prompt = f"{long_prompt}Question: {question}\nAnswer: {answer}\n\n" # no choices are provided in the f	CODE
MEDIUM	lm_eval/tasks/metabench/process_docs.py	126	long_prompt = f"{long_prompt}{question}\nA. {choice_A}\nB. {choice_B}\nC. {choice_C}\nD. {choice_D}\nAnswer:	CODE
MEDIUM	lm_eval/tasks/metabench/process_docs_permute.py	24	long_prompt = f"{long_prompt}Question: {question}\nAnswer: {answer}\n\n" # no choices are provided in the f	CODE
MEDIUM	lm_eval/tasks/metabench/process_docs_permute.py	126	long_prompt = f"{long_prompt}{question}\nA. {choice_A}\nB. {choice_B}\nC. {choice_C}\nD. {choice_D}\nAnswer:	CODE
MEDIUM	lm_eval/tasks/french_bench/README.md	59	# Not to use in few-shot	COMMENT
MEDIUM	lm_eval/tasks/ruler/vt_utils.py	208	# This condition is to check if we are generating the few-shot.	COMMENT
MEDIUM	lm_eval/tasks/bigbench/generate_tasks.py	219	+ "_zero_shot", # zero-shot version of the dataset	CODE
MEDIUM	lm_eval/tasks/race/README.md	54	- PR #3716. Fixed how fill-in-the-blank ("cloze") sub-questions are rendered in the few-shot context. For a question l	CODE
MEDIUM	lm_eval/models/utils.py	909	# not to the reasoning trace (which often contains \n\n etc.)	COMMENT
MEDIUM	lm_eval/_cli/run.py	33	# Evaluate on multiple tasks with few-shot examples	COMMENT
MEDIUM	tests/models/test_model_utils.py	145	# Prompt exactly fills context window, gen toks reduced to 0	COMMENT
MEDIUM	docs/model_guide.md	170	# ... more few-shot examples, potentially	COMMENT
MEDIUM	docs/interface.md	38	# With few-shot examples	COMMENT
MEDIUM	docs/interface.md	64	# Multiple tasks with few-shot examples	COMMENT

Modern Structural Boilerplate26 hits · 26 pts

Severity	File	Line	Snippet	Context
LOW	lm_eval/__init__.py	29	__all__ = ["evaluate", "simple_evaluate", "__version__"]	CODE
LOW	lm_eval/filters/__init__.py	27	__all__ = [	CODE
LOW	lm_eval/tasks/__init__.py	25	__all__ = [	CODE
LOW	lm_eval/tasks/ifeval/instructions.py	41	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/tasks/truthfulqa-multi/utils.py	7	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/tasks/leaderboard/ifeval/instructions.py	30	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/loggers/trackio_logger.py	11	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/loggers/wandb_logger.py	13	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/loggers/utils.py	12	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/config/__init__.py	6	__all__ = ["EvaluatorConfig", "TaskConfig", "GroupConfig"]	CODE
LOW⚡	lm_eval/models/winml.py	199	def _setup_winml_devices_and_providers(self) -> None:	CODE
LOW	lm_eval/models/gguf.py	12	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/models/__init__.py	78	__all__ = ["MODEL_MAPPING"]	CODE
LOW	lm_eval/models/neuron_optimum.py	34	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/models/textsynth.py	26	logger = logging.getLogger(__name__)	CODE
LOW	lm_eval/_cli/__init__.py	8	__all__ = ["HarnessCLI"]	CODE
LOW	lm_eval/api/task.py	520	def set_config(self, key: str, value: Any, update: bool = False) -> None:	CODE
LOW	lm_eval/api/task.py	560	def set_fewshot_seed(self, seed: int \| None = None) -> None:	CODE
LOW	lm_eval/api/registry.py	58	__all__ = [	CODE
LOW	lm_eval/api/model.py	225	def set_cache_hook(self, cache_hook: "CacheHook") -> None:	CODE
LOW	scripts/make_table_results.py	13	logger = logging.getLogger(__name__)	CODE
LOW	scripts/make_table_tasks.py	14	logger = logging.getLogger(__name__)	CODE
LOW	scripts/clean_training_data/generate_13_grams.py	41	logger = logging.getLogger(__name__)	CODE
LOW	scripts/clean_training_data/compress_and_package.py	13	logger = logging.getLogger(__name__)	CODE
LOW	scripts/clean_training_data/process_sorted_buckets.py	31	logger = logging.getLogger(__name__)	CODE
LOW	scripts/clean_training_data/sort_13_gram_buckets.py	23	logger = logging.getLogger(__name__)	CODE

AI Structural Patterns25 hits · 24 pts

Severity	File	Line	Context
LOW	lm_eval/evaluator.py	55	CODE
LOW	lm_eval/evaluator.py	424	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	168	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	176	CODE
LOW	lm_eval/tasks/evalita_llm/utils.py	523	CODE
LOW	lm_eval/tasks/minerva_math/utils.py	96	CODE
LOW	lm_eval/tasks/leaderboard/math/utils.py	91	CODE
LOW	lm_eval/tasks/ruler/qa_utils.py	41	CODE
LOW	lm_eval/tasks/ruler/prepare_niah.py	213	CODE
LOW	lm_eval/loggers/evaluation_tracker.py	130	CODE
LOW	lm_eval/config/task.py	50	CODE
LOW	lm_eval/models/optimum_ipex.py	33	CODE
LOW	lm_eval/models/nemo_lm.py	169	CODE
LOW	lm_eval/models/hf_vlms.py	38	CODE
LOW	lm_eval/models/vllm_causallms.py	61	CODE
LOW	lm_eval/models/neuron_optimum.py	133	CODE
LOW	lm_eval/models/huggingface.py	70	CODE
LOW	lm_eval/models/huggingface.py	748	CODE
LOW	lm_eval/models/trtllm_causallms.py	44	CODE
LOW	lm_eval/models/sglang_causallms.py	37	CODE
LOW	lm_eval/models/api_models.py	114	CODE
LOW	lm_eval/models/megatron_lm.py	154	CODE
LOW	lm_eval/models/megatron_lm.py	703	CODE
LOW	lm_eval/api/task.py	268	CODE
LOW	tests/test_evaluator_utils.py	702	CODE

Verbosity Indicators11 hits · 24 pts

Severity	File	Line	Snippet	Context
LOW	lm_eval/tasks/score/math/prompt_templates.json	11	"prompt": "You should solve this math problem.\nIf the problem is easy, provide a brief solution with little	CODE
LOW⚡	lm_eval/tasks/infinitebench/utils.py	367	# Step 1: find last standalone A-D letter (official regex)	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	372	# Step 2: empty prediction	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	376	# Step 3: first character	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	380	# Step 4: full prediction matches label letter	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	384	# Step 5: replace punctuation, check prefixes (matching official chars)	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	395	# Step 6: scan words for first A-D letter	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	430	# Step 1: find last standalone A-J letter (official regex)	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	437	# Step 2: replace chars and consolidate spaces (matching official)	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	447	# Step 3: check startswith	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	453	# Step 4: check answer prefixes (matching official set)	COMMENT

Structural Annotation Overuse10 hits · 22 pts

Severity	File	Line	Snippet	Context
LOW⚡	lm_eval/tasks/infinitebench/utils.py	367	# Step 1: find last standalone A-D letter (official regex)	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	372	# Step 2: empty prediction	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	376	# Step 3: first character	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	380	# Step 4: full prediction matches label letter	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	384	# Step 5: replace punctuation, check prefixes (matching official chars)	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	395	# Step 6: scan words for first A-D letter	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	430	# Step 1: find last standalone A-J letter (official regex)	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	437	# Step 2: replace chars and consolidate spaces (matching official)	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	447	# Step 3: check startswith	COMMENT
LOW⚡	lm_eval/tasks/infinitebench/utils.py	453	# Step 4: check answer prefixes (matching official set)	COMMENT

TODO Padding8 hits · 12 pts

Severity	File	Line	Snippet	Context
LOW	lm_eval/filters/selection.py	7	# TODO: implement "arg_max" filter. either it should take in an arbitrary "scoring"/reward function	COMMENT
LOW	lm_eval/tasks/hendrycks_ethics/deontology.yaml	9	# TODO: implement exact-match metric for this subset	COMMENT
LOW	lm_eval/models/anthropic_llms.py	245	temperature=temperature, # TODO: implement non-greedy sampling for Anthropic	CODE
LOW	lm_eval/models/neuron_optimum.py	422	# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context	COMMENT
LOW	lm_eval/models/huggingface.py	1338	# TODO: implement some kind of efficient-request-middleware that lumps together requests with the same context	COMMENT
LOW	scripts/regression.py	36	# TODO: implement num_fewshot and limit per task, e.g. task1:5,task2:1:100,task3::1000	COMMENT
LOW	scripts/regression.py	39	# TODO: implement hf-auto to pick between causal and seq2seq models so we don't need this	COMMENT
LOW	scripts/regression.py	163	# TODO: implement proper timing for each task	COMMENT

Cross-Language Confusion1 hit · 8 pts

Severity	File	Line	Snippet	Context
HIGH	lm_eval/tasks/bbq/utils.py	75	# Unfortunately, bias score for `n_non_unk = 0` is undefined,	COMMENT

Synthetic Comment Markers1 hit · 8 pts

Severity	File	Line	Snippet	Context
HIGH	lm_eval/tasks/arabic_leaderboard_complete/README.md	181	* `arabic_leaderboard_acva`: Arabic-Culture-Value-Alignment (ACVA) is a yes/no question dataset, generated by GPT3.5 Tur	COMMENT

Fake / Example Data7 hits · 8 pts

Severity	File	Line	Snippet	Context
LOW⚡	tests/test_utils.py	426	auth_token="dummy-token",	CODE
LOW⚡	tests/test_utils.py	429	assert tokenizer.headers["Authorization"] == "Bearer dummy-token"	CODE
LOW⚡	tests/test_utils.py	451	auth_token="dummy-token",	CODE
LOW⚡	tests/test_utils.py	454	assert tokenizer.headers["Authorization"] == "Bearer dummy-token"	CODE
LOW	tests/test_utils.py	476	auth_token="dummy-token",	CODE
LOW	tests/test_utils.py	516	auth_token="dummy-token",	CODE
LOW	tests/test_utils.py	541	auth_token="dummy-token",	CODE

Dead Code3 hits · 6 pts

Severity	File	Line	Context
MEDIUM	lm_eval/models/hf_vlms.py	413	CODE
MEDIUM	lm_eval/models/hf_vlms.py	414	CODE
MEDIUM	lm_eval/models/hf_vlms.py	435	CODE

Analysis Overview

What These Metrics Mean

Score History

Severity Breakdown

Directory Score Breakdown

Pattern Findings