A framework for few-shot evaluation of language models.
1813 matches across 15 categories. Click a row to expand file-level details.
| Severity | File | Line | Snippet |
|---|---|---|---|
| HIGH | lm_eval/filters/selection.py | 0 | can define custom behavior here, if an individual instantiation of a filter class should have state. |
| HIGH | lm_eval/filters/selection.py | 0 | can define custom behavior here, if an individual instantiation of a filter class should have state. |
| HIGH | lm_eval/api/filter.py | 0 | can define custom behavior here, if an individual instantiation of a filter class should have state. |
| HIGH | lm_eval/filters/selection.py | 0 | assuming each entry of `resps` is a list of model responses, we discard all but the first response. |
| HIGH | …val/tasks/darija_bench/darija_transliteration/utils.py | 0 | assuming each entry of `resps` is a list of model responses, we discard all but the first response. |
| HIGH | lm_eval/tasks/darija_bench/darija_translation/utils.py | 0 | assuming each entry of `resps` is a list of model responses, we discard all but the first response. |
| HIGH | …_eval/tasks/darija_bench/darija_summarization/utils.py | 0 | assuming each entry of `resps` is a list of model responses, we discard all but the first response. |
| HIGH | lm_eval/tasks/super_glue/record/t5_utils.py | 0 | lower text and remove punctuation, articles and extra whitespace. |
| HIGH | lm_eval/tasks/french_bench/utils.py | 0 | lower text and remove punctuation, articles and extra whitespace. |
| HIGH | lm_eval/tasks/longbench/metrics.py | 0 | lower text and remove punctuation, articles and extra whitespace. |
| HIGH | lm_eval/tasks/mlqa/utils.py | 0 | lower text and remove punctuation, articles and extra whitespace. |
| HIGH | lm_eval/tasks/tinyBenchmarks/utils_truthfulqa.py | 0 | returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer |
| HIGH | lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py | 0 | returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer |
| HIGH | lm_eval/tasks/truthfulqa-multi/utils.py | 0 | returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer |
| HIGH | lm_eval/tasks/truthfulqa/utils.py | 0 | returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer |
| HIGH | lm_eval/tasks/noreval/nortruthfulqa/generation/utils.py | 0 | returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer |
| HIGH | lm_eval/tasks/noreval/norsumm/utils.py | 0 | returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer |
| HIGH | lm_eval/tasks/galician_bench/utils.py | 0 | returns `t5` style bleu scores. see the related implementation: https://github.com/google-research/text-to-text-transfer |
| HIGH | lm_eval/tasks/tinyBenchmarks/utils_truthfulqa.py | 0 | returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe |
| HIGH | lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py | 0 | returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe |
| HIGH | lm_eval/tasks/truthfulqa-multi/utils.py | 0 | returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe |
| HIGH | lm_eval/tasks/truthfulqa/utils.py | 0 | returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe |
| HIGH | lm_eval/tasks/noreval/nortruthfulqa/generation/utils.py | 0 | returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe |
| HIGH | lm_eval/tasks/noreval/norsumm/utils.py | 0 | returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe |
| HIGH | lm_eval/tasks/galician_bench/utils.py | 0 | returns `t5` style rouge scores. see the related implementation: https://github.com/google-research/text-to-text-transfe |
| HIGH | lm_eval/tasks/darijammlu/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/mmlusr/config.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/e2lmc/noor/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/egymmlu/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/arab_culture/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | …val/tasks/arab_culture_completion/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/mmlu/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/tmmluplus/default/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/arabicmmlu/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/tmlu/default/_generate_configs.py | 0 | take in a yaml, and output all "other" splits with this yaml |
| HIGH | lm_eval/tasks/mgsm/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrimmlu/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/paws-x/_generate_config.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/translation/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/xnli/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/openai_mmlu/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/adr/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/mafand/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/naijarc/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/belebele/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/injongointent/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/xlsum/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/afrisenti/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/masakhapos/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/ntrex/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/flores/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/salt/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/masakhanews/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/uhura-arc-easy/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/sib/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/afriqa/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrobench/masakhaner/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrixnli/gen_utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/afrixnli/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| HIGH | lm_eval/tasks/xwinograd/utils.py | 0 | generate a yaml file for each language. :param output_dir: the directory to output the files to. :param overwrite: wheth |
| 293 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | lm_eval/evaluator_utils.py | 173 | def _compute_task_aggregations( |
| LOW | lm_eval/evaluator_utils.py | 319 | def _collect_groups_bottom_up(groups: dict[str, Group]) -> list[Group]: |
| LOW | lm_eval/evaluator_utils.py | 404 | def _propagate_higher_is_better( |
| LOW | lm_eval/utils.py | 47 | def is_transformers_available() -> bool: |
| LOW | lm_eval/utils.py | 324 | def get_sample_results_filenames(filenames: list[str]) -> list[str]: |
| LOW | lm_eval/utils.py | 331 | def get_rolling_token_windows( |
| LOW | lm_eval/utils.py | 844 | def check_remote_tokenizer_support( |
| LOW | lm_eval/tasks/__init__.py | 36 | def get_task_name_from_config(task_config: dict[str, str]) -> str: |
| LOW | lm_eval/tasks/__init__.py | 50 | def get_task_name_from_object(task_object): |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 207 | def generate_optimal_plans_for_problem_state(P, state, num_plans, timeout): |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 330 | def create_tmp_dom_prob_replace_init(P, state, result_domain_file, result_problem_file): |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 671 | def str_remove_before_first_parentheses(s): |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 680 | def str_remove_after_last_parentheses(s): |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 207 | def generate_optimal_plans_for_problem_state(P, state, num_plans, timeout): |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 330 | def create_tmp_dom_prob_replace_init(P, state, result_domain_file, result_problem_file): |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 671 | def str_remove_before_first_parentheses(s): |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 680 | def str_remove_after_last_parentheses(s): |
| LOW | lm_eval/tasks/jfinqa/test_jfinqa_utils.py | 35 | def test_normalize_comma_only_between_digits(self): |
| LOW | lm_eval/tasks/jfinqa/test_jfinqa_utils.py | 58 | def test_extract_answer_multiline_with_answer(self): |
| LOW | lm_eval/tasks/jfinqa/test_jfinqa_utils.py | 100 | def test_exact_numerical_match(self): |
| LOW | lm_eval/tasks/jfinqa/test_jfinqa_utils.py | 115 | def test_non_numeric_fallback(self): |
| LOW | lm_eval/tasks/jfinqa/test_jfinqa_utils.py | 123 | def test_same_unit_different_values(self): |
| LOW | lm_eval/tasks/jfinqa/test_jfinqa_utils.py | 145 | def test_missing_optional_fields(self): |
| LOW | lm_eval/tasks/jfinqa/test_jfinqa_utils.py | 163 | def test_exact_and_numerical_match(self): |
| LOW | lm_eval/tasks/jfinqa/test_jfinqa_utils.py | 169 | def test_numerical_match_only(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 133 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 170 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 243 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 293 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 340 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 379 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 427 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 476 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 550 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 600 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 660 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 721 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 784 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 853 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 911 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 939 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1017 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1108 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1153 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1208 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1241 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1295 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1331 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1356 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1432 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1460 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1492 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1523 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1580 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1612 | def get_instruction_args_keys(self): |
| LOW | lm_eval/tasks/ifeval/utils.py | 24 | def test_instruction_following_strict( |
| LOW | lm_eval/tasks/ifeval/utils.py | 57 | def test_instruction_following_loose( |
| LOW | lm_eval/tasks/ifeval/multilingual/utils.py | 23 | def test_instruction_following_strict( |
| LOW | lm_eval/tasks/ifeval/multilingual/utils.py | 56 | def test_instruction_following_loose( |
| LOW | …ks/ifeval/multilingual/instructions/es_instructions.py | 104 | def get_instruction_args_keys(self): |
| 595 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | lm_eval/tasks/cruxeval/utils.py | 223 | # ============================================================================ |
| MEDIUM | lm_eval/tasks/cruxeval/utils.py | 225 | # ============================================================================ |
| MEDIUM | lm_eval/tasks/cruxeval/utils.py | 241 | # ============================================================================ |
| MEDIUM | lm_eval/tasks/cruxeval/utils.py | 243 | # ============================================================================ |
| MEDIUM | lm_eval/tasks/cruxeval/utils.py | 425 | # ============================================================================ |
| MEDIUM | lm_eval/tasks/cruxeval/utils.py | 427 | # ============================================================================ |
| MEDIUM | lm_eval/api/registry.py | 414 | # ============================================================================= |
| MEDIUM | lm_eval/api/registry.py | 416 | # ============================================================================= |
| MEDIUM | lm_eval/api/registry.py | 460 | # ============================================================================= |
| MEDIUM | lm_eval/api/registry.py | 462 | # ============================================================================= |
| MEDIUM | lm_eval/api/registry.py | 520 | # ============================================================================= |
| MEDIUM | lm_eval/api/registry.py | 522 | # ============================================================================= |
| MEDIUM | lm_eval/api/registry.py | 570 | # ============================================================================= |
| MEDIUM | lm_eval/api/registry.py | 572 | # ============================================================================= |
| MEDIUM | lm_eval/api/group.py | 327 | # ============================================================================= |
| MEDIUM | lm_eval/api/group.py | 329 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 13 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 15 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 24 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 26 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 58 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 60 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 112 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 114 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 717 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 719 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 180 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 182 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 455 | # ============================================================================= |
| MEDIUM | tests/test_fewshot_context.py | 457 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 12 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 14 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 359 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 361 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 921 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 923 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 84 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 86 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 205 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 207 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 777 | # ============================================================================= |
| MEDIUM | tests/test_task_manager.py | 779 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 38 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 40 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 213 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 215 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 268 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 270 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 304 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 306 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 15 | # ============================================================================= |
| MEDIUM | tests/test_samplers.py | 17 | # ============================================================================= |
| MEDIUM | tests/test_aggregation_pipeline.py | 26 | # --------------------------------------------------------------------------- |
| MEDIUM | tests/test_aggregation_pipeline.py | 28 | # --------------------------------------------------------------------------- |
| MEDIUM | tests/test_aggregation_pipeline.py | 107 | # --------------------------------------------------------------------------- |
| MEDIUM | tests/test_aggregation_pipeline.py | 109 | # --------------------------------------------------------------------------- |
| MEDIUM | tests/test_evaluator_utils.py | 115 | # --------------------------------------------------------------------------- |
| MEDIUM | tests/test_evaluator_utils.py | 117 | # --------------------------------------------------------------------------- |
| MEDIUM | tests/test_evaluator_utils.py | 139 | # --------------------------------------------------------------------------- |
| MEDIUM | tests/test_evaluator_utils.py | 141 | # --------------------------------------------------------------------------- |
| 38 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | lm_eval/tasks/_index.py | 63 | except Exception as err: |
| LOW | lm_eval/tasks/realtoxicityprompts/metric.py | 36 | except Exception: |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 123 | except Exception as e: |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 148 | except Exception: |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 325 | except Exception as e: |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 676 | except Exception: |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 1058 | except Exception as e: |
| MEDIUM | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 1053 | def parse_prediction(prediction): |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 123 | except Exception as e: |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 148 | except Exception: |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 325 | except Exception as e: |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 676 | except Exception: |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 1058 | except Exception as e: |
| MEDIUM | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 1053 | def parse_prediction(prediction): |
| LOW | lm_eval/tasks/slr_bench/lm_eval_slr_bench.py | 14 | except Exception as e: |
| LOW | lm_eval/tasks/slr_bench/lm_eval_slr_bench.py | 59 | except Exception as e: |
| MEDIUM | lm_eval/tasks/slr_bench/lm_eval_slr_bench.py | 60 | print(f"Error in process_results: {e}") |
| LOW | lm_eval/tasks/humaneval/utils.py | 9 | except Exception as e: |
| LOW | lm_eval/tasks/aime/utils.py | 49 | except Exception: |
| LOW | lm_eval/tasks/hendrycks_math/utils.py | 49 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/default/utils.py | 32 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/default/utils.py | 70 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/default/utils.py | 84 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/default/utils.py | 158 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/default/utils.py | 189 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/en/utils.py | 32 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/en/utils.py | 70 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/en/utils.py | 84 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/en/utils.py | 158 | except Exception: |
| LOW | lm_eval/tasks/hrm8k/en/utils.py | 189 | except Exception: |
| LOW | lm_eval/tasks/humaneval_infilling/utils.py | 9 | except Exception as e: |
| LOW | lm_eval/tasks/medtext/utils.py | 18 | except Exception as e: |
| LOW | lm_eval/tasks/medtext/utils.py | 27 | except Exception as e: |
| LOW | lm_eval/tasks/medtext/utils.py | 33 | except Exception as e: |
| LOW | lm_eval/tasks/medtext/utils.py | 39 | except Exception as e: |
| LOW | lm_eval/tasks/medtext/utils.py | 47 | except Exception as e: |
| MEDIUM | lm_eval/tasks/medtext/utils.py | 24 | def doc_eval(pred, refs): |
| LOW | lm_eval/tasks/olaph/utils.py | 19 | except Exception as e: |
| LOW | lm_eval/tasks/olaph/utils.py | 28 | except Exception as e: |
| LOW | lm_eval/tasks/olaph/utils.py | 34 | except Exception as e: |
| LOW | lm_eval/tasks/olaph/utils.py | 40 | except Exception as e: |
| LOW | lm_eval/tasks/olaph/utils.py | 48 | except Exception as e: |
| MEDIUM | lm_eval/tasks/olaph/utils.py | 25 | def doc_eval(pred, refs): |
| LOW | lm_eval/tasks/minerva_math/utils.py | 197 | except Exception as e: |
| LOW | lm_eval/tasks/leaderboard/math/utils.py | 209 | except Exception as e: |
| LOW | lm_eval/tasks/toksuite/utils.py | 474 | except Exception: |
| LOW | lm_eval/tasks/toksuite/utils.py | 494 | except Exception: |
| LOW | lm_eval/tasks/toksuite/utils.py | 518 | except Exception: |
| LOW | lm_eval/tasks/toksuite/utils.py | 522 | except Exception: |
| LOW | lm_eval/tasks/toksuite/utils.py | 555 | except Exception: |
| LOW | lm_eval/tasks/toksuite/utils.py | 578 | except Exception: |
| LOW | lm_eval/tasks/toksuite/utils.py | 601 | except Exception: |
| LOW | lm_eval/tasks/toksuite/utils.py | 605 | except Exception: |
| LOW | lm_eval/tasks/meqsum/utils.py | 18 | except Exception as e: |
| LOW | lm_eval/tasks/meqsum/utils.py | 52 | except Exception as e: |
| LOW | lm_eval/tasks/meqsum/utils.py | 58 | except Exception as e: |
| LOW | lm_eval/tasks/meqsum/utils.py | 64 | except Exception as e: |
| LOW | lm_eval/tasks/meqsum/utils.py | 72 | except Exception as e: |
| LOW | lm_eval/tasks/med_prescriptions/utils.py | 2060 | except Exception: |
| LOW | lm_eval/tasks/med_prescriptions/utils.py | 2066 | except Exception: |
| 104 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | lm_eval/evaluator_utils.py | 1 | |
| LOW | lm_eval/__init__.py | 2 | |
| LOW | lm_eval/evaluator.py | 1 | |
| LOW | lm_eval/filters/__init__.py | 1 | |
| LOW | lm_eval/filters/__init__.py | 6 | |
| LOW | lm_eval/filters/__init__.py | 8 | |
| LOW | lm_eval/filters/__init__.py | 8 | |
| LOW | lm_eval/filters/__init__.py | 8 | |
| LOW | lm_eval/filters/__init__.py | 8 | |
| LOW | lm_eval/filters/extraction.py | 1 | |
| LOW | lm_eval/tasks/_index.py | 1 | |
| LOW | lm_eval/tasks/_factory.py | 1 | |
| LOW | lm_eval/tasks/_yaml_loader.py | 1 | |
| LOW | lm_eval/tasks/manager.py | 1 | |
| LOW | lm_eval/tasks/manager.py | 20 | |
| LOW | lm_eval/tasks/babilong/common_utils.py | 11 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 3 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 4 | |
| LOW | lm_eval/tasks/jfinqa/utils.py | 12 | |
| LOW | lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py | 2 | |
| LOW | lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py | 8 | |
| LOW | lm_eval/tasks/aime/utils.py | 1 | |
| LOW | lm_eval/tasks/noreval/norsumm/utils.py | 1 | |
| LOW | lm_eval/tasks/noreval/norsumm/utils.py | 7 | |
| LOW | lm_eval/tasks/spanish_bench/utils.py | 2 | |
| LOW | lm_eval/tasks/spanish_bench/utils.py | 5 | |
| LOW | lm_eval/tasks/minerva_math/utils.py | 5 | |
| LOW | lm_eval/tasks/minerva_math/utils.py | 5 | |
| LOW | lm_eval/tasks/minerva_math/utils.py | 14 | |
| LOW | lm_eval/tasks/darija_bench/darija_sentiment/utils.py | 1 | |
| LOW | lm_eval/tasks/darija_bench/darija_sentiment/utils.py | 2 | |
| LOW | …_eval/tasks/darija_bench/darija_summarization/utils.py | 1 | |
| LOW | lm_eval/tasks/longbench/utils.py | 1 | |
| LOW | lm_eval/tasks/longbench/utils.py | 2 | |
| LOW | lm_eval/tasks/longbench/utils.py | 3 | |
| LOW | lm_eval/tasks/leaderboard/gpqa/utils.py | 1 | |
| LOW | lm_eval/tasks/xquad/utils.py | 1 | |
| LOW | lm_eval/tasks/xquad/utils.py | 2 | |
| LOW | lm_eval/tasks/xquad/utils.py | 4 | |
| LOW | lm_eval/tasks/xquad/utils.py | 7 | |
| LOW | lm_eval/tasks/cnn_dailymail/utils.py | 6 | |
| LOW | lm_eval/tasks/score/utils.py | 21 | |
| LOW | lm_eval/tasks/score/utils.py | 25 | |
| LOW | lm_eval/tasks/ruler/vt_utils.py | 30 | |
| LOW | lm_eval/tasks/ruler/vt_utils.py | 30 | |
| LOW | lm_eval/tasks/ruler/fwe_utils.py | 20 | |
| LOW | lm_eval/tasks/ruler/common_utils.py | 10 | |
| LOW | lm_eval/tasks/afrobench/nollysenti/prompt_5/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/nollysenti/prompt_2/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/nollysenti/prompt_3/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/nollysenti/prompt_4/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/nollysenti/prompt_1/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/injongointent/prompt_5/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/injongointent/prompt_2/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/injongointent/prompt_3/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/injongointent/prompt_4/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/injongointent/prompt_1/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/afrisenti/prompt_5/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/afrisenti/prompt_2/utils.py | 1 | |
| LOW | lm_eval/tasks/afrobench/afrisenti/prompt_3/utils.py | 1 | |
| 121 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | lm_eval/evaluator_utils.py | 404 | |
| LOW | lm_eval/evaluator_utils.py | 483 | |
| LOW | lm_eval/evaluator.py | 424 | |
| LOW | lm_eval/filters/extraction.py | 39 | |
| LOW | lm_eval/filters/extraction.py | 157 | |
| LOW | lm_eval/filters/extraction.py | 42 | |
| LOW | lm_eval/tasks/_factory.py | 127 | |
| LOW | lm_eval/tasks/realtoxicityprompts/metric.py | 12 | |
| LOW | lm_eval/tasks/evalita_llm/metrics.py | 49 | |
| LOW | lm_eval/tasks/evalita_llm/metrics.py | 63 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 11 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 30 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 91 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 193 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 246 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 439 | |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 526 | |
| LOW | lm_eval/tasks/simple_cooccurrence_bias/utils.py | 29 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 642 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 730 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 91 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 917 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 993 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 642 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 730 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 91 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 917 | |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 993 | |
| LOW | lm_eval/tasks/mgsm/utils.py | 131 | |
| LOW | lm_eval/tasks/chartqa/utils.py | 192 | |
| LOW | …asks/catalan_bench/flores_ca/create_yamls_flores_ca.py | 273 | |
| LOW | lm_eval/tasks/truthfulqa-multi/utils.py | 38 | |
| LOW | lm_eval/tasks/mmmu/utils.py | 105 | |
| LOW | lm_eval/tasks/mmmu/utils.py | 223 | |
| LOW | lm_eval/tasks/mmmu/utils.py | 316 | |
| LOW | lm_eval/tasks/mmmu/utils.py | 230 | |
| LOW | lm_eval/tasks/translation/utils.py | 41 | |
| LOW | lm_eval/tasks/aime/utils.py | 97 | |
| LOW | lm_eval/tasks/hendrycks_math/utils.py | 97 | |
| LOW | lm_eval/tasks/qasper/utils.py | 6 | |
| LOW | lm_eval/tasks/qasper/utils.py | 9 | |
| LOW | lm_eval/tasks/qasper/utils.py | 31 | |
| LOW | …s/portuguese_bench/flores_pt/create_yamls_flores_pt.py | 272 | |
| LOW | lm_eval/tasks/bbq/utils.py | 212 | |
| LOW | lm_eval/tasks/bbq/utils.py | 300 | |
| LOW | lm_eval/tasks/bbq/utils.py | 303 | |
| LOW | lm_eval/tasks/hrm8k/default/utils.py | 146 | |
| LOW | lm_eval/tasks/hrm8k/en/utils.py | 146 | |
| LOW | …asks/spanish_bench/flores_es/create_yamls_flores_es.py | 272 | |
| LOW | lm_eval/tasks/minerva_math/utils.py | 159 | |
| LOW | lm_eval/tasks/leaderboard/math/utils.py | 170 | |
| LOW | lm_eval/tasks/toksuite/utils.py | 423 | |
| LOW | lm_eval/tasks/toksuite/utils.py | 532 | |
| LOW | lm_eval/tasks/med_prescriptions/utils.py | 2178 | |
| LOW | lm_eval/tasks/med_prescriptions/utils.py | 2271 | |
| LOW | lm_eval/tasks/score/utils.py | 74 | |
| LOW | lm_eval/tasks/score/utils.py | 199 | |
| LOW | lm_eval/tasks/score/utils.py | 93 | |
| LOW | lm_eval/tasks/score/non_greedy_summarizer.py | 33 | |
| LOW | lm_eval/tasks/score/non_greedy_summarizer.py | 117 | |
| 106 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | lm_eval/result_schema.py | 21 | { |
| LOW | lm_eval/result_schema.py | 41 | # Per-task list of per-document sample results. |
| LOW | lm_eval/result_schema.py | 61 | "upper_git_hash": str | None, |
| LOW | lm_eval/result_schema.py | 81 | # Model source identifier (e.g. "hf"). |
| LOW | lm_eval/tasks/tinyBenchmarks/utils_truthfulqa.py | 61 | # bleurt_scores_true = self.bleurt.compute( |
| LOW | lm_eval/tasks/ifeval/instructions.py | 1 | # Copyright 2023 The Google Research Authors. |
| LOW | lm_eval/tasks/ifeval/instructions_util.py | 1 | # Copyright 2023 The Google Research Authors. |
| LOW | lm_eval/tasks/ifeval/instructions_registry.py | 1 | # Copyright 2023 The Google Research Authors. |
| LOW | …val/tasks/ifeval/multilingual/instructions_registry.py | 1 | # Copyright 2024 The Google Research Authors. |
| LOW | …multilingual/instruction_utils/ca_instructions_util.py | 1 | # coding=utf-8 |
| LOW | …multilingual/instruction_utils/es_instructions_util.py | 1 | # coding=utf-8 |
| LOW | …ks/ifeval/multilingual/instructions/es_instructions.py | 1 | # coding=utf-8 |
| LOW | …ks/ifeval/multilingual/instructions/ca_instructions.py | 1 | # Copyright 2024 The Google Research Authors. |
| LOW | lm_eval/tasks/catalan_bench/truthfulqa_va/utils.py | 181 | # bleurt_scores_false = self.bleurt.compute( |
| LOW | lm_eval/tasks/truthfulqa-multi/utils.py | 81 | completion = results[0] |
| LOW | lm_eval/tasks/truthfulqa-multi/utils.py | 101 | bleu_scores = [bleu([[ref]], [completion]) for ref in all_refs] |
| LOW | lm_eval/tasks/truthfulqa-multi/utils.py | 121 | # rouge2_max = rouge2_correct |
| LOW | lm_eval/tasks/truthfulqa/utils.py | 61 | |
| LOW | lm_eval/tasks/longbench/metrics.py | 1 | # MIT License |
| LOW | lm_eval/tasks/longbench/_generate_config.py | 1 | # MIT License |
| LOW | lm_eval/tasks/leaderboard/ifeval/instructions.py | 1 | # Copyright 2023 The Google Research Authors. |
| LOW | lm_eval/tasks/leaderboard/ifeval/instructions_util.py | 1 | # Copyright 2023 The Google Research Authors. |
| LOW | …eval/tasks/leaderboard/ifeval/instructions_registry.py | 1 | # Copyright 2023 The Google Research Authors. |
| LOW | lm_eval/tasks/logiqa2/utils_logiqa2.py | 21 | # # https://github.com/csitfun/LogiQA2.0/blob/main/logiqa2nli/nli-prompt.py |
| LOW | lm_eval/tasks/score/utils.py | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | lm_eval/tasks/score/non_greedy_summarizer.py | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | lm_eval/tasks/score/mmlu_pro/utils_mmlu_pro.py | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …ore/math/prompt_robustness_math_counting_and_prob.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …/tasks/score/math/prompt_robustness_math_geometry.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …ks/score/math/non_greedy_robustness_math_geometry.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …/score/math/non_greedy_robustness_math_num_theory.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …l/tasks/score/math/prompt_robustness_math_precalc.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …/score/math/non_greedy_robustness_math_prealgebra.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …asks/score/math/prompt_robustness_math_num_theory.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …math/non_greedy_robustness_math_counting_and_prob.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …asks/score/math/prompt_robustness_math_prealgebra.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …/math/prompt_robustness_math_intermediate_algebra.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | lm_eval/tasks/score/math/math_grader.py | 1 | # Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved. |
| LOW | lm_eval/tasks/score/math/math_grader.py | 21 | # copies of the Software, and to permit persons to whom the Software is |
| LOW | lm_eval/tasks/score/math/math_grader.py | 41 | # copies of the Software, and to permit persons to whom the Software is |
| LOW | lm_eval/tasks/score/math/math_grader.py | 61 | # copies of the Software, and to permit persons to whom the Software is |
| LOW | …sks/score/math/non_greedy_robustness_math_precalc.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …h/non_greedy_robustness_math_intermediate_algebra.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …/agi_eval/option_order_robustness_agieval_lsat_rc.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …score/agi_eval/prompt_robustness_agieval_lstat_lr.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …e/agi_eval/non_greedy_robustness_agieval_sat_math.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …agi_eval/option_order_robustness_agieval_sat_math.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …e/agi_eval/non_greedy_robustness_agieval_lstat_ar.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | lm_eval/tasks/score/agi_eval/utils_agieval.py | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …/agi_eval/option_order_robustness_agieval_lsat_ar.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …s/score/agi_eval/prompt_robustness_agieval_sat_en.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …ore/agi_eval/non_greedy_robustness_agieval_sat_en.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …/agi_eval/non_greedy_robustness_agieval_logiqa_en.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …e/agi_eval/non_greedy_robustness_agieval_lstat_lr.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …gi_eval/option_order_robustness_agieval_logiqa_en.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …score/agi_eval/prompt_robustness_agieval_lstat_ar.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …score/agi_eval/prompt_robustness_agieval_sat_math.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …re/agi_eval/non_greedy_robustness_agieval_lsat_rc.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …core/agi_eval/prompt_robustness_agieval_logiqa_en.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| LOW | …/score/agi_eval/prompt_robustness_agieval_lsat_rc.yaml | 1 | # Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 28 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | lm_eval/tasks/_yaml_loader.py | 50 | # Check if this is a built-in task module |
| LOW | lm_eval/tasks/_yaml_loader.py | 69 | # Check if we need to reload the module |
| LOW | lm_eval/tasks/_yaml_loader.py | 72 | # Check if it was modified |
| LOW | lm_eval/tasks/evalita_llm/utils.py | 126 | if results: # Check if results is not empty |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 183 | # Check if new_plan is a plan |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 807 | # Check if the answer is equal (as a set) to the real stored answer |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 860 | # Check if the plan candidate from the answer (a) is a proper subsequence of the plan in the question and (b |
| LOW | lm_eval/tasks/acpbench/gen_2shot_with_pddl/acp_utils.py | 978 | # Check if the answer is equal as sets to the correct answers. |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 183 | # Check if new_plan is a plan |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 807 | # Check if the answer is equal (as a set) to the real stored answer |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 860 | # Check if the plan candidate from the answer (a) is a proper subsequence of the plan in the question and (b |
| LOW | lm_eval/tasks/acpbench/gen_2shot/acp_utils.py | 978 | # Check if the answer is equal as sets to the correct answers. |
| LOW | …ks/ifeval/multilingual/instructions/es_instructions.py | 1366 | # Check if the last character in value is a dot (.) |
| LOW | …ks/ifeval/multilingual/instructions/es_instructions.py | 1506 | # Check if all normalized, alphabetic characters are uppercase, ignoring non-alphabetic characters |
| LOW | …ks/ifeval/multilingual/instructions/ca_instructions.py | 1364 | # Check if the last character in value is a dot (.) |
| LOW | …ks/ifeval/multilingual/instructions/ca_instructions.py | 1504 | # Check if all normalized, alphabetic characters are uppercase, ignoring non-alphabetic characters |
| LOW | lm_eval/tasks/chartqa/utils.py | 211 | # Check if the number is in the model answer with commas (e.g. 1,000) |
| LOW | lm_eval/tasks/chartqa/utils.py | 214 | # Check if the number is in the model answer without commas (e.g. 1000) |
| LOW | lm_eval/tasks/graphwalks/utils.py | 42 | # Check if formatted correctly |
| LOW | lm_eval/tasks/aime/utils.py | 26 | # Check if answer matches target |
| LOW | lm_eval/tasks/bbq/utils.py | 224 | # Check if answer is "Not known" |
| LOW | lm_eval/tasks/med_prescriptions/utils.py | 2106 | # Check if the text contains any Indian script characters |
| LOW | lm_eval/tasks/arab_culture/utils_mcq.py | 17 | ### Set this to one to add the country and region information to the prompt |
| LOW | lm_eval/tasks/arab_culture/utils_mcq.py | 19 | ### Set this to one to add the region information to the prompt |
| LOW | lm_eval/tasks/arab_culture/utils_mcq.py | 21 | ### Set this to change between Arabic and English for the answer keys and the choices keys |
| LOW | lm_eval/tasks/jsonschema_bench/metrics.py | 28 | # Check if the schema is valid |
| LOW | lm_eval/tasks/afrobench/masakhaner/prompt_5/utils.py | 17 | if pair: # Check if the line is not empty |
| LOW | lm_eval/tasks/afrobench/masakhaner/prompt_2/utils.py | 17 | if pair: # Check if the line is not empty |
| LOW | lm_eval/tasks/afrobench/masakhaner/prompt_3/utils.py | 17 | if pair: # Check if the line is not empty |
| LOW | lm_eval/tasks/afrobench/masakhaner/prompt_4/utils.py | 17 | if pair: # Check if the line is not empty |
| LOW | lm_eval/tasks/afrobench/masakhaner/prompt_1/utils.py | 17 | if pair: # Check if the line is not empty |
| LOW | …eval/tasks/arab_culture_completion/utils_completion.py | 18 | ### Set this to one to add the country and region information to the prompt |
| LOW | …eval/tasks/arab_culture_completion/utils_completion.py | 20 | ### Set this to one to add the region information to the prompt |
| LOW | …eval/tasks/arab_culture_completion/utils_completion.py | 22 | ### Set this to change between Arabic and English for the answer keys and the choices keys |
| LOW | lm_eval/decontamination/decontaminate.py | 61 | # Check if we've decontaminated this combination before |
| LOW | lm_eval/models/winml.py | 326 | # Check if encoding empty string gives BOS token |
| LOW | lm_eval/models/winml.py | 556 | # Check if greedy (argmax matches actual token) |
| LOW | lm_eval/models/hf_vlms.py | 586 | # Check if per-token argmax is exactly equal to continuation |
| LOW | lm_eval/models/neuron_optimum.py | 542 | # Check if per-token argmax is exactly equal to continuation |
| LOW | lm_eval/models/huggingface.py | 1529 | # Check if per-token argmax is exactly equal to continuation |
| LOW | lm_eval/models/megatron_lm.py | 987 | # Check if greedy |
| LOW | lm_eval/_cli/run.py | 478 | # Print results |
| LOW | lm_eval/api/task.py | 1078 | # Check if answer is provided (handle a=0 as valid answer index) |
| LOW | tests/test_tasks.py | 28 | # Check if task_classes is empty |
| Severity | File | Line | Snippet |
|---|---|---|---|
| HIGH | lm_eval/models/winml.py | 388 | Run inference using ONNX Runtime GenAI to get full logits sequence. Args: input_text: Inpu |
| HIGH | lm_eval/models/ibm_watsonx_ai.py | 229 | Determines whether a stop token has been generated in the `response_tokens` compared to the `context_tokens`. |
| HIGH | lm_eval/models/utils.py | 280 | Generates and yields batches from the reordered array. The method of grouping and batching depends on the param |
| HIGH | lm_eval/models/utils.py | 504 | This function checks if the (Hugging Face) tokenizer has a padding token and sets it if not present. Some tokenizers req |
| HIGH | lm_eval/models/utils.py | 611 | Normalize generation kwargs for consistent handling across model backends. Model implementations may have different |
| HIGH | lm_eval/models/utils.py | 829 | Truncates input tokens and/or reduces max_gen_toks to fit within max_model_len. Strategy: 1. No truncation |
| HIGH | lm_eval/api/registry.py | 102 | Materialize a lazy placeholder into the actual object. This is at module level to avoid memory leaks from lru_cache |
| HIGH | lm_eval/api/registry.py | 188 | Register an object under one or more aliases. Can be used as a decorator or called directly for direct registra |
| HIGH | lm_eval/api/registry.py | 279 | Retrieve an object by alias, materializing if needed. Thread-safe lazy loading: if the alias points to a placeh |
| HIGH | lm_eval/api/registry.py | 492 | Get a model class by name. Args: model_name: The registered name of the model Returns: The mod |
| HIGH | lm_eval/api/registry.py | 546 | Get a filter by name. Args: filter_name: The registered name of the filter, or a callable Returns: |
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | lm_eval/evaluator.py | 198 | # See https://github.com/EleutherAI/lm-evaluation-harness/pull/1412 |
| MEDIUM | lm_eval/tasks/tinyBenchmarks/utils_truthfulqa.py | 160 | # init RougeScorer once (https://github.com/EleutherAI/lm-evaluation-harness/issues/1692)--rouge_types are const |
| MEDIUM | lm_eval/tasks/ifeval/instructions_util.py | 29 | # see https://github.com/EleutherAI/lm-evaluation-harness/issues/2210 |
| MEDIUM | lm_eval/tasks/aime/utils.py | 35 | # string normalization from https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_math |
| MEDIUM | lm_eval/tasks/hendrycks_math/utils.py | 35 | # string normalization from https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_math |
| MEDIUM | lm_eval/tasks/truthfulqa/utils.py | 164 | # init RougeScorer once (https://github.com/EleutherAI/lm-evaluation-harness/issues/1692)--rouge_types are const |
| LOW | lm_eval/tasks/bbq/utils.py | 65 | # If all elements are NaN, then we simply return NaN |
| MEDIUM | lm_eval/tasks/noreval/nortruthfulqa/generation/utils.py | 137 | # init RougeScorer once (https://github.com/EleutherAI/lm-evaluation-harness/issues/1692)--rouge_types are const |
| MEDIUM | lm_eval/tasks/noreval/norsumm/utils.py | 87 | # init RougeScorer once (https://github.com/EleutherAI/lm-evaluation-harness/issues/1692)--rouge_types are const |
| MEDIUM | lm_eval/tasks/minerva_math/utils.py | 28 | # https://github.com/wellecks/lm-evaluation-harness/blob/master/lm_eval/tasks/minerva_math.py |
| LOW | lm_eval/tasks/longbench/_generate_config.py | 177 | # Now we just set a boolean flag to indicate whether we need a newline |
| MEDIUM | lm_eval/tasks/leaderboard/ifeval/instructions_util.py | 28 | # see https://github.com/EleutherAI/lm-evaluation-harness/issues/2210 |
| MEDIUM | lm_eval/tasks/leaderboard/math/utils.py | 25 | # https://github.com/wellecks/lm-evaluation-harness/blob/master/lm_eval/tasks/minerva_math.py |
| MEDIUM | lm_eval/tasks/cruxeval/utils.py | 242 | # lm-evaluation-harness Integration Functions |
| MEDIUM | lm_eval/models/openai_completions.py | 314 | "Loglikelihood (and therefore `multiple_choice`-type tasks) is not supported for chat completions as OpenAI |
| MEDIUM | lm_eval/models/huggingface.py | 1393 | # See: https://github.com/EleutherAI/lm-evaluation-harness/issues/1678 |
| MEDIUM | lm_eval/models/sglang_causallms.py | 40 | # batch args from lm-eval interface: https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/interfa |
| LOW | lm_eval/models/megatron_lm.py | 837 | # We just pass through the requests without additional splitting |
| LOW | lm_eval/models/megatron_lm.py | 857 | # We just return results without additional gathering |
| MEDIUM | lm_eval/api/metrics.py | 614 | # See https://github.com/EleutherAI/lm-evaluation-harness/pull/1390 for more documentation. |
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | lm_eval/tasks/slr_bench/lm_eval_slr_bench.py | 33 | # Create the reference in the required format |
| MEDIUM | lm_eval/tasks/toksuite/utils.py | 500 | # Create the summary row with column averages |
| MEDIUM | lm_eval/tasks/med_prescriptions/utils.py | 2101 | # Create a regular expression pattern for Indian scripts |
| MEDIUM | lm_eval/tasks/ruler/vt_utils.py | 75 | # Create a list of the repeated noise |
| MEDIUM | lm_eval/loggers/utils.py | 25 | # Define the pattern to match ',none' at the end of the string |
| MEDIUM | lm_eval/config/evaluate_config.py | 223 | # Create an instance and validate |
| MEDIUM | lm_eval/models/api_models.py | 269 | """This method is responsible for creating the json payload that will be sent to the API.""" |
| MEDIUM | tests/test_registry.py | 157 | # Create a class to test with |
| MEDIUM | tests/test_metrics.py | 12 | # Create a minimal config |
| MEDIUM | tests/test_cli_subcommands.py | 444 | # Create a minimal valid task yaml |
| MEDIUM | tests/test_cli_subcommands.py | 855 | # Create a YAML config file |
| MEDIUM | tests/test_task_manager.py | 529 | # Create a custom arc_easy.yaml that has a different metric |
| MEDIUM | tests/test_task_manager.py | 588 | # Create a custom task using a real dataset |
| MEDIUM | tests/test_task_manager.py | 640 | # Create a completely new task (not overriding any default) |
| MEDIUM | tests/models/test_vllm_context_length.py | 24 | # Create a mock VLLM instance with required attributes |
| MEDIUM | tests/models/test_vllm_context_length.py | 205 | # Create a mock request |
| MEDIUM | tests/scripts/test_zeno_visualize.py | 17 | # Define the process_model_args function that replicates the fixed logic in zeno_visualize.py |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | lm_eval/tasks/score/math/prompt_templates.json | 11 | "prompt": "You should solve this math problem.\nIf the problem is easy, provide a brief solution with little |
| LOW | lm_eval/tasks/infinitebench/utils.py | 367 | # Step 1: find last standalone A-D letter (official regex) |
| LOW | lm_eval/tasks/infinitebench/utils.py | 372 | # Step 2: empty prediction |
| LOW | lm_eval/tasks/infinitebench/utils.py | 376 | # Step 3: first character |
| LOW | lm_eval/tasks/infinitebench/utils.py | 380 | # Step 4: full prediction matches label letter |
| LOW | lm_eval/tasks/infinitebench/utils.py | 384 | # Step 5: replace punctuation, check prefixes (matching official chars) |
| LOW | lm_eval/tasks/infinitebench/utils.py | 395 | # Step 6: scan words for first A-D letter |
| LOW | lm_eval/tasks/infinitebench/utils.py | 430 | # Step 1: find last standalone A-J letter (official regex) |
| LOW | lm_eval/tasks/infinitebench/utils.py | 437 | # Step 2: replace chars and consolidate spaces (matching official) |
| LOW | lm_eval/tasks/infinitebench/utils.py | 447 | # Step 3: check startswith |
| LOW | lm_eval/tasks/infinitebench/utils.py | 453 | # Step 4: check answer prefixes (matching official set) |
| Severity | File | Line | Snippet |
|---|---|---|---|
| HIGH | lm_eval/tasks/bbq/utils.py | 75 | # Unfortunately, bias score for `n_non_unk = 0` is undefined, |
| Severity | File | Line | Snippet |
|---|---|---|---|
| HIGH | lm_eval/tasks/arabic_leaderboard_complete/README.md | 181 | * `arabic_leaderboard_acva`: Arabic-Culture-Value-Alignment (ACVA) is a yes/no question dataset, generated by GPT3.5 Tur |
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | lm_eval/models/hf_vlms.py | 413 | |
| MEDIUM | lm_eval/models/hf_vlms.py | 414 | |
| MEDIUM | lm_eval/models/hf_vlms.py | 435 |