Repository Analysis

opendataloader-project/opendataloader-pdf

PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.

7.1 Low AI signal View on GitHub
7.1
Adjusted Score
7.1
Raw Score
100%
Time Factor
2026-05-29
Last Push
21,814
Stars
Java
Language
67,146
Lines of Code
298
Files
225
Pattern Hits
2026-05-31
Scan Date

Score History

Severity Breakdown

CRITICAL 13HIGH 0MEDIUM 41LOW 171

Pattern Findings

225 matches across 13 categories. Click a row to expand file-level details.

Hallucination Indicators13 hits · 160 pts
SeverityFileLineSnippet
CRITICAL…taloader/pdf/hybrid/HancomAISchemaTransformerTest.java1246 assertThat(c.getBoundingBox().getLeftX()).isEqualTo(expectedLeft, org.assertj.core.api.Assertions.within(0.01));
CRITICAL…taloader/pdf/hybrid/HancomAISchemaTransformerTest.java1247 assertThat(c.getBoundingBox().getRightX()).isEqualTo(expectedRight, org.assertj.core.api.Assertions.within(0.01)
CRITICAL…taloader/pdf/hybrid/HancomAISchemaTransformerTest.java1248 assertThat(c.getBoundingBox().getTopY()).isEqualTo(expectedTopY, org.assertj.core.api.Assertions.within(0.01));
CRITICAL…taloader/pdf/hybrid/HancomAISchemaTransformerTest.java1249 assertThat(c.getBoundingBox().getBottomY()).isEqualTo(expectedBottomY, org.assertj.core.api.Assertions.within(0.
CRITICAL…ava/org/opendataloader/pdf/hybrid/OcrStrategyTest.java257 assertThat(word.getBbox().getLeftX()).isCloseTo(24.0, org.assertj.core.api.Assertions.within(0.01));
CRITICAL…ava/org/opendataloader/pdf/hybrid/OcrStrategyTest.java258 assertThat(word.getBbox().getRightX()).isCloseTo(72.0, org.assertj.core.api.Assertions.within(0.01));
CRITICAL…g/opendataloader/pdf/processors/DocumentProcessor.java614// org.verapdf.gf.model.impl.containers.StaticContainers.setFlavour(Collections.singletonList(PDFAFlavour.WCAG_2_
CRITICAL…g/opendataloader/pdf/processors/DocumentProcessor.java695 org.verapdf.gf.model.impl.containers.StaticContainers.clearAllContainers();
CRITICAL…main/java/org/opendataloader/pdf/api/OutputWriter.java40 * org.opendataloader.pdf.processors.DocumentProcessor.extractContents(
CRITICAL…perpowers/plans/2026-04-29-hybrid-hancom-ai-options.md833 org.opendataloader.pdf.cli.CLIOptions.addAllTo(options);
CRITICAL…perpowers/plans/2026-04-29-hybrid-hancom-ai-options.md879 org.opendataloader.pdf.cli.CLIOptions.applyAllTo(coreConfig, cmd);
CRITICAL…perpowers/plans/2026-04-29-hybrid-hancom-ai-options.md929 if (org.opendataloader.pdf.api.Config.HYBRID_OFF.equals(c.getHybrid())) {
CRITICAL…perpowers/plans/2026-04-29-hybrid-hancom-ai-options.md935 if (org.opendataloader.pdf.api.Config.HYBRID_MODE_AUTO.equals(c.getHybridConfig().getMode())) {
Hyper-Verbose Identifiers114 hits · 128 pts
SeverityFileLineSnippet
LOW…n/opendataloader-pdf/tests/test_convert_integration.py6def test_convert_generates_output(input_pdf, output_dir):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py47def test_defaults_preserve_prior_behavior():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py56def test_disable_ocr_sets_do_ocr_false():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py62def test_ocr_engine_tesseract_yields_tesseract_options():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py70def test_ocr_engine_rapidocr_yields_rapidocr_options():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py78def test_force_full_page_ocr_propagates_to_engine_options():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py86def test_ocr_lang_overrides_engine_default():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py94def test_psm_is_applied_to_tesseract():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py100def test_psm_is_ignored_for_non_tesseract_engines():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py107def test_ocr_engine_auto_yields_auto_options():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py115def test_unknown_engine_raises_value_error_with_clear_message():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py228def test_engine_check_easyocr_always_ok():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py235def test_engine_check_auto_always_ok():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py242def test_engine_check_unknown_kind_returns_false():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py257def test_engine_check_tesseract_missing_binary():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py266def test_engine_check_tesseract_present():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py274def test_engine_check_tesserocr_missing_package():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py283def test_engine_check_rapidocr_missing_package():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py292def test_engine_check_rapidocr_missing_onnxruntime():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py374def test_no_ocr_no_warning_when_engine_left_default(monkeypatch, caplog):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py382def test_no_ocr_warns_when_ocr_lang_set(monkeypatch, caplog):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py389def test_no_ocr_warns_when_psm_set(monkeypatch, caplog):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py396def test_no_ocr_alone_emits_no_warning(monkeypatch, caplog):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py26def _capture_pipeline_options(**kwargs):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py129def test_create_converter_rejects_denylisted_engine():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py186def test_argparse_no_ocr_and_force_ocr_are_mutually_exclusive():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py194def test_argparse_kserve_engine_is_rejected():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py213def test_argparse_psm_accepted_as_integer():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py306def test_engine_check_ocrmac_off_macos():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py314def test_engine_check_ocrmac_on_macos_missing_package():
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py352def test_no_ocr_warns_when_engine_explicitly_set(monkeypatch, caplog):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py361def test_no_ocr_warns_when_engine_explicitly_set_to_easyocr(monkeypatch, caplog):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py409def test_main_exits_when_tesseract_binary_missing(monkeypatch, caplog):
LOW…dataloader-pdf/tests/test_hybrid_server_ocr_options.py428def test_main_skips_engine_check_when_no_ocr(monkeypatch, caplog):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py32 def test_partial_success_status(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py45 def test_partial_success_multiple_failed_pages(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py60 def test_partial_success_no_page_range_with_total_pages(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py74 def test_partial_success_no_page_range_fallback(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py86 def test_success_no_errors_field(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py97 def test_document_field_present(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py109 def test_partial_success_first_page_failed_with_page_range(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py120 def test_partial_success_last_page_failed_with_page_range(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py131 def test_partial_success_all_pages_failed(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py143 def test_partial_success_all_pages_failed_with_total_pages(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py156 def test_failure_status_no_failed_pages_detection(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py168 def test_partial_success_missing_pages_key(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py184 def test_std_bad_alloc_errors(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py222 def test_failed_pages_with_empty_entries_in_pages_dict(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py233 def test_boundary_pages_detected_via_errors(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py245 def test_both_strategies_combined(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py260 def test_overlap_between_gap_and_error_is_deduplicated(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py271 def test_duplicate_page_in_errors(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py279 def test_no_page_pattern_errors_falls_back_to_gap(self):
LOW…loader-pdf/tests/test_hybrid_server_partial_success.py290 def test_empty_errors_with_partial_success(self):
LOWpython/opendataloader-pdf/tests/test_runner.py40def test_streaming_failure_does_not_duplicate_output(monkeypatch, capsys, patched_jar):
LOWpython/opendataloader-pdf/tests/test_runner.py68def test_quiet_failure_prints_captured_streams_once(monkeypatch, capsys, patched_jar):
LOW…er-pdf/tests/test_hybrid_server_picture_description.py22def _capture_pipeline_options(**kwargs):
LOW…er-pdf/tests/test_hybrid_server_picture_description.py43def test_custom_prompt_is_forwarded_to_vlm_options():
LOW…er-pdf/tests/test_hybrid_server_picture_description.py56def test_default_prompt_is_preserved_when_user_omits_flag():
LOW…er-pdf/tests/test_hybrid_server_picture_description.py72def test_blank_prompt_falls_back_to_default(blank):
54 more matches not shown…
Decorative Section Separators22 hits · 78 pts
SeverityFileLineSnippet
MEDIUMscripts/build-all.sh10# =================================================================
MEDIUMscripts/build-all.sh12# =================================================================
MEDIUMscripts/build-all.sh17# =================================================================
MEDIUMscripts/build-all.sh19# =================================================================
MEDIUMscripts/build-all.sh35# =================================================================
MEDIUMscripts/build-all.sh37# =================================================================
MEDIUMscripts/build-all.sh48# =================================================================
MEDIUMscripts/build-all.sh50# =================================================================
MEDIUMscripts/build-all.sh61# =================================================================
MEDIUMscripts/build-all.sh63# =================================================================
MEDIUMscripts/build-all.sh74# =================================================================
MEDIUMscripts/build-all.sh76# =================================================================
MEDIUM.github/workflows/release.yml19 # =================================================================
MEDIUM.github/workflows/release.yml21 # =================================================================
MEDIUM.github/workflows/release.yml64 # =================================================================
MEDIUM.github/workflows/release.yml66 # =================================================================
MEDIUM.github/workflows/release.yml70 # =================================================================
MEDIUM.github/workflows/release.yml72 # =================================================================
MEDIUM.github/workflows/release.yml96 # =================================================================
MEDIUM.github/workflows/release.yml98 # =================================================================
MEDIUM.github/workflows/release.yml125 # =================================================================
MEDIUM.github/workflows/release.yml127 # =================================================================
AI Slop Vocabulary13 hits · 29 pts
SeverityFileLineSnippet
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json35 "markdown": "where soas below some threshold cannot be recovered, so that an observer can only guess about order.$
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json658 "markdown": "This book's approach is premised on a simple assumption: because behavioral economics is foremost a
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1043 "markdown": "<!-- image -->\n\n<!-- image -->\n\norganizations to navigate successfully the global digital economy
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1267 "markdown": "## Promotional Materials\n\nA good promotional strategy should include multiple facets, from
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1274 "markdown": "Figure 12.2. A set of open textbooks printed in bulk are featured in this photo. Open textbooks from
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1323 "markdown": "Figure 1: Depth up-scaling for the case with n = 32 , s = 48 , and m = 8 . Depth up-scaling is achiev
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1337 "markdown": "Table 2: Evaluation results for SOLAR 10.7B and SOLAR 10.7B-Instruct along with other top-performing
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1344 "markdown": "| Model | Alpaca-GPT4 | OpenOrca | Synth. Math-Instruct | H6 (Avg.) | ARC | HellaSw
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1358 "markdown": "## Acknowledgements\n\nWe would like to extend our gratitude to the teams at Hugging Face, particular
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1358 "markdown": "## Acknowledgements\n\nWe would like to extend our gratitude to the teams at Hugging Face, particular
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1386 "markdown": "## A Contributions\n\nThe contributions of this study are as follows:\n\n- Introduction of the SOLAR
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1386 "markdown": "## A Contributions\n\nThe contributions of this study are as follows:\n\n- Introduction of the SOLAR
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json1393 "markdown": "plexity when compared to MoE. This shift in approach offers a unique and more straightforward way of
Excessive Try-Catch Wrapping15 hits · 22 pts
SeverityFileLineSnippet
MEDIUM…on/opendataloader-pdf/src/opendataloader_pdf/runner.py79 print("Error running opendataloader-pdf CLI.", file=sys.stderr)
LOW…dataloader-pdf/src/opendataloader_pdf/hybrid_server.py159 except Exception:
LOW…dataloader-pdf/src/opendataloader_pdf/hybrid_server.py700 except Exception as e:
LOW…dataloader-pdf/src/opendataloader_pdf/hybrid_server.py794 except Exception as e:
MEDIUM…ents/chunking_strategy/docling_page_range_benchmark.py215 print(f"Error: PDF not found at {pdf_path}")
LOWscripts/experiments/docling_baseline_bench.py87 except Exception as e:
LOWscripts/experiments/docling_fastapi_bench.py100 except Exception:
LOWscripts/experiments/docling_fastapi_bench.py193 except Exception as e:
LOWscripts/experiments/docling_subprocess_bench.py103 except Exception as e:
LOWscripts/experiments/docling_subprocess_bench.py114 except Exception as e:
LOWscripts/experiments/docling_subprocess_bench.py231 except Exception as e:
LOWbuild-scripts/fetch_shaded_jar.py44 except Exception:
MEDIUMbuild-scripts/set_version.py36 print(f"Error: VERSION file not found at {version_path}")
MEDIUMbuild-scripts/set_version.py39 print(f"Error: Java pom.xml not found at {java_pom_path}")
MEDIUMbuild-scripts/set_version.py42 print(f"Error: Python pyproject.toml not found at {python_pyproject_path}")
Verbosity Indicators10 hits · 20 pts
SeverityFileLineSnippet
LOW…java/org/opendataloader/pdf/hybrid/HancomAIClient.java197 // Step 1: DLA + OCR. This is required — downstream steps have nothing
LOW…java/org/opendataloader/pdf/hybrid/HancomAIClient.java210 // Step 2: Table Structure — crop each Table region from page image, send to TSR individually
LOW…java/org/opendataloader/pdf/hybrid/HancomAIClient.java224 // Step 3: Figure captioning — pdf2img → crop figures → caption each
LOW…n/java/org/opendataloader/pdf/hybrid/HancomClient.java133 // Step 1: Upload PDF
LOW…n/java/org/opendataloader/pdf/hybrid/HancomClient.java137 // Step 2: Get visual info
LOW…n/java/org/opendataloader/pdf/hybrid/HancomClient.java143 // Step 3: Always cleanup
LOWscripts/bench.sh46# Step 1: Build Java if needed
LOWscripts/bench.sh57# Step 2: Clone or update bench repo
LOWscripts/bench.sh66# Step 3: Find JAR path
LOWscripts/bench.sh73# Step 4: Run benchmark with JAR
Unused Imports13 hits · 13 pts
SeverityFileLineSnippet
LOWpython/opendataloader-pdf/tests/test_hybrid_server.py4
LOW…opendataloader-pdf/tests/test_hybrid_server_unicode.py10
LOWpython/opendataloader-pdf/tests/test_cli_options.py3
LOW…n/opendataloader-pdf/src/opendataloader_pdf/wrapper.py9
LOW…/opendataloader-pdf/src/opendataloader_pdf/__init__.py1
LOW…/opendataloader-pdf/src/opendataloader_pdf/__init__.py1
LOW…/opendataloader-pdf/src/opendataloader_pdf/__init__.py1
LOW…dataloader-pdf/src/opendataloader_pdf/hybrid_server.py292
LOW…dataloader-pdf/src/opendataloader_pdf/hybrid_server.py296
LOW…r-pdf-core/src/test/resources/generate-cid-test-pdf.py26
LOW…ents/chunking_strategy/docling_page_range_benchmark.py17
LOWexamples/python/batch/batch_processing.py16
LOWscripts/experiments/docling_fastapi_bench.py46
Deep Nesting10 hits · 10 pts
SeverityFileLineSnippet
LOWpython/opendataloader-pdf/tests/test_hybrid_server.py8
LOWpython/opendataloader-pdf/tests/test_hybrid_server.py38
LOWpython/opendataloader-pdf/tests/test_hybrid_server.py57
LOW…on/opendataloader-pdf/src/opendataloader_pdf/runner.py13
LOW…dataloader-pdf/src/opendataloader_pdf/hybrid_server.py165
LOW…dataloader-pdf/src/opendataloader_pdf/hybrid_server.py807
LOW…r-pdf-core/src/test/resources/generate-cid-test-pdf.py53
LOWexamples/python/rag/basic_chunking.py63
LOWscripts/experiments/docling_subprocess_bench.py165
LOWbuild-scripts/fetch_shaded_jar.py19
Over-Commented Block8 hits · 8 pts
SeverityFileLineSnippet
LOW…rg/opendataloader/pdf/processors/CaptionProcessor.java81 imageNode.setRecognizedStructureId(content.getRecognizedStructureId());
LOW…endataloader/pdf/processors/ClusterTableProcessor.java81// ClusterTableConsumer clusterTableConsumer = new ClusterTableConsumer();
LOW…endataloader/pdf/processors/ClusterTableProcessor.java101//// }
LOW…pendataloader/pdf/processors/AutoTaggingProcessor.java781 // alt/alt_source schema (alt absent ↔ alt_source=missing), but the
LOW…ava/org/opendataloader/pdf/hybrid/TriageProcessor.java681 return TriageResult.backend(pageNumber, 0.85, signals);
LOWscripts/run-cli.sh1#!/bin/bash
LOWscripts/bench.sh1#!/usr/bin/env bash
LOWscripts/build-all.sh1#!/bin/bash
Fake / Example Data4 hits · 6 pts
SeverityFileLineSnippet
LOWsamples/json/lorem.json18 "content" : "Lorem Ipsum"
LOWsamples/json/lorem.json27 "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et
LOWsamples/json/lorem.json27 "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et
LOWnode/opendataloader-pdf/test/executeJar.unit.test.ts71 await expect(promise).resolves.toBe('Lorem ipsum dolor');
Slop Phrases1 hit · 2 pts
SeverityFileLineSnippet
MEDIUMdocs/hybrid/experiments/speed/subprocess_results.json840 "markdown": "## Saccharometer DI Water Glucose Solution Yeast Suspension\n\n| 24 ml | 0 ml | 4 ml |\n|------
Redundant / Tautological Comments1 hit · 2 pts
SeverityFileLineSnippet
LOWscripts/run-cli.sh22# Check if Java is installed
Example Usage Blocks1 hit · 2 pts
SeverityFileLineSnippet
LOWscripts/bench.sh6# Usage: