PDF Parser for AI-ready data. Automate PDF accessibility. Open-source.
225 matches across 13 categories. Click a row to expand file-level details.
| Severity | File | Line | Snippet |
|---|---|---|---|
| CRITICAL | …taloader/pdf/hybrid/HancomAISchemaTransformerTest.java | 1246 | assertThat(c.getBoundingBox().getLeftX()).isEqualTo(expectedLeft, org.assertj.core.api.Assertions.within(0.01)); |
| CRITICAL | …taloader/pdf/hybrid/HancomAISchemaTransformerTest.java | 1247 | assertThat(c.getBoundingBox().getRightX()).isEqualTo(expectedRight, org.assertj.core.api.Assertions.within(0.01) |
| CRITICAL | …taloader/pdf/hybrid/HancomAISchemaTransformerTest.java | 1248 | assertThat(c.getBoundingBox().getTopY()).isEqualTo(expectedTopY, org.assertj.core.api.Assertions.within(0.01)); |
| CRITICAL | …taloader/pdf/hybrid/HancomAISchemaTransformerTest.java | 1249 | assertThat(c.getBoundingBox().getBottomY()).isEqualTo(expectedBottomY, org.assertj.core.api.Assertions.within(0. |
| CRITICAL | …ava/org/opendataloader/pdf/hybrid/OcrStrategyTest.java | 257 | assertThat(word.getBbox().getLeftX()).isCloseTo(24.0, org.assertj.core.api.Assertions.within(0.01)); |
| CRITICAL | …ava/org/opendataloader/pdf/hybrid/OcrStrategyTest.java | 258 | assertThat(word.getBbox().getRightX()).isCloseTo(72.0, org.assertj.core.api.Assertions.within(0.01)); |
| CRITICAL | …g/opendataloader/pdf/processors/DocumentProcessor.java | 614 | // org.verapdf.gf.model.impl.containers.StaticContainers.setFlavour(Collections.singletonList(PDFAFlavour.WCAG_2_ |
| CRITICAL | …g/opendataloader/pdf/processors/DocumentProcessor.java | 695 | org.verapdf.gf.model.impl.containers.StaticContainers.clearAllContainers(); |
| CRITICAL | …main/java/org/opendataloader/pdf/api/OutputWriter.java | 40 | * org.opendataloader.pdf.processors.DocumentProcessor.extractContents( |
| CRITICAL | …perpowers/plans/2026-04-29-hybrid-hancom-ai-options.md | 833 | org.opendataloader.pdf.cli.CLIOptions.addAllTo(options); |
| CRITICAL | …perpowers/plans/2026-04-29-hybrid-hancom-ai-options.md | 879 | org.opendataloader.pdf.cli.CLIOptions.applyAllTo(coreConfig, cmd); |
| CRITICAL | …perpowers/plans/2026-04-29-hybrid-hancom-ai-options.md | 929 | if (org.opendataloader.pdf.api.Config.HYBRID_OFF.equals(c.getHybrid())) { |
| CRITICAL | …perpowers/plans/2026-04-29-hybrid-hancom-ai-options.md | 935 | if (org.opendataloader.pdf.api.Config.HYBRID_MODE_AUTO.equals(c.getHybridConfig().getMode())) { |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | …n/opendataloader-pdf/tests/test_convert_integration.py | 6 | def test_convert_generates_output(input_pdf, output_dir): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 47 | def test_defaults_preserve_prior_behavior(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 56 | def test_disable_ocr_sets_do_ocr_false(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 62 | def test_ocr_engine_tesseract_yields_tesseract_options(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 70 | def test_ocr_engine_rapidocr_yields_rapidocr_options(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 78 | def test_force_full_page_ocr_propagates_to_engine_options(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 86 | def test_ocr_lang_overrides_engine_default(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 94 | def test_psm_is_applied_to_tesseract(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 100 | def test_psm_is_ignored_for_non_tesseract_engines(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 107 | def test_ocr_engine_auto_yields_auto_options(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 115 | def test_unknown_engine_raises_value_error_with_clear_message(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 228 | def test_engine_check_easyocr_always_ok(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 235 | def test_engine_check_auto_always_ok(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 242 | def test_engine_check_unknown_kind_returns_false(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 257 | def test_engine_check_tesseract_missing_binary(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 266 | def test_engine_check_tesseract_present(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 274 | def test_engine_check_tesserocr_missing_package(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 283 | def test_engine_check_rapidocr_missing_package(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 292 | def test_engine_check_rapidocr_missing_onnxruntime(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 374 | def test_no_ocr_no_warning_when_engine_left_default(monkeypatch, caplog): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 382 | def test_no_ocr_warns_when_ocr_lang_set(monkeypatch, caplog): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 389 | def test_no_ocr_warns_when_psm_set(monkeypatch, caplog): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 396 | def test_no_ocr_alone_emits_no_warning(monkeypatch, caplog): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 26 | def _capture_pipeline_options(**kwargs): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 129 | def test_create_converter_rejects_denylisted_engine(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 186 | def test_argparse_no_ocr_and_force_ocr_are_mutually_exclusive(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 194 | def test_argparse_kserve_engine_is_rejected(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 213 | def test_argparse_psm_accepted_as_integer(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 306 | def test_engine_check_ocrmac_off_macos(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 314 | def test_engine_check_ocrmac_on_macos_missing_package(): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 352 | def test_no_ocr_warns_when_engine_explicitly_set(monkeypatch, caplog): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 361 | def test_no_ocr_warns_when_engine_explicitly_set_to_easyocr(monkeypatch, caplog): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 409 | def test_main_exits_when_tesseract_binary_missing(monkeypatch, caplog): |
| LOW | …dataloader-pdf/tests/test_hybrid_server_ocr_options.py | 428 | def test_main_skips_engine_check_when_no_ocr(monkeypatch, caplog): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 32 | def test_partial_success_status(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 45 | def test_partial_success_multiple_failed_pages(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 60 | def test_partial_success_no_page_range_with_total_pages(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 74 | def test_partial_success_no_page_range_fallback(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 86 | def test_success_no_errors_field(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 97 | def test_document_field_present(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 109 | def test_partial_success_first_page_failed_with_page_range(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 120 | def test_partial_success_last_page_failed_with_page_range(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 131 | def test_partial_success_all_pages_failed(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 143 | def test_partial_success_all_pages_failed_with_total_pages(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 156 | def test_failure_status_no_failed_pages_detection(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 168 | def test_partial_success_missing_pages_key(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 184 | def test_std_bad_alloc_errors(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 222 | def test_failed_pages_with_empty_entries_in_pages_dict(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 233 | def test_boundary_pages_detected_via_errors(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 245 | def test_both_strategies_combined(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 260 | def test_overlap_between_gap_and_error_is_deduplicated(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 271 | def test_duplicate_page_in_errors(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 279 | def test_no_page_pattern_errors_falls_back_to_gap(self): |
| LOW | …loader-pdf/tests/test_hybrid_server_partial_success.py | 290 | def test_empty_errors_with_partial_success(self): |
| LOW | python/opendataloader-pdf/tests/test_runner.py | 40 | def test_streaming_failure_does_not_duplicate_output(monkeypatch, capsys, patched_jar): |
| LOW | python/opendataloader-pdf/tests/test_runner.py | 68 | def test_quiet_failure_prints_captured_streams_once(monkeypatch, capsys, patched_jar): |
| LOW | …er-pdf/tests/test_hybrid_server_picture_description.py | 22 | def _capture_pipeline_options(**kwargs): |
| LOW | …er-pdf/tests/test_hybrid_server_picture_description.py | 43 | def test_custom_prompt_is_forwarded_to_vlm_options(): |
| LOW | …er-pdf/tests/test_hybrid_server_picture_description.py | 56 | def test_default_prompt_is_preserved_when_user_omits_flag(): |
| LOW | …er-pdf/tests/test_hybrid_server_picture_description.py | 72 | def test_blank_prompt_falls_back_to_default(blank): |
| 54 more matches not shown… | |||
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | scripts/build-all.sh | 10 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 12 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 17 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 19 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 35 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 37 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 48 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 50 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 61 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 63 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 74 | # ================================================================= |
| MEDIUM | scripts/build-all.sh | 76 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 19 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 21 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 64 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 66 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 70 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 72 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 96 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 98 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 125 | # ================================================================= |
| MEDIUM | .github/workflows/release.yml | 127 | # ================================================================= |
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 35 | "markdown": "where soas below some threshold cannot be recovered, so that an observer can only guess about order.$ |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 658 | "markdown": "This book's approach is premised on a simple assumption: because behavioral economics is foremost a |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1043 | "markdown": "<!-- image -->\n\n<!-- image -->\n\norganizations to navigate successfully the global digital economy |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1267 | "markdown": "## Promotional Materials\n\nA good promotional strategy should include multiple facets, from |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1274 | "markdown": "Figure 12.2. A set of open textbooks printed in bulk are featured in this photo. Open textbooks from |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1323 | "markdown": "Figure 1: Depth up-scaling for the case with n = 32 , s = 48 , and m = 8 . Depth up-scaling is achiev |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1337 | "markdown": "Table 2: Evaluation results for SOLAR 10.7B and SOLAR 10.7B-Instruct along with other top-performing |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1344 | "markdown": "| Model | Alpaca-GPT4 | OpenOrca | Synth. Math-Instruct | H6 (Avg.) | ARC | HellaSw |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1358 | "markdown": "## Acknowledgements\n\nWe would like to extend our gratitude to the teams at Hugging Face, particular |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1358 | "markdown": "## Acknowledgements\n\nWe would like to extend our gratitude to the teams at Hugging Face, particular |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1386 | "markdown": "## A Contributions\n\nThe contributions of this study are as follows:\n\n- Introduction of the SOLAR |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1386 | "markdown": "## A Contributions\n\nThe contributions of this study are as follows:\n\n- Introduction of the SOLAR |
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 1393 | "markdown": "plexity when compared to MoE. This shift in approach offers a unique and more straightforward way of |
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | …on/opendataloader-pdf/src/opendataloader_pdf/runner.py | 79 | print("Error running opendataloader-pdf CLI.", file=sys.stderr) |
| LOW | …dataloader-pdf/src/opendataloader_pdf/hybrid_server.py | 159 | except Exception: |
| LOW | …dataloader-pdf/src/opendataloader_pdf/hybrid_server.py | 700 | except Exception as e: |
| LOW | …dataloader-pdf/src/opendataloader_pdf/hybrid_server.py | 794 | except Exception as e: |
| MEDIUM | …ents/chunking_strategy/docling_page_range_benchmark.py | 215 | print(f"Error: PDF not found at {pdf_path}") |
| LOW | scripts/experiments/docling_baseline_bench.py | 87 | except Exception as e: |
| LOW | scripts/experiments/docling_fastapi_bench.py | 100 | except Exception: |
| LOW | scripts/experiments/docling_fastapi_bench.py | 193 | except Exception as e: |
| LOW | scripts/experiments/docling_subprocess_bench.py | 103 | except Exception as e: |
| LOW | scripts/experiments/docling_subprocess_bench.py | 114 | except Exception as e: |
| LOW | scripts/experiments/docling_subprocess_bench.py | 231 | except Exception as e: |
| LOW | build-scripts/fetch_shaded_jar.py | 44 | except Exception: |
| MEDIUM | build-scripts/set_version.py | 36 | print(f"Error: VERSION file not found at {version_path}") |
| MEDIUM | build-scripts/set_version.py | 39 | print(f"Error: Java pom.xml not found at {java_pom_path}") |
| MEDIUM | build-scripts/set_version.py | 42 | print(f"Error: Python pyproject.toml not found at {python_pyproject_path}") |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | …java/org/opendataloader/pdf/hybrid/HancomAIClient.java | 197 | // Step 1: DLA + OCR. This is required — downstream steps have nothing |
| LOW | …java/org/opendataloader/pdf/hybrid/HancomAIClient.java | 210 | // Step 2: Table Structure — crop each Table region from page image, send to TSR individually |
| LOW | …java/org/opendataloader/pdf/hybrid/HancomAIClient.java | 224 | // Step 3: Figure captioning — pdf2img → crop figures → caption each |
| LOW | …n/java/org/opendataloader/pdf/hybrid/HancomClient.java | 133 | // Step 1: Upload PDF |
| LOW | …n/java/org/opendataloader/pdf/hybrid/HancomClient.java | 137 | // Step 2: Get visual info |
| LOW | …n/java/org/opendataloader/pdf/hybrid/HancomClient.java | 143 | // Step 3: Always cleanup |
| LOW | scripts/bench.sh | 46 | # Step 1: Build Java if needed |
| LOW | scripts/bench.sh | 57 | # Step 2: Clone or update bench repo |
| LOW | scripts/bench.sh | 66 | # Step 3: Find JAR path |
| LOW | scripts/bench.sh | 73 | # Step 4: Run benchmark with JAR |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | python/opendataloader-pdf/tests/test_hybrid_server.py | 4 | |
| LOW | …opendataloader-pdf/tests/test_hybrid_server_unicode.py | 10 | |
| LOW | python/opendataloader-pdf/tests/test_cli_options.py | 3 | |
| LOW | …n/opendataloader-pdf/src/opendataloader_pdf/wrapper.py | 9 | |
| LOW | …/opendataloader-pdf/src/opendataloader_pdf/__init__.py | 1 | |
| LOW | …/opendataloader-pdf/src/opendataloader_pdf/__init__.py | 1 | |
| LOW | …/opendataloader-pdf/src/opendataloader_pdf/__init__.py | 1 | |
| LOW | …dataloader-pdf/src/opendataloader_pdf/hybrid_server.py | 292 | |
| LOW | …dataloader-pdf/src/opendataloader_pdf/hybrid_server.py | 296 | |
| LOW | …r-pdf-core/src/test/resources/generate-cid-test-pdf.py | 26 | |
| LOW | …ents/chunking_strategy/docling_page_range_benchmark.py | 17 | |
| LOW | examples/python/batch/batch_processing.py | 16 | |
| LOW | scripts/experiments/docling_fastapi_bench.py | 46 |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | python/opendataloader-pdf/tests/test_hybrid_server.py | 8 | |
| LOW | python/opendataloader-pdf/tests/test_hybrid_server.py | 38 | |
| LOW | python/opendataloader-pdf/tests/test_hybrid_server.py | 57 | |
| LOW | …on/opendataloader-pdf/src/opendataloader_pdf/runner.py | 13 | |
| LOW | …dataloader-pdf/src/opendataloader_pdf/hybrid_server.py | 165 | |
| LOW | …dataloader-pdf/src/opendataloader_pdf/hybrid_server.py | 807 | |
| LOW | …r-pdf-core/src/test/resources/generate-cid-test-pdf.py | 53 | |
| LOW | examples/python/rag/basic_chunking.py | 63 | |
| LOW | scripts/experiments/docling_subprocess_bench.py | 165 | |
| LOW | build-scripts/fetch_shaded_jar.py | 19 |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | …rg/opendataloader/pdf/processors/CaptionProcessor.java | 81 | imageNode.setRecognizedStructureId(content.getRecognizedStructureId()); |
| LOW | …endataloader/pdf/processors/ClusterTableProcessor.java | 81 | // ClusterTableConsumer clusterTableConsumer = new ClusterTableConsumer(); |
| LOW | …endataloader/pdf/processors/ClusterTableProcessor.java | 101 | //// } |
| LOW | …pendataloader/pdf/processors/AutoTaggingProcessor.java | 781 | // alt/alt_source schema (alt absent ↔ alt_source=missing), but the |
| LOW | …ava/org/opendataloader/pdf/hybrid/TriageProcessor.java | 681 | return TriageResult.backend(pageNumber, 0.85, signals); |
| LOW | scripts/run-cli.sh | 1 | #!/bin/bash |
| LOW | scripts/bench.sh | 1 | #!/usr/bin/env bash |
| LOW | scripts/build-all.sh | 1 | #!/bin/bash |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | samples/json/lorem.json | 18 | "content" : "Lorem Ipsum" |
| LOW | samples/json/lorem.json | 27 | "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et |
| LOW | samples/json/lorem.json | 27 | "content" : "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et |
| LOW | node/opendataloader-pdf/test/executeJar.unit.test.ts | 71 | await expect(promise).resolves.toBe('Lorem ipsum dolor'); |
| Severity | File | Line | Snippet |
|---|---|---|---|
| MEDIUM | docs/hybrid/experiments/speed/subprocess_results.json | 840 | "markdown": "## Saccharometer DI Water Glucose Solution Yeast Suspension\n\n| 24 ml | 0 ml | 4 ml |\n|------ |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | scripts/run-cli.sh | 22 | # Check if Java is installed |
| Severity | File | Line | Snippet |
|---|---|---|---|
| LOW | scripts/bench.sh | 6 | # Usage: |