feat: Support instance caching for docling by erichare · Pull Request #11030 · langflow-ai/langflow (original) (raw)

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes introduce LRU caching for DocumentConverter creation in docling_utils to improve performance by reusing converters across runs, and migrate docling_inline from multiprocessing to threading-based worker management with updated queue handling and helper function signatures.

Changes

Cohort / File(s)	Summary
Converter Caching src/lfx/src/lfx/base/data/docling_utils.py	Adds _get_cached_converter() function with LRU cache decorator to reuse DocumentConverter instances across calls, avoiding repeated model loading. Includes fallback non-cached path when picture description config is present. Updates imports to include lru_cache.
Threading Migration src/lfx/src/lfx/components/docling/docling_inline.py	Replaces multiprocessing-based worker management with threading. Renames helper methods: _wait_for_result_with_process_monitoring() → _wait_for_result_with_thread_monitoring() and _terminate_process_gracefully() → _stop_thread_gracefully(). Updates queue handling to use queue.Queue and thread monitoring logic. Adjusts imports: adds queue and threading, removes multiprocessing imports.
Caching Tests src/lfx/tests/unit/base/data/test_docling_utils.py	Introduces TestDocumentConverterCaching class with seven test methods validating cache existence, key-based hits/misses, LRU eviction, performance improvement, cache clearing, and cache differentiation by OCR engine and pipeline parameters.

Sequence Diagram(s)

sequenceDiagram participant Caller participant DoclingInline participant ThreadPool as Worker Thread participant Converter as _get_cached_converter() participant Cache as LRU Cache participant DocumentConverter

Caller->>DoclingInline: process_files()
activate DoclingInline

DoclingInline->>ThreadPool: Thread(target=docling_worker, ...)
activate ThreadPool

ThreadPool->>Converter: _get_cached_converter(pipeline, ocr_engine, ...)
activate Converter

Note over Converter: Check cache key
Converter->>Cache: Lookup (pipeline, ocr_engine, ...)

alt Cache Hit
    Cache-->>Converter: Return cached DocumentConverter
    note over Converter: Reuse existing (seconds)
else Cache Miss
    Converter->>DocumentConverter: Create new instance
    activate DocumentConverter
    DocumentConverter-->>Converter: Converter ready
    deactivate DocumentConverter
    note over Converter: Load models (15-20 min)
    Converter->>Cache: Store in LRU cache
end

Converter-->>ThreadPool: DocumentConverter instance
deactivate Converter

ThreadPool->>ThreadPool: docling_worker(converter, ...)
note over ThreadPool: Process documents
ThreadPool->>DoclingInline: result_queue.put(result)

deactivate ThreadPool

DoclingInline->>DoclingInline: _wait_for_result_with_thread_monitoring()
note over DoclingInline: Monitor thread liveness
DoclingInline->>DoclingInline: _stop_thread_gracefully()

Caller-->>DoclingInline: Processing complete
deactivate DoclingInline

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Thread safety considerations: Review queue and thread synchronization in docling_inline.py, ensuring result_queue operations and thread lifecycle management are correct
Caching behavior: Verify LRU eviction logic, cache key construction (parameter hashing), and fallback to non-cached path when picture description config is present
Converter lifecycle: Ensure converter instances are properly reused and that non-cached fallback doesn't introduce memory leaks or state issues
Test coverage validation: Confirm mocking strategy in tests accurately represents caching and thread behavior

Pre-merge checks and finishing touches

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 4 warnings)

Check name	Status	Explanation	Resolution
Test Coverage For New Implementations	❌ Error	PR lacks test coverage for critical threading refactor in docling_inline.py and has incorrect patch path in existing tests.	Create test_docling_inline.py with threading tests and fix patch path in test_docling_utils.py line 164 from 'lfx.base.data.docling_utils.DocumentConverter' to 'docling.document_converter.DocumentConverter'.
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	You can run @coderabbitai generate docstrings to improve docstring coverage.
Test Quality And Coverage	⚠️ Warning	Test suite has critical gaps: incorrect mock patch path prevents cache test mocks from working, threading refactor lacks test coverage, error handling paths untested, and fallback logic uncovered.	Fix patch path to 'docling.document_converter.DocumentConverter', add threading tests (queue, timeout, shutdown), add error path tests, add parameter validation tests, and test cache with realistic mocks.
Test File Naming And Structure	⚠️ Warning	Test file follows naming convention (test_*.py) and has descriptive test methods, but contains duplicate TestDocumentConverterCaching class definition and lacks negative/error scenario test coverage.	Remove duplicate TestDocumentConverterCaching class and add pytest.raises() tests for invalid parameters, missing dependencies, and error conditions to ensure comprehensive coverage.
Excessive Mock Usage Warning	⚠️ Warning	Test suite exhibits excessive mock usage (71% of tests) with incorrect patch paths causing silent failures and reduced test validity.	Fix patch paths to use correct import locations and refactor tests to replace mocks with lightweight test objects, separating cache unit tests from integration tests.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title 'feat: Support instance caching for docling' accurately summarizes the main change: introducing instance caching for the Docling document converter.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.