Go: file parser config by JinHai-CN · Pull Request #15989 · infiniflow/ragflow (original) (raw)

Review Change Stack

📝 Walkthrough

Walkthrough

The parser infrastructure is refactored to accept library type configuration. Document, presentation, and spreadsheet parsers now validate and dispatch on configurable libType fields via constructors and switch-based routing in Parse methods, replacing hardcoded parsing library bindings. The CLI passes explicit lib_type: "office_oxide" configuration through GetParser.

Changes

Library type configuration for pluggable parsing backends

Layer / File(s) Summary
Parser GetParser configuration interface internal/ingestion/parser/type.go GetParser() signature changes to accept config map[string]string, extracts lib_type from config, and passes the derived libType string to each file-type-specific parser constructor instead of zero-argument constructors.
Document parsers (DOC, DOCX) internal/ingestion/parser/doc_parser.go, internal/ingestion/parser/docx_parser.go DOCParser and DOCXParser now carry libType fields, validate library type in constructors accepting libType string, and dispatch Parse() via switch statement on p.libType with error handling for unsupported types. DOCXParser introduces an exported OfficeOxide constant.
Presentation parsers (PPT, PPTX) internal/ingestion/parser/ppt_parser.go, internal/ingestion/parser/pptx_parser.go PPTParser and PPTXParser add libType fields and update constructors to accept libType with validation. Parse() methods dispatch parsing based on p.libType, calling OfficeOxideParse for supported types and returning errors for unsupported values.
Spreadsheet parsers (XLS, XLSX) internal/ingestion/parser/xls_parser.go, internal/ingestion/parser/xlsx_parser.go XLSParser and XLSXParser add libType fields and constructors with validation. Parse() dispatches on p.libType; OfficeOxideParse() is extracted as a new exported method to separate backend-specific logic, enabling future backend pluggability.
PDF parser configuration fields internal/ingestion/parser/pdf_parser.go PDFParser struct gains ParserType, Model, and LibType string fields to store configuration for backend and model selection.
CLI integration with parser configuration internal/cli/user_command.go UserParseLocalFile now creates a config map with lib_type set to "office_oxide" and passes it to GetParser(), wiring the new configuration interface into the CLI parsing path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested labels

💞 feature, 🔧 refactoring

🐰 Parsers now dance with grace,
Each backend finds its rightful place.
From DOC to XLSX, all in line,
LibType routing works so fine!
Configuration blooms, backends shine. 🌸

🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is largely incomplete and off-topic. It provides only a minimal statement without context, background, or any explanation of the parser config implementation. Expand the description to explain what parser config is being added, why it's needed, how it works, and what backends/library types are supported.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and generic, using non-descriptive terms like 'config' without clarifying what is being configured or why. Use a more specific title that describes the main change, e.g., 'Add configurable library type selection for document parsers' or 'Support pluggable parser backends via config parameter'.

✅ Passed checks (2 passed)

Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.