Go: file parser config by JinHai-CN · Pull Request #15989 · infiniflow/ragflow (original) (raw)
📝 Walkthrough
Walkthrough
The parser infrastructure is refactored to accept library type configuration. Document, presentation, and spreadsheet parsers now validate and dispatch on configurable libType fields via constructors and switch-based routing in Parse methods, replacing hardcoded parsing library bindings. The CLI passes explicit lib_type: "office_oxide" configuration through GetParser.
Changes
Library type configuration for pluggable parsing backends
| Layer / File(s) | Summary |
|---|---|
| Parser GetParser configuration interface internal/ingestion/parser/type.go | GetParser() signature changes to accept config map[string]string, extracts lib_type from config, and passes the derived libType string to each file-type-specific parser constructor instead of zero-argument constructors. |
| Document parsers (DOC, DOCX) internal/ingestion/parser/doc_parser.go, internal/ingestion/parser/docx_parser.go | DOCParser and DOCXParser now carry libType fields, validate library type in constructors accepting libType string, and dispatch Parse() via switch statement on p.libType with error handling for unsupported types. DOCXParser introduces an exported OfficeOxide constant. |
| Presentation parsers (PPT, PPTX) internal/ingestion/parser/ppt_parser.go, internal/ingestion/parser/pptx_parser.go | PPTParser and PPTXParser add libType fields and update constructors to accept libType with validation. Parse() methods dispatch parsing based on p.libType, calling OfficeOxideParse for supported types and returning errors for unsupported values. |
| Spreadsheet parsers (XLS, XLSX) internal/ingestion/parser/xls_parser.go, internal/ingestion/parser/xlsx_parser.go | XLSParser and XLSXParser add libType fields and constructors with validation. Parse() dispatches on p.libType; OfficeOxideParse() is extracted as a new exported method to separate backend-specific logic, enabling future backend pluggability. |
| PDF parser configuration fields internal/ingestion/parser/pdf_parser.go | PDFParser struct gains ParserType, Model, and LibType string fields to store configuration for backend and model selection. |
| CLI integration with parser configuration internal/cli/user_command.go | UserParseLocalFile now creates a config map with lib_type set to "office_oxide" and passes it to GetParser(), wiring the new configuration interface into the CLI parsing path. |
Estimated code review effort
🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
- infiniflow/ragflow#15936: Adds
UserParseLocalFileCLI handling; this PR updates that same path to passlib_type=office_oxideconfiguration intoparser.GetParser. - infiniflow/ragflow#15976: Introduces DOCX parser and initial
GetParser/CLI wiring; this PR extends that pathway by addinglibType-based dispatch across all document, presentation, and spreadsheet parsers. - infiniflow/ragflow#15979: Expands OfficeOxide byte parsing implementations in parser files; this PR adds the dispatch layer that routes to those implementations based on configurable
libType.
Suggested labels
💞 feature, 🔧 refactoring
🐰 Parsers now dance with grace,
Each backend finds its rightful place.
From DOC to XLSX, all in line,
LibType routing works so fine!
Configuration blooms, backends shine. 🌸
🚥 Pre-merge checks | ✅ 2 | ❌ 3
❌ Failed checks (2 warnings, 1 inconclusive)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Description check | ⚠️ Warning | The description is largely incomplete and off-topic. It provides only a minimal statement without context, background, or any explanation of the parser config implementation. | Expand the description to explain what parser config is being added, why it's needed, how it works, and what backends/library types are supported. |
| Docstring Coverage | ⚠️ Warning | Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. |
| Title check | ❓ Inconclusive | The title is vague and generic, using non-descriptive terms like 'config' without clarifying what is being configured or why. | Use a more specific title that describes the main change, e.g., 'Add configurable library type selection for document parsers' or 'Support pluggable parser backends via config parameter'. |
✅ Passed checks (2 passed)
| Check name | Status | Explanation |
|---|---|---|
| Linked Issues check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.
❤️ Share
Comment @coderabbitai help to get the list of available commands and usage tips.