feat: adds markdown as an output for the URL component by philnash · Pull Request #11312 · langflow-ai/langflow (original) (raw)

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@src/lfx/pyproject.toml`:
- Line 46: Update the markitdown dependency constraint in pyproject.toml:
replace the existing "markitdown>=0.1.4,<2.0.0" entry with
"markitdown>=0.1.5b1,<2.0.0" to avoid known security issues, then regenerate
your lockfile or run your package manager (poetry/pip-compile/etc.) and install
to ensure the updated version is used in CI and local environments.

🧹 Nitpick comments (3)

src/lfx/src/lfx/components/data_source/url.py (1)

190-206: Consider adding error handling for markdown conversion failures.

The _markdown_extractor method could fail if the HTML content is malformed or if encoding issues occur during BytesIO conversion. While MarkItDown might handle some edge cases, explicit error handling would improve robustness.

♻️ Suggested improvement with error handling

@staticmethod def _markdown_extractor(x: str) -> str: """Convert HTML to Markdown format."""

  • stream = io.BytesIO(x.encode("utf-8"))
  • result = MarkItDown(enable_plugins=False).convert_stream(stream)
  • return result.markdown
  • try:
  •    stream = io.BytesIO(x.encode("utf-8"))
  •    result = MarkItDown(enable_plugins=False).convert_stream(stream)
  •    return result.markdown
  • except Exception as e:
  •    logger.warning(f"Failed to convert HTML to Markdown, falling back to text: {e}")
  •    return BeautifulSoup(x, "lxml").get_text()

src/backend/tests/unit/components/data_source/test_url_component.py (2)

148-168: Docstring mentions "different format options" but only tests HTML.

The docstring says "Test URLComponent with different format options" but the test only covers HTML format. Consider updating the docstring to be more specific.

✏️ Suggested docstring fix

def test_url_component_html_format(self, mock_recursive_loader):

  • """Test URLComponent with different format options."""
  • """Test URLComponent with HTML format."""
component = URLComponent()

170-191: Tests don't verify the actual extractor logic.

The test mocks RecursiveUrlLoader.load which returns pre-converted content. This means the _markdown_extractor method (and other extractors) is never actually invoked during the test. The extractor is passed to the loader, but since the loader is mocked, the conversion logic isn't tested.

Consider adding unit tests that directly test the extractor methods to ensure the conversion logic works correctly.

✏️ Suggested addition: Direct extractor tests

def test_markdown_extractor_converts_html(self): """Test that _markdown_extractor correctly converts HTML to Markdown.""" html = "

Title

Paragraph

" result = URLComponent._markdown_extractor(html) assert "Title" in result assert "Paragraph" in result

def test_text_extractor_strips_html(self): """Test that _text_extractor removes HTML tags.""" html = "

Title

Paragraph

" result = URLComponent._text_extractor(html) assert "<" not in result assert "Title" in result

def test_html_extractor_returns_unchanged(self): """Test that _html_extractor returns content unchanged.""" html = "Content" result = URLComponent._html_extractor(html) assert result == html

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a673cf and 10bcfeb.

⛔ Files ignored due to path filters (1)

📄 CodeRabbit inference engine (.cursor/rules/backend_development.mdc)

src/backend/**/*.py: Use FastAPI async patterns with await for async operations in component execution methods
Use asyncio.create_task() for background tasks and implement proper cleanup with try/except for asyncio.CancelledError
Use queue.put_nowait() for non-blocking queue operations and asyncio.wait_for() with timeouts for controlled get operations

Files:

📄 CodeRabbit inference engine (.cursor/rules/icons.mdc)

In Python component classes, set the icon attribute to a string matching the desired icon name (e.g., icon = "AstraDB"). The string must match the frontend icon mapping exactly (case-sensitive).

Files:

📄 CodeRabbit inference engine (.cursor/rules/testing.mdc)

src/backend/tests/**/*.py: Place backend unit tests in src/backend/tests/ directory, component tests in src/backend/tests/unit/components/ organized by component subdirectory, and integration tests accessible via make integration_tests
Use same filename as component with appropriate test prefix/suffix (e.g., my_component.pytest_my_component.py)
Use the client fixture (FastAPI Test Client) defined in src/backend/tests/conftest.py for API tests; it provides an async httpx.AsyncClient with automatic in-memory SQLite database and mocked environment variables. Skip client creation by marking test with @pytest.mark.noclient
Inherit from the correct ComponentTestBase family class located in src/backend/tests/base.py based on API access needs: ComponentTestBase (no API), ComponentTestBaseWithClient (needs API), or ComponentTestBaseWithoutClient (pure logic). Provide three required fixtures: component_class, default_kwargs, and file_names_mapping
Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes
Test both sync and async code paths, mock external dependencies appropriately, test error handling and edge cases, validate input/output behavior, and test component initialization and configuration
Use @pytest.mark.asyncio decorator for async component tests and ensure async methods are properly awaited
Test background tasks using asyncio.create_task() and verify completion with asyncio.wait_for() with appropriate timeout constraints
Test queue operations using non-blocking queue.put_nowait() and asyncio.wait_for(queue.get(), timeout=...) to verify queue processing without blocking
Use @pytest.mark.no_blockbuster marker to skip the blockbuster plugin in specific tests
For database tests that may fail in batch runs, run them sequentially using uv run pytest src/backend/tests/unit/test_database.py r...

Files:

📄 CodeRabbit inference engine (Custom checks)

**/test_*.py: Review test files for excessive use of mocks that may indicate poor test design - check if tests have too many mock objects that obscure what's actually being tested
Warn when mocks are used instead of testing real behavior and interactions, and suggest using real objects or test doubles when mocks become excessive
Ensure mocks are used appropriately for external dependencies only, not for core logic
Backend test files should follow the naming convention test_*.py with proper pytest structure
Test files should have descriptive test function names that explain what is being tested
Tests should be organized logically with proper setup and teardown
Consider including edge cases and error conditions for comprehensive test coverage
Verify tests cover both positive and negative scenarios where appropriate
For async functions in backend tests, ensure proper async testing patterns are used with pytest
For API endpoints, verify both success and error response testing

Files:

Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Inherit from the correct `ComponentTestBase` family class located in `src/backend/tests/base.py` based on API access needs: `ComponentTestBase` (no API), `ComponentTestBaseWithClient` (needs API), or `ComponentTestBaseWithoutClient` (pure logic). Provide three required fixtures: `component_class`, `default_kwargs`, and `file_names_mapping`

Applied to files:

Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Use same filename as component with appropriate test prefix/suffix (e.g., `my_component.py` → `test_my_component.py`)

Applied to files:

Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Test component versioning and backward compatibility using `file_names_mapping` fixture with `VersionComponentMapping` objects mapping component files across Langflow versions

Applied to files:

Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Test component build config updates by calling `to_frontend_node()` to get the node template, then calling `update_build_config()` to apply configuration changes

Applied to files:

Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Place backend unit tests in `src/backend/tests/` directory, component tests in `src/backend/tests/unit/components/` organized by component subdirectory, and integration tests accessible via `make integration_tests`

Applied to files:

Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/backend_development.mdc:0-0
Timestamp: 2025-11-24T19:46:09.104Z
Learning: Applies to tests/unit/components/**/*.py : Create unit tests in `src/backend/tests/unit/components/` mirroring the component directory structure, using `ComponentTestBaseWithClient` or `ComponentTestBaseWithoutClient` base classes

Applied to files:

Learnt from: edwinjosechittilappilly
Repo: langflow-ai/langflow PR: 0
File: :0-0
Timestamp: 2025-08-05T22:51:27.961Z
Learning: The TestComposioComponentAuth test in src/backend/tests/unit/components/bundles/composio/test_base_composio.py demonstrates proper integration testing patterns for external API components, including real API calls with mocking for OAuth completion, comprehensive resource cleanup, and proper environment variable handling with pytest.skip() fallbacks.

Applied to files:

Learnt from: CR
Repo: langflow-ai/langflow PR: 0
File: .cursor/rules/testing.mdc:0-0
Timestamp: 2025-11-24T19:47:28.997Z
Learning: Applies to src/backend/tests/**/*.py : Create comprehensive unit tests for all new backend components. If unit tests are incomplete, create a corresponding Markdown file documenting manual testing steps and expected outcomes

Applied to files:

Learnt from: Jkavia
Repo: langflow-ai/langflow PR: 11111
File: src/backend/tests/unit/api/v2/test_workflow.py:10-11
Timestamp: 2025-12-19T18:04:08.938Z
Learning: In the langflow-ai/langflow repository, pytest-asyncio is configured with asyncio_mode = 'auto' in pyproject.toml. This means you do not need to decorate test functions or classes with pytest.mark.asyncio; async tests are auto-detected and run by pytest-asyncio. When reviewing tests, ensure they rely on this configuration (i.e., avoid unnecessary pytest.mark.asyncio decorators) and that tests living under any tests/ path (e.g., src/.../tests/**/*.py) follow this convention. If a test explicitly requires a different asyncio policy, document it and adjust the config accordingly.

Applied to files:

src/lfx/src/lfx/components/data_source/url.py (3)

2-8: LGTM - Imports are appropriate for the new functionality.

The io module and MarkItDown imports are correctly added to support the new markdown extraction feature.


112-117: LGTM - Clear documentation of the new format option.

The info text clearly explains the three available output formats and their purposes.


251-256: LGTM - Clean extractor selection pattern.

The dict-based extractor mapping is a clean approach that's easy to extend. The fallback to _text_extractor provides safe behavior for unexpected format values.

src/backend/tests/unit/components/data_source/test_url_component.py (2)

10-35: LGTM - Test class follows the component testing guidelines.

The test class correctly inherits from ComponentTestBaseWithoutClient and provides all three required fixtures: component_class, default_kwargs, and file_names_mapping. Based on learnings, this follows the established testing patterns.


127-146: LGTM - Text format test covers the basic scenario.

The test verifies that the text format option works correctly with the expected content type.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.