Skip to content

feat: skip config validation during discovery for sources with DynamicSchemaLoader #467

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

aaronsteers
Copy link
Contributor

@aaronsteers aaronsteers commented Apr 9, 2025

Replaces:

Allow Airbyte sources to run discovery unprivileged with DynamicSchemaLoader

This PR allows Airbyte sources to run discovery unprivileged if the source API doesn't need auth in order to provide the catalog info.

Implementation

  1. Added automatic detection of DynamicSchemaLoader to skip config validation during discovery. When a source uses the dynamic schema feature, we assume that the schema endpoint might not require authentication and automatically skip config validation during discovery.

  2. Modified the AirbyteEntrypoint to make the --config parameter optional for the discover command. This allows running discovery without providing a config file when the source doesn't require config validation.

This enables sources with dynamic schemas to provide catalog information without authentication when the schema endpoint doesn't require it.

Testing

Added unit tests to verify that:

  • Sources using DynamicSchemaLoader skip config validation during discovery
  • Sources not using DynamicSchemaLoader still require config validation
  • The AirbyteEntrypoint correctly handles discovery without config when appropriate

Link to Devin run: https://app.devin.ai/sessions/e6aa19df336347919b6cabcff7143a1c
Requested by: Aaron ("AJ") Steers (aj@airbyte.io)

Summary by CodeRabbit

  • New Features

    • The discovery process now allows execution without a mandatory configuration file when applicable.
    • Dynamic schema support has been introduced to automatically adjust configuration validation during discovery.
  • Enhancements

    • Optimized the execution flow for improved performance and resource efficiency.
devin-ai-integration bot and others added 21 commits April 8, 2025 22:33
…cSchemaLoader

Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
…ation control

Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
…class

Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
…ver property

Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
Co-Authored-By: Aaron <AJ> Steers <aj@airbyte.io>
@Copilot Copilot AI review requested due to automatic review settings April 9, 2025 16:22
@github-actions github-actions bot added the enhancement New feature or request label Apr 9, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

unit_tests/sources/declarative/test_manifest_declarative_source_dynamic_schema.py:163

  • [nitpick] The final duplicate assertions for 'check_config_during_discover' and 'check_config_against_spec' are redundant. Consider removing the duplicate assertions to streamline the test.
assert source.check_config_during_discover is False
Copy link
Contributor

coderabbitai bot commented Apr 9, 2025

📝 Walkthrough

Walkthrough

The changes introduce a new boolean attribute check_config_during_discover in both the BaseConnector and ManifestDeclarativeSource classes to govern configuration validation during the discovery phase. The discover command in the entrypoint now accepts an optional configuration, with a conditional branch to handle cases where no config is provided by substituting an empty dictionary. Additionally, new tests have been added to validate the behavior when using either dynamic or static schema loaders, ensuring the discovery process returns the expected catalog.

Changes

File(s) Change Summary
airbyte_cdk/connector.py Added a new boolean attribute check_config_during_discover (default False) in BaseConnector along with updated docstrings for both check_config_against_spec and check_config_during_discover.
airbyte_cdk/entrypoint.py Made the --config argument optional for the discover command; added logic in run for handling a missing config by creating an empty configuration; modified the message yielding to use a generator expression; updated the condition to check not self.source.check_config_during_discover.
airbyte_cdk/sources/declarative/…/manifest_declarative_source.py Introduced the _uses_dynamic_schema_loader method to detect DynamicSchemaLoader usage and initialized the instance variable check_config_during_discover based on this check.
unit_tests/sources/declarative/test_manifest_declarative_source_dynamic_schema.py Added new tests to assess the behavior of check_config_during_discover and configuration validation during discovery when using dynamic versus inline schema loaders.
unit_tests/test_entrypoint.py Updated the test for missing required arguments by removing the "discover" command from the parameterization, now focusing solely on the check and read commands.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as "CLI"
    participant EP as "Entrypoint"
    participant SRC as "Source"
    
    CLI->>EP: Execute `discover` command (with/without config)
    EP->>EP: Parse arguments (config now optional)
    alt Config not provided & check_config_during_discover is False
        EP->>EP: Create empty configuration dictionary
    end
    EP->>SRC: Invoke `discover(empty_config)`
    SRC->>EP: Return catalog/messages
    EP->>CLI: Yield results
Loading
sequenceDiagram
    participant MDS as "ManifestDeclarativeSource"
    participant CFG as "Source Config"
    
    MDS->>MDS: Call _uses_dynamic_schema_loader() and inspect streams in CFG
    alt DynamicSchemaLoader found
        MDS->>MDS: Set check_config_during_discover = True
    else No DynamicSchemaLoader found
        MDS->>MDS: Set check_config_during_discover = False
    end
Loading

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • maxi297 – does this reviewer seem like a good fit for scrutinizing the connector changes, wdyt?
  • darynaishchenko – would you be available to review the modifications in the entrypoint and test updates, wdyt?
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai plan to trigger planning for file edits and PR creation.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)

447-477: Well-implemented detection logic

The implementation of _uses_dynamic_schema_loader() thoroughly checks both regular streams and dynamic streams for the presence of a DynamicSchemaLoader. The documentation clearly explains the purpose and behavior of this method.

One minor suggestion: would adding a simple caching mechanism be beneficial if this method gets called multiple times? wdyt?

unit_tests/sources/declarative/test_manifest_declarative_source_dynamic_schema.py (4)

76-128: Thorough testing of discovery with DynamicSchemaLoader.

The test effectively mocks the streams method and validates that discovery works with an empty config when using DynamicSchemaLoader. Should we also add a validation test that explicitly verifies the config validation is being skipped? Perhaps by patching check_config_against_spec_or_exit and asserting it's not called? wdyt?


130-135: Docstring doesn't fully match test implementation.

The docstring states "Test that discovery validates config when DynamicSchemaLoader is not used," but the test doesn't actually verify any validation occurs. Consider updating the docstring or adding validation verification logic?

-    """Test that discovery validates config when DynamicSchemaLoader is not used."""
+    """Test that discovery works with a static schema loader with minimal config."""

13-13: Imported but unused utility.

The check_config_against_spec_or_exit is imported but never used in the tests. If the intent was to verify validation behavior, consider adding tests that mock this function and assert whether it's called or not based on the schema loader type.

For example, you might add a test like:

@patch("airbyte_cdk.sources.utils.schema_helpers.check_config_against_spec_or_exit")
def test_validation_skipped_with_dynamic_schema_loader(mock_check_config):
    # Setup source with DynamicSchemaLoader
    # Perform discovery
    # Assert mock_check_config.assert_not_called()

167-171: Additional assertions at the end of the test are redundant.

Lines 173-174 repeat assertions already made at lines 163-164. Consider removing the redundant assertions to improve test clarity, or if they're meant to verify something different after the discovery call, add a comment explaining the purpose.

    assert isinstance(catalog, AirbyteCatalog)
    assert len(catalog.streams) == 1
    assert catalog.streams[0].name == "test_static_stream"

-    assert source.check_config_during_discover is False
-    assert source.check_config_against_spec is True
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5366b13 and 08397ad.

📒 Files selected for processing (5)
  • airbyte_cdk/connector.py (1 hunks)
  • airbyte_cdk/entrypoint.py (3 hunks)
  • airbyte_cdk/sources/declarative/manifest_declarative_source.py (3 hunks)
  • unit_tests/sources/declarative/test_manifest_declarative_source_dynamic_schema.py (1 hunks)
  • unit_tests/test_entrypoint.py (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (2)
unit_tests/sources/declarative/test_manifest_declarative_source_dynamic_schema.py (4)
airbyte_cdk/sources/declarative/manifest_declarative_source.py (2)
  • ManifestDeclarativeSource (61-477)
  • streams (147-172)
airbyte_cdk/sources/utils/schema_helpers.py (1)
  • check_config_against_spec_or_exit (168-186)
airbyte_cdk/sources/abstract_source.py (2)
  • discover (85-90)
  • streams (74-79)
airbyte_cdk/entrypoint.py (1)
  • discover (239-248)
airbyte_cdk/entrypoint.py (3)
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (1)
  • discover (171-177)
airbyte_cdk/test/entrypoint_wrapper.py (1)
  • discover (182-199)
airbyte_cdk/sources/source.py (1)
  • discover (48-52)
⏰ Context from checks skipped due to timeout of 90000ms (12)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: SDM Docker Image Build
  • GitHub Check: Pytest (Fast)
  • GitHub Check: Analyze (python)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
🔇 Additional comments (13)
unit_tests/test_entrypoint.py (1)

181-181: Update test to match removed config requirement

The test has been updated to remove "discover" from the required_args dictionary, which correctly aligns with the changes made to the entrypoint making the --config parameter optional for the discover command.

airbyte_cdk/connector.py (1)

35-42: New attribute and documentation looks good!

Adding docstrings for the existing check_config_against_spec attribute and introducing the new check_config_during_discover attribute with clear documentation improves the code readability. The default value of False ensures backward compatibility.

airbyte_cdk/sources/declarative/manifest_declarative_source.py (1)

113-113: Good initialization of the new attribute

Setting check_config_during_discover based on the detection of a DynamicSchemaLoader during initialization ensures the behavior is determined early and consistently.

airbyte_cdk/entrypoint.py (5)

96-96: Parameter updated to match the new feature

Changing the --config parameter from required to optional is the key change that enables discovery without authentication when not needed.


144-148: Style improvement with generator expression

Converting the message yielding process from a list comprehension to a generator expression is a nice optimization that prevents creating an intermediate list.


149-163: Core implementation of the new feature

This conditional block elegantly handles the case when discovery is requested without a config file and the source doesn't require config validation. The empty config dictionary approach keeps the code clean and avoids unnecessary code duplication.


168-171: Consistent style improvement

Another instance of converting from list comprehension to generator expression for more efficient message yielding.


243-243: Updated condition for config validation

The condition has been updated to use the new check_config_during_discover attribute instead of check_config_against_spec, which aligns perfectly with the new feature.

unit_tests/sources/declarative/test_manifest_declarative_source_dynamic_schema.py (5)

5-5: Clear and concise docstring summary.

The docstring effectively summarizes the purpose of these tests, which aligns with the PR goal of skipping config validation during discovery for sources using DynamicSchemaLoader.


9-14: Good imports organization.

The imports are well-organized and include all necessary components for testing the feature.


16-49: Well-structured test for DynamicSchemaLoader detection.

This test correctly verifies that when a source uses DynamicSchemaLoader, the check_config_during_discover flag is set to True, while maintaining check_config_against_spec as True. The detailed configuration setup provides good context for understanding how DynamicSchemaLoader is detected.


51-74: Strong coverage of the negative case.

This test properly verifies the inverse condition - when InlineSchemaLoader is used instead of DynamicSchemaLoader, the check_config_during_discover flag is correctly set to False. This ensures comprehensive logic coverage.


133-172: Comprehensive test with static schema loader.

This test confirms that discovery works with a static schema loader and an empty config. It effectively verifies the catalog structure and stream properties.

Comment on lines +125 to +127
assert isinstance(catalog, AirbyteCatalog)
assert len(catalog.streams) == 1
assert catalog.streams[0].name == "test_dynamic_stream"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be failing. Probably failing to fail because we're mocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
1 participant