🎉 Destination Databricks: Support Azure storage (#15140) by abaerptc · Pull Request #15329 · airbytehq/airbyte (original) (raw)

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation25 Commits12 Checks0 Files changed

Conversation

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters

[ Show hidden characters]({{ revealButtonHref }})

abaerptc

What

Adds Azure storage as an option for Databricks destination.
This PR resolves #15140.

How

Adds Azure storage as an alternative to S3.

  1. *.json - spec and sample configs
  2. Databricks*Config.java - abstracted cloud storage, with two implementations (S3 and Azure)
  3. Databricks*Destination.java - switches between S3 and Azure implementations depending on config
  4. Databricks*StreamCopierFactory.java - abstracted, with two implementations (S3 and Azure)
  5. Databricks*StreamCopier.java - abstracted, with existing S3 implementation refactored and Azure implementation added
  6. DatabricksSqlOperations.java - minor change to allow execution of multiple statements
  7. Dockerfile - bumped version
  8. build.gradle - new deps
  9. AzureBlobStorageConnectionChecker.java - adding new constructor to allow reuse in Databricks connector
  10. *Test.java
  11. *.md - docs

🚨 User Impact 🚨

No breaking changes.

Pre-merge Checklist

Updating a connector

Community member or Airbyter

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Tests

Unit Screen Shot 2022-08-04 at 3 17 46 PM Integration

None.

Acceptance Screen Shot 2022-08-04 at 3 18 46 PM

Cannot run S3 acceptance test locally.

@abaerptc

@CLAassistant

CLA assistant check
All committers have signed the CLA.

@abaerptc

@abaerptc

@abaerptc

@abaerptc

The Azure integration test passes, but I don't have access to Databricks on AWS, and it's proving difficult to configure Azure Databricks to write to S3, so I'm not able to fully run the S3 integration tests. I have confirmed that the data gets written to S3 fine, but still need to confirm that the tables can be created from the S3 location in Databricks. Could someone from Airbyte please trigger the integration tests to run from this PR?

@marcosmarxm

Member

marcosmarxm commented

Aug 9, 2022

edited by github-actionsbot

Loading

@abaerptc

@abaerptc

I pushed a new commit that should fix the error seen in the previous run of the S3 acceptance test.

The Azure acceptance test will require the addition of a new secrets file by an Airbyte member (see sample_secrets/azure_config.json).

@marcosmarxm

/test connector=connectors/destination-databricks

I added azure_config to Airbyte secret manager.

@tuliren

@abaerptc, thank you for the PR! Could you add me as a collaborator to your fork? In this way I can push changes to this branch if needed as I am working on merging it.

@marcosmarxm, there is another PR pending merging for Databricks. Would you mind me doing the coordination and merging it later?

@marcosmarxm

Member

marcosmarxm commented

Aug 10, 2022

edited by github-actionsbot

Loading

@marcosmarxm

@marcosmarxm, there is another PR pending merging for Databricks. Would you mind me doing the coordination and merging it later?

Sure @tuliren I'd just to confirm tests are running properly here.

@marcosmarxm

@marcosmarxm

@marcosmarxm

Member

marcosmarxm commented

Aug 10, 2022

edited by github-actionsbot

Loading

@abaerptc

@tuliren You're added as a collaborator.

@marcosmarxm

@abaerptc now the integration tests are running with your changes!

@abaerptc

@abaerptc now the integration tests are running with your changes!

@marcosmarxm I see a few failures with messages like "Invalid configuration value detected for fs.azure.account.key". I think you need to give your Databricks instance access to your Azure account via the Spark configuration property fs.azure.account.key.

@TamGB

Awesome PR @abaerptc !

Just a side note, the AWS version supports destination pattern matching (see airbyte-integrations/connectors/destination-databricks/src/main/java/io/airbyte/integrations/destination/databricks/DatabricksS3StorageConfig.java), is this something you would consider adding? (of course, I am happy to help!)

@abaerptc

Awesome PR @abaerptc !

Just a side note, the AWS version supports destination pattern matching (see airbyte-integrations/connectors/destination-databricks/src/main/java/io/airbyte/integrations/destination/databricks/DatabricksS3StorageConfig.java), is this something you would consider adding? (of course, I am happy to help!)

@TamGB I don't object to having pattern matching added. However, I don't have the availability to work on it. Please go ahead and implement it if you'd find it valuable.

@pkudinov

Is this abstracted enough to add support for Databricks on Google Cloud (i.e. add GCS as a target bucket along with AWS S3, and Azure Blob)?

@abaerptc

Is this abstracted enough to add support for Databricks on Google Cloud (i.e. add GCS as a target bucket along with AWS S3, and Azure Blob)?

You should be able to add another implementation of the data storage method pretty easily based on what I've done here, yes.

@abaerptc

@natalyjazzviolin

@abaerptc looks like there are conflicts, could you resolve them please?

@tuliren there is a lot of community interest for this PR, would you be able to move it up in your queue for review/merging? Thank you in advance :)

@abaerptc

@abaerptc

@marcosmarxm

So sorry @abaerptc for the long delay to reply. @tuliren is actively working to improve databricks destination for other features and to implement your contribution too. Soon he will return to you!

@tuliren

Sorry about the delay.

To test the Azure integration, we would need a separate Databricks cluster. I was not able to do that so far. Given that it has been delayed for a long time, I will do some local testing, and mute the integration test related to Azure for now.

I target to merge it today.

@tuliren

@tuliren

Contributor

tuliren commented

Oct 15, 2022

edited by github-actionsbot

Loading

tuliren

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@tuliren

Contributor

tuliren commented

Oct 15, 2022

edited by github-actionsbot

Loading

/publish connector=connectors/destination-databricks run-tests=false

🕑 Publishing the following connectors:
connectors/destination-databricks
https://github.com/airbytehq/airbyte/actions/runs/3255000004

Connector Did it publish? Were definitions generated?
connectors/destination-databricks

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@tuliren

@abaerptc, thank you very much for your contribution. Sorry about the long delay.

Please don't forget to remove me from your repo.

jhammarstedt pushed a commit to jhammarstedt/airbyte that referenced this pull request

Oct 31, 2022

…rbytehq#15329)

Co-authored-by: Marcos Marx marcosmarxm@users.noreply.github.com Co-authored-by: marcosmarxm marcosmarxm@gmail.com Co-authored-by: Liren Tu tuliren@gmail.com Co-authored-by: Octavia Squidington III octavia-squidington-iii@users.noreply.github.com