Cloud Data Fusion overview (original) (raw)

Technology areas
- Overview
- Guides
- Reference
- Resources
Cross-product tools
Console
Discover
Product overview
Explore the plugins
Get started
Enable or disable Cloud Data Fusion
Introduction to Cloud Data Fusion: Console
Introduction to Cloud Data Fusion: Studio
Introduction to Cloud Data Fusion networking
Authentication
Quickstarts
- Create a target campaign pipeline
- Create a private instance with Private Service Connect
- Create a pipeline monitoring dashboard
- Use Salesforce batch source to analyze leads data in BigQuery
Create
Configure
Configure plugins
- Batch sources
  * SAP
  * Configure an SAP ERP system
  * SAP Ariba
  * SAP BW Open Hub Destination
  * SAP ODATA
  * SAP SLT Replication
  * SAP SuccessFactors
  * SAP Table
  * SAP Order to Cash accelerator
  * SAP Procure to Pay accelerator
  * Other applications
  * Database
  * Redshift
  * Salesforce
  * Salesforce overview
  * Create a Salesforce Connected App for Cloud Data Fusion
  * Use case: SOQL queries in the Salesforce source
  * Best practices for the Salesforce source
Manage
Manage Cloud Data Fusion: Studio
- Manage Studio administration
- Manage pipeline design
  * Create and manage namespaces
  * Work with plugins
  * Types of plugins
  * Deploy a plugin from the Hub
  * Create and manage connections
  * Macros and macro functions
  * Create plugin templates
  * Manage multiple versions of the same plugin
  * Plugin drivers
  * Preview data
  * Create alerts
- Data preparation with Wrangler
  * Wrangler overview
  * Wrangler workspace directives
  * Parse files
  * Format strings
  * Send records to error
  * Work with numbers
  * Work with decimal data
  * Transform dates
  * Filter data
  * Find and replace data
  * Fill null or empty cells
  * Rename, copy, delete, or keep columns
  * Join and swap two columns
  * Extract data from fields
  * Explode data from fields
  * Mask data
  * Apply a hashing algorithm
  * Encode and decode rows
  * Wrangler command-line directives
- Manage macros, preferences, and runtime arguments
Manage upgrades
- Versioning in Cloud Data Fusion
- Version upgrades for instances and pipelines
- Patch revisions for instances
- Available upgrades
- Configure maintenance windows
Monitor
Generate reports
- Audit logs
- Metrics overview
- Monitor system, instance, and pipeline health
- View Cloud Data Fusion logs
- View advanced pipeline logs
- Monitor pipeline status in Pub/Sub
Secure and control access
Security overview
Access control with IAM
- Access control with IAM
- Control access with tags
- Service accounts in Cloud Data Fusion
- Minimum permissions required for the Cloud Data Fusion Service Account
- Grant service account roles for Dataproc
- Use case: Access control for Dataproc cluster in another project
- Create custom constraints
Customer-managed data encryption
VPC Service controls
Tutorials
Plugins
- Read from a PostgreSQL database
- Read from a Microsoft SQL Server table
- Read from multiple Microsoft SQL Server tables
Troubleshoot
Troubleshoot general issues
Troubleshoot batch pipelines
Troubleshoot Knowledge Catalog asset lineage integrations
Troubleshoot deleting clusters
Troubleshoot replication jobs
Troubleshoot plugins

Cloud Data Fusion overview

Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building and managing data pipelines. The Cloud Data Fusion web interface lets you build scalable data integration solutions. It lets you connect to various data sources, transform the data, and then transfer it to various destination systems, without having to manage the infrastructure.

Cloud Data Fusion is powered by the open source projectCDAP.

Get started with Cloud Data Fusion

You can start exploring Cloud Data Fusion in minutes.

Create a Cloud Data Fusion instance: get started by creating a Cloud Data Fusion instance.
Cost: before you begin your journey, familiarize yourself withCloud Data Fusion costs.
Concepts: understand the keyterminologies used in Cloud Data Fusion.
Quickstart: experience Cloud Data Fusion by creating your first pipeline.

Explore Cloud Data Fusion

The main components of Cloud Data Fusion are explained in the following sections.

Tenant project

The set of services required to build and orchestrate Cloud Data Fusion pipelines and store pipeline metadata are provisioned in a tenant project, inside a tenancy unit. A separate tenant project is created for each customer project, in which Cloud Data Fusion instances are provisioned. The tenant project inherits all the networking and firewall configurations from the customer project.

Cloud Data Fusion: Console

The Cloud Data Fusion console, also referred to as control plane, is a set of API operationsand a web interface that deal with the Cloud Data Fusion instance itself, such as creating, deleting, restarting, and updating it.

Cloud Data Fusion: Studio

Cloud Data Fusion Studio, also referred to as the data plane, is a set ofREST API and web interface operations that deal with creation, execution, and management of pipelines and related artifacts.

Concepts

This section introduces some of the core concepts of Cloud Data Fusion.

Concept	Description
Cloud Data Fusion instance	A Cloud Data Fusion instance is a unique deployment of Cloud Data Fusion. To get started with Cloud Data Fusion, you create a Cloud Data Fusion instance through the Google Cloud console. You can create multiple instances in a single Google Cloud console project and can specify the Google Cloud region to create your Cloud Data Fusion instances in. Based on your requirements and cost constraints, you can create aDeveloper, Basic, or Enterprise instance. Each Cloud Data Fusion instance contains a unique, independent Cloud Data Fusion deployment that contains a set of services, which handle pipeline lifecycle management, orchestration, coordination, and metadata management. These services run using long-running resources in atenant project.
Namespace	A namespace is a logical grouping of applications, data, and the associated metadata in a Cloud Data Fusion instance. You can think of namespaces as a partitioning of the instance. In a single instance, one namespace stores the data and metadata of an entity independently from another namespace.
Pipeline	A pipeline is a way to visually design data and control flows to extract, transform, blend, aggregate, and load data from various on-premises and cloud data sources. Building pipelines lets you create complex data processing workflows that can help you solve data ingestion, integration, and migration problems. You can use Cloud Data Fusion to build both batch and real-time pipelines, depending on your needs. Pipelines let you express your data processing workflows using the logical flow of data, while Cloud Data Fusion handles all the functionality that is required to physically run in an execution environment.
Pipeline node	On the Studio page of the Cloud Data Fusion web interface, pipelines are represented as a series of nodes arranged in a directed acyclic graph (DAG), forming a one-way flow. Nodes represent the various actions that you can take with your pipelines, such as reading from sources, performing data transformations, and writing output to sinks. You can develop data pipelines in the Cloud Data Fusion web interface by connecting together sources, transformations, sinks, and other nodes.
Plugin	A plugin is a customizable module that can be used to extend the capabilities of Cloud Data Fusion. Cloud Data Fusion provides plugins for sources, transforms, aggregates, sinks, error collectors, alert publishers, actions, and post-run actions. A plugin is sometimes referred to as a node, usually in the context of the Cloud Data Fusion web interface. To discover and access the popular Cloud Data Fusion plugins, see Cloud Data Fusion plugins.
Hub	In the Cloud Data Fusion web interface, to browse plugins, sample pipelines, and other integrations, click Hub. When a new version of a plugin is released, it's visible in the Hub in any instance that's compatible. This applies even if the instance was created before the plugin was released.
Pipeline preview	Cloud Data Fusion Studio lets you test the accuracy of pipeline design using Preview on the subset of data. A pipeline in preview runs in the tenant project.
Pipeline execution	Cloud Data Fusion creates ephemeral execution environments to execute pipelines. Cloud Data Fusion supports Managed Service for Apache Spark as an execution environment Cloud Data Fusion provisions an ephemeral Managed Service for Apache Spark cluster in your customer project at the beginning of a pipeline run, executes the pipeline using Spark in the cluster, and then deletes the cluster after the pipeline execution is complete. Alternatively, if you manage your Managed Service for Apache Spark clusters in controlled environments, through technologies like Terraform, you can also configure Cloud Data Fusion to not provision clusters. In those environments, you can run pipelines against existing Managed Service for Apache Spark clusters.
Compute profile	A compute profile specifies how and where a pipeline is executed. A profile encapsulates any information required to set up and delete the physical execution environment of a pipeline. For example, a compute profile includes the following: Execution provisioner Resources (memory and CPU) Minimum and maximum node count Other values A profile is identified by name and must be assigned a provisioner and its related configuration. A profile can exist either at the Cloud Data Fusion instance level or at the namespace level. The Cloud Data Fusion default compute profile is Autoscaling.
Reusable pipeline	Reusable data pipelines in Cloud Data Fusion allows creation of a single pipeline that can apply a data integration pattern to a variety of use cases and datasets. Reusable pipelines give better manageability by setting most of the configuration of a pipeline at execution time, instead of hard-coding it at design time.
Trigger	Cloud Data Fusion supports creating a trigger on a data pipeline (called the downstream pipeline), to have it run at the completion of one or more different pipelines (called upstream pipelines). You choose when the downstream pipeline runs—for example, upon the success, failure, stop, or any combination thereof, of the upstream pipeline run. Triggers are useful in the following cases: Cleansing your data once, and then making it available to multiple downstream pipelines for consumption. Sharing information, such as runtime arguments and plugin configurations, between pipelines. This is called Payload configuration. Having a set of dynamic pipelines that can run using the data of the hour, day, week, or month, instead of using a static pipeline that must be updated on every run.

Cloud Data Fusion resources

Explore Cloud Data Fusion resources:

Release notes provide change logs of features, changes, and deprecations
Pricing for Cloud Data Fusion
Supported regions for Cloud Data Fusion
API and reference

What's next

See Cloud Data Fusion use cases.
Create a Cloud Data Fusion instance.
Work through atutorial.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License, and code samples are licensed under the Apache 2.0 License. For details, see the Google Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2026-06-15 UTC.