What is Datafusion in Google Cloud Platform (GCP) ?

Last Updated : 09 Jul, 2024

Let's start with an introduction to Cloud Data Fusion. Cloud Data Fusion provides a graphical user interface and APIs that increase time efficiency and reduce complexity. It is user-friendly. Cloud Data Fusion provides you with user user-friendly graphical interface to build data pipelines with NO CODE.

It supports parallel query execution, which significantly helps in the multi-processing of data.
You can use existing templates, connectors to Google Cloud, and other Cloud service providers.
There is a variety of transformations present to help you get your desired quality and format of the data.
Cloud Data Fusion is extensible. This includes the ability to integrate it with Apache Airflow, SQL Engine and many more.

Benefits of Using Data fusion

The following are the benefits of using data fusion:

It reduces complexity by providing a simplified graphical user interface.
It supports multiple triggers and extensions to integrate multiple sources.
It supports multi-core processing which fastens the query execution.

The following are the Primary terminologies related to GCP Datafusion :

Transformations (Transform)
Sink
Source
Error Handlers
Wranglers

1. Transformations

When creating a Datafusion pipeline, Transformation is a process of changing the source data by imposing some rules to transform it into the desired result.

Example: CSV Formatter, Compressor.

2. Sink

Sink is the terminology used in Datafusion to refer Target objects. Target objects can be of different types.

Example: Bigquery, GCS

3. Source

Source is the terminology used in Datafusion to refer Source objects. Source objects can be of different types.

Example : Excel, Bigtable

4. Error Handlers

Error Handlers in Datafusion is used to deal with errors occured in the pipelines which ensures robust data processing and query execution.

5. Wranglers

Wrangling in Datafusion provides tools for data preparation includes harvesting of data (cleaning, structuring, enriching raw data) into desired format of the data in no time.

How to use Data Fusion in Google Cloud Console?

Step 1: In the Cloud console, from the Navigation menu select Data Fusion.

Step 2 : Click the Create an Instance link at the top of the section to create a Cloud Data Fusion instance.

In the Create Data Fusion instance page that loads:

Step 3: A pictorial representation of the pipeline appears in the user i, which is a graphical interface for developing data integration pipelines.

Step 4: In the top right menu, there are several options click Deploy. This will submit the pipeline to Cloud Data Fusion.

What are alternate options for Datafusion in GCP?

The following are the services which you can use as an alternative way of Datafusion.

Dataproc
Dataflow

1. Dataproc

Cloud Data Fusion offers the ability to create ETL jobs using their graphical pipeline UI representation whereas Dataproc lets us run manually created Spark/Hadoop/Hive jobs depending upon your requirement. Also, If you focus on the data transformation/wrangling with low/no code solution, Data fusion is the solution.

2. Dataflow

Dataflow is a Google Cloud service that provides unified stream and batch data processing at scale.If systems are Hadoop dependent, then it is wise to choose Dataproc over Dataflow.