Introduction to Data Engineering Agent

Create your first DEA pipeline

Sprawdź postępy

/ 40

Customize the Data Engineering Agent with pipeline instructions

Sprawdź postępy

/ 40

Enhance the pipeline with data quality checks

Sprawdź postępy

/ 20

Ten moduł może zawierać narzędzia AI, które ułatwią Ci naukę.

GSP1386

Overview

The traditional data engineering lifecycle—ingesting raw data, cleaning it, managing dependencies, and building analytical models—has historically required hundreds of lines of manual SQL and complex configuration files. We are now entering the era of Agentic Data Engineering, where the focus shifts from manual execution to high-level partnership.

An AI Agent in this context is not just a chatbot; it is a specialized collaborator designed to understand data schemas, identify file types, and translate complex business intent into production-ready architectures. In an agentic workflow, you provide the "intent"—the governance rules and business goals—while the agent handles the "execution," such as generating SQL, handling PII anonymization, and managing table dependencies.

In this lab, you act as a Lead Data Engineer partnering with an automated Data Engineering (DE) Agent to help build a data pipeline for the fictional ecommerce company: theLook and its marketing team. Your goal is not to manually write the ETL scripts, but to prompt the agent to design, build, and verify a complex customer segmentation pipeline using diverse data sources and formats. You guide the agent through multi-format data ingestion (Parquet, CSV, Avro), implementing PII anonymization, adapting it to your team’s needs, and have it create an RFM model.

What you'll learn

In this lab, you learn how to:

Interface with a DE Agent to coordinate multi-source data ingestion.
Define requirements for data cleansing and PII masking.
Orchestrate the creation of an RFM (Recency, Frequency, Monetary) model.
Implement data quality checks.
[Optional] Verify pipeline logic by analyzing segment distributions against traffic source profiles.

Prerequisites

You should be generally familiar with the basics of data engineering and the basics of ETL or ELT pipelines. A basic understanding of config files is beneficial but not required.

Setup and requirements

Before you click the Start Lab button

Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources are made available to you.

This hands-on lab lets you do the lab activities in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials you use to sign in and access Google Cloud for the duration of the lab.

To complete this lab, you need:

Access to a standard internet browser (Chrome browser recommended).

Note: Use an Incognito (recommended) or private browser window to run this lab. This prevents conflicts between your personal account and the student account, which may cause extra charges incurred to your personal account.

Time to complete the lab—remember, once you start, you cannot pause a lab.

Note: Use only the student account for this lab. If you use a different Google Cloud account, you may incur charges to that account.

How to start your lab and sign in to the Google Cloud console

Click the Start Lab button. If you need to pay for the lab, a dialog opens for you to select your payment method. On the right is the Lab setup and access panel with the following:
- The Open Google Cloud console button
- The temporary credentials (username and password) that you must use for this lab
- Other information, if needed, to step through this lab
Note that the lab timer is located near the top of the page, showing the remaining time.
Click Open Google Cloud console (or right-click and select Open Link in Incognito Window if you are running the Chrome browser).

The lab spins up resources, and then opens another tab that shows the Sign in page.

Tip: Arrange the tabs in separate windows, side-by-side.
Note: If you see the Choose an account dialog, click Use Another Account.
If necessary, copy the Username below and paste it into the Sign in dialog.
{{{user_0.username | "Username"}}}
You can also find the Username in the Lab setup and access panel.
Click Next.
Copy the Password below and paste it into the Welcome dialog.
{{{user_0.password | "Password"}}}
You can also find the Password in the Lab setup and access panel.
Click Next.
Important: You must use the credentials the lab provides you. Do not use your Google Cloud account credentials. Note: Using your own Google Cloud account for this lab may incur extra charges.
Click through the subsequent pages:
- Accept the terms and conditions.
- Do not add recovery options or two-factor authentication (because this is a temporary account).
- Do not sign up for free trials.

After a few moments, the Google Cloud console opens in this tab.

Note: To access Google Cloud products and services, click the Navigation menu or type the service or product name in the Search field. Navigation menu icon and Search field

Activate Cloud Shell

Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.

Click Activate Cloud Shell at the top of the Google Cloud console.
Click through the following windows:
- Continue through the Cloud Shell information window.
- Authorize Cloud Shell to use your credentials to make Google Cloud API calls.

When you are connected, you are already authenticated, and the project is set to your Project_ID, . The output contains a line that declares the Project_ID for this session:

Your Cloud Platform project in this session is set to {{{project_0.project_id | "PROJECT_ID"}}}

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

(Optional) You can list the active account name with this command:

gcloud auth list

Click Authorize.

Output:

ACTIVE: * ACCOUNT: {{{user_0.username | "ACCOUNT"}}} To set the active account, run: $ gcloud config set account `ACCOUNT`

(Optional) You can list the project ID with this command:

gcloud config list project

Output:

[core] project = {{{project_0.project_id | "PROJECT_ID"}}}

Note: For full documentation of gcloud, in Google Cloud, refer to the gcloud CLI overview guide.

Task 1. Create your first DEA pipeline

In this task, you create your first pipeline with the help of the Data Engineering Agent and use a natural language prompt to create a plan for the pipeline. Finally, you use your knowledge of data operations to assess what it generates.

Open the BigQuery console

In the Google Cloud Console, select Navigation menu > BigQuery.

The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.

Click Done.

The BigQuery console opens.

Configure the Pipeline

At the top of the BigQuery console, next to the + icon, click the down arrow, and then select Pipeline.
Select Run with user credentials of , and then click Get Started.
If the Ask Agent is not selectable at the top of the screen, hover over it, and then enable the Gemini in Data Analytic API.

When it's enabled, a dialog box titled Ask Agent appears.
Enter the prompt:
In the gs://{{{project_0.project_id|PROJECT_ID}}} bucket in Cloud Storage, use the subscription users document, products.parquet, order_items.avro, and orders.csv to create a plan to build an RFM customer segmentation pipeline. Clean the ingested tables before the analysis and make sure to anonymize PII. Include email and names in the segmentation. After creating the pipeline, compare the average monetary value for each RFM segment based on whether their traffic source is a type of social media or not using the column values from the data profile scan.
Observe three things:
- "subscription users document" refers to 'subscription_users.csv': This is the benefit of natural language; while the other files are identified with the exact file name, sometimes you might only remember the basic name of the file. Rather than needing to go through Cloud Storage and find the exact file name, you give it the name of the file and the DEA finds the exact file.
- There are 3 different file types here: DEA can join datasets of varying types without issues. Ensure all are supported file types of BigQuery
- DEA goes beyond just pipelining: After creating and preparing the necessary data, it goes a step beyond to calculate something to help verify the pipeline.
Press ENTER or .

It generates a list of steps and assumptions. Read through them and validate it's what you want. There are a few things to watch for:
- Read the assumptions: is there anything you think is off-base?
- It should have been able to capture that "subscription users document" is subscription_users.csv'. If not, either tell it to expect a .csv or ask it to scan the full bucket and select from there
- There should be an Autoclean summary that discusses columns to anonymize
- An RFM analysis table should be the final step
Once everything is in working order, tell the agent you approve.

Afterwards a node diagram appears on the screen.

Click Check my progress to verify the objective. Create your first DEA pipeline

You created your first pipeline! But you are not quite done yet. Notice how, while the names are consistent and readable, you really had no control of how they were made or what they were named. Your team has consistent naming conventions and rules for creating datasets. In the next section, you use pipeline instructions along with the Data Engineering Agent to adapt the agent to your team's needs.

Task 2. Customize the Data Engineering Agent with pipeline instructions

Now that you have created your first pipeline DAG, you are ready to go a step further and learn how to customize the agent to adapt to your workflows.

At the top of the BigQuery console, next to the + icon, click the down arrow, and then select Pipeline.
Select Run with user credentials of and then Get Started.
Click Ask Agent towards the top of the screen or the Gemini logo at the bottom.
Before entering anything in the box, click Pipeline Instructions.
Select Create instructions file to open a blank GEMINI.md file.

This is where you enter directives to the DE agent to create your pipeline and any naming conventions you have. Your team:
- Prefers one BigQuery dataset for a specific department of your organization
- Wants a way to ensure they can maximise the use of wildcard (*) identifiers for granting table level permissions for certain groups
- Prefers separate table prefixes for, raw, staging, production, and aggregated tables meant for downstream viewing
- Ensures column names use snake_case
- Wants logic explained
- Desires clear and concise code
- Needs straight-forward dependency management
Copy and paste the following into the GEMINI.md file:
# Data Engineering Agent Instructions ## 1. General Guidelines * Always use the `${ref("filename")}` function to reference other tables or operations within Dataform SQLX files to ensure correct dependency management. * Generate clear and concise SQL code. * Add comments to complex logic. ## 2. Naming Conventions It's preferable to create one dataset to house all levels of data for a specific team. Adhere to the following naming conventions when creating new BigQuery objects: * **Datasets:** `theLook_[team]` (e.g., `theLook_sales`, `theLook_marketing`) * **Tables:** * raw_[source_name]: For raw data ingested from sources (e.g., raw_gcs_orders, raw_api_users). * stg_[entity_name]: For staging tables after initial cleaning and transformation (e.g., stg_orders, stg_users). * prod_[entity_name]: For tables that are meant for analyst teams to work on and pull from (e.g. prod_orders) * mart_[table_name]: For tables in the presentation layer or specific data marts; meant for use in dashboard for non-technical teams * **Columns:** Use `snake_case` (e.g., `order_id`, `customer_name`, `transaction_amount`)
After that is pasted, you can go back to the pipeline tab and see that the above Manage instructions, it now says "1 instruction file added".
Click Save.
Now copy and paste the below prompt into the dialog box:
In the {{{project_0.project_id|PROJECT_ID}}}-dea_agent_test bucket in Cloud Storage, use the subscription users document, products.parquet, order_items.avro, and orders.csv to create a plan to build an RFM customer segmentation pipeline. Clean the ingested tables before the analysis and make sure to anonymize PII. Include email and names in the segmentation. After creating the pipeline, compare the average monetary value for each RFM segment based on whether their traffic source is a type of social media or not using the column values from the data profile scan. This is for the marketing team. Note: The line "This is for the marketing team" dictates the initial dataset name.
Press ENTER or .

It generates a list of steps and assumptions. Read through them and validate it’s what you want. Once again, read through the assumptions and ensure it captured all the correct data file names.
Once it all looks good, type ’approve’.
After nodes appear on the page, click the top node.

It should say create_dataset. You should now be able to read the query that encompasses this node. It says CREATE SCHEMA IF NOT EXISTS {{{project_0.project_id|PROJECT_ID}}}.theLook_marketing. This shows that DEA did read your pipeline instructions and is incorporating them into your workflow.
Click Run > Run all tasks.
Once finished, select the prod_rfm_analysis node, and on the left hand side, select Data preview.

In this view, observe the column names. The bars represent the distribution of values in the table.

Click Check my progress to verify the objective. Customize the Data Engineering Agent with pipeline instructions

Task 3. Enhance the pipeline with data quality checks

Now that the RFM analysis pipeline is built, enhance it by adding a data quality check using the Data Engineering Agent. You want to ensure that all monetary values in your final analysis table are non-negative as is typical of RFM analyses. This task demonstrates how you can use DEA to iteratively refine pipelines and incorporate data governance best practices like data quality assertions both by having DEA act on the pipeline or acting yourself.

Return to the pipeline interface from Task 2.
Click Ask Agent.
Enter the following prompt:
Please add a data quality assertion to the `prod_rfm_analysis` table. The assertion should check that all values in the 'monetary' column are greater than or equal to 0. Name the assertion 'check_non_negative_monetary'.
Review the plan proposed by the agent. It should indicate modification of the script for the prod_rfm_analysis node.
Type approve to apply the changes.
Once updated, click on the prod_rfm_analysis node.

Examine the generated SQL code. You should see an added ASSERT statement within the pre_operations or post_operations block.
Click Apply at the top of the screen.
Click Run > Run all tasks again.

If any data violated this assertion, the pipeline would fail at this step, highlighting the data quality issue.

What happens when you want to make another assertion but perhaps make a mistake? How would you amend that? Oftentimes it's easier to correct a mistake manually rather than describe the change to be made to the agent and have it remedy it.

Edit the pipeline manually

In the agent chatbox you have used in previous steps, insert this prompt and run:
Add another assertion that checks the age column is between 12 and 23. Call it 'check_reasonable_age'.
After it creates the node, run it.
Navigate to the Execution tab of the pipeline.

Observe that this execution fails.

Why? The check aims to ensure the validity of your data by verifying ages are within expectations. The website does not allow purchases from anyone younger than 12 and the oldest person to ever live was 122 years old. Look back at the prompt: the max value is 23 instead of 123. Rather than tell the Agent to fix it, you will fix it.
Go back to the Pipeline tab.
Click the node you just created named 'check_reasonable_age'.
In the bottom panel, click Open > In new tab.
In the new tab that opens, change 23 to 123.
Return to the tab that contains the pipeline and look at the node's contents now.

Observe that it has updated here as well.
Run the task and go to the Execution tab once again.

Observe that now the execution succeeds.
Return to the pipeline tab and click Apply at the top of the page once again.

Click Check my progress to verify the objective. Enhance the pipeline with data quality checks

[Optional] Visualize your RFM analysis

You have already completed the lab tasks to explore the bulk of what Data Engineering Agent can do. This optional section aims to show you how you can extend in BigQuery to get quick visualisations of what the DEA outputs.

In the BigQuery Explorer, find the theLook_marketing dataset and expand it.
Locate the prod_rfm_analysis table.
Click the menu icon (three dots) next to the table name and select Open In > Python Notebook. This opens a new Colab Enterprise notebook.
Run the first two cells either by clicking into the cell and using the CTRL + ENTER keyboard shortcut or by pressing the "▶" that appears when a cell is selected.
Under the second cell, observe a button that says "Visualize with results" materializes. Click it.
Note: You might have to select it multiple times to see the resulting chart.
After the chart appears, on the right-hand panel, there is a section called Breakdown Dimension.
Click Add Dimension.
Select traffic_source_type.

A bar chart now shows average_monetary_value by traffic_source_type. In Task 2, you completed steps to achieve a quick way to see the distribution of values; you have built upon that task and discovered an easy way to visualize more in depth for quick quality checks before pushing the data created by your pipelines to your data teams. Feel free to play around with the other functionality in the chart.

Congratulations!

You have successfully built and deployed a sophisticated data pipeline using the Data Engineering Agent.

By completing this lab, you have demonstrated proficiency in:

Cross-Format Ingestion: You successfully unified data from CSV, Parquet, and Avro files into a single BigQuery environment.
Automated Data Preparation: You used natural language prompts to handle PII anonymization and data cleaning without manual coding.
Strategic Analytics: You generated an RFM customer segmentation model and analyzed traffic sources to find the average monetary value for each segment.
Architectural Governance: You implemented a GEMINI.md file to ensure the agent follows professional standards for raw, staging, and production data layers.

You’ve seen how the Data Engineering Agent shrinks the gap between a business requirement and a production-ready pipeline. You are now ready to apply these automated patterns to your own data engineering challenges!

Next steps / Learn more

Read the documentation about what powers the Data Engineering Agent: BigQuery Pipelines Documentation
Read more about the Agent: Use the Data Engineering Agent to build and modify data pipelines
Is your data estate ready for AI? Get ideas on how to assess: The Era of Agentic AI: Is Your Data Platform Ready?
Learn about autocleaning to speed up more of your data engineering workflow: From Chaos to Clarity: Fast-Track Data Engineering with Autocleaning

Google Cloud training and certification

...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.

Manual Last Updated May 12, 2026

Lab Last Tested May 12, 2026

Copyright 2026 Google LLC. All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

GSP1386

Overview

What you'll learn

Prerequisites

Setup and requirements

Before you click the Start Lab button

How to start your lab and sign in to the Google Cloud console

Activate Cloud Shell

Task 1. Create your first DEA pipeline

Open the BigQuery console

Configure the Pipeline

Task 2. Customize the Data Engineering Agent with pipeline instructions

Task 3. Enhance the pipeline with data quality checks

Edit the pipeline manually

[Optional] Visualize your RFM analysis

Congratulations!

Next steps / Learn more

Google Cloud training and certification

Zanim zaczniesz

Użyj przeglądania prywatnego

Zaloguj się w konsoli

Aby uruchomić moduł, użyj przeglądania prywatnego