Before you begin
- Labs create a Google Cloud project and resources for a fixed time
- Labs have a time limit and no pause feature. If you end the lab, you'll have to restart from the beginning.
- On the top left of your screen, click Start lab to begin
Enable the Dataproc API
/ 20
Create a BigQuery dataset
/ 20
Prepare the PySpark Data Quality Script
/ 20
Run the Batch Pipeline
/ 20
Verify the Data in BigQuery
/ 20
Serverless for Apache Spark from Google Cloud is a fully-managed service that simplifies running Spark batch workloads without managing infrastructure. This pattern provides a robust approach for ETL (Extract, Transform, Load) workflows, ensuring that only high-quality data lands in your analytical systems.
Raw data ingested into a data lake often contains imperfections such as missing values, incorrect formats, or invalid entries. Loading this data directly into an analytical warehouse can corrupt reports and lead to poor business decisions.
Create an automated data quality pipeline. This pipeline intercepts raw data, applies a set of validation rules, and then intelligently routes the data. Clean records are sent to the production data warehouse, while records that fail validation are sent to a "dead-letter queue" (DLQ) for inspection and remediation.
In this lab, you will build this solution by running a custom PySpark job on Serverless for Apache Spark. The job will:
This pattern ensures that your data warehouse remains pristine and provides a clear, auditable process for handling data errors.
close_price. Incomplete ticker data is sent to a DLQ, preventing corruption of time-series analysis models.In this lab, you will learn how to:
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Google Skills using an incognito window.
Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.
Click the Activate Cloud Shell button () at the top right of the console.
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.
(Output)
(Example output)
(Output)
(Example output)
When you begin this lab, a Terraform script runs to automatically provision most of the necessary infrastructure and resources. The following have been created for you:
Project ID =
Region =
Zone =
A custom VPC Network (spark-network) and Subnet (spark-subnet) configured with the required network access for Serverless for Apache Spark.
Two Cloud Storage buckets:
scripts/), the raw input data (source/), and as a temporary staging area for the BigQuery connector.A raw data file: A Python script has automatically generated and uploaded a 1,000-record CSV file (source/customer_contacts_1000.csv) to the main bucket. Approximately 20% of these records contain intentional imperfections (e.g., missing IDs, invalid emails) to test your pipeline.
Your goal is to write a script that can identify and separate the good records from the bad ones and then load them into the correct destinations.
First, you'll confirm that the lab resources were created correctly and preview the source data you will be working with.
-main-bucket and another ending in -dlq-bucket.Activate Cloud Shell.
Run the following command to view the header and the first 10 records of the raw CSV file located in your main bucket.
Before you can run a job, you must enable the Dataproc API. Run the following command in Cloud Shell to enable it:
Click Check my progress to verify your performed task.
Your Terraform script has set up the network and storage, but you still need to create the destination BigQuery dataset where your clean data will be loaded.
In Cloud Shell, run the following command to create a new BigQuery dataset named customer_data_clean.
You can now validate that the dataset was created successfully in the console. Use the Navigation Menu (☰) to go to BigQuery. In the Classic explorer panel, click the arrow next to your project ID to expand its contents, and you should see your new customer_data_clean dataset.
Click Check my progress to verify your performed task.
Now, create the custom PySpark script that contains the logic for validating the data. The script's logic is straightforward:
In Cloud Shell, create the PySpark script file named customer_dq.py.
Paste the following commented Python code into the nano editor.
Critical Note Ensure the final line of the script is spark.stop(). Delete anything that is below it, such as </bq_dataset_table>.
Press CTRL+X, then Y, and then Enter to save and exit nano.
Upload your new PySpark script to the main Cloud Storage bucket.
Click Check my progress to verify your performed task.
With the script uploaded, you can now configure and submit the job to Serverless for Apache Spark.
1. Set the following environment variables in Cloud Shell. These variables create shortcuts to the resources provisioned by Terraform.
Review the command below before running it. It submits your script as a batch job and passes your environment variables as arguments.
--subnet: This flag is critical. It tells the job to run within the secure, custom spark-subnet created by Terraform, which is a security best practice.--deps-bucket: This flag specifies a GCS bucket for staging job dependencies.--: This double-dash separates the gcloud command's flags from the arguments that will be passed directly to your PySpark script.Run the command to submit the job:
Click Check my progress to verify your performed task.
Now that the pipeline has run, verify that only the clean records were loaded into BigQuery.
In Cloud Shell, run a query to count the clean records in the BigQuery table. The count should be approximately 800.
To view a sample of the clean data, run the following command. The output will show records that all have valid IDs and email formats.
Click Check my progress to verify your performed task.
Finally, verify that the records that failed the data quality checks were correctly routed to the DLQ bucket for later analysis.
In Cloud Shell, view a sample of the invalid records in the DLQ bucket. The head -n 11 command will show the header row plus the first 10 error records.
The command should return a sample of the approximately 200 records that failed validation. You will see rows with missing IDs or malformed emails.
Example output:
You can also view the error file directly in the Google Cloud Console:
errors/ folder..csv file to open and view its contents in the browser.You have successfully built and tested a production-grade, batch data quality pipeline!
In this lab, you wrote a custom PySpark job to validate and process a file from Cloud Storage, loaded the clean results into a BigQuery table, and routed the invalid records to a DLQ bucket, all within a pre-provisioned, secure network environment. This pattern is a foundational component of modern, reliable data platforms.
Copyright 2026 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
This content is not currently available
We will notify you via email when it becomes available
Great!
We will contact you via email if it becomes available
One lab at a time
Confirm to end all existing labs and start this one