06 Serverless Data Processing with Dataflow: Operations

Cours · 14 heures 45 minutes

< 1%

terminé

Accédez à plus de 700 ateliers et cours

Serverless Data Processing with Dataflow - Custom Dataflow Flex Templates (Python)

Atelier 2 heures universal_currency_alt 5 crédits show_chart Avancé

info Cet atelier peut intégrer des outils d'IA pour vous accompagner dans votre apprentissage.

Overview
Setup and requirements
Task 1. Set up your pipeline
Task 2. Create a custom Dataflow Flex Template container image
Task 3. Create and stage the flex template
Task 4. Execute the template from the UI
Task 5. Execute the template using gcloud
End your lab

Accédez à plus de 700 ateliers et cours

Overview

In this lab, you:

Convert a custom pipeline into a custom Dataflow Flex Template.
Run a Dataflow Flex Template.

Prerequisites:

Basic familiarity with Python.

A pipeline that accepts command-line parameters is vastly more useful than one with those parameters hard-coded. However, running it requires creating a development environment. An even better option for pipelines that are expected to be rerun by a variety of different users or in a variety of different contexts would be to use a Dataflow template.

There are many Dataflow templates that have already been created as part of Google Cloud Platform, to learn more, explore the Get started with Google-provided templates guide. But none of them perform the same function as the pipeline in this lab. Instead, in this part of the lab, you convert the pipeline into a newer custom Dataflow Flex Template (as opposed to a custom traditional template).

Converting a pipeline into a custom Dataflow Flex Template requires the use of a Docker container to package up your code and the dependencies, a Dockerfile to describe what code to build, Cloud Build to build the underlying container that will be executed at runtime to create the actual job, and a metadata file to describe the job parameters.

Setup and requirements

Lab setup

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.

Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

In the Google Cloud console, on the Navigation menu (), select IAM & Admin > IAM.
Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.

Compute Engine default service account name and editor status highlighted on the Permissions tabbed page

Note: If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

In the Google Cloud console, on the Navigation menu, click Cloud Overview > Dashboard.
Copy the project number (e.g. 729328892908).
On the Navigation menu, select IAM & Admin > IAM.
At the top of the roles table, below View by Principals, click Grant Access.
For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.
For Role, select Project (or Basic) > Editor.
Click Save.

In the Google Cloud console, from the Navigation menu (), select Vertex AI > Dashboard.
Click Enable All Recommended APIs.
In the Navigation menu, click Workbench.

At the top of the Workbench page, ensure you are in the Instances view.
Click Create New.
Configure the Instance:
- Name: lab-workbench
- Region: Set the region to
- Zone: Set the zone to
- Advanced Options (Optional): If needed, click "Advanced Options" for further customization (e.g., machine type, disk size).

Create a Vertex AI Workbench instance

Click Create.

This will take a few minutes to create the instance. A green checkmark will appear next to its name when it's ready.

Click Open Jupyterlab next to the instance name to launch the JupyterLab interface. This will open a new tab in your browser.

Workbench Instance Deployed

Next, click Terminal. This will open up a terminal where you can run all the commands in this lab.

Download Code Repository

Next you will download a code repository for use in this lab.

In the terminal you just opened, enter the following:

git clone https://github.com/GoogleCloudPlatform/training-data-analyst cd /home/jupyter/training-data-analyst/quests/dataflow_python/

On the left panel of your notebook environment, in the file browser, you will notice the training-data-analyst repo added.
Navigate into the cloned repo /training-data-analyst/quests/dataflow_python/. You will see a folder for each lab, which is further divided into a lab sub-folder with code to be completed by you, and a solution sub-folder with a fully workable example to reference if you get stuck.

Explorer option highlighted in the expanded View menu

Note: To open a file for editing purposes, simply navigate to the file and click on it. This will open the file, where you can add or modify code.

Click Check my progress to verify the objective. Create notebook instance and clone course repo

Task 1. Set up your pipeline

For this lab, we will leverage the existing pipeline code from the Branching Pipelines lab (solutions folder).

Open the appropriate lab

In the terminal in your jupyterlab environment, run the following commands:

cd 2_Branching_Pipelines/lab export BASE_DIR=$(pwd)

Set up the virtual environment and dependencies

Before you can begin editing the actual pipeline code, you need to ensure that you have installed the necessary dependencies.

Go back to the terminal you previously opened in your IDE environment, then create a virtual environment for your work in this lab:

sudo apt-get update && sudo apt-get install -y python3-venv

python3 -m venv df-env

source df-env/bin/activate

Next, install the packages you will need to execute your pipeline:

python3 -m pip install -q --upgrade pip setuptools wheel python3 -m pip install apache-beam[gcp]

Finally, ensure that the Dataflow API is enabled:

gcloud services enable dataflow.googleapis.com

Set up the data environment

Set up the data environment:

# Create GCS buckets and BQ dataset cd $BASE_DIR/../.. source create_batch_sinks.sh # Generate event dataflow source generate_batch_events.sh # Change to the directory containing the practice version of the code cd $BASE_DIR

Update your pipeline code

Update the my_pipeline.py file in your IDE by using the solution file, which can be found in training-data-analyst/quests/dataflow_python/2_Branching_Pipelines/solution/:

cp /home/jupyter/training-data-analyst/quests/dataflow_python/2_Branching_Pipelines/solution/my_pipeline.py $BASE_DIR

Click Check my progress to verify the objective. Set up the data environment

Task 2. Create a custom Dataflow Flex Template container image

First, enable Kaniko cache use by default. Kaniko caches container build artifacts, so using this option speeds up subsequent builds. We will also use pip3 freeze to record the packages and their versions being used in our environment.

gcloud config set builds/use_kaniko True

Next, we will create our Dockerfile. This will specify the code and the dependencies we need to use.

a. To complete this task, create a New File in the dataflow_python/2_Branching_Pipelines/lab folder in the file explorer of your IDE.

b. To create New File, click on File >> New >> Text File.

c. Rename the file name as Dockerfile, to rename the file name right click on it.

d. Open Dockerfile file in the editor panel, click on the file to open it.

e. Copy the below code to the Dockerfile file and save it:

FROM gcr.io/dataflow-templates-base/python3-template-launcher-base ARG WORKDIR=/dataflow/template RUN mkdir -p ${WORKDIR} WORKDIR ${WORKDIR} RUN apt-get update && apt-get install -y libffi-dev && rm -rf /var/lib/apt/lists/* COPY my_pipeline.py . ENV FLEX_TEMPLATE_PYTHON_PY_FILE="${WORKDIR}/my_pipeline.py" RUN python3 -m pip install apache-beam[gcp]==2.60.0

Finally, use Cloud Build to build the container image:

export PROJECT_ID=$(gcloud config get-value project) export TEMPLATE_IMAGE="gcr.io/$PROJECT_ID/dataflow/my_pipeline:latest" gcloud builds submit --tag $TEMPLATE_IMAGE .

This will take a few minutes to build and push the container.

Click Check my progress to verify the objective. Create a custom Dataflow Flex Template container image

Task 3. Create and stage the flex template

To run a template, you need to create a template spec file in a Cloud Storage containing all of the necessary information to run the job, such as the SDK information and metadata.

a. Create a New File in the dataflow_python/2_Branching_Pipelines/lab folder in the file explorer of your IDE.

b. To create New File, click on File >> New >> Text File.

c. Rename the file name as metadata.json, to rename the file name right click on it.

d. Open metadata.json file in the editor panel. To open the file right click on the metadata.json file then select open With >> Editor.

e. To complete this task, we need to create a metadata.json file in the following format that accounts for all of the input parameters your pipeline expects. Refer to the solution if you need. This does require you to write your own parameter regex checking. While not best practice, ".*" will match on any input.

{ "name": "Your pipeline name", "description": "Your pipeline description", "parameters": [ { "name": "inputPath", "label": "Input file path.", "helpText": "Path to events.json file.", "regexes": [ ".*\\.json" ] }, { "name": "outputPath", "label": "Output file location", "helpText": "GCS Coldline Bucket location for raw data", "regexes": [ "gs:\\/\\/[a-zA-z0-9\\-\\_\\/]+" ] }, { "name": "tableName", "label": "BigQuery output table", "helpText": "BigQuery table spec to write to, in the form 'project:dataset.table'.", "is_optional": true, "regexes": [ "[^:]+:[^.]+[.].+" ] } ] }

Then build and stage the actual template:

export PROJECT_ID=$(gcloud config get-value project) export TEMPLATE_PATH="gs://${PROJECT_ID}/templates/mytemplate.json" # Will build and upload the template to GCS gcloud dataflow flex-template build $TEMPLATE_PATH \ --image "$TEMPLATE_IMAGE" \ --sdk-language "PYTHON" \ --metadata-file "metadata.json"

Verify that the file has been uploaded to the template location in Cloud Storage.

Click Check my progress to verify the objective. Create and stage the Flex Template

Task 4. Execute the template from the UI

To complete this task, follow the instructions below:

Go to the Dataflow page in the Google Cloud console.
Click CREATE JOB FROM TEMPLATE.
Enter a valid job name in the Job name field.
Set the Regional endpoint to .
Select Custom template from the Dataflow template drop-down menu.
Enter the Cloud Storage path to your template file in the template Cloud Storage path field.
Input the appropriate items under Required parameter

a. For Input file path, enter

b. For Output file location, enter

c. For BigQuery output table, enter
Click RUN JOB.

Note: You don't need to specify a staging bucket; Dataflow will create a private one in your project using your project number, similar to

Examine the Compute Engine console and you will see a temporary launcher VM that is created to execute your container and initiate your pipeline with the provided parameters.

Task 5. Execute the template using gcloud

One of the benefits of using Dataflow templates is the ability to execute them from a wider variety of contexts, other than a development environment. To demonstrate this, use gcloud to execute a Dataflow template from the command line.

To complete this task, execute the following command in your terminal, modifying the parameters as appropriate:

export PROJECT_ID=$(gcloud config get-value project) export REGION={{{project_0.startup_script.lab_region|Region}}} export JOB_NAME=mytemplate-$(date +%Y%m%H%M$S) export TEMPLATE_LOC=gs://${PROJECT_ID}/templates/mytemplate.json export INPUT_PATH=gs://${PROJECT_ID}/events.json export OUTPUT_PATH=gs://${PROJECT_ID}-coldline/template_output/ export BQ_TABLE=${PROJECT_ID}:logs.logs_filtered gcloud dataflow flex-template run ${JOB_NAME} \ --region=$REGION \ --template-file-gcs-location ${TEMPLATE_LOC} \ --parameters "inputPath=${INPUT_PATH},outputPath=${OUTPUT_PATH},tableName=${BQ_TABLE}"

Ensure that your pipeline completes successfully.

Click Check my progress to verify the objective. Execute the template from the UI and using gcloud

End your lab

When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

1 star = Very dissatisfied
2 stars = Dissatisfied
3 stars = Neutral
4 stars = Satisfied
5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

06

Serverless Data Processing with Dataflow: Operations

Serverless Data Processing with Dataflow - Custom Dataflow Flex Templates (Python)

Overview

Setup and requirements

Lab setup

Check project permissions

Download Code Repository

Task 1. Set up your pipeline

Open the appropriate lab

Set up the virtual environment and dependencies

Set up the data environment

Update your pipeline code

Task 2. Create a custom Dataflow Flex Template container image

Task 3. Create and stage the flex template

Task 4. Execute the template from the UI

Task 5. Execute the template using gcloud

End your lab

Avant de commencer

Utilisez la navigation privée

Connectez-vous à la console

Utilisez la navigation privée pour effectuer l'atelier