arrow_back

Serverless Data Processing with Dataflow - Custom Dataflow Flex Templates (Java)

로그인 가입
700개 이상의 실습 및 과정 이용하기

Serverless Data Processing with Dataflow - Custom Dataflow Flex Templates (Java)

실습 2시간 universal_currency_alt 크레딧 5개 show_chart 고급
info 이 실습에는 학습을 지원하는 AI 도구가 통합되어 있을 수 있습니다.
700개 이상의 실습 및 과정 이용하기

Overview

A pipeline that accepts command-line parameters is vastly more useful than one with those parameters hard-coded. However, running it requires creating a development environment. An even better option for pipelines that are expected to be rerun by a variety of different users or in a variety of different contexts would be to use a Dataflow template.

There are many Dataflow templates that have already been created as part of Google Cloud Platform, which you can explore in the Get started with Google documentation. But none of them perform the same function as the pipeline in this lab. Instead, in this part of the lab, you convert the pipeline into a newer custom Dataflow Flex Template (as opposed to a custom traditional template).

Converting a pipeline into a custom Dataflow Flex Template requires the use of an Uber JAR to package up your code and the dependencies, a Dockerfile to describe what code to build, Cloud Build to build the underlying container that will be executed at runtime to create the actual job, and a metadata file to describe the job parameters.

Prerequisites

Basic familiarity with Java.

What you learn

In this lab, you:

  • Convert a custom pipeline into a custom Dataflow Flex Template.
  • Run a Dataflow Flex Template.

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.

  5. Click Open Google Console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Activate Google Cloud Shell

Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.

Google Cloud Shell provides command-line access to your Google Cloud resources.

  1. In Cloud console, on the top right toolbar, click the Open Cloud Shell button.

    Highlighted Cloud Shell icon

  2. Click Continue.

It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:

Project ID highlighted in the Cloud Shell Terminal

gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.

  • You can list the active account name with this command:
gcloud auth list

Output:

Credentialed accounts: - @.com (active)

Example output:

Credentialed accounts: - google1623327_student@qwiklabs.net
  • You can list the project ID with this command:
gcloud config list project

Output:

[core] project =

Example output:

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide .

Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), select IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.

Compute Engine default service account name and editor status highlighted on the Permissions tabbed page

Note: If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.
  1. In the Google Cloud console, on the Navigation menu, click Cloud Overview > Dashboard.
  2. Copy the project number (e.g. 729328892908).
  3. On the Navigation menu, select IAM & Admin > IAM.
  4. At the top of the roles table, below View by Principals, click Grant Access.
  5. For New principals, type:
{project-number}-compute@developer.gserviceaccount.com
  1. Replace {project-number} with your project number.
  2. For Role, select Project (or Basic) > Editor.
  3. Click Save.

Setting up your IDE

For the purposes of this lab, you will mainly be using a Theia Web IDE hosted on Google Compute Engine. It has the lab repo pre-cloned. There is java langauge server support, as well as a terminal for programmatic access to Google Cloud APIs via the gcloud command line tool, similar to Cloud Shell.

  1. To access your Theia IDE, copy and paste the link shown in Google Cloud Skills Boost to a new tab.
Note: You may need to wait 3-5 minutes for the environment to be fully provisioned, even after the url appears. Until then you will see an error in the browser.

Credentials pane displaying the ide_url

The lab repo has been cloned to your environment. Each lab is divided into a labs folder with code to be completed by you, and a solution folder with a fully workable example to reference if you get stuck.

  1. Click on the File Explorer button to look:

Expanded File Explorer menu with the labs folder highlighted

You can also create multiple terminals in this environment, just as you would with cloud shell:

New Terminal option highlighted in the Terminal menu

You can see with by running gcloud auth list on the terminal that you're logged in as a provided service account, which has the exact same permissions are your lab user account:

Terminal dislaying the gcloud auth list command

If at any point your environment stops working, you can try resetting the VM hosting your IDE from the GCE console like this:

Both the Reset button and VM instance name highlighted on the VM instances page

Task 1. Set up your pipeline

For this lab, we will leverage the existing pipeline code from the Branching Pipelines lab (solutions folder).

Open the appropriate lab

  1. Create a new terminal in your IDE environment, if you haven't already, and copy and paste the following command:
# Change directory into the lab cd 2_Branching_Pipelines/labs # Download dependencies mvn clean dependency:resolve export BASE_DIR=$(pwd)
  1. Set up the data environment:
# Create GCS buckets and BQ dataset cd $BASE_DIR/../.. source create_batch_sinks.sh # Generate event dataflow source generate_batch_events.sh # Change to the directory containing the practice version of the code cd $BASE_DIR

Click Check my progress to verify the objective. Set up the data environment

Update your pipeline code

  • Update the MyPipeline.java in your IDE by using the solution file, which can be found in 2_Branching_Pipelines/solution/src/main/java/com/mypackage/pipeline:
cp /home/project/training-data-analyst/quests/dataflow/2_Branching_Pipelines/solution/src/main/java/com/mypackage/pipeline/MyPipeline.java $BASE_DIR/src/main/java/com/mypackage/pipeline/

Task 2. Create a custom Dataflow Flex Template container image

  1. To complete this task, first add the following plugin in your pom.xml file to enable building an Uber JAR. First add this in the properties tag:
<maven-shade-plugin.version>3.2.3</maven-shade-plugin.version>
  1. Then add this in the build plugins tag:
<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>${maven-shade-plugin.version}</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> <configuration> <transformers> <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/> </transformers> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> </configuration> </execution> </executions> </plugin>
  1. Now you can build an Uber JAR file using this command:
cd $BASE_DIR mvn clean package

Note the size. This Uber JAR file has all the dependencies embedded in it.

  1. You can run this file as a standalone application with no external dependencies on other libraries:
ls -lh target/*.jar
  1. In the same directory as your pom.xml file, create a file named Dockerfile with the following text. Be sure to set FLEX_TEMPLATE_JAVA_MAIN_CLASS to your full class name and YOUR_JAR_HERE to the Uber JAR that you've created.
FROM gcr.io/dataflow-templates-base/java11-template-launcher-base:latest # Define the Java command options required by Dataflow Flex Templates. ENV FLEX_TEMPLATE_JAVA_MAIN_CLASS="YOUR-CLASS-HERE" ENV FLEX_TEMPLATE_JAVA_CLASSPATH="/template/pipeline.jar" # Make sure to package as an uber-jar including all dependencies. COPY target/YOUR-JAR-HERE.jar ${FLEX_TEMPLATE_JAVA_CLASSPATH}
  1. You will then use Cloud Build to offload the building of this container for you, rather than building it locally. First, turn the caching on to speed up future builds:
gcloud config set builds/use_kaniko True
  1. Then execute the actual build. This will tar up the entire directory including the Dockerfile with instructions on what to actually build, upload it to the service, build a container, and push that container to Artifact Registry in your project for future use.
export TEMPLATE_IMAGE="gcr.io/$PROJECT_ID/my-pipeline:latest" gcloud builds submit --tag $TEMPLATE_IMAGE .

You can also monitor the build status from the Cloud Build UI. You can also see that the resulting container has been uploaded to Artifact Registry.

Click Check my progress to verify the objective. Create a custom Dataflow Flex Template container image

Task 3. Create and stage the Flex Template

To run a template, you need to create a template spec file in a Cloud Storage containing all of the necessary information to run the job, such as the SDK information and metadata.

  1. To complete this task, create a metadata.json file in the following format that accounts for all of the input parameters your pipeline expects.

Refer to the solution if you need. This does require you to write your own parameter regex checking. While not best practice, ".*" will match on any input.

{ "name": "Your pipeline name", "description": "Your pipeline description", "parameters": [ { "name": "inputSubscription", "label": "Pub/Sub input subscription.", "helpText": "Pub/Sub subscription to read from.", "regexes": [ "[-_.a-zA-Z0-9]+" ] }, { "name": "outputTable", "label": "BigQuery output table", "helpText": "BigQuery table spec to write to, in the form 'project:dataset.table'.", "is_optional": true, "regexes": [ "[^:]+:[^.]+[.].+" ] } ] }
  1. Then build and stage the actual template:
export TEMPLATE_PATH="gs://${PROJECT_ID}/templates/mytemplate.json" # Will build and upload the template to GCS gcloud dataflow flex-template build $TEMPLATE_PATH \ --image "$TEMPLATE_IMAGE" \ --sdk-language "JAVA" \ --metadata-file "metadata.json"
  1. Verify that the file has been uploaded to the template location in Cloud Storage.

Click Check my progress to verify the objective. Create and stage the Flex Template

Task 4. Execute the template from the UI

To complete this task, follow the instructions below:

  1. Go to the Dataflow page in the Google Cloud console.
  2. Click Create job from template.
  3. Enter a valid job name in the Job name field.
  4. Select Custom template from the Dataflow template drop-down menu.
  5. Enter the Cloud Storage path to your template file in the template Cloud Storage path field.
  6. Input the appropriate items under Required parameters.
  7. Click Run job.
Note: You don't need to specify a staging bucket; DataFlow will create a private one in your project using your project number, similar to gs://dataflow-staging--/staging.
  1. Examine the Compute Engine console and you will see a temporary launcher VM that is created to execute your container and initiate your pipeline with the provided parameters.

Task 5. Execute the template using gcloud

One of the benefits of using Dataflow templates is the ability to execute them from a wider variety of contexts, other than a development environment. To demonstrate this, use gcloud to execute a Dataflow template from the command line.

  1. To complete this task, execute the following command in your terminal, modifying the parameters as appropriate:
export PROJECT_ID=$(gcloud config get-value project) export REGION={{{project_0.default_region | Region}}} export JOB_NAME=mytemplate-$(date +%Y%m%H%M$S) export TEMPLATE_LOC=gs://${PROJECT_ID}/templates/mytemplate.json export INPUT_PATH=gs://${PROJECT_ID}/events.json export OUTPUT_PATH=gs://${PROJECT_ID}-coldline/ export BQ_TABLE=${PROJECT_ID}:logs.logs_filtered gcloud dataflow flex-template run ${JOB_NAME} \ --region=$REGION \ --template-file-gcs-location ${TEMPLATE_LOC} \ --parameters "inputPath=${INPUT_PATH},outputPath=${OUTPUT_PATH},tableName=${BQ_TABLE}"
  1. Ensure that your pipeline completes successfully.

Click Check my progress to verify the objective. Execute the template from the UI and using gcloud

End your lab

When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

시작하기 전에

  1. 실습에서는 정해진 기간 동안 Google Cloud 프로젝트와 리소스를 만듭니다.
  2. 실습에는 시간 제한이 있으며 일시중지 기능이 없습니다. 실습을 종료하면 처음부터 다시 시작해야 합니다.
  3. 화면 왼쪽 상단에서 실습 시작을 클릭하여 시작합니다.

시크릿 브라우징 사용

  1. 실습에 입력한 사용자 이름비밀번호를 복사합니다.
  2. 비공개 모드에서 콘솔 열기를 클릭합니다.

콘솔에 로그인

    실습 사용자 인증 정보를 사용하여
  1. 로그인합니다. 다른 사용자 인증 정보를 사용하면 오류가 발생하거나 요금이 부과될 수 있습니다.
  2. 약관에 동의하고 리소스 복구 페이지를 건너뜁니다.
  3. 실습을 완료했거나 다시 시작하려고 하는 경우가 아니면 실습 종료를 클릭하지 마세요. 이 버튼을 클릭하면 작업 내용이 지워지고 프로젝트가 삭제됩니다.

현재 이 콘텐츠를 이용할 수 없습니다

이용할 수 있게 되면 이메일로 알려드리겠습니다.

감사합니다

이용할 수 있게 되면 이메일로 알려드리겠습니다.

한 번에 실습 1개만 가능

모든 기존 실습을 종료하고 이 실습을 시작할지 확인하세요.

시크릿 브라우징을 사용하여 실습 실행하기

이 실습을 실행하려면 시크릿 모드 또는 시크릿 브라우저 창을 사용하세요. 개인 계정과 학생 계정 간의 충돌로 개인 계정에 추가 요금이 발생하는 일을 방지해 줍니다.