실습 설정 안내 및 요구사항
계정과 진행 상황을 보호하세요. 이 실습을 실행하려면 항상 시크릿 브라우저 창과 실습 사용자 인증 정보를 사용하세요.

Creating Reusable Pipelines in Cloud Data Fusion

실습 1시간 30분 universal_currency_alt 크레딧 5개 show_chart 고급
info 이 실습에는 학습을 지원하는 AI 도구가 통합되어 있을 수 있습니다.
이 콘텐츠는 아직 휴대기기에 최적화되지 않음
최상의 경험을 위해 데스크톱 컴퓨터에서 이메일로 전송된 링크를 사용하여 방문하세요.

GSP810

Google Cloud Self-Paced Labs logo

Overview

In this lab you will learn how to build a reusable pipeline that reads data from Cloud Storage, performs data quality checks, and writes to Cloud Storage.

Objectives

What you'll learn

  • How to use the Argument Setter plugin to allow the pipeline to read different input in every run.
  • How to use the Argument Setter plugin to allow the pipeline to perform different quality checks in every run.
  • How to write the output data of each run to Cloud Storage.

Setup and requirements

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Google Skills using an incognito window.

  2. Note the lab's access time (for example, 02:00:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

    Note: Once you click Start lab, it will take about 15 - 20 minutes for the lab to provision necessary resources and create a Data Fusion instance. During that time, you can read through the steps below to get familiar with the goals of the lab.

    When you see lab credentials (Username and Password) in the left panel, the instance is created and you can continue logging into the console.
  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud console.

  5. Click Open Google console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Note: Do not click End lab unless you have finished the lab or want to restart it. This clears your work and removes the project.

Log in to Google Cloud Console

  1. Using the browser tab or window you are using for this lab session, copy the Username from the Connection Details panel and click the Open Google Console button.
Note: If you are asked to choose an account, click Use another account.
  1. Paste in the Username, and then the Password as prompted.
  2. Click Next.
  3. Accept the terms and conditions.

Since this is a temporary account, which will last only as long as this lab:

  • Do not add recovery options
  • Do not sign up for free trials
  1. Once the console opens, view the list of services by clicking the Navigation menu (Navigation menu icon) at the top-left.

Navigation menu

Activate Cloud Shell

Cloud Shell is a virtual machine that contains development tools. It offers a persistent 5-GB home directory and runs on Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources. gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab completion.

  1. Click the Activate Cloud Shell button (Activate Cloud Shell icon) at the top right of the console.

  2. Click Continue.
    It takes a few moments to provision and connect to the environment. When you are connected, you are also authenticated, and the project is set to your PROJECT_ID.

Sample commands

  • List the active account name:
gcloud auth list

(Output)

Credentialed accounts: - <myaccount>@<mydomain>.com (active)

(Example output)

Credentialed accounts: - google1623327_student@qwiklabs.net
  • List the project ID:
gcloud config list project

(Output)

[core] project = <project_ID>

(Example output)

[core] project = qwiklabs-gcp-44776a13dea667a6 Note: Full documentation of gcloud is available in the gcloud CLI overview guide.

Check project permissions

Before you begin working on Google Cloud, you must ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud overview.

Default compute service account

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  1. In the Google Cloud console, on the Navigation menu, click Cloud overview.

  2. From the Project info card, copy the Project number.

  3. On the Navigation menu, click IAM & Admin > IAM.

  4. At the top of the IAM page, click Add.

  5. For New principals, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  1. For Select a role, select Basic (or Project) > Editor.

  2. Click Save.

Task 1. Add necessary permissions for your Cloud Data Fusion instance

  1. On the Google Cloud console title bar, type Data Fusion in the Search field, then click Data Fusion in the search results. Click Instances.
Note: Creation of the instance takes around 20 minutes. Refresh your browser periodically until you see an instance being created. Please wait for it to be ready.

Next, you will grant permissions to the service account associated with the instance, using the following steps.

  1. From the Google Cloud console, navigate to the IAM & Admin > IAM.

  2. Confirm that the Compute Engine Default Service Account {project-number}-compute@developer.gserviceaccount.com is present, copy the Service Account to your clipboard.

  3. On the IAM Permissions page, click +Grant Access.

  4. In the New principals field paste the service account.

  5. Click into the Select a role field and start typing Cloud Data Fusion API Service Agent, then select it.

  6. Click ADD ANOTHER ROLE.

  7. Add the Dataproc Administrator role.

  8. Click Save.

Click Check my progress to verify the objective. Add Cloud Data Fusion API Service Agent role to service account

Grant service account user permission

  1. In the console, on the Navigation menu, click IAM & admin > IAM.

  2. Select the Include Google-provided role grants checkbox.

  3. Scroll down the list to find the Google-managed Cloud Data Fusion service account that looks like service-{project-number}@gcp-sa-datafusion.iam.gserviceaccount.com and then copy the service account name to your clipboard.

Google-managed Cloud Data Fusion service account listing

  1. Next, navigate to the IAM & admin > Service Accounts.

  2. Click on the default compute engine account that looks like {project-number}-compute@developer.gserviceaccount.com, and select the Principals with access tab on the top navigation.

  3. Click on the Grant Access button.

  4. In the New Principals field, paste the service account you copied earlier.

  5. In the Role dropdown menu, select Service Account User.

  6. Click Save.

Task 2. Set up the Cloud Storage bucket

Next, you will create a Cloud Storage bucket in your project to be used later to store results when your pipeline runs.

  • In Cloud Shell, execute the following commands to create a new bucket:

    export BUCKET=$GOOGLE_CLOUD_PROJECT gcloud storage buckets create gs://$BUCKET

The created bucket has the same name as your Project ID.

Click Check my progress to verify the objective. Setup Cloud Storage bucket

Task 3. Navigate to the Cloud Data Fusion UI

  1. Navigate to Data Fusion click Instances, then click the View Instance link next to your Data Fusion instance. Select your lab credentials to sign in, if required.
Note: If you encounter an error 500 warning, close the error browser page and repeat step 1.
  1. If prompted to take a tour of the service click on No, Thanks. You should now be in the Cloud Data Fusion UI.

Task 4. Deploy the Argument Setter plugin

  1. In the Cloud Data Fusion web UI, click Hub on the upper right.

Cloud Data Fusion UI with Hub highlighted

  1. Click the Argument setter action plugin and click Deploy.

  2. In the Deploy window that opens, click Finish.

  3. Click Create a pipeline. The Pipeline Studio page opens.

Task 5. Read from Cloud Storage

  1. In the left panel of the Pipeline Studio page, using the Source drop-down menu, select Google Cloud Storage.

  2. Hover over the Cloud Storage node and click the Properties button that appears.

Cloud Storage properties button highlighted

  1. In the Reference name field, type GCS1.

  2. In the Path field, type ${input.path}. This macro controls what the Cloud Storage input path will be in the different pipeline runs.

  3. In the Format field, select text.

  4. In the right Output Schema panel, remove the offset field from the output schema by clicking the trash icon in the offset field row.

Offset delete button highlighted

  1. Click the X button to exit the Properties dialog box.

Task 6. Transform your data

  1. In the left panel of the Pipeline Studio page, using the Transform drop-down menu, select Wrangler.

  2. In the Pipeline Studio canvas, drag an arrow from the Cloud Storage node to the Wrangler node.

Cloud Storage node and Wrangler nodes

  1. Hover over the Wrangler node and click the Properties button that appears.

  2. In the Input field name, type body.

  3. In the Recipe field, type ${directives}. This macro controls what the transform logic will be in the different pipeline runs.

Directives wrangler

  1. Click the X button to exit the Properties dialog box.

Task 7. Write to Cloud Storage

  1. In the left panel of the Pipeline Studio page, using the Sink drop-down menu, select Cloud Storage.

  2. On the Pipeline Studio canvas, drag an arrow from the Wrangler node to the Cloud Storage node you just added.

Pipeline Studio canvas Wrangler and Cloud Storage nodes connected

  1. Hover over the Cloud Storage sink node and click the Properties button that appears.

  2. In the Reference name field, type GCS2.

  3. In the Path field, type the path of your Cloud Storage bucket you created earlier.

Path field name for Cloud Storage bucket

  1. In the Format field, select json.

  2. Click the X button to exit the Properties menu.

Task 8. Set the macro arguments

  1. In the left panel of the Pipeline Studio page, using the Conditions and Actions drop-down menu, select the Argument Setter plugin.

  2. In the Pipeline Studio canvas, drag an arrow from the Argument Setter node to the Cloud Storage source node.

Pipeline Studio canvas Argument Setter and Cloud Storage nodes connected

  1. Hover over the Argument Setter node and click the Properties button that appears.

  2. In the URL field, add the following:

https://storage.googleapis.com/reusable-pipeline-tutorial/args.json

Argument setter URL

The URL corresponds to a publicly accessible object in Cloud Storage that contains the following content:

{ "arguments" : [ { "name": "input.path", "value": "gs://reusable-pipeline-tutorial/user-emails.txt" }, { "name": "directives", "value": "send-to-error !dq:isEmail(body)" } ] }

The first of the two arguments is the value for input.path. The path gs://reusable-pipeline-tutorial/user-emails.txt is a publicly accessible object in Cloud Storage that contains the following test data:

alice@example.com bob@example.com craig@invalid@example.com

The second argument is the value for directives. The value send-to-error !dq:isEmail(body) configures Wrangler to filter out any lines that are not a valid email address. For example, craig@invalid@example.com is filtered out.

  1. Click the X button to exit the Properties menu.

Task 9. Deploy and run your pipeline

  1. In the top bar of the Pipeline Studio page, click Name your pipeline. Give your pipeline a name (like Reusable-Pipeline), and then click Save.

Reusable pipeline name

  1. Click Deploy on the top right of the Pipeline Studio page. This will deploy your pipeline.

  2. Once deployed, click the drop-down menu on the Run button. Notice the boxes for the input.path and directives arguments. This notifies Cloud Data Fusion that the pipeline will set values for these required arguments during runtime provided through the Argument Setter plugin. Click Run.

  3. Wait for the pipeline run to complete and the status to change to Succeeded.

Reusable pipeline run success

Click Check my progress to verify the objective. Build and execute your pipeline

Congratulations!

In this lab, you have learned how to use the Argument Setter plugin to create a reusable pipeline, which can take in different input arguments with every run.

Take your next lab

Continue with Redacting Confidential Data within your Pipelines in Cloud Data Fusion.

Manual Last Updated June 14, 2025

Lab Last Tested June 14, 2025

Copyright 2026 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

시작하기 전에

  1. 실습에서는 정해진 기간 동안 Google Cloud 프로젝트와 리소스를 만듭니다.
  2. 실습에는 시간 제한이 있으며 일시중지 기능이 없습니다. 실습을 종료하면 처음부터 다시 시작해야 합니다.
  3. 화면 왼쪽 상단에서 실습 시작을 클릭하여 시작합니다.

시크릿 브라우징 사용

  1. 실습에 입력한 사용자 이름비밀번호를 복사합니다.
  2. 비공개 모드에서 콘솔 열기를 클릭합니다.

콘솔에 로그인

    실습 사용자 인증 정보를 사용하여
  1. 로그인합니다. 다른 사용자 인증 정보를 사용하면 오류가 발생하거나 요금이 부과될 수 있습니다.
  2. 약관에 동의하고 리소스 복구 페이지를 건너뜁니다.
  3. 실습을 완료했거나 다시 시작하려고 하는 경우가 아니면 실습 종료를 클릭하지 마세요. 이 버튼을 클릭하면 작업 내용이 지워지고 프로젝트가 삭제됩니다.

현재 이 콘텐츠를 이용할 수 없습니다

이용할 수 있게 되면 이메일로 알려드리겠습니다.

감사합니다

이용할 수 있게 되면 이메일로 알려드리겠습니다.

한 번에 실습 1개만 가능

모든 기존 실습을 종료하고 이 실습을 시작할지 확인하세요.

시크릿 브라우징을 사용하여 실습 실행하기

이 실습을 실행하는 가장 좋은 방법은 시크릿 모드 또는 시크릿 브라우저 창을 사용하는 것입니다. 개인 계정과 학생 계정 간의 충돌로 개인 계정에 추가 요금이 발생하는 일을 방지해 줍니다.