arrow_back

Streaming Data Processing: Streaming Data Pipelines into Bigtable

登录 加入
访问 700 多个实验和课程

Streaming Data Processing: Streaming Data Pipelines into Bigtable

实验 2 个小时 universal_currency_alt 5 积分 show_chart 入门级
info 此实验可能会提供 AI 工具来支持您学习。
访问 700 多个实验和课程

Overview

In this lab, you will use Dataflow to collect traffic events from simulated traffic sensor data made available through Google Cloud PubSub, and write them into a Bigtable table.

Note: At the time of this writing, streaming pipelines are not available in the DataFlow Python SDK. So the streaming labs are written in Java.

Objectives

In this lab, you will perform the following tasks:

  • Launch Dataflow pipeline to read from Pub/Sub and write into Bigtable.
  • Open an HBase shell to query the Bigtable database.

Setup

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Sign in to Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
    There is no pause feature. You can restart if needed, but you have to start at the beginning.

  3. When ready, click Start lab.

  4. Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.

  5. Click Open Google Console.

  6. Click Use another account and copy/paste credentials for this lab into the prompts.
    If you use other credentials, you'll receive errors or incur charges.

  7. Accept the terms and skip the recovery resource page.

Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (Navigation menu icon), select IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Cloud Overview > Dashboard.

Compute Engine default service account name and editor status highlighted on the Permissions tabbed page

Note: If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.
  1. In the Google Cloud console, on the Navigation menu, click Cloud Overview > Dashboard.
  2. Copy the project number (e.g. 729328892908).
  3. On the Navigation menu, select IAM & Admin > IAM.
  4. At the top of the roles table, below View by Principals, click Grant Access.
  5. For New principals, type:
{project-number}-compute@developer.gserviceaccount.com
  1. Replace {project-number} with your project number.
  2. For Role, select Project (or Basic) > Editor.
  3. Click Save.

Task 1. Preparation

You will be running a sensor simulator from the training VM. There are several files and some setup of the environment required.

Open the SSH terminal and connect to the training VM

  1. In the Console, on the Navigation menu ( Navigation menu icon), click Compute Engine > VM instances.

  2. Locate the line with the instance called training-vm.

  3. On the far right, under the Connect column, click on SSH to open a terminal window. Then click Connect.

In this lab, you will enter CLI commands on the training-vm.

Verify initialization is complete

  • The training-vm is installing some software in the background. Verify that setup is complete by checking the contents of the new directory:
ls /training

The setup is complete when the result of your list (ls) command output appears as in the image below. If the full listing does not appear, wait a few minutes and try again.

Note: It may take 2 to 3 minutes for all background actions to complete. student-04-2324ale56789@training-vm:~$ ls /training bq-magic.sh project_env.sh sensor_magic.sh student-04-2324ale56789@training-vm:~$

Download code repository

  • Next, you will download a code repository for use in this lab:
git clone https://github.com/GoogleCloudPlatform/training-data-analyst

Set environment variables

  • On the training-vm SSH terminal, enter the following:
source /training/project_env.sh

This script sets the $DEVSHELL_PROJECT_ID and $BUCKET environment variables.

Prepare HBase quickstart files

  • In the training-vm SSH terminal, run the script to download and unzip the quickstart files (you will later use these to run the HBase shell.):
cd ~/training-data-analyst/courses/streaming/process/sandiego ./install_quickstart.sh

Click Check my progress to verify the objective. Copy sample files to the training_vm home directory

Task 2. Simulate traffic sensor data into Pub/Sub

  • In the training-vm SSH terminal, start the sensor simulator. The script reads sample data from a csv file and publishes it to Pub/Sub:
/training/sensor_magic.sh

This command will send 1 hour of data in 1 minute. Let the script continue to run in the current terminal.

Open a second SSH terminal and connect to the training VM

  1. In the upper right corner of the training-vm SSH terminal, click on the gear-shaped button (Settings icon), and select New Connection from the drop-down menu. A new terminal window will open.

The new terminal session will not have the required environment variables. Complete the next step to set these variables.

  1. In the new training-vm SSH terminal, enter the following:
source /training/project_env.sh

Click Check my progress to verify the objective. Simulate traffic sensor data into Pub/Sub

Task 3. Launch dataflow pipeline

  1. To ensure that the proper APIs and permissions are set, execute the following block of code in the Cloud Shell:
gcloud services disable dataflow.googleapis.com --force gcloud services enable dataflow.googleapis.com
  1. In the second training-vm SSH terminal, navigate to the directory for this lab. Examine the script in Cloud Shell or using nano. Do not make any changes to the code.:
cd ~/training-data-analyst/courses/streaming/process/sandiego nano run_oncloud.sh

What does the script do?

The script takes 3 required arguments: project id, bucket name, classname and possibly a 4th argument: options. In this part of the lab, we will use the --bigtable option which will direct the pipeline to write into Cloud Bigtable.

  1. Press CTRL+X to exit.

  2. Run the following commands to create the Bigtable instance:

cd ~/training-data-analyst/courses/streaming/process/sandiego export ZONE={{{project_0.startup_script.gcp_zone|Lab GCP Zone}}} ./create_cbt.sh
  1. Run the following commands for the Dataflow pipeline to read from PubSub and write into Cloud Bigtable:
cd ~/training-data-analyst/courses/streaming/process/sandiego export REGION={{{project_0.startup_script.gcp_region|Lab GCP Region}}} ./run_oncloud.sh $DEVSHELL_PROJECT_ID $BUCKET CurrentConditions --bigtable

Example successful run:

[INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 47.582 s [INFO] Finished at: 2018-06-08T21:25:32+00:00 [INFO] Final Memory: 58M/213M [INFO] ------------------------------------------------------------------------

Click Check my progress to verify the objective. Launch Dataflow Pipeline

Task 4. Explore the pipeline

  1. Return to the browser tab for Console. On the Navigation menu ( Navigation menu icon), click Dataflow and click on the new pipeline job. Confirm that the pipeline job is listed and verify that it is running without errors.

  2. Find the write:cbt step in the pipeline graph, and click on the down arrow on the right to see the writer in action. Click on the given writer. Review the Bigtable Options in the Step summary.

Task 5. Query Bigtable data

  1. In the second training-vm SSH terminal, run the quickstart.sh script to launch the HBase shell:
cd ~/training-data-analyst/courses/streaming/process/sandiego/quickstart ./quickstart.sh
  1. When the script completes, you will be in an HBase shell prompt that looks like this:
hbase(main):001:0>
  1. At the HBase shell prompt, type the following query to retrieve 2 rows from your Bigtable table that was populated by the pipeline. It may take a few minutes for results to return via the HBase query.

Repeat the 'scan' command until you see a list of rows returned:

scan 'current_conditions', {'LIMIT' => 2}
  1. Review the output. Notice each row is broken into column, timestamp, value combinations.

  2. Run another query. This time look only at the lane: speed column, limit to 10 rows, and specify rowid patterns for start and end rows to scan over:

scan 'current_conditions', {'LIMIT' => 10, STARTROW => '15#S#1', ENDROW => '15#S#999', COLUMN => 'lane:speed'}
  1. Review the output. Notice that you see 10 of the column, timestamp, value combinations, all of which correspond to Highway 15. Also notice that column is restricted to lane: speed.

  2. Feel free to run other queries if you are familiar with the syntax. Once you're satisfied, enter quit to exit the shell:

quit

Task 6. Cleanup

  1. In the second training-vm SSH terminal, run the following script to delete your Bigtable instance:
cd ~/training-data-analyst/courses/streaming/process/sandiego ./delete_cbt.sh

If prompted to confirm, enter Y.

  1. On your Dataflow page in your Cloud Console, click on the pipeline job name.

  2. Click Stop on the top menu bar. Select Cancel, and then click Stop Job.

  3. Go back to the first SSH terminal with the publisher, and enter Ctrl+C to stop it.

  4. In the BigQuery console, click on the three dots next to the demos dataset, and click Delete.

  5. Type delete and then click Delete.

End your lab

When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.

准备工作

  1. 实验会创建一个 Google Cloud 项目和一些资源,供您使用限定的一段时间
  2. 实验有时间限制,并且没有暂停功能。如果您中途结束实验,则必须重新开始。
  3. 在屏幕左上角,点击开始实验即可开始

使用无痕浏览模式

  1. 复制系统为实验提供的用户名密码
  2. 在无痕浏览模式下,点击打开控制台

登录控制台

  1. 使用您的实验凭证登录。使用其他凭证可能会导致错误或产生费用。
  2. 接受条款,并跳过恢复资源页面
  3. 除非您已完成此实验或想要重新开始,否则请勿点击结束实验,因为点击后系统会清除您的工作并移除该项目

此内容目前不可用

一旦可用,我们会通过电子邮件告知您

太好了!

一旦可用,我们会通过电子邮件告知您

一次一个实验

确认结束所有现有实验并开始此实验

使用无痕浏览模式运行实验

请使用无痕模式或无痕式浏览器窗口运行此实验。这可以避免您的个人账号与学生账号之间发生冲突,这种冲突可能导致您的个人账号产生额外费用。