 
始める前に
- ラボでは、Google Cloud プロジェクトとリソースを一定の時間利用します
- ラボには時間制限があり、一時停止機能はありません。ラボを終了した場合は、最初からやり直す必要があります。
- 画面左上の [ラボを開始] をクリックして開始します
Check BigQuery datasets creation
/ 5
Check BigQuery tables creation
/ 5
Create a partioned table
/ 10
Check denormalized table creation
/ 10
Check cluster by order_id creation
/ 5
Check cluster by product_id creation
/ 5
Create a materialized view
/ 10
Create a table partition by order_ts
/ 5
BigQuery is Google's fully managed, NoOps, low cost analytics database. With BigQuery you can query terabytes and terabytes of data without having any infrastructure to manage or needing a database administrator. BigQuery uses SQL and can take advantage of the pay-as-you-go model. BigQuery allows you to focus on analyzing data to find meaningful insights.
In this lab, you will learn to improve your database from a performance and cost perspective.
In this lab, you will learn to:
Read these instructions. Labs are timed and you cannot pause them. The timer, which starts when you click Start Lab, shows how long Google Cloud resources will be made available to you.
This hands-on lab lets you do the lab activities yourself in a real cloud environment, not in a simulation or demo environment. It does so by giving you new, temporary credentials that you use to sign in and access Google Cloud for the duration of the lab.
To complete this lab, you need:
Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method. On the left is the Lab details pane, which is populated with the temporary credentials that are needed for this lab.
Copy the Password and then click Open Google Cloud console. The lab spins up resources, then opens another tab that shows the Sign in page.
On the Sign in page, verify that the username from the Lab details pane is auto-filled. Click Next.
Paste the password in the Enter your password field. Click Next.
Click through the subsequent pages:
Understand your account management.
Accept the terms and conditions.
After a few moments, the console opens.
Cloud Shell is a virtual machine (VM) that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud. Cloud Shell provides command-line access to your Google Cloud resources.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
Output:
Example output:
Output:
Example output:
gcloud see the gcloud command-line tool overview.
Now you will add an Instacart data warehouse data files into your Google Cloud project. This export is publicly available on Kaggle. To avoid creating a Kaggle account, the csv files have been uploaded to a publicly accessible Google Cloud Storage bucket for use with this lab.
Click Authorize in the Authorize Cloud Shell dialog box that pops up.
Enter the following snippets into the command-line prompt to copy the Instacart data into your new bucket:
Check BigQuery Datasets creation
Check BigQuery tables creation
The output should be a table with contents similar to the output below:
Below are a few questions regarding the data that you loaded into BigQuery tables earlier. Write SQL queries to solve the following problem statements in the BigQuery Editor and answer the questions. Also, note the bytes billed from the job details page for each of those queries.
The entity relationship diagram may be helpful and is presented below:
In the Google Cloud Console, select Navigation menu > BigQuery:
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.
Click Done.
The BigQuery console opens.
Click on Compose new query.
Use this query to answer the question:
Note that the bytes billed was approximately 53 MB.
Use this query to answer the question:
The bytes billed was approximately 550 MB.
Use this query to answer the question:
The bytes billed was approximately 745 MB.
Use this query to answer the question:
The bytes billed was approximately 10 MB.
In this task, you will learn to optimize the cost measured in bytes billed for a query by partitioning a table. A partitioned table is divided into segments, called partitions, that make it easier to manage and query your data. By dividing a large table into smaller partitions, you can improve query performance and control costs by reducing the number of bytes read by a query. You partition tables by specifying a partition column which is used to segment the table.
If a query uses a qualifying filter on the value of the partitioning column, BigQuery can scan the partitions that match the filter and skip the remaining partitions. This process is called pruning.
Challenge A: Write a query to find which hour was the busiest in terms of order volume on August 1, 2022.
Your boss has written a query to solve this challenge.
In the SQL Workspace toolbar, click the + Compose new query icon to open the SQL code editor.
Copy and paste the following query into the BigQuery editor, then click Run.
This query outputs the busiest hour in terms of order volume for 1st August 2022.
Your boss noted that the cost of this query, measured in bytes billed, was 79 MB.
Your boss is unhappy with this query and has asked you to review its performance and optimize it.
Optimize this query using the techniques that you have learned.
One observation is in stage "S00: Input" you are doing a full table scan, "Records read: 3421083" equals the total number of rows in cymbal_bq_opt_1.orders_with_timestamps. For each record in this table, BigQuery slots are doing Compute to convert a Timestamp to a DATE and compare that to the value "2022-08-01". This is done 3.4 million times.
One way to avoid a full table scan, and to reduce the cost of the query, is to make use of table partitioning.
Check partioned table creation
As shown in the table below, partitioning greatly reduces bytes billed. This is driven by a massive reduction in records read in stage "S00: Input", achieved by avoiding a full table scan.
| Challenge A (Before) | Challenge A (After) | Improvement | |
|---|---|---|---|
| Bytes billed | 79 MB | 10 MB | 87% reduction | 
| No of records read in "S00: Input" | 3,421,083 | 114,036 | 96% reduction | 
Challenge B: Your boss has written the following query to find the most popular product sold on August 1, 2022.
This query outputs the most popular product sold on August 1, 2022.
Your boss sees that the cost of this query, measured in bytes billed, was 550 MB.
Your boss is unhappy with this query and has asked you to review its performance and optimize it.
Optimize this query using the techniques that you have learned.
From the job information page, you see that the bytes billed is 550 MB.
One observation is that the most expensive stage is "S03: Join+" as slots are spending time on Compute to join the three input tables. The dataset seems to be in a normalized form. There is a many-to-many relationship between orders and products. One order has many products, and one product may be purchased across many orders. To deal with this, a bridging table called order_products is introduced. This way, there is a 1-to-many relationship on either side of the bridging table. One order has many order_products and one product has many order_products. Because of this normalized structure, three tables need to be joined. This join is very expensive.
For this analytics use case, this normalized Online Transaction Processing (OLTP) approach is not necessary. By de-normalzining the structure of the dataset, you can greatly improve analytical queries.
Check denormalized table creation
As per the table below, denormalization has a big impact on the cost of the query (measured in bytes billed). This is because, through denormalization, you avoid the three table join that was seen in the original query.
| Challenge B (Before) | Challenge 2B (After) | Improvement | |
|---|---|---|---|
| Bytes billed | 550 MB | 29 MB | 94% reduction | 
Clustered tables in BigQuery are tables that have a user-defined column sort order using clustered columns. Clustered tables can improve query performance and reduce query costs.
In BigQuery, a clustered column is a user-defined table property that sorts storage blocks based on the values in the clustered columns. The storage blocks are adaptively sized based on the size of the table. A clustered table maintains the sort properties in the context of each operation that modifies it. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns instead of the entire table or table partition. As a result, BigQuery might not be able to accurately estimate the bytes to be processed by the query or the query costs, but it attempts to reduce the total bytes at execution.
Challenge C: The Fraud Team has developed a new ML model, and one of its inputs is the number of products per order. The ML model has predicted that order 1564244 is suspicious. Write a query to report which products were purchased in that order. Report on the sequence in which they were added to the basket.
Write a query to find which product was added to the basket first in this suspicious order.
This query finds which product was added to the basket first in this suspicious order.
The Fraud Team noticed that the cost of this query, measured in bytes billed, was 497 MB. They are unhappy with this query and have asked you to review its performance and optimize it.
Optimize this query using the techniques that you have learned.
To test your skills! Don't peek at the hints until you've had try!
From the job information page, ypu see that the bytes billed is 497 MB.
Examine the execution details tab to see how this could be reduced:
<project_id>.cymbal_bq_opt_1.products.<project_id>.cymbal_bq_opt_1.order_products to compare each row's order id with input 1564244 Joining rows from #1 with project_id.cymbal_bq_opt_1.products on product_id.Clustering of column order_id can help to optimize Stage "S02: Output" as it'll sort the rows in the table and make it easier to look up the rows with order_id = 1564244.
Clustering of column product_id can help optimize this stage as well by improving join performance.
order_id.Check cluster by order_id creation
product_id.Check cluster by product_id creation
As shown in the table below, using clustering greatly reduces bytes billed.
| Challenge C (Before) | Challenge 2C (After) | Improvement | |
|---|---|---|---|
| Bytes billed | 497 MB | 20 MB | 95% reduction | 
In this task you will create materialized view. In BigQuery, materialized views are precomputed views that periodically cache the results of a query for increased performance and efficiency. BigQuery leverages precomputed results from materialized views and whenever possible reads only delta changes from the base tables to compute up-to-date results. Materialized views can be queried directly or can be used by the BigQuery optimizer to process queries to the base tables.
Challenge D: It's the end of the month and your boss wants a chart showing order volume by date and hour for August. Write a SQL query to calculate the total order volume by date and hour. On which date and hour was the most number of orders in August (e.g., 2022-08-11 13H)?
Your boss was really happy with your hourly order volume table. She discovered, using her amazing BigQuery skills, that this was the query you wrote to create this table.
She was so happy that she added your query to her dashboard. She made a time series chart of hourly sales in August, powered by your query and shared the dashboard with the CEO!
Excited by the email from your boss, the CEO tries to open the dashboard. He is furious; it is very slow to load. The CEO vents his anger by replying to your boss' email. Unfortunately, your boss has just gone on vacation. Her out-of-office auto reply mentions you as point-of-contact for the dashboard. To make matters worse, you do not have edit access to the dashboard, so you cannot edit the query!
Make the query run faster for the CEO and improve dashboard loading speed and cost.
The original loading speed, measured in query Elapsed Time, was around 1,000 ms. What is your new and improved speed?
The original cost, measured in bytes billed, was around 27MB. What is your new and reduced cost?
The answer here is to use a materialized view.
In particular you need to use smart-tuning to make the hidden read-only query in the dashboard use of your new materialized view.
As per stage "S00: Input" you're doing a full table scan of cymbal_bq_opt_1.orders_with_timestamps which is likely the reason for the slow performance and high cost.
You need to create a materialized view with the same logic in the same dataset cymbal_bq_opt_1 .
Check materialized view creation
As per stage "S00: Input", you are no longer reading from the base table, but are now reading from the materialized view (via the "smart tuning" feature).
The overall impact of using a materialized view is that costs are reduced and query speed improves.
| Challenge D (Before) | Challenge D (After) | Improvement | |
|---|---|---|---|
| Bytes billed | 27 MB | 10 MB | 62% reduction | 
| Elapsed Time | 1,000 ms | 389 ms | 61% reduction | 
The CEO is impressed and your boss gives you a raise!
Challenge E: Your boss has asked for a report for the order IDs on the day with the most customers. Your colleague has written the following SQL to pull the report.
Your boss has told your colleague that the query is too expensive, with bytes billed = 79 MB.
As the resident BigQuery expert, your colleague comes to you for help and advice to reduce this cost.
The original cost, measured in bytes billed, was around 79MB. What is your new and reduced cost?
The Execution Details page shows that you are doing a full table scan. Stage "S00: Input" is reading all 3.4 million rows of the table. To help avoid this, partition the table.
Check table partition by order_ts creation
The Execution Details page shows that you are still doing a full table scan. The subquery is not pruning partitions, but the query is executing faster than before (from 2 sec to 1 sec). The number of bytes billed is not reduced from the original query (seen in the Job Information tab).
By inspecting the results table of query 2, the target date was found to be "2022-08-17". What happens if you do not write a subquery and instead just provide this date directly as per the query below:
Now you are no longer doing a full table scan. You are making use of the partitioning and pruning the partition with the filter on the provided date string. This is seen in stage "S00: Input" where you are no longer reading the full 3.2M rows of the table. You have again reduced the elapsed time (2 sec reduced to 1 sec). As per the job information page, the bytes billed have been reduced from 79MB to 10MB.
| query | bytes billed | bytes saved | elapsed time | time reduction | 
|---|---|---|---|---|
| original query | 79 MB | - | 2 seconds | - | 
| query 2 | 79 MB | 0% | 1 second | 50% | 
| query 3 | 10 MB | 87% | 1 second | 50% | 
As per the table above, adding partitioning alone reduces the elapsed time by 50%, but does not reduce the bytes billed. This is because of the way BigQuery handles pruning of partitions through subqueries. This also occurs when pruning partitions through joins, but they were not explored in this challenge.
For customers using on-demand pricing, this is important to remember. Adding partitioning alone does not reduce the financial costs if a query is using subqueries or joins to prune partitions.
By working through several scenarios you've learned how to reduce the bytes billed by through partitioning and pruning the partiion with filters; optimize a query using denormalization; and improved row filtering and join performance with clustering. You have improved dashboard loading speed and cost by using the materialized view.
...helps you make the most of Google Cloud technologies. Our classes include technical skills and best practices to help you get up to speed quickly and continue your learning journey. We offer fundamental to advanced level training, with on-demand, live, and virtual options to suit your busy schedule. Certifications help you validate and prove your skill and expertise in Google Cloud technologies.
Manual Last Updated November 01, 2023
Lab Last Tested November 01, 2023
Copyright 2024 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
 
 
 
 
このコンテンツは現在ご利用いただけません
利用可能になりましたら、メールでお知らせいたします
 
ありがとうございます。
利用可能になりましたら、メールでご連絡いたします
 
 
1 回に 1 つのラボ
既存のラボをすべて終了して、このラボを開始することを確認してください
