In BigQuery, you organize data into BigQuery datasets and define the schema (or structure) of each table using column names and data types. The schema of a table can impact query performance and cost in BigQuery, as it determines how quickly and efficiently BigQuery can access and process the data in a table. BigQuery supports flexible schemas, and it is possible to make changes to the schema without having to rewrite the data.
The objective of this lab is to provide Snowflake professionals with the necessary knowledge and skills to start designing and implementing effective BigQuery table schemas. Upon completion of this lab, Snowflake professionals will have a deeper understanding of how to design, optimize, and query table schemas in BigQuery
In this lab, you create BigQuery datasets and tables to store data, create nested and repeated fields to maintain relationships in denormalized data, and create partitioned and clustered tables to optimize query performance.
Create and query nested and repeated fields in BigQuery.
Create and query partitioned tables in BigQuery.
Create and query clustered tables in BigQuery.
Setup and requirements
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
How to start your lab and sign in to the Console
Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method.
On the left is a panel populated with the temporary credentials that you must use for this lab.
Copy the username, and then click Open Google Console.
The lab spins up resources, and then opens another tab that shows the Choose an account page.
Note: Open the tabs in separate windows, side-by-side.
On the Choose an account page, click Use Another Account. The Sign in page opens.
Paste the username that you copied from the Connection Details panel. Then copy and paste the password.
Note: You must use the credentials from the Connection Details panel. Do not use your Google Cloud Skills Boost credentials. If you have your own Google Cloud account, do not use it for this lab (avoids incurring charges).
Click through the subsequent pages:
Accept the terms and conditions.
Do not add recovery options or two-factor authentication (because this is a temporary account).
Do not sign up for free trials.
After a few moments, the Cloud console opens in this tab.
Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left.
Task 1. Create BigQuery dataset and tables
In BigQuery, you can use data definition language (DDL) to create datasets and tables. You can also use the SQL statement LOAD DATA to load data from one or more files into a new or existing table.
In this task, you use DDL to create a dataset and tables in BigQuery, and then load data into the new tables using the LOAD DATA statement.
In the Google Cloud console, in the Navigation menu (), under Analytics, click BigQuery.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.
Click Done.
In the SQL Workspace toolbar, click the Editor icon to open the SQL query editor.
In the query editor, copy and paste the following query, and click Run:
CREATE SCHEMA IF NOT EXISTS
ticket_sales OPTIONS(
location="us");
This query creates a new BigQuery dataset named ticket_sales. Note that the DDL statement uses the term SCHEMA to refer to a logical collection of tables, views, and other resources, which in BigQuery is known as a dataset.
Expand the Explorer pane (left side) which lists the dataset, and then click on the dataset name ticket_sales to confirm that it was successfully created.
This query uses an explicit table schema definition to load data into the sales table from a CSV file in Cloud Storage.
The Results pane displays a message that the LOAD statement ran successfully.
In the Explorer pane, click on the Details and Preview tabs to confirm the data was loaded into the sales table.
You can click Refresh (top right) to refresh the data in the Preview tab.
Click Check my progress to verify the objective.
Create BigQuery dataset and tables
You have now used SQL statements in BigQuery to create a dataset and a table, and load data into it. Practice creating another table and loading data for event information.
In the query editor, run the following query to create a new table names events:
Review the Explorer pane, and confirm that there are now two tables: one named sales and one named events.
In the query editor, run the following query to load data into the events table:
LOAD DATA INTO ticket_sales.events
FROM FILES (
skip_leading_rows=1,
format = 'CSV',
field_delimiter = ',',
max_bad_records = 10,
uris =['gs://tcd_repo/data/entertainment_media/ticket-sales/events.csv']);
This query uses schema auto-detection to load data into the events table from a CSV file in Cloud Storage.
The Results pane displays a message that the LOAD statement ran successfully.
Click on the Details and Preview tabs to confirm the data was loaded into the events table.
You can click Refresh (top right) to refresh the data in the Preview tab.
Click Check my progress to verify the objective.
Load data into event table
Task 2. Create and query nested and repeated fields
Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized. In BigQuery, the recommended way to denormalize data is to use nested and repeated fields. You can use nested and repeated fields to maintain relationships in denormalized data, instead of completely flattening your data.
In this task, you learn how to create and query nested and repeated fields in BigQuery.
In the query editor, run the following query:
SELECT
e.eventid,
e.eventname,
s.saletime,
s.qtysold,
s.pricepaid,
s.commission
FROM
ticket_sales.events e
JOIN
ticket_sales.sales s
ON
e.eventid = s.eventid
ORDER BY
eventid, eventname;
Note: There is a one-to-many relationship between the events table and the sales table. When you run this query, you see repetition on the "one" side of the one-to-many relationship. For every sale, the event is repeated. To remove the repetition, you can aggregate the sales data into an array.
In the query editor, run the following query:
SELECT
e.eventid,
e.eventname,
ARRAY_AGG(STRUCT(
s.saletime,
s.qtysold,
s.pricepaid,
s.commission)) as sales
FROM
ticket_sales.events e
JOIN
ticket_sales.sales s
ON
e.eventid = s.eventid
GROUP BY
eventid, eventname
ORDER BY
eventid, eventname;
While the SQL for step 2 is similar to previous SQL in step 1, notice the addition of the ARRAY_AGG, STRUCT, and GROUP BY functions.
Review the query results.
Rather than repeating the data on the "one" side of the one-to-many relationship, the data on the "many" side is now in an array of structs.
You can also wrap the previous query in a CREATE TABLE statement to create a nested hierarchical table.
In the query editor, run the following query:
CREATE OR REPLACE TABLE ticket_sales.event_sales
as (
SELECT
e.eventid,
e.eventname,
ARRAY_AGG(STRUCT(
s.saletime,
s.qtysold,
s.pricepaid,
s.commission)) as sales
FROM
ticket_sales.events e
JOIN
ticket_sales.sales s
ON
e.eventid = s.eventid
GROUP BY
eventid, eventname
);
Click Go to Table, and examine the table schema.
The schema includes a nested and repeated field named sales, which contains the sale time, quantity sold, price paid, and commission for each event sale.
This new nested and repeated structure changes the way that you write queries.
To count the number of sales by event, run the following query:
SELECT
eventid,
eventname,
ARRAY_LENGTH(sales) AS sale_count
FROM
ticket_sales.event_sales
ORDER BY
sale_count DESC;
What if you wanted to see the top commissions per event? That would require querying inside the array. To do that, you need to unnest (or flatten) the array.
To unnest the array and identify the top two commissions per event, run the following:
SELECT
eventid,
eventname,
ARRAY((SELECT AS STRUCT saletime, commission FROM UNNEST(sales)
ORDER BY(commission) DESC LIMIT 2)) as top_2
FROM
ticket_sales.event_sales
ORDER BY
eventid;
The UNNEST operator is used to flatten the sales array, so that it can be queried, and those results are converted into an array.
To learn more about using using UNNEST to flatten arrays, review the documentation titled UNNEST operator.
Click Check my progress to verify the objective.
Create and query nested and repeated fields
Task 3. Create and query partitioned tables
In BigQuery, one method of reducing the number of bytes processed by a query is to divide a large table into smaller segments called partitions, and then include a filter in your queries to select only data from the appropriate partition. This process is known as partition pruning and can be used to reduce query costs.
In this task, you learn how to create and query time-unit partitioned tables (on a DATETIME column) to minimize the number of bytes processed by queries.
In the query editor, paste the following code, but do not click Run:
SELECT * FROM ticket_sales.sales;
The BigQuery query validator provides an estimate of the number of bytes that will be processed before you run it. Note the estimated number for this query (15.12 MB).
Paste the following query, but do not click Run:
SELECT * FROM ticket_sales.sales
WHERE saletime = '12/14/2008 09:13:17';
Notice that the number of bytes processed is the same as the query in step 1 (15.12 MB), even though the new query is only asking for sales from a specific date.
To create a new sales table that is partitioned daily by the saletime column, run the following query:
CREATE OR REPLACE TABLE
ticket_sales.sales_partitioned_by_date
PARTITION BY
DATETIME_TRUNC(saletime, DAY)
AS (
SELECT
* except (saletime),
PARSE_DATETIME( "%m/%d/%Y %H:%M:%S", saletime) as saletime
FROM
ticket_sales.sales );
Click on the Details tab to confirm that the table is partitioned by DAY on the column named saletime.
Paste the following query, and notice the lower estimate of data to be processed (18.98 KB).
SELECT *
FROM ticket_sales.sales_partitioned_by_date
WHERE saletime = parse_datetime("%m/%d/%Y %H:%M:%S", '12/14/2008 09:13:17');
Click Run to retrieve the query results.
The query processes less data (18.98 KB) because it is running on the partitioned table. BigQuery is able to use partition pruning to process less data which can result in less cost and faster queries.
Click Check my progress to verify the objective.
Create and query partitioned tables
Task 4. Create and query clustered tables
Another method of optimizing query performance in BigQuery is to cluster values within a table to sort and group data into logical storage blocks. Queries that filter or aggregate by the clustered columns only scan the relevant blocks based on the clustered columns, instead of the entire table or table partition. This process is known as block pruning and can make joins, searches, grouping, and sorting faster.
In this task, you learn how to create and query clustered tables to optimize query performance.
In the query editor, run the following query:
SELECT
LANGUAGE,
COUNT(views) AS views
FROM
`cloud-training-demos.wikipedia_benchmark.Wiki1B`
GROUP BY
LANGUAGE
ORDER BY
views DESC;
This query uses one of Google's public datasets that contains a lot of data (in this case, 1 billion rows). The query counts views by language in a table of Wikipedia data.
In the Results pane, click on the Execution details tab.
Note the number of bytes shuffled (619.68 KB).
To create a new dataset for the Wikipedia data, run the following query:
CREATE SCHEMA IF NOT EXISTS
wiki_clustered OPTIONS(
location="us");
To create a table clustered on the language column, run the following query:
CREATE OR REPLACE TABLE
wiki_clustered.Wikipedia_by_language
CLUSTER BY
language
AS (
SELECT * FROM `cloud-training-demos.wikipedia_benchmark.Wiki1B`);
This command can take up to several minutes to run.
Review the Details tab for the table to confirm that it is clustered on the column named language.
To query the clustered table, run the following query below:
SELECT
LANGUAGE,
COUNT(views) AS views
FROM
wiki_clustered.Wikipedia_by_language
GROUP BY
language
ORDER BY
views DESC;
In the Results pane, click on the Execution details tab.
Notice that the number of bytes shuffled is lower (47.98 KB) when the same query is run on the clustered table. Fewer bytes shuffled means faster execution time in BigQuery.
Click Check my progress to verify the objective.
Create and query clustered tables
End your lab
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
1 star = Very dissatisfied
2 stars = Dissatisfied
3 stars = Neutral
4 stars = Satisfied
5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
Los labs crean un proyecto de Google Cloud y recursos por un tiempo determinado
.
Los labs tienen un límite de tiempo y no tienen la función de pausa. Si finalizas el lab, deberás reiniciarlo desde el principio.
En la parte superior izquierda de la pantalla, haz clic en Comenzar lab para empezar
Usa la navegación privada
Copia el nombre de usuario y la contraseña proporcionados para el lab
Haz clic en Abrir la consola en modo privado
Accede a la consola
Accede con tus credenciales del lab. Si usas otras credenciales, se generarán errores o se incurrirá en cargos.
Acepta las condiciones y omite la página de recursos de recuperación
No hagas clic en Finalizar lab, a menos que lo hayas terminado o quieras reiniciarlo, ya que se borrará tu trabajo y se quitará el proyecto
Este contenido no está disponible en este momento
Te enviaremos una notificación por correo electrónico cuando esté disponible
¡Genial!
Nos comunicaremos contigo por correo electrónico si está disponible
Un lab a la vez
Confirma para finalizar todos los labs existentes y comenzar este
Usa la navegación privada para ejecutar el lab
Usa una ventana de navegación privada o de Incógnito para ejecutar el lab. Así
evitarás cualquier conflicto entre tu cuenta personal y la cuenta
de estudiante, lo que podría generar cargos adicionales en tu cuenta personal.
In this lab, you learn how to define and query table schemas in BigQuery including how to create and query nested and repeated fields, partitioned tables, and clustered tables.
Duración:
0 min de configuración
·
Acceso por 90 min
·
60 min para completar