Storing and querying massive datasets can be time consuming and expensive without the right infrastructure. BigQuery is a serverless and fully managed enterprise data warehouse that enables fast and cost-effective queries using the processing power of Google's infrastructure. In BigQuery, storage and compute resources are decoupled, which provides you the flexibility to store and query your data according to your organization's needs and requirements.
In this lab, you use the BigQuery query validator and the bq command-line tool to estimate the amount of data to be processed before running a query. You also use a SQL query and the API to determine resource usage after a query has run successfully.
Objectives
In this lab, you learn how to:
Use the BigQuery query validator to estimate the amount of data to be processed by a query.
Determine slot usage for executed queries using a SQL query and the API.
Complete a dry run of a query to estimate the amount of data to be processed by a query.
Setup and Requirements
Qwiklabs setup
For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.
Sign in to Qwiklabs using an incognito window.
Note the lab's access time (for example, 1:15:00), and make sure you can finish within that time.
There is no pause feature. You can restart if needed, but you have to start at the beginning.
When ready, click Start lab.
Note your lab credentials (Username and Password). You will use them to sign in to the Google Cloud Console.
Click Open Google Console.
Click Use another account and copy/paste credentials for this lab into the prompts.
If you use other credentials, you'll receive errors or incur charges.
Accept the terms and skip the recovery resource page.
How to start your lab and sign in to the Console
Click the Start Lab button. If you need to pay for the lab, a pop-up opens for you to select your payment method.
On the left is a panel populated with the temporary credentials that you must use for this lab.
Copy the username, and then click Open Google Console.
The lab spins up resources, and then opens another tab that shows the Choose an account page.
Note: Open the tabs in separate windows, side-by-side.
On the Choose an account page, click Use Another Account. The Sign in page opens.
Paste the username that you copied from the Connection Details panel. Then copy and paste the password.
Note: You must use the credentials from the Connection Details panel. Do not use your Google Cloud Skills Boost credentials. If you have your own Google Cloud account, do not use it for this lab (avoids incurring charges).
Click through the subsequent pages:
Accept the terms and conditions.
Do not add recovery options or two-factor authentication (because this is a temporary account).
Do not sign up for free trials.
After a few moments, the Cloud console opens in this tab.
Note: You can view the menu with a list of Google Cloud Products and Services by clicking the Navigation menu at the top-left.
Activate Google Cloud Shell
Google Cloud Shell is a virtual machine that is loaded with development tools. It offers a persistent 5GB home directory and runs on the Google Cloud.
Google Cloud Shell provides command-line access to your Google Cloud resources.
In Cloud console, on the top right toolbar, click the Open Cloud Shell button.
Click Continue.
It takes a few moments to provision and connect to the environment. When you are connected, you are already authenticated, and the project is set to your PROJECT_ID. For example:
gcloud is the command-line tool for Google Cloud. It comes pre-installed on Cloud Shell and supports tab-completion.
You can list the active account name with this command:
[core]
project = qwiklabs-gcp-44776a13dea667a6
Note:
Full documentation of gcloud is available in the
gcloud CLI overview guide
.
Task 1. Use the query validator to estimate the amount of data to be processed
When you enter a query in the Google Cloud console, the BigQuery query validator verifies the query syntax and provides an estimate of the number of bytes to be processed by the query.
In this task, you query a public dataset (New York Citi Bikes) maintained by the BigQuery public datasets program. Using this dataset, you learn how to use the query validator to validate a SQL query and to estimate the amount of data to be processed by a query before you run it.
In the Google Cloud console, in the Navigation menu (), under Analytics, click BigQuery.
The Welcome to BigQuery in the Cloud Console message box opens. This message box provides a link to the quickstart guide and the release notes.
Click Done.
In the SQL Workspace toolbar, click the Editor tab to open the SQL query editor.
In the BigQuery query editor, paste the following query but do not run the query:
SELECT
COUNT(*)
FROM
`bigquery-public-data.new_york_citibike.citibike_trips`
WHERE
start_station_name LIKE '%Broadway%';
When executed, this query returns the count of the station names that contain the text "Broadway" in the column named start_station_name in the citibike_trips table.
In the query editor toolbar, notice the circular check icon, which activates the query validator and confirms that the query is valid.
BigQuery will automatically run the query validator when you add or modify the code in the query editor.
A green or red check displays above the query editor depending on whether the query is valid or invalid. If the query is valid, the validator also displays the amount of data to be processed if you choose to run the query.
According to the query validator, this query will process 1.06 GB when run.
Click Run.
The query returns the number of records (5,414,611) that contain the text "Broadway" in the column named start_station_name.
Click Check my progress to verify the objective.
Estimate the amount of data processed by a query
Task 2. Determine slot usage using a SQL query
BigQuery uses slots (virtual CPUs) to execute SQL queries, and it automatically calculates how many slots each query requires, depending on query size and complexity. After you run a query in the Google Cloud console, you receive both the results and a summary of the amount of resources that were used to execute the query.
In this task, you identify the job ID of the query executed in the previous task and use it in a new SQL query to retrieve additional information about the query job.
In the Query results, click on the Job information tab.
Identify the line for the Job ID, and use the provided value to select the project ID and job ID.
For example, the value qwiklabs-gcp-01-5f4dee7a15a3:US.bquxjob_403a14df_185dd37737a begins with the project ID, followed by the location where the job was executed, and ends with the job ID. The syntax for :US. identifies the location where the job was executed.
The project ID is the first part qwiklabs-gcp-01-5f4dee7a15a3 (before :US.), while the job ID is the last part bquxjob_403a14df_185dd37737a (after :US.).
Note: You can copy the full value to a text editor or document to make it easier to select the project ID and the job ID.
In the query editor, copy and paste the following query, replacing 'YOUR_ID' with your job ID (such as 'bquxjob_403a14df_185dd37737a'):
SELECT
query,
reservation_id,
CONCAT('*****@',REGEXP_EXTRACT(user_email,r'@(.+)'))
AS user_email,
total_bytes_processed,
total_slot_ms,
job_stages
FROM
`region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
job_id = 'YOUR_ID';
When executed, this query returns the slot usage of the query job previously executed on the Citi Bikes public dataset.
Click Run.
The output of this query provides a table that shows the query stages and the associated slot usage for each stage.
Since an individual task in a query is executed by one slot, the sum of values in the column named job_stages.completed_parallel_inputs is the number of total slots used to run the query.
However, after a single slot has completed the first task assigned to it, it can be reassigned to complete another task.
So understanding the total slot time used to run the query (value provided in the column named total_slot_ms) is also important. Specifically, the slot time (in milliseconds, or ms) is provided for the entire query job and for each stage of the query, which represents the amount of slot time used to complete that stage.
For example, a query may complete 150 tasks, but if each task executes quickly, the query may actually use a lower number of slots, such as 100, rather than 150 slots.
Click Check my progress to verify the objective.
Determine slot usage using SQL query
Task 3. Determine slot usage using an API call
You can also retrieve information about a specific query job using the API. In BigQuery, you can use the API directly by making requests to the server, or you can use client libraries in your preferred language: C#, Go, Java, Node.js, PHP, Python, or Ruby.
In this task, you use the Google APIs Explorer to test the BigQuery API and retrieve the slot usage for the query that you ran in a previous task.
In a new Incognito browser tab, navigate to the BigQuery API page for the jobs.get method.
In the Try this method window, input your project ID and job ID that you identified in the previous task.
For example, qwiklabs-gcp-01-5f4dee7a15a3 for the project ID and bquxjob_403a14df_185dd37737a for the job ID.
Click Execute.
If asked to confirm your login, select the student username that you used to login to Google Cloud for the previous tasks:
Review the API response for each stage and for the entire query job.
To see the value for completed parallel inputs for the first stage, scroll down to statistics > query > queryPlan > name: S00 > completedParallelInputs.
To see the total slots used for the entire query job, scroll down to the end of the results to review the value for totalSlotMs.
Task 4. Complete a dry run of a query to estimate the amount of data processed
In the bq command-line tool, you can use the --dry_run flag to estimate the number of bytes read by the query before you run the query. You can also use the dryRun parameter when submitting a query job using the API or client libraries. Dry runs of queries do not use query slots, and you are not charged for performing a dry run.
In this task, you learn how to complete a dry run of a query using the bq command-line tool in Cloud Shell.
In Cloud Shell, run the following command:
bq query \
--use_legacy_sql=false \
--dry_run \
'SELECT
COUNT(*)
FROM
`bigquery-public-data`.new_york_citibike.citibike_trips
WHERE
start_station_name LIKE "%Lexington%"'
The output shows you the estimated amount of bytes to be processed by the query before you run the query to retrieve any results.
Query successfully validated. Assuming the tables are not modified, running this query will process 1135353688 bytes of data.
Now that you know how many bytes will be processed by the query, you have the information needed to decide on your next steps for your workflow.
Click Check my progress to verify the objective.
Complete a dry run of a query
End your lab
When you have completed your lab, click End Lab. Google Cloud Skills Boost removes the resources you’ve used and cleans the account for you.
You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.
The number of stars indicates the following:
1 star = Very dissatisfied
2 stars = Dissatisfied
3 stars = Neutral
4 stars = Satisfied
5 stars = Very satisfied
You can close the dialog box if you don't want to provide feedback.
For feedback, suggestions, or corrections, please use the Support tab.
Copyright 2022 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.
Os laboratórios criam um projeto e recursos do Google Cloud por um período fixo
Os laboratórios têm um limite de tempo e não têm o recurso de pausa. Se você encerrar o laboratório, vai precisar recomeçar do início.
No canto superior esquerdo da tela, clique em Começar o laboratório
Usar a navegação anônima
Copie o nome de usuário e a senha fornecidos para o laboratório
Clique em Abrir console no modo anônimo
Fazer login no console
Faça login usando suas credenciais do laboratório. Usar outras credenciais pode causar erros ou gerar cobranças.
Aceite os termos e pule a página de recursos de recuperação
Não clique em Terminar o laboratório a menos que você tenha concluído ou queira recomeçar, porque isso vai apagar seu trabalho e remover o projeto
Este conteúdo não está disponível no momento
Você vai receber uma notificação por e-mail quando ele estiver disponível
Ótimo!
Vamos entrar em contato por e-mail se ele ficar disponível
Um laboratório por vez
Confirme para encerrar todos os laboratórios atuais e iniciar este
Use a navegação anônima para executar o laboratório
Para executar este laboratório, use o modo de navegação anônima ou uma janela anônima do navegador. Isso evita conflitos entre sua conta pessoal e a conta de estudante, o que poderia causar cobranças extras na sua conta pessoal.
This lab teaches you to perform SQL executions and monitor their slot utilization.
Duração:
Configuração: 0 minutos
·
Tempo de acesso: 90 minutos
·
Tempo para conclusão: 60 minutos