How Data Scientists Can Increase Their Productivity With the Aid of Data Engineers Solutions Using BigQuery, Google Colab and Python

Flávio Brito
12 min readAug 17, 2021
Photo by Rob Wingate on Unsplash

This article aims to bring a set of solutions in Python used by Data Engineers that will increase the productivity of Data Scientists that needs to use Google BigQuery in daily operations and just want this thing to work!

Before we start, I would like to talk about an amazing tool that is helping many Data Engineers and Data Scientists in their daily projects.

Did you hear about Google Colab? No, this is a spectacular online notebook based platform that provides you with a free cloud service environment to develop powerful solutions using Python or R that brings you superpowers to deal with your Data in BigQuery and Google Cloud Storage. But you can also connect AWS S3 buckets with it, for example.

Grab a cup of coffee and let’s type cool Python codes.

Set up your Google Colab Notebook

Considering that you already have a Google account, go to

https://drive.google.com

There you can create a folder ( Click on + New, then Foder) to keep the notebooks or to create a notebook if you are already in your Google Drive, click in +New > Folder > More > Google Colaboratory, as shown in the following image.

Following the numbers, you can see:

  1. Code — to add a new code notebook cell. And Text to add a text notebook cell
  2. Notebook cell with a play button in the beginning to run the code
  3. RAM and Disk spaces. Google Colab has a good start in the free version that offers you 12 GB of RAM to process your code. Nice! You can continue working on your local machine and let Google Colab do the hard work!
  4. Notebook cell tool- Arrows (Up/Down) used to organize moving each cell up and down, link — a URL linking the cell content, gear — cell settings, mirror the cell, trash — delete cell
  5. Folders and Files — place to create folder and store files to be used by the Notebook.

In a free option you can import/ export data from Google Drive, Google Cloud Storage, BigQuery and also use GPU!

A GPU can be used to improve the code performance.

Colab provides GPU and it’s totally free!

Go to the main menu and click on Runtime / Change Runtime Type. By default, no hardware accelerator is provided.

Do you need more power? Google also offers a Google Colab Pro version.

Python Packages

Many packages are pre-installed and others you can install by running:

!pip install package_name

or if you need to forcefully remove and install again the package from a specific version you can do:

!pip install ‘package_name==1.2.5’ —- force-reinstall

where package_na is the name of the package that you need and 1.2.5 is a specific version and force-reinstall is the parameter option to force the package reinstallation if it already had installed.

To use BigQuery from python code it's necessary to import bigquery from google.cloud. In the case of running code from Google Colab, there is no need to install it, but it probably will be necessary outside. A good practice is to install from requirements.txt file, defining the package name or package name==version_ number.

requirements.txt

google-cloud-bigquery==2.24.0

Before installing some packages, list the python packages offered by Google Colab running:

Troubleshooting

If you have some issues trying to install python packages, check the Google Colab python version provided running in the notebook cell.

Google Colab Authentication

Running the imports previously described, the Google Colab authentication process will be invoked — auth method from google.colab. It will prompt a message given you an Authentication URL, asking you to return a verification code generated by the authorization session.

Click on the link and return the code from

This process will be necessary whenever you connect

Copy the code, and paste in a box "Enter verification code".

Now that you have your credentials valid let's do our job.

HOW TO …

List Tables from a Dataset using BigQuery CLI

Google Colab give you a set of pre-installed tools to work with BigQuery. The most important is the Python script for interacting with it — bq
You can call running:

!bq --global_flags] <command> [--command_flags] [args]

To check all commands run on Google Colab cell:

!bq --help

To check all flags run:

!bq --helpful

To list all tables in BigQuery you can run the bq client command with 3 parameters, in this example:

  • ls: for list
  • project: data-analytics-live
  • dataset_id: silver_raw

Use BigQuery via magics

The `google.cloud.bigquery` library also includes a magic command which runs a query and either displays the result or saves it to a variable as a `DataFrame`.

%%bigquery

The magics command (%%) lets the notebook knows you are about to load BigQuery data.

%%bigquery --project datasyncbqSELECT table_schema,table_nameFROM datasyncbq.datasync.INFORMATION_SCHEMA.TABLESWHERE table_type='VIEW';

The code above will return the table schema and table name attributes from BigQuery information schema tables where the table type is a view from the project datasyncbq.

Add a DataFrame name to export the query outcome:

%%bigquery --project datasyncbq dfSELECT table_schema,table_nameFROM datasyncbq.datasync.INFORMATION_SCHEMA.TABLESWHERE table_type='VIEW';

List Tables from BigQuery Dataset using Python

BigQuery python library has many features and to access it, you need to import bigquery from google.cloud package.

To add the project name, you can include in the format:

table.project

for table in tables:
print("{}.{}.{}".format(table.project, table.dataset_id, table.table_id))

BigQuery Metadata — List All Tables From BigQuery

BigQuery contains these views for table metadata:

  • TABLES and TABLES_OPTIONS — metadata about TABLES and VIEWS.
  • COLUMNS andCOLUMNS_FIELDS_PATHS metadata about columns and fields.
  • For metadata about table partitions — PARTITIONS
SELECT * FROM PROJECT_ID.DATASET_ID.INFORMATION_SCHEMA.TABLES

The image above is part of the command output. In this case, I would like to get the table name, column name, and if it's null or not, the data type of the field, and if it is a field used in a partition of a specific table(data_sync_data).

So far so good, you learned in this first part:

  • How you create a Google Colab notebook,
  • Enable a GPU power,
  • Connect Google Colab and BigQuery using python code and magic command to process SQL queries
  • How to retrieve metadata from BigQuery

Read BigQuery Data Using Pandas

The pandas-gbq the library covers basic functionality, such as writing a DataFrame to BigQuery and running a query, but as a third-party library, it may not handle all BigQuery features or use cases.

In the below example we call read_gbq function passing the query that uses public data from bigquery-public-data.

The table contains the number of applicants for a Social Security card by year of birth and sex. The number of such applicants is restricted to U.S. births where the year of birth, sex, State of birth (50 States and District of Columbia) are known, and where the given name is at least 2 characters long. source: http://www.ssa.gov/OACT/babynames/limits.html

We defined the dialect parameter as 'standard', that uses BigQuery's standard SQL, which is compliant with the SQL 2011 standard and the default ‘legacy’ that uses BigQuery’s legacy SQL dialect. Another parameter that we pass is project_id that came from a variable with the same name. The outcome is a sum of occurrences of a specific name from the state = 'TX', Texas, grouping then by name and organising in descending order by the count value.

Convert DataFrame to JSON

Exporting a Data Frame to a JSON string or file can be done using a special functionto_json .

Here we created a pandas Data Frame based on the SQL query outcome. The bq_credentials came from a:

bq_credentials = service_account.Credentials.from_service_account_info (
{ JSON - Credentials}
)

See more in the Credentials session of this tutorial.

The pandas DataFrame has a function to_json(oriented="records") that converts a Data Frame object to JSON string using the parameter oriented to indicate of expected JSON string format. In this case:

records : list like [{column -> value}, … , {column -> value}]

BigQuery and Google Colab are amazing solutions, but what if you want to get data from a spreadsheet? There are some options, like using pandas to read a Microsoft Excel file from Google Drive or read Google Sheets as a data source.

Load Data From Google Sheet to Pandas

In this tutorial, I will explain how to connect and read data from Google Sheets. here we will use a package named gspread. This package is a Python API for Google Sheets. There are interesting features like:

  • Google Sheets API v4.
  • Open a spreadsheet by title, key or URL.
  • Read, write, and format cell ranges.
  • Sharing and access control.
  • Batching updates.

To install it in other projects, including in the requirements.txt

gspread==3.0.1      

In this example, we are importing these modules to authenticate on Google Colab and using them on GoogleCredentials.

from google.colab import auth
auth.authenticate_user()
import gspread
from oauth2client.client import GoogleCredentials
import pandas as pd

A pandas data frame will be created to store the data read. The code snippet below is the core of the setup, where we pass the credentials and the defined Google Sheet, in this example, the name is 'BigQuery - DataSync - First Cleaning' and sheet1 is the first sheet to be considered.

gc = gspread.authorize(GoogleCredentials.get_application_default())
worksheet = gc.open('BigQuery - DataSync - First Cleaning').sheet1

Getting All Values from a Google Sheet

To get all values from the defined sheet, we need to use the get_all_values() function associated with the worksheet object already defined passing to a variable data. We usedpop(0) function passing 0 to retrieve all headers of the Google Sheet.

Now having data and header, it's possible to create a pandas(pd) DataFrame passing data to the data parameter and headers to columns. As we are using Google Colab we don't need to print the DataFrame (df), only invoke in a separate line.

data = worksheet.get_all_values()
headers = data.pop(0)
df = pd.DataFrame(data, columns=headers)
df

Create a Function to Delete Multiple Tables in Batch from BigQuey Using Google Sheet

In Data Engineering it's important to automate procedures, one of them is to delete tables in batch in BigQuery. We can use Google Sheet to store tables metadata that we will use in the delete procedure.

In the code below there is a function that expects to receive two parameters, table_path which is a concatenation of dataset_id and table_id and the project_id, that identifies the BigQuery project name. We created an instance of BigQuery client and call delete_table(table_id, not_found_ok=True) a function that deletes a table (table_id) and ignores if a table is not found.

After that we created a for-loop that for all DataFrame df rows, concatenate a dataset_id and table_id with a point connecting both strings and passed to project_dataset_tabel_path variable, calling delete_tables function passing the project_dataset_tabel_path and the project_id for each row retrieved.

Create Multiple Tables Delete Command Using BigQuery CLI command and BigQuery Compose Interface

BigQuery can also be accessed and operated using the bq command. Is it possible to create BigQuery console commands using the BigQuery Compose interface.

the command is bq rm -f -twhere:

  • bq is BigQuery command
  • rm is to remove the tables
  • - f is to force without ask
  • - t is a table name

Generating commands in batch using SQL is very useful in Data Engineering to avoid typos, missing quotes and to improve productivity. The outcome can be saved to Google Drive as CSV, JSON and also to the local file if you click on Save Query Results.

You can run the exported commands from a BASH file to remove the tables.

For more details about BigQuery CLI, visit the following page

https://cloud.google.com/bigquery/docs/reference/bq-cli-reference

Create a SQL Command for Dropping Tables BigQuery Metadata

Using the same idea it's possible to create SQL commands using BigQuery metadata. We can concatenate a "DROP TABLE" statement with table_schema and table_name

Export BigQuery Table Schema to JSON using BigQuery CLI

Export BigQuery table schema is very useful if you need to reprocess a complex table schema. To demonstrate it, use show to display information about a resource with --schemato display only the table’s schema with a flag prettyjson: easy-to-read JSON format.

!bq show --schema --format=prettyjson data-analytics-live:datasync.daily_sync_data > daily_sync_data.json
!cat daily_sync_data.json

Create a Table with Partition based on a SELECT

A partition table is a large table that is divided into small segments (partitions) to make it easier to manage by BigQuery. Using it we can reduce the cost of a query processed because we are reducing the number of rows processed. If the query filters on the value of the partitioned column, BigQuery will skip the partitions that don't match the filter.

Steps

  1. Create a SELECT casting and added new fields
  2. Create a CREATE TABLE command adding PARTITION BY statement with the field name. Partition by field need to be CAST as DATE
  3. Compose a query with both

So far so good, you learned in the second part:

  • List Tables from a Dataset using BigQuery CLI
  • Use BigQuery via magics
  • List Tables from BigQuery Dataset using Python
  • BigQuery Metadata — List All Tables From BigQuery
  • Use BigQuery through pandas-gbq
  • Read BigQuery Data Using Pandas
  • Load Data From Google Sheet to Pandas
  • Create a Function to Delete Multiple Tables in Batch from BigQuey Using Google Sheet
  • Create Multiple Tables Delete Command Using BigQuery CLI command and BigQuery Compose Interface
  • Create a SQL Command for Dropping Tables BigQuery Metadata
  • Export BigQuery Table Schema to JSON using BigQuery CLI
  • Create a Table with Partition based on a SELECT

BigQuery Credentials

Reading BigQuery Credentials from Dictionary

BigQuery credentials can be passed as a Python dictionary or by JSON File, in the following example, we used theservice_account.Credentials.from_service_account_info() function passing a Python dictionary. You need to replace the values in uppercase with your BigQuery credential values.

bq_credentials = service_account.Credentials.from_service_account_info (
{ Credentials}
)

where Credentials is the dictionary (key and values) that you downloaded from Google Cloud Platform.

replace, PROJECT_NAME, PRIVATE_ID,PRIVATE_ID_KEY,EMAIL,CLIENT_ID with the value of your credentials.

Reading BigQuery Credentials from JSON File

BigQuery credentials can be stored in a JSON file, the most common way to use it, and can be read using thefrom_service_account_fileto authenticate with a service account key file or passing a dictionary through service_account.Credentials.from_service_account_info function

In the following example, we illustrate how to do it using a JSON file.

service_account_json_file_name = "app/config/credentials/bigquery_credentials.json"

create a variable credentials that received the service account credential got from the functionfrom_service_account_file(file, scope) that expects the file name with path and the scope of the credential.

credentials = service_account.Credentials.from_service_account_file(
service_account_json_file_name, scopes["https://www.googleapis.com/auth/cloud-platform"],)

So far so good, you learned in the last part:

  • Read BigQuery credentials from a dictionary and from a JSON file

Conclusion

In this article, we showed focused solutions using BigQuery and Google Colab using Python to improve Data Scientists and Data Engineers productivity.

You can check the author’s GitHub repositories for code, ideas, and resources in data science and data engineering. Please feel free to add me on LinkedIn or follow me on Twitter.

Thanks for reading!

--

--