Airflow tutorial 6: Build a data pipeline using Google Cloud Bigquery

4 minute read

In this tutorial, we will build a data pipeline using Google Cloud Bigquery and Airflow

The GitHub links for this tutorial

A quick look at this tutorial

This tutorial is inspired by this blog post from the official Google Cloud blogs.

We will be using 2 public datasets hosted on Google BigQuery:

Github Archive: 30 million events monthly, including issues, commits, and pushes on Github.
Hacker news: contains a full daily update of all the stories and comments from Hacker News.

From the public datasets, we will create a data pipeline for aggregating daily stats about Github repos and the associated Hacker News story and pushing the result into a new joint table every day. The final table will be used to power this visualization report: GitHub on Hacker News trends dashboard created using Google Data Studio.

Overview of Google BigQuery

Google BigQuery is a big data analytics product from Google that helps you run ad-hoc analysis on massive dataset using Google Cloud infrastructure. With the power BigQuery, you can run a query to analyze terabytes of data within seconds.

Some characteristics of Google BigQuery:

Serverless: is an execution model where the cloud provider (AWS, Google) will dynamically allocate the resources to run your code, and only charge for that amount of resources.
Highly Scalable: can scale to tens of thousands of machines in Google Data Center.
High performance: massively parallel execution and automatic performance optimization.
High Availability: Automatically replicates data between zones to enable high availability. It also automatically load balances to provide optimal performance and to minimize the impact of any hardware failures.

The advantage of choosing Google BigQuery for this tutorial:

Access to Google BigQuery public datasets: Google pays for the storage of these datasets and provides public access to the data via your cloud project. You pay only for the queries that you perform on the data. Moreover, there’s a 1TB per month free tier for queries, making getting started super easy.
Real-time analysis of massive datasets
No-ops (no operations): no need to do operations work like setting up or maintaining any infrastructure. Just focus on your code.

Working with Google BigQuery

The first step you need to do is sign in to your Google account
Search for Google Cloud, sign up
Go to Google cloud console
Search for Bigquery in the navigation bar.

In this tutorial, we will be using 2 public datasets hosted on Google BigQuery:

Github Archive: 30 million events monthly, including issues, commits and pushes on Github.
- Link: Data - More info
Hacker news: contains a full daily update of all the stories and comments from Hacker News.
- Link: Data - More info

Exploratory data analysis

The first step in working with any new datasets is to do some analysis to explore the data.

You can either use:

the BigQuery Web UI to run your adhoc queries.
go through my jupyter notebook to reproduce my analysis.

Steps to run my jupyter notebook docker environment and reproduce the analysis:

Clone this airflow-tutorial repo
Go to the notebooks directory, you should see a docker-compose.yml file

cd notebooks

Run this command in the terminal to start the docker environment

docker-compose up

You should see a similar output in your terminal with the first time login token:

Copy/paste this URL into your browser when you connect for the first time, to login with a token:
http://(7b568d3eba0f or 127.0.0.1):8888/?token=7d229945745ecd113adc572d50d93c98e3552516ee65a432

Copy the token and open a new web browser, enter:

localhost:8889/?token=7d229945745ecd113adc572d50d93c98e3552516ee65a432

That’s it. You should now see the work/github-trend-analysis.ipynb file

Building the data pipeline

Setting up

In order to run this data pipeline example, you need to set up a few things first. Specifically:

Create a service account (Cloud Console)
Setup a Google Cloud Connection in Airflow
Supply the config variables

Follow this instruction to set up and run your DAG.

Running the Airflow docker environment

I have already created a new docker environment to run this data pipeline example. Steps to run the airflow environment:

Check out the Github master branch of this tutorial
Start the Airflow environment with docker

bash run_gcloud_example.sh

Stop the Airflow environment when you are finished

bash stop_gcloud_example.sh

Test your DAG

After connection and config variables have been set up using this instruction, you can now test and run your DAG.

Using the command below to test specific task in the DAG:

docker-compose -f docker-compose-gcloud.yml run --rm webserver airflow test [DAG_ID] [TASK_ID] [EXECUTION_DATE]

Examples:

# Task 1
docker-compose -f docker-compose-gcloud.yml run --rm webserver airflow test bigquery_github_trends bq_check_githubarchive_day 2018-12-01

# Task 2
docker-compose -f docker-compose-gcloud.yml run --rm webserver airflow test bigquery_github_trends bq_check_hackernews_full 2018-12-01

Resources

Share on

Twitter Facebook LinkedIn

Tuan Vu

Airflow tutorial 6: Build a data pipeline using Google Cloud Bigquery

A quick look at this tutorial

Overview of Google BigQuery

Working with Google BigQuery

Exploratory data analysis

Building the data pipeline

Setting up

Running the Airflow docker environment

Test your DAG

Resources

Share on

Leave a comment

You may also enjoy

System Design Interview Prep Guide

On How To Stand Out in Your Job Application: Tips for New Grads

How to Use Large Language Models While Reducing Cost and Improving Performance

Reflection on System Efficiency at Meta: A Personal Perspective