# How to create a pivot table in Google BigQuery

on
jun 27, 2021
in

To calculate the elapsed times of Terraform Cloud run stages, I needed to create a pivot table in Google BigQuery. As the classic SQL query was very expensive, I changed it to use the BigQuery PIVOT operator which was 750 times cheaper.

Terraform Cloud allows you to manage your infrastructure at an enterprise scale. It is a managed service, and the pricing is based on the number of users and the number of concurrent runs you need. By default you get a single agent or runner. Extra runners cost a few thousand dollar per year. So, you need to balance throughput and cost. To do this, you need data on the utilization of the runs. Terraform Cloud does not provide any data out of the box, so I activated the Terraform run notifications webhook and streamed the data into BigQuery. To get useful information I had to pivot the data.

The Terraform run notification webhook, emits a notification whenever a run changes state. These notifications look as follows:

{
"run_url": "https://app.terraform.io/app/binx-io/my-ws/runs/run-BbUuHJv3w1kztNDi",
"run_id": "run-BbUuHJv3w1kztNDi",
"run_message": "Queued manually using Terraform",
"run_created_at": "2021-06-12T12:11:20.000Z",
"run_created_by": "mjvanholsteijn",
"workspace_id": "ws-tgHKyu5U8a3dDUGC",
"workspace_name": "my-ws",
"organization_name": "binx-io",
{
"message": "Run Created",
"trigger": "run:created",
"run_status": "pending",
"run_updated_at": "2021-06-12T12:11:20.000Z",
"run_updated_by": "mjvanholsteijn"
}
]
}

The notification only reports the time of the state change. It does not report the elapsed time since the previous state change.

As I loaded all of these events into BigQuery, I had to find a smart way to collect all of the timestamps of the appropriate state changes in a single row.

As you can see from the sample, the notifications is modeled as an array. So Terraform potentially sends multiple state changes in a single notification. To make it a simple table, I use the unnest operator:

SELECT
e.workspace_id,
e.run_id,
n.run_updated_at,
n.run_status
FROM
UNNEST(notifications) AS n

Which delivers a table with a row for each of the notifications in the array:

|workspace_id       |run_id              |run_updated_at     |run_status|
|-------------------|--------------------|----------|--------|---|------|
|ws-tgHKyu5U8a3dDUGC|run-ubJZBXt4aT2DprjJ|2021-06-13 20:49:02|applied|
|ws-tgHKyu5U8a3dDUGC|run-EMYsYn6oZd36Bk2o|2021-06-13 19:47:53|pending|
|ws-tgHKyu5U8a3dDUGC|run-ubJZBXt4aT2DprjJ|2021-06-13 20:48:53|planned|
|ws-tgHKyu5U8a3dDUGC|run-jLcoV2rTiWpncAzp|2021-06-13 19:43:37|applying|
|ws-tgHKyu5U8a3dDUGC|run-jLcoV2rTiWpncAzp|2021-06-13 19:43:27|planning|
|ws-tgHKyu5U8a3dDUGC|run-BbUuHJv3w1kztNDi|2021-06-12 12:11:20|pending|

To be able to calculate the duration of each stage, I needed to pivot the table.

## Using plain-ol’ SQL

Unaware of the PIVOT operator, my first solution was to use plain old SQL to solve this problem. The following statement uses a sub-select for each of the event notification times:

SELECT
workspace_id,
run_id,
(SELECT ANY_VALUE(n.run_updated_at)
WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'pending') as pending,
(SELECT ANY_VALUE(n.run_updated_at)
WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'planning') as planning,
(SELECT ANY_VALUE(n.run_updated_at)
WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'planned') as planned,
(SELECT ANY_VALUE(n.run_updated_at)
WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'applying') as applying,
(SELECT ANY_VALUE(n.run_updated_at)
WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'applied') as applied,
FROM
( SELECT DISTINCT workspace_id, run_id
) e

It produces the expected result:

workspace_id run_id pending planning planned applying applied
ws-tgHKyu5U8a3dDUGC run-ubJZBXt4aT2DprjJ 2021-06-13 20:48:45 2021-06-13 20:48:47 2021-06-13 20:48:53 2021-06-13 20:48:55 2021-06-13 20:49:02
ws-tgHKyu5U8a3dDUGC run-jLcoV2rTiWpncAzp 2021-06-13 19:43:25 2021-06-13 19:43:27 2021-06-13 19:43:34 2021-06-13 19:43:37 2021-06-13 19:43:44
ws-tgHKyu5U8a3dDUGC run-EMYsYn6oZd36Bk2o 2021-06-13 19:47:53 2021-06-13 19:47:53 2021-06-13 19:48:00 2021-06-13 19:48:02 2021-06-13 19:48:06

But the cost were horrendous! The query returned in 1.5 seconds and spend almost 30 seconds of slot time. For a table with just a handful of rows, that is clearly not acceptable.

## pivot table in Google BigQuery

After some googling, I found that the PIVOT operator does exactly what I want. The following statement will add the columns pending, planning, planned, applying, applied with
the column’s value set to max(run_updated_at).

SELECT
*
FROM (
SELECT
e.workspace_id,
e.run_id,
n.run_updated_at,
n.run_status
FROM
)
PIVOT (
MAX(run_updated_at)
FOR run_status IN ('pending','planning','planned', 'applying','applied')
)

It produces the following result:

workspace_id run_id pending planning planned applying applied
ws-tgHKyu5U8a3dDUGC run-ubJZBXt4aT2DprjJ 2021-06-13 20:48:45 2021-06-13 20:48:47 2021-06-13 20:48:53 2021-06-13 20:48:55 2021-06-13 20:49:02
ws-tgHKyu5U8a3dDUGC run-jLcoV2rTiWpncAzp 2021-06-13 19:43:25 2021-06-13 19:43:27 2021-06-13 19:43:34 2021-06-13 19:43:37 2021-06-13 19:43:44
ws-tgHKyu5U8a3dDUGC run-EMYsYn6oZd36Bk2o 2021-06-13 19:47:53 2021-06-13 19:47:53 2021-06-13 19:48:00 2021-06-13 19:48:02 2021-06-13 19:48:06

The data was returned in 0.2 seconds and only consumed 0.04 seconds of slot time. That is 750 times cheaper. How awesome is that!

## conclusion

After first trying some good old fashioned nested SQL to pivot tables, I found the PIVOT operator very simple to use. it is really sweet.

Image by Harry Strauss from Pixabay

Mark van Holsteijn is a senior software systems architect, and CTO of binx.io. He is passionate about removing waste in the software delivery process and keeping things clear and simple.