Blog

How to create a pivot table in Google BigQuery

27 Jun, 2021
Xebia Background Header Wave

To calculate the elapsed times of Terraform Cloud run stages, I needed to create a pivot table in Google BigQuery. As the classic SQL query was very expensive, I changed it to use the BigQuery PIVOT operator which was 750 times cheaper. Terraform Cloud allows you to manage your infrastructure at an enterprise scale. It is a managed service, and the pricing is based on the number of users and the number of concurrent runs you need. By default you get a single agent or runner. Extra runners cost a few thousand dollar per year. So, you need to balance throughput and cost. To do this, you need data on the utilization of the runs. Terraform Cloud does not provide any data out of the box, so I activated the Terraform run notifications webhook and streamed the data into BigQuery. To get useful information I had to pivot the data.

terraform run notifications

The Terraform run notification webhook, emits a notification whenever a run changes state. These notifications look as follows:

{
  "payload_version": 1,
  "notification_configuration_id": "nc-Mra7BwRJgiyEEPo4",
  "run_url": "https://app.terraform.io/app/binx-io/my-ws/runs/run-BbUuHJv3w1kztNDi",
  "run_id": "run-BbUuHJv3w1kztNDi",
  "run_message": "Queued manually using Terraform",
  "run_created_at": "2021-06-12T12:11:20.000Z",
  "run_created_by": "mjvanholsteijn",
  "workspace_id": "ws-tgHKyu5U8a3dDUGC",
  "workspace_name": "my-ws",
  "organization_name": "binx-io",
  "notifications": [
    {
      "message": "Run Created",
      "trigger": "run:created",
      "run_status": "pending",
      "run_updated_at": "2021-06-12T12:11:20.000Z",
      "run_updated_by": "mjvanholsteijn"
    }
  ]
}

The notification only reports the time of the state change. It does not report the elapsed time since the previous state change. As I loaded all of these events into BigQuery, I had to find a smart way to collect all of the timestamps of the appropriate state changes in a single row.

unnesting the notifications

As you can see from the sample, the notifications is modeled as an array. So Terraform potentially sends multiple state changes in a single notification. To make it a simple table, I use the unnest operator:

SELECT
  e.workspace_id,
  e.run_id,
  n.run_updated_at,
  n.run_status
FROM
  terraform_cloud_notifications.events e,
  UNNEST(notifications) AS n

Which delivers a table with a row for each of the notifications in the array:

|workspace_id       |run_id              |run_updated_at     |run_status|
|-------------------|--------------------|----------|--------|---|------|
|ws-tgHKyu5U8a3dDUGC|run-ubJZBXt4aT2DprjJ|2021-06-13 20:49:02|applied|
|ws-tgHKyu5U8a3dDUGC|run-EMYsYn6oZd36Bk2o|2021-06-13 19:47:53|pending|
|ws-tgHKyu5U8a3dDUGC|run-ubJZBXt4aT2DprjJ|2021-06-13 20:48:53|planned|
|ws-tgHKyu5U8a3dDUGC|run-jLcoV2rTiWpncAzp|2021-06-13 19:43:37|applying|
|ws-tgHKyu5U8a3dDUGC|run-jLcoV2rTiWpncAzp|2021-06-13 19:43:27|planning|
|ws-tgHKyu5U8a3dDUGC|run-BbUuHJv3w1kztNDi|2021-06-12 12:11:20|pending|

To be able to calculate the duration of each stage, I needed to pivot the table.

Using plain-ol’ SQL

Unaware of the PIVOT operator, my first solution was to use plain old SQL to solve this problem. The following statement uses a sub-select for each of the event notification times:

SELECT
  workspace_id,
  run_id,
  (SELECT ANY_VALUE(n.run_updated_at) 
   FROM terraform_cloud_notifications.events, unnest(notifications) n
   WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'pending') as pending,
  (SELECT ANY_VALUE(n.run_updated_at) 
   FROM terraform_cloud_notifications.events, unnest(notifications) n 
   WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'planning') as planning,
  (SELECT ANY_VALUE(n.run_updated_at) 
   FROM terraform_cloud_notifications.events, unnest(notifications) n 
   WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'planned') as planned,
  (SELECT ANY_VALUE(n.run_updated_at) 
   FROM terraform_cloud_notifications.events, unnest(notifications) n 
   WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'applying') as applying,
  (SELECT ANY_VALUE(n.run_updated_at) 
   FROM terraform_cloud_notifications.events, unnest(notifications) n 
   WHERE workspace_id = e.workspace_id AND run_id = e.run_id AND n.run_status = 'applied') as applied,
FROM
( SELECT DISTINCT workspace_id, run_id
  FROM terraform_cloud_notifications.events
) e

It produces the expected result:

workspace_idrun_idpendingplanningplannedapplyingapplied
ws-tgHKyu5U8a3dDUGCrun-ubJZBXt4aT2DprjJ2021-06-13 20:48:452021-06-13 20:48:472021-06-13 20:48:532021-06-13 20:48:552021-06-13 20:49:02
ws-tgHKyu5U8a3dDUGCrun-jLcoV2rTiWpncAzp2021-06-13 19:43:252021-06-13 19:43:272021-06-13 19:43:342021-06-13 19:43:372021-06-13 19:43:44
ws-tgHKyu5U8a3dDUGCrun-EMYsYn6oZd36Bk2o2021-06-13 19:47:532021-06-13 19:47:532021-06-13 19:48:002021-06-13 19:48:022021-06-13 19:48:06

But the cost were horrendous! The query returned in 1.5 seconds and spend almost 30 seconds of slot time. For a table with just a handful of rows, that is clearly not acceptable.

pivot table in Google BigQuery

After some googling, I found that the PIVOT operator does exactly what I want. The following statement will add the columns pending, planning, planned, applying, applied with
the column’s value set to max(run_updated_at).

SELECT
  *
FROM (
  SELECT
    e.workspace_id,
    e.run_id,
    n.run_updated_at,
    n.run_status
  FROM
    terraform_cloud_notifications.events e,
    UNNEST(notifications) AS n 
) 
PIVOT ( 
    MAX(run_updated_at)
    FOR run_status IN ('pending','planning','planned', 'applying','applied')
)

It produces the following result:

workspace_idrun_idpendingplanningplannedapplyingapplied
ws-tgHKyu5U8a3dDUGCrun-ubJZBXt4aT2DprjJ2021-06-13 20:48:452021-06-13 20:48:472021-06-13 20:48:532021-06-13 20:48:552021-06-13 20:49:02
ws-tgHKyu5U8a3dDUGCrun-jLcoV2rTiWpncAzp2021-06-13 19:43:252021-06-13 19:43:272021-06-13 19:43:342021-06-13 19:43:372021-06-13 19:43:44
ws-tgHKyu5U8a3dDUGCrun-EMYsYn6oZd36Bk2o2021-06-13 19:47:532021-06-13 19:47:532021-06-13 19:48:002021-06-13 19:48:022021-06-13 19:48:06

The data was returned in 0.2 seconds and only consumed 0.04 seconds of slot time. That is 750 times cheaper. How awesome is that!

conclusion

After first trying some good old fashioned nested SQL to pivot tables, I found the PIVOT operator very simple to use. it is really sweet. Image by Harry Strauss from Pixabay

Mark van Holsteijn
Mark van Holsteijn is a senior software systems architect at Xebia Cloud-native solutions. He is passionate about removing waste in the software delivery process and keeping things clear and simple.
Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts