Lob’s website experience is not optimized for Internet Explorer.
Please choose another browser.

Engineering
October 7, 2021

Meeting Expectations: Running Database Changes With Zero Downtime

Editor’s Note: Jay was an intern with Lob for three months in the summer of 2021. He worked on several important projects, including allowing users more control over the privacy of their data, improving our webhooks platform, and upgrading critical dependencies.

At Lob we receive millions of HTTP requests from our customers daily. Our Print and Mail API facilitates the print and delivery system that is our core product. These requests lead to the creation of new records that represent letters, postcards, self mailers, accounts, and many other mailable and non-mailable resources. Over the last eight years, many of our database tables have accumulated a massive volume of data. Herein lies the problem. 

A naive approach to executing column changes 

We needed to update multiple `json` field types to `jsonb` recently. I had never worked with a database that contained more than tens of thousands of entries, so I naively thought that the process would be easy. 

My original approach was to do something like this for each table:

<script src="https://gist.github.com/SidneyAllen/9a2d86821c33d32fe2f081b17a97ff6d.js"></script>

I quickly ran into problems. At Lob, we pride ourselves on API availability. However, my approach would lock tables and create issues with our availability because ALTER TABLE statements in Postgres hold an ACCESS EXCLUSIVE type lock while the operation runs. ACCESS EXCLUSIVE type locks are the most restrictive table-level lock mode for all table-level locks. While an ACCESS EXCLUSIVE lock is held, no concurrent query statements can be executed on the table. Bear in mind that our tables have a massive volume of data. To withhold the use of these tables for a column type change would essentially deny the use of our Print and Mail API for customers for hours, possibly days.

Approaching column changes with zero downtime

My approach for tackling this zero downtime constraint was inspired by PostgreSQL at Scale: Database Schema Changes Without Downtime.

Below is my approach in consideration of the zero downtime constraint: 

  1. Run a database migration script that creates a new column of type `jsonb` for each `json` column in affected tables.
  2. Make changes to our Print and Mail API to dual write to both the `json` and `jsonb` columns.
  3. Backfill the value of the new `jsonb` column with the pre-existing `json` column. 
  4. After backfilling the `jsonb` columns, run a database migration script to drop the `json` column and rename the `jsonb` column to the name of the `json` column. 
  • It is important to do both the ALTER TABLE … DROP COLUMN queries and the ALTER TABLE … RENAME COLUMN queries in a single transaction. Other queries pertinent to our Print and Mail API’s operations could execute queries on the table in between the time it takes for a table’s `json` column to be dropped and the `jsonb` column to be renamed. These queries could potentially try to write to the old column name but because we haven’t renamed the old column’s name to that of the old column’s, the query would fail
  1. Change the API to write only to the new column


You might be wondering “Why am I writing to both columns in Step 2”? Because zero downtime is the most important consideration in this project. Since new records can be added to the tables we are changing while we are changing them, we need to account for new data in our execution of the changes.

I used Node.js with the inclusion of a package called Knex, a SQL query builder for relational databases. Migration scripts are executed using the Knex command line interface. Below is the code as it corresponds to each step of the process:


Step 1

Create a new column of type `jsonb` for each table with a `json` column

JavaScript

<script src="https://gist.github.com/SidneyAllen/8d812f8ba2eb43b28393d843ec635692.js"></script>


Step 2

Enable dual writing to both `json` and `jsonb` columns. 


Example in pseudocode (Wherever the controller method is for creating a new record into a table with a `json` typed column):

<script src="https://gist.github.com/SidneyAllen/09549e339d289fb58987f0a5a28b277f.js"></script>


Step 3

The backfill script. 

This script is executed via the command line like so: 

node backfill-jsonb-columns-script.js <loop_count> <batch_size> <timeout_delay in ms>

  • loop_count: How many times to execute the update query
  • batch_size: How many records to update in one UPDATE statement
  • timeout_delay: How long to wait between each batch update


JavaScript code

<script src="https://gist.github.com/SidneyAllen/ef3cccc0ecaf367957787c25371b1c09.js"></script>

Step 4 

Drop the `json` column and rename the `jsonb` column to the name of the dropped column. 

JavaScript Code

<script src="https://gist.github.com/SidneyAllen/b23e8b3803a779cdf9b2841b787ed8c1.js"></script>

Step 5 

Revert changes made in Step 2.

Considerations

In order to make the column type change from `json` to `jsonb`, we need to consider two things for the query statements we execute. 

  1. Would the statement lock the table from other concurrent query statements? 
  2. If the table is locked, is it possible to perform the operation quickly? 

In order to answer these questions, we need to understand what SQL queries are being executed at each step of my approach. 

Step 1

Create a new column of type `jsonb` for each table with a `json` column. 


SQL Statements:

`ALTER TABLE… ADD COLUMN…`

`ALTER TABLE… ADD COLUMN…`

...

Would the statement lock the table from other concurrent query statements?

Yes. This type of ALTER TABLE… statement acquires an ACCESS EXCLUSIVE type table lock before executing.


If the table is locked, is it possible to perform the operation quickly?

Yes. `ALTER TABLE… ADD COLUMN…` executes very quickly since all rows with the new column are given a default value of NULL.


Solution

No need to remedy. All tables should be locked for a very short period of time.

Step 2 

This step does not execute any SQL queries.

Step 3

The backfill script.

SQL Statements:

`UPDATE… SET…`

`UPDATE… SET…`


Would the statement lock the table from other concurrent query statements?

Partially. Reads to the table can happen concurrently but writes cannot. INSERTS and UPDATES in Postgres acquire a ROW EXCLUSIVE type table lock before executing. Although the lock is called ROW EXCLUSIVE, it affects the entire table. ROW EXCLUSIVE locks block out any concurrent query that is not a SELECT… query.

If the table is locked, is it possible to perform the operation quickly?

Negative. If we were to do a single UPDATE… SET… query for every table.  It would take hours to execute the queries and would inevitably result in downtime. Being able to read from the tables concurrently while the UPDATE… SET… queries are being executed is great but being unable to do write operations to those tables would lead to a denial of our service. 

Solution

The remedy to this can be seen in the code for the backfill script. UPDATE… SET… queries are executed in a while loop and done in batches. A timeout is set in between each UPDATE… SET… query so that other write operations can be executed on tables affected by the backfill script. The script is able to be run and stopped as many times, which means we can execute the backfilling process periodically and over different spans of time rather than all at once.

Step 4, Drop the `json` column and rename the `jsonb` column to the name of the dropped column. 

SQL Statements:

`ALTER TABLE... DROP COLUMN…

ALTER TABLE... DROP COLUMN…

ALTER TABLE... RENAME COLUMN…

ALTER TABLE... RENAME COLUMN…

Would the statement lock the table from other concurrent query statements?

Yes. Query statements are executed within a single transaction, which locks the entire table.


If the table is locked, is it possible to perform the operation quickly?

Yes. Both statements are very quick to execute.

Solution

No need to remedy. All tables should be locked for a very short period of time.

Step 5 

This step does not execute any SQL queries to our database.

Conclusion

Like most organizations that possess a massive volume of data, Lob needs to ensure that our services have zero downtime. To do so, Lob software engineers need to  consider the database changes they are making. Migrations and backfill scripts need to be planned in advance to ensure availability.

This blog provides general information and discussion about direct mail marketing and related subjects. The content provided in this blog ("Content”), should not be construed as and is not intended to constitute financial, legal or tax advice. You should seek the advice of professionals prior to acting upon any information contained in the Content. All Content is provided strictly “as is” and we make no warranty or representation of any kind regarding the Content.

Continue Reading