Erik's Engineering

something alliterative

The Longest Schema Change

Sometimes, changing column type isn't as simple as a migration with a change_column command. This is the story of one of those times. It happened last year. I’ve reconstructed events and timelines, but none of it is gospel.

I worked on a service that was designed to answer the question “When did people like my post?”. You gave it social media items (“entities”) and it watched them for you, ever ready to answer those questions. It did this by polling the relevant social media APIs and storing the results in an Amazon RDS Postgresql database.

Schema Diagram

The schema looked vaguely like this, with some pretty normal tables for tracking entities and metadata about them, and then sample tables that used Postgres' partitioned data feature to store sample data about them in subtables broken up by date.

The Outage

One day we got a flood of sentries (our exception monitoring system) about inserts and updates failing. It turns out we had recorded over 2 billion Instagram samples. We had thought an awful lot about scaling things, but we didn’t think to make our primary keys 64 bit.

In the moment, what was blowing up was trying to insert new rows into the Instagram samples table. A little thinking ahead revealed that even if we fixed that, the entities table had a latest_sample_id column that would also blow up.

Plan in hand, I tried using migrations to alter the two tables. It worked great in development and CI, but failed because our deploy process would kill anything that took more than 10 minutes. I went ahead and used psql to run the basic alter table statement on the main Instagram samples table. It took about 45 minutes or so to complete, and the change to the latest_sample_id took 75 minutes or so. This got the service back up and running.

Excellent!

Early Warning System

Unfortunately, the service was handling a lot more volume than we expected and we now knew we had a bunch of other tables that were going to break in the same way. We needed to update them, and we needed a way to get a warning before they exploded. In fact, we wanted to protect all our systems.

Rails 5 introduced the notion of an ApplicationRecord that all AR models would inherit from, much like the ApplicationController. We weren’t on Rails 5 everywhere yet, but it seemed like a good idea. I would add a universal after_commit callback to check to check the value of the generated id and scream loudly if it exceeded a threshold.

class ApplicationRecord < ActiveRecord::Base  
  self.abstract_class = true

  after_create :check_for_id_exhaustion

  def check_for_id_exhaustion  
    if self.id.to_i > self.class.id_error_threshold  
      Raven.capture_message("Model #{self.class} has id numbers that are dangerously high", { extra: { id: self.id } }) if (self.id.to_i % 100000 == 0)  
    end

    true  
  end

  def self.id_error_threshold  
    100_000_000  
  end  
end

Notice that the threshold is quite low - high enough to show that the table is growing a lot, but not so high that altering it will take too long. The actual threshold is in a thunk that makes it easy to override for individual classes or tests. Typically I’d bump it up to 200_000_000 the first time it went off and then much higher when I actually altered the table. This introduced a new failure mode where I alter table to 64bit but still spam myself with sentries because I forgot to raise the limit appropriately.

Also notice that we don’t sentry for every single insert. If a table grows fast you will flood the notification system if you don’t ease back a bit.

So now we will know if a table we didn’t think of is heading towards a wall, and we will get an insistent reminder if we don’t get around to one of the ones we already knew about fast enough.

Down for Maintenance

Next we needed to actually fix those tables. We thought we’d to do them all starting entities table first, as it seemed most crucial. We declared an outage on a Saturday monrning, when our metrics showed that almost no one used the system, and proceeded to start on the alter table statements. We got entities done, but it took soooo long we decided to pull the plug on the rest until we could find a better way.

Worse, when we tried to go back on line everything was performing incredibly poorly. It turns out that Postgres tables require periodic maintenance called vacuuming in order to keep the query optimizer well tuned. Amazon RDS normally does a good job of automatically vacuuming your tables, but sometimes after a major change like this you can end up with an optimizer that is completely out of whack until you manually vacuum it. I really can’t tell you how to predict this, because it hasn’t happened to us again, even with similar changes. Just be warned that if you see the problem you will need to manually vacuum the table you just changed.

Insert From Select, In Batches if Necessary

One of the standard recommendations for doing an alter table more quickly is to create a new table with the schema you want, populate it via an insert statement that selects all the contents of the existing table, then drop the old table and rename the new one over the top of it.

create table public.next_samples_new (id bigserial primary key,  
  entity_id bigint, network_id integer, entity_type_id integer,  
  next_sample_at timestamp, created_at timestamp not null,  
  updated_at timestamp not null, last_sent_at timestamp);

insert into public.next_samples_new select * from next_samples;

alter table public.next_samples rename to public.next_samples_old;  
alter table public.next_samples_new rename to public.next_samples;

The best part is that you can do a dry run of the create and insert without taking down production. Or so you would think. It worked great for a smaller table, where there weren’t too many rows, but when I went on to one with ~500M rows the insert took a very long time and other queries started failing. This directly contradicts the Postgres docs, which state that readers (like the select part of insert from select) don’t affect other queries. Nevertheless it happened.

Luckily, that larger table was append only. Once inserted, rows are never updated. That let me insert a few million rows at a time by selecting them based on their id values - a very quick index query.

insert into public.next_samples_new select * from next_samples where  
  id >= 10000000 and id < 15000000

Do that over and over again until you get them all, then catch up to the last few while you have things paused during a maintenance window and then swap the tables out.

We declared a couple maintenance windows and worked our way through all the different “normal” tables that needed changing. Some got the straight alter table treatment, some got insert from select, and one of them forced me to do it in batches. None of these later maintenance windows ran very long.

A Note About Migrations

You may have noticed that I’m doing a lot of database maintenance via SQL console, rather than migrations. I love migrations - they are one of the greatest improvements in Rails over every other framework I had used before it. They let you have lots of different dev, test and production databases without worrying about keeping their schemas in sync. Our continuous deployment system ran them as part of every deploy. Unfortunately that system couldn’t handle a migration that took more than about 10 minutes.

Worse, when doing insert from select, when you rename the tables at the end it would not rename the sequence objects that Postgres is using to generate id numbers. This really confused Rails' schema dumper and things would blow up. You either needed to take the extra steps to manually rename those (automatically created) sequences, or not update schema.rb from a database that has had tables renamed that way.

As a compromise for these very long and involved migrations I’d deploy a build with a migration that will update smaller databases just fine, but with the actual changes commented out or skipped in production. It probably used the rails migrations change_column syntax. Once it registered as complete in production I can uncomment the inside of the migration so my coworkers' dev environments will stay up to date.

I did the actual production changes via psql over tmux. This was a pain in the ass and more error prone than Rails migrations, so only viable for extraordinary changes. I think it is super important to keep the migrations producing a viable dev database even if they no longer contain the exact code that ran in production.

Updating the Sample Tables

At this point, I had updated all of the tables except the non-Instagram sample tables. All the other tables are used by multiple types of social media data, but the sample tables are network and type of thing specific. The sample tables all use partitioned data and I didn’t want to try doing insert from select on literally hundreds of daily sub-tables, never mind whether PG would let me change them given the way their schema were inherited from a master table. They would have to be done with a traditional alter table statement. Luckily during the very first (unscheduled) outage we noticed that the alter table statement ran much faster on the Instagram samples table than on non-partitioned tables. Something about splitting the data into a bunch of smaller tables helped it go a lot faster. I still thought the total time would be too long for a maintenance window, so I decided to do it while the system was online.

In order to do that I had to add a series of on/off switches that would let me pause processing of different types of data one at a time. Up to this point, when we wanted to do a maintenance window we would basically deploy a change that rejected all work and just let requests fail while we changed the db. Other microservices would retry their failed requests and recover once we were back. Inelegant, but it got the job done. This new system would instead keep accepting those requests and just queue them up. Once maintenance was over the system worked through the backlog and no other systems saw anything but a little extra latency as long as we don’t fill up redis with too many jobs.

With this in place, I was able to do the rest of the changes during a normal work day. I went through all the networks one by one. Stop processing Facebook pages, wait for it to quiesce, alter Facebook samples table, check it, turn Facebook pages back on, make sure it’s solid, then go on to Youtube. It took a few hours to get through them all.

After all the build-up, the final changes to the samples tables were completed on a single Wednesday morning as I worked from the comfort of my back yard. I’m going to chalk this up to how doing a better job preparing for things makes them go more smoothly. As someone much wiser said: do the refactorings that will make the change you want easy (this may be hard), then make the change.

All told, the entire process from initial outage to final migrations took 4-5 months, in part because once out of crisis it was generally lower priority than other work. The after create hook warning system LET it be lower priority, because as individual tables got too big for their id column we got reminders to take care of them. The plus side to that was that it gave me lots of time to think about the best way to do things and we didn’t have to do things while under the gun.

Published on 20/02/2017 at 15h36 under , .

Comment The Longest Schema Change

Powered by Typo – Thème Frédéric de Villamil | Photo Glenn