Erik's Engineering

something alliterative

The 202 Pattern

The Problem

A couple years ago we were working on a reporting project that needed data that wasn’t really kept online. A user would select a number of streams and a time range. Relevant log data would be downloaded from S3, indexed in SOLR, then queried, sent to the client, and displayed. Depending on which stream and how much time they asked for, this might take anywhere from a few seconds to a few hours. Of course, once it was loaded we could make subsequent queries very quickly. Obviously we couldn’t just have the browser stop and wait for the request to complete.

A Solution

What we settled on is something we now call the 202 pattern. The 202 HTTP response code means “request accepted; processing”. The client makes a get or post request. We validate it and respond with a 202 status and a response body {try_again_in_seconds: 5}, 5 seconds later the client makes THE EXACT SAME request. The server checks to see if it is done, and returns either another 202 or a 200 and the results of the query.

The client is essentially stateless. There’s no difference between starting the process and checking to see if it is done. Rather than creating a request object, receiving an ID for it, then polling to see when that ID completes, it just makes the same request over and over. The server is free to coalesce requests for the same data from multiple clients into a single background job - in fact unless you go out of your way to stop it you’ll just do that by default. The response is also cached until/unless you go out of your way to clean it up.

In the subsequent 2 years various teams around Spredfast have used the same pattern over and over, both for requests from web browsers and requests between servers.

We call it “The 202 Pattern”. Catchy, right?

When Not To Do This

Now, like any new technique, we’ve gone a little overboard with this. In one case, we went and used it within a single service, with one job polling to see if another job had finished. Polling like this wastes client resources. That’s OK if the client is a web browser running on a laptop. Not so good when the client is also the server and those attempts to check back and see if it’s done can get in the way of actually getting the work done. If it’s reasonable to use callbacks (e.g. when this is finished it does FinishJob.perform_async(some_id)) that’s much more efficient overall.

Lists

Everybody likes lists, don't they? It makes you feel like deciding on the right thing to do is somehow formal and sciency.

Pros

  • Client code stays simpler. With no extra state involved in the process there’s no danger of losting track and starting over.
  • Client has no idea how the server is doing it - great implementation hiding. At one point we were actually kicking off a Jenkins job, and no one but us and our bar tenders had to know.
  • The server can ask the client to poll back faster or slower as needs arise. Our main monolith generally runs with only a few rails processes for servicing web requests. If those get saturated it’s a problem, so it can be worth temporarily trading slower delivery times for lower server load.
  • Responses are naturally cached server side.
  • Multiple clients making the same request will all be mapped to the same background query - no duplicate work.
  • Keeps your web requests short and sweet. Does all the slow stuff in the background.

Cons

  • Nowhere near as simple as just doing a blocking request.
  • Definitely not RESTful (especially as we lean towards POSTing a JSON document for the query)
  • User waits a little longer for data than they strictly have to.
  • Some libraries make it hard to check for different 200 series response codes. At least one of our client implementation looks at the content of the response body to figure out if they got a 200 or 202 :/
  • No way to force a new query from scratch. At least we haven’t settled on one, and it would imply sending a different query from the repeated 202 checks. This has not been a problem in practice.
  • Less efficient for the client than waiting for a callback.

Gotchas

  • Is user/session an implicit query parameter? Make sure to include EVERYTHING when you check to see if a query is already running, but don’t include extra stuff or you’ll wind up duplicating requests.
  • How long should results stick around? You probably don’t want to cache forever. Be intentional about this. We typically keep them in our Postgres DB, and have a periodic job that cleans up old ones.
  • Does your query include things like lists of ids? Make sure to sort them so you don’t accidentally run equivalent queries multiple times. This might even apply to the order of keys in a JSON hash. Luckily Postgres JSONB columns are smart about that.

Some Code

I seem to remember that people like code samples. This feels like it would be easy to generalize, but it’s also so simple that trying to make it perfectly generalized and reusable gets in the way of understanding.

def big_query
  bqr = BigQueryRequest.new query_opts  # grab just the ones we want from params
  if bqr.valid?
    query = bqr.find_or_create # this implies starting background processing in find_or_create. I have mixed feelings.

    if query.done?
      json = {
        data: query.data,
        meta: {
          build_took: query.processing_finished_at - query.processing_started_at,
          query_id: query.id
        }
      }
      render json: json, status: 200
    else # not done
      if query.timed_out?
        render_msg(508, "BigQueryRequest id '#{query.id}' timed out")
      else
        json = {
          try_again_in_seconds: 10,
          query_id: query.id
        }
        render json: json, status: 202
      end
    else
    end
  else
    render_resource_errors bqr
  end
end

It’s only slightly more complicated than a bog standard create action.

  • Is the request valid? This is where you do permissions checks, etc.
  • Find or create if so.
  • Is it done? If so, return 200 with the results.
  • If it’s not done, make sure it didn’t time out.
  • If it’s not timed out, return 202 with a recommended retry rate.

Notice we send back our internal query id in case we need to debug why a specific request seems to get stuck in an infinite 202 loop. We don’t even expose a way to request one of these things by id, but it can be a big help when you want to look things up on the production console.

Congratulations. You have reached the end of the blog post.

Published on 30/06/2016 at 12h05 under , .

Comments are disabled

Powered by Typo – Thème Frédéric de Villamil | Photo Glenn