Memory Leak Whack-a-Mole

October 27, 2016

[devops] [puma] [heroku] [sidekiq] [redis] [rails]

⊕
https://www.flickr.com/photos/tpapi/2765541278Memory leaks happen. They shouldn’t, but you just can’t squash enough of them. Or the leak is so slow that it takes hours to manifest. The trouble with memory leaks is that they are permanent for the life of a process. Once a process has expanded its VSZ, there’s no way in Unix to shrink it. So your process starts to bloat, eventually your servers page and thrash. Performance tanks. Then your pager starts going off and there goes your beauty sleep.

On your own servers this is when init or monit or god come into play; they watch your production processes for signs of trouble and then issue restarts. On Heroku and other PaaS, you don’t have access to these tools, in fact, quite often, you simply can’t restart a process; the container is the process. So what can you do? Killing the container might disrupt a critical transaction and piss off a user. Are you running a bunch of background tasks in Sidekiq? What happens if you pull the plug on a 100-thread worker?

The general recipe that won’t impact performance nor threaten data loss is as follows:

1. Detect the problem (trickier than you think);
2. Receive a webhook;
3. SignalNot to confuse you, but I’m not talking about Unix signals. I just mean “Send a message from one running process to another over some channel.” signal(7) might be used as the channel, but that’s an implementation detail. the process to stop accepting new work;
4. Wait for the process to finish all “in-flight” work and go “quiet,” and finally;
5. Signal the process to exit, allowing your cloud infrastructure to create a new container instance with a fresh memory footprint.

The challenges are that you might not be able to monitor within the same container, or that logs are streaming to an outside consumer. In addition, each application has different designs for signal handling.

Detection

Heroku will insert log messages into your application’s log stream, Error R14 for example, which signals that your container is over it’s memory quota.

You need to have a service that consumes the log stream, looks for patterns (regular expressions) and then triggers an action. Logentries.com has such a facility, but others do, too. It is best to experiment with a few of them before making a commitment.

Receiving the Webhook

One of Logentries features is the ability to send a webhook when a regular expression matches. Let’s create a controller that will accept the webhook message.The message format is vendor-specific, but it should be well documented.

⊕LogentriesAuth is a special helper class to confirm that the message is authentic. We wouldn’t want some nefarious third party injecting messages into our system and restarting our servers over and over again, would we? A gist of it is available here, under the MIT license. https://gist.github.com/kmayer/f7e5a…

class LogentriesAlertsController < ApplicationController
  skip_before_action :verify_authenticity_token

  def create
    return head(:unprocessable_entity) unless
      LogentriesAuth.new(request).authorized?

    # do something here...

    head :ok
  end
end

Once we’ve parsed the message and determined that we can act on it, we need to send a message, in this first case, to Puma. We’ll leverage a well known and commonly used service, Redis, as a general delivery post office; we’ll write a key/value pair and Puma will check for the key puma::stop::#{hostname} (where hostname looks like “web.5”). If the key is present, it will start the orderly shutdown process.

⊕For brevity, I’ve omitted the private helper methods used in the controller. A complete gist is available here: https://gist.github.com/kmayer/abed3…

  ...
  def create
    return head(:unprocessable_entity) unless
      LogentriesAuth.new(request).authorized?

    if what =~ /R1[45]/ && 
         who =~ /web\.\d+/         
      Redis.current.set(
        "puma::stop::#{who}", 
        "*** HEROKU ERROR #{what} ***", 
        ex: 60.seconds
      )
    end
    
    head :ok
  end
  
  private
  ...

Signal the application to stop accepting new work

Puma version 3 has a new feature that allows you to add custom plugins. With it, you can use Puma’s API to signal itself to stop; it will handle the rest of the clean up on its own. This custom plugin checks Redis for the “message” set by our controller. Puma background threads are lightweight enough that they won’t impact performance.

require 'puma/plugin'
require 'redis'

Puma::Plugin.create do
  def start(launcher)
    # Heroku sets this environment variable
    hostname = ENV['DYNO'] 
    return unless hostname

    redis = Redis.current
    # we can haz Redis?
    return unless redis.ping == 'PONG' 

    signal_token = "puma::stop::#{hostname}"
    
    in_background do
      loop do
        sleep 5
        if (message = redis.get(signal_token))
          redis.del(signal_token)
          launcher.stop
          break
        end
      end
    end
  end
end

Okay, all of the pieces are in place so we can do some integration tests. Let’s inject a log entry that looks close enough to match our regular expressions to trigger the alert.

Rails.logger.error 'heroku web.1 - - Error R14 (Memory quota exceeded)'

If all goes well, you should see the message hit your log, then sometime later, a POST on the LogentriesAlertsController, then a key/value pair inserted into Redis and finally, your Puma instance quitting. Your PaaS will spawn a new replacement with a clean, fresh serving of memory.

Wait for the process to finish all “in-flight” work

Sidekiq is a little trickier; It has an API for going quiet, which will complete all in-flight work, and won’t pull any new jobs off the queue; it has #stop! which will signal the main process to exit. But there’s no API call that says, “go quiet, then quit.” That’s okay, it’s easy enough to build a little class to wrap the Sidekiq process control API:

class SidekiqControl
  def initialize(pattern)
    @pattern = pattern
  end

  def quiet(timeout = 300)
    begin
      processes.each(&:quiet!)
      sleep 1
      timeout -= 1
      busy = processes
             .map { |p| p['busy'] }
             .sum
    end until busy == 0 || timeout < 0
  end

  def stop(timeout = 300)
    quiet(timeout)
    begin
      processes.each(&:stop!)
      sleep 1
      timeout -= 1
    end until processes.empty? || timeout < 0
  end
  
  private

  def processes
    ::Sidekiq::ProcessSet.new
      .select { |ps| @pattern === ps['hostname'] }
  end
end

Signal the Process to Exit

Also, unlike Puma, there isn’t a similar plugin architecture for running arbitrary background threads. So we’ll build a little rake task:

task 'sidekiq:stop', [:pattern] => 'environment' do |_task, args|
  sq_ctl = SidekiqControl.new(args[:pattern])
  sq_ctl.stop
end

and now, let’s modify the controller:

class LogentriesAlertsController < ApplicationController
  def create
  ...
    if what =~ /R1[45]/ &&
         who =~ /sidekiq\.\d+/
       system "bundle exec rake sidekiq:stop[#{who}] &"
    end
  ...
  end
end

Again, we can do a quick integration test by injecting another log entry.

Rails.logger.error 'heroku sidekiq.1 - - Error R14 (Memory quota exceeded)'

Ken, that’s a lot of work! Is it worth it?

Boy, howdy! I have a Sidekiq worker that handles a “critical” job queue, for, you know, critical jobs. Every now and then a user submits a job with a ginormous data set that just makes our Nokigiri parser bloat like that guy from “Big Trouble In Little China.” If it weren’t for my suite of tools, the worker exceeds it’s memory allocation and grinds to a slow crawl. Instead, we signal the worker, spawn a new one, and work goes on it’s merry way.

This keeps my application running and my users happy. I can also allocate the technical debt-service at a time of my choosing, instead of being forced to react to a crisis. Is it perfect? Hell, no, but my pager rings less.

Oh, yeah, and remember: