Dennis on technology

On Technology and Business

Non-optimal routing for high-traffic Heroku applications

Dennis de Reus
May 7, 2013 4 min read
Technology Heroku

NOTE: This article was written in 2013, things may have changed since then, although Heroku’s devcenter still describes random allocation of requests to Dynos. I’m also a happy user of Heroku. My product doesnt come close to the workload that would be a problem for this routing setup. However, I wanted to share the (outdated) analysis below for those that are interested

I’m a big fan of Heroku. For everything from really small projects to medium sized projects it has helped me focus on the development of the applications rather than how I deploy it. Recently however, Heroku took some flak after RapGenius, previously a Heroku success story, noted that it did have scaling issues on the platform

Specifically, the routing was not as smart as was advertised: routers do indeed route to ‘idle’ dynos, but the list of busy dynos is only based on the requests passed through that specific router, without taking other routers into account. Heroku, due to its size, has so many routers that the result is practically random assignment of work to the nodes. What does this mean in practice? To find out, I’ve simulated a simple application to see what happens at different loads.

The simulation

We’ll be simulating a application with 100 dynos. There will only be 1 kind of job for the dynos, and it will always take the same time (1 tick). The amount of incoming jobs is determined by the utilization, and the jobs are randomly distributed over the dynos. So, during a tick the following happens:

A node may receive 1 or more jobs in its queue (to simulate incoming work)
A node will remove exactly 1 job for its queue (to simulate doing work)

# objects.rb
class Node

  attr_reader :queue

  def initialize
    @queue = 0
  end

  def tick
    remove_job || 0
  end

  def add_job
    @queue += 1
  end

  def remove_job
    @queue -= 1 if @queue > 0
  end

end

For the simulation, I’ve run 10 samples for each utilization level. Each sample runs for 500 ticks and then writes the queue lengths of the dynos to a CSV.

# simulation.rb
require 'csv'
require './objects.rb'

UTILIZATION = (0..110)
SAMPLES = 10
NODES = 100
TICKS = 500

data = {}

UTILIZATION.each do |util|
  data[util] = []
  SAMPLES.times do
    nodes = []
    NODES.times { |n| nodes << Node.new }
    TICKS.times do |n|
      util.times { nodes.sample.add_job }
      nodes.each do |node|
        node.tick
      end
    end
    nodes.each do |node|
      data[util] << node.queue
    end
  end
end

CSV.open('../data/simulation.csv', 'w') do |csv|
  UTILIZATION.each do |util|
    csv << data[util]
  end
end

I’ve used R to read and plot the data:

Queue size vs Dyno utilization

library(ggplot2)

data <- read.table("../data/simulation.csv", header=FALSE, sep=",")
mean <- apply(data,1,mean)
median <- apply(data,1,median)
max <- apply(data,1,max)
sum <- apply(data,1,sum)

df = data.frame(utilization=seq(from=0,to=110),mean=mean,median=median,max=max)

png('../plots/queue_vs_utilization.png', width=1200, height=900, res=200)
print(
ggplot(data = df) + 
  scale_shape_manual(name="Type", values=c(2,3,4)) +
  geom_smooth(aes(x = utilization, y = mean), span=0.2) +
  geom_point(aes(x  = utilization, y = mean, shape = "mean")) +
  # geom_step( aes(x  = utilization, y = median)) +
  geom_point(aes(x  = utilization, y = median, shape = "median")) +
  geom_smooth(aes(x = utilization, y = max)) +
  geom_point(aes(x  = utilization, y = max, shape = "max")) +
  scale_y_continuous("Queue size") +
  scale_x_continuous("Dyno Utilization (%)") +
  coord_cartesian(ylim=c(0, 20), xlim=c(0,100))
)
dev.off()

What does the data tell us? If you don’t want to keep your users waiting (on average!) for more than 1 other request, you will need to take into account that rount-robin routing takes out about 25% of your capacity, and you need to overprovision 33% in dynos. It could be even worse if several of your request take a long time, because then the maximum queue size becomes more important and you might even go back to a utilization of 50% to get decent results.

Perhaps, by the time you are spending money on over a 100 dynos, it becomes time to run your own load balancers.

If you want to learn more about this kind simulations, I highly recommend Exploring Everyday Things with R and Ruby by Sau Sheong Chang, available without DRM from O’Reilly.