Comparing Message Queue Architecture on AWS

In Forter, we crunch multiple data streams each with its own different requirements. In order to choose the right tool for the job we mapped different event dispatching architectures that suit our needs.

For transaction processing we require durable job queue and for event processing we require high throughput event queue. Both types of queues are required to have low latencies. Another consideration is in favor of Amazon backed SaaS over Do It Yourself solutions.

This comparison focuses on event dispatching, and does not dive into the event processing architecture itself. Some stateful products such as Spark and In-Memory-Data-Grids can replace both dispatching and processing but are not covered here. For completeness, Elastic Load Balancer is also being compared to queues, even though it is not a queue.

TL;DR

If you are light on DevOps and not latency sensitive use SQS for job management and Kinesis for event stream processing.

If latency is an issue, use ELB or 2 RabbitMQs (or 2 beanstalkds) for job management and Redis for event stream processing.

ELB SQS 2 RabbitMQs 1 Redis (ElasticCache) Kinesis
SaaS yes yes nope yes yes
Low Latency ? ~100ms ~5s
High Throughput auto scalable auto scalable but expensive ~2-10k/s ~10-50k/s scalable
Throughput spikes pre-warm ok ok ok pre-scale
Processing instances needed peak traffic avg traffic avg traffic or twice avg traffic avg traffic
Failure Guarantees client retries on failure at-least once it’s complicated at most once at-least once
Job Queue:
reliable jobs (dead letter Q) client reroutes on error supported supported supported nope
prioritization nope multiple Qs priority-queue plugin multiple lists multiple streams
Event Processing:
FIFO + batch nope not guaranteed ? supported supported
multiple recipients nope multiple Qs multiple Qs multiple lists supported

Notes:

  • https://www.cloudamqp.com provides a hosted RabbitMQ service cluster and is not covered here. However, the Highly Available RabbitMQ cluster is included. CloudAMQP do not document what network partitioning mitigations they have configured for RabbitMQ cluster.
  • https://redislabs.com provides a hosted Redis service and is not covered here. It uses custom replication and sharding proxies, but their assumptions are not documented anywhere.
  • http://iron.io provides a full-featured hosted queuing service. It is not covered here since their product is not based on opensource, which makes it more difficult to evaluate it’s pros/cons.
  • beanstalkd is a really simple job queue. Consider beanstalkd as it provides priority queue and dead letter Q.
  • All latency and throughput figures are ballpark figures.

Service Oriented Architecture (ELB)

In classic SOA, each tier exposes a single endpoint. This is implemented using an AWS Elastic Load Balancer. A public facing ELB for the API servers, and an internal facing ELB for the Processing Servers.

Pros:

  • The latency overhead of the ELB itself is negligible.
  • Session stickiness can be used to preload Processing Servers with cached state, and then send them requests to be processed faster.
  • Out-of-the-box ELB monitoring and automation

Cons:

  • Some API requests need to get a higher priority over other API requests, and that is not taken care of. This was one of our main problems with this architecture, especially with a mix of real-time clients and clients that send batch jobs.
  • This architecture assumes that there is enough Processing Servers to handle all requests (peak throughput). If there isn’t the Processing Server applies back pressure on the API server (error or timeout), which in turn returns an error to the API user which in turn re-tries the API request (applying more pressure). To avoid this, the number of running processing servers needs to be enough to handle peak traffic.
  • ELB was not designed to handle huge traffic spikes since it takes a few minutes to internally scale. You should contact AWS support to warm your ELB if you have a planned traffic spike. We at Forter are in the eCommerce market where traffic spikes are rare.
  • API Server needs to handle retries. Not provided by the ELB itself.
  • Processing Server must respond within the http timeout (configurable between 1 to 3600 seconds). Otherwise the protocol would need two phases which adds more complexity.

Master Worker Architecture (SQS)

The Master Worker pattern is implemented by by introducing a queuing service. The API server puts the request into the end of the queue, and each processing server pulls one request from the beginning of the queue.

Pros:

  • SQS guarantees message delivery.
  • Can handle large spikes of traffic that waits in the queue.
  • The number of processing servers needed could be reduced to handle average load (though sacrificing latency).
  • Request prioritization – The API server may insert different requests into different queues. The processing servers remove requests first from the high priority queue, and only if it’s empty from the lower priority queue.
  • Request retries
  • Out-of-the-box SQS monitoring and automation

Cons:

  • The problem for us at Forter with SQS is that it is too slow for real-time transactions (can go up to 100ms end-to-end delay), and its per message pricing is too expensive for high throughput event analytics. So it’s a lose-lose tradeoff for us.
  • Does not guarantee FIFO processing

DIY Master Worker Architecture (RabbitMQ)

We replace SQS with two EC2 instances running RabbitMQ, each running on a separate availability zone. API servers round-robin between the RabbitMQ servers, and the processing server listens to messages on both RabbitMQ servers.

Pros:

  • Negligible latency.
  • Can handle large spikes of traffic that waits in the queue.
  • The number of processing servers needed could be reduced to handle average load (though sacrificing latency).
  • Request prioritization (with multiple queues or with the priority plugin)
  • Request retries

Cons:

  • Not as a service. DevOps mana can be better used somewhere else.
  • No message delivery guarantee in face of RabbitMQ server failure.
  • Manual Discovery – Discovery of the RabbitMQ instances requires reaching out to to the EC2 (describe instances) API to retrieve the latest RabbitMQ ip addresses (before connection/ reconnection).
  • Limited throughput (~10k)

DIY HA Master Worker Architecture (RabbitMQ HA)

In forter we made two assumptions. First, our processing servers have enough power to make sure RabbitMQ is empty most of the time, which would reduce message loss in case of failures. Second, the API server degrades gracefully in cases the Processing Server did not get the message. So while HA is definitely on our roadmap we are not stressed out about it.

Deploying your own RabbitMQ HA solution on EC2 requires attention to details.

In order to recover from one instance failure we would need to hack RabbitMQ cluster discovery to remove the old ip and add the new ip to the cluster.

Split Brain happens when network latencies between EC2 availability zones rises.When it comes to the CAP theorem we at Forter would prefer at least once messaging semantics (Availability) over at most once messaging semantics (Consistency). RabbitMQ could lose messages when handling split brains. It has three strategies:

  • pause_minority configuration – One of the availability zones stops working. The next version RabbitMQ is going to reduce the time it takes for it to stop working as to reduce the amount of accepted messages that are not processed. Even with the fix, this solution prefers Consistency over Availability which is not what we require.
  • auto_heal configuration – All availability zones keep working (separately), however when the network partition is resolved one of the availability zones deletes all of its messages which is bad. The reason for this is that RabbitMQ have not implemented queue merging semantics on network partition recovery.
  • ignore – No message is lost, however the cluster will no longer act as a cluster after the network is back to normal, so each time that happens the devops team would need to deploy a new version into production (where the cluster is in tact).

DIY Twice Master Worker Architecture (RabbitMQ)

Avoiding message loss could be implemented on the application level. Some requests can generate more income than others, so in those cases we might be willing to spend twice the ec2 costs. The API server could send some requests to both RabbitMQ servers (running as standalone not as a cluster).

For example, given transaction foo the API server inserts it to the higher priority queue of the first server as foo1 and to the lower priority queue of the second server as foo2. The API then waits for either foo1 or foo2 to complete and returns the one to arrive first.

Pros:

  • Can handle 1 failure (either an instance or network partition)
  • Much lower 99% percentile processing latency since chances are low for both foo1 and foo2 to be stuck in GC pause or virtualization pause.
  • Negligible latency.
  • Can handle large spikes of traffic that waits in the queue.
  • Request prioritization (with multiple queues or with the priority plugin)

Cons:

  • Sacrifices latency of low priority requests in favor of high priority. Need to use double the Processing Servers to handle low priority requests at peak traffic.
  • May requires deduplication application semantics to reconcile foo1 and foo2 results.
  • Limited throughput (~10k/sec)
  • Not as a service.
  • Manual Discovery

DIY Event Streaming Architecture (Redis)

Redis is a fast single threaded server for storing useful data structures in memory (list,dictionary,etc..). AWS provides a managed Redis server (ElasticCache) which can hold a list of unprocessed events.

Each API server inserts to the beginning of the list. Each Processing server removes messages from the end of a single list, effectively implementing a queue.

In order to support N readers handling different messages we would need N lists. The API server would use consistent hashing of the event source to route the event to the correct list, effectively implementing FIFO shards.

In order to support K readers handling the same messages we would need K lists. The API server would write the same message to all K lists.

Pros:

Cons:

  • Event lost in case of redis node failure (ElasticCache does not support sharding).
  • The API server needs to discard old events from time to time to prevent redis from running out of memory.
  • Redis stores in memory, that means per GB price is considerably higher.
  • ElasticCache redis does not support sharding. Need to do it client side
  • Processing Server or redis failure results in events loss.
  • All retries are handled by the processing server.

Event Streaming Architecture (Kinesis)

Kinesis exposes an API for streams (buffers stored on disk) that keeps events in order for 24 hours (similar to Kafka). It can be used as event queuing service as long as the client keeps track the buffer offset to continue reading from.

Pros:

  • Highly Available
  • Very high write throughput (sharded). Usually more shards are needed for read capacity.
  • Same event can be consumed by different processing servers (one reading batches to S3, and another reading in realtime)
  • Events are in order (per shard)

Cons:

  • Reading clients are stateful (need to keep track of last index read, what shards to read from, other failed clients that require handover of work)
  • A few seconds latency (http polling every ~5 seconds by each reader). Had we have chosen Kafka, we would have enjoyed shorter latency times, so we might switch in the future from Kinesis to Kafka.
  • All retries are handled by the processing server.
  • All events need to be processed within 24 hours, otherwise data is lost.

Discuss on hacker news