9.5 Low Latency Decision as a Service Design Patterns

A few hundred milliseconds latency is achievable for a complex fraud prevention system, but with very little wiggle room. In the past two years, we have selected a few design patterns that have helped us achieve our latency goals, using standard technologies used by most SaaS companies (load balancers, queues, JVM, rest APIs, etc…). Here, we share 9 and a half tips that have helped us to build a system we’re proud of.

Some Context First

Low Latency and Availability is essential to our product – and our SaaS product is the center of our business. Forter provides Fraud Prevention Decision as a Service for eCommerce merchants (retailers with either web or mobile presence). That’s fraud decisions for every transaction, in real-time. Our clients call Forter’s API during the checkout phase of their workflow. This is a very sensitive place to be, since a failure at that point could result in loss of business to our customers.

Low Latency is just as crucial to many other SaaS companies. If your service does not respond in time, it is not available. It’s as simple as that. For us, low latency means a few hundreds of milliseconds. To compare, here are some ballpark latency figures (above median latencies – the long tail figures could be worse):

  • A few network hops between AWS availability zones (~1ms)
  • JVM young garbage collection of a few hundred MBs (~20ms)
  • Read by primary key from SSD storage (~30ms)
  • US West Coast to East Coast network latency (~100ms)
  • JVM full garbage collection a few hundred MBs (~500ms-1000ms)

1. Measure your API Latency as a Function of Probability

If our API does not respond on time, our clients cannot continue their workflow. So, internally, we log the latency of each API request, and calculate percentiles based on that raw data. For example, the API endpoint may return a decision within 300ms 90% of the time, and within 500ms 99.9% of the time.

This measurement gets tricky when using a stress test tool (there is a great presentation about exactly that, if you’d like to know more: How NOT to Measure Latency). It basically states that a typical benchmarking tool such as Apache Bench, when stuck waiting for that 1 response that never comes back (a hiccup), does not take into account the number of requests it didn’t even send during that time.

Another problem with a stress test tool is that it uses synthetic requests, which simulates a best case scenario where the caches are already hot, something that is not generally true for real-world transactions.

Measuring latency is essential, but do it the right way. Use percentiles, and distinguish real traffic from synthetic requests.

2. Define How the Client Should React in Case of a Timeout

The rest API client should place a timeout on the API call, and act when there is an error or timeout. It could retry the request or perform a predefined default action. In both cases, the service needs to be designed to support it.

Having the client retry means that the service needs to be idempotent. For example, let’s say that due to a network hiccup, a request with id 1234 has timed out. A second (retry) request with the same id 1234 is then sent to the service. The first request may have reached the service and so it means the service received two requests with the same ID. The service should handle that gracefully, and in both cases it should return the same decision.

Having the client choose a default action means that the transaction entry in our customer’s database may have a different value from our databases. For example, transaction id 1234 timed out, and the client has decided to approve the transaction. In the case where the request reached the service, it may have been registered as an approved transaction or a declined transaction, and in the case where it never reached the service, it would not have been registered at all. For these cases, a reconciliation process might be needed at the end of the billing period, in which these small discrepancies are sorted out.

The customers care about end-to-end service availability. Network hiccups happen, but they are simple to resolve by making the rest API client part of a coherent decision flow.

3. All subsystems must be able to make a decision

Continuous Deployment is great, since it allows us to change code rapidly. However, the reality is that code that changes could introduce new unexpected latency or stability issues, and is thus less trustworthy (stable) than code that worked flawlessly in the past weeks or months. Code Stability vs Code Changes

Our Storm code written in Java changes the most. If that component fails to decide on time (approve or decline a transaction), the API servers written with Node.js take a decision, and if it fails, then the nginx lua code takes a decision. If we hadn’t designed our system that way, the client would need to make the decision for us, which is something we would like to minimize.

It’s ok for the system to perform graceful degradation once in awhile, but over some threshold we need to rollback into a previous version (calculating that threshold is for another blog post). Our reality is that we always have two production systems online (we call them “prod” and “old prod”). In case of a graceful degradation alert, we have the option to switch back (via Amazon ELB API) to the “old prod”, giving us enough time to fix the problem.

There is a famous quote from an Israeli cult movie, where Krembo is asked how to win a swimming race: “You start with your fastest sprint, and gradually pick up the pace.” As a startup we must keep moving faster as we grow (not slower). That means more developers introducing more code changes every day. A multi-layered decision architecture, plus a hot standby, serves as our safety net.

4. Handle Excess Traffic

Don’t Pushback to the Client Some of our customers might send us transactions in batches. For example, in the case of a problem with their backend systems, they might want to rerun some of the transactions. If we perform simple throttling, the API would have to return an error back, which is unacceptable in a Decision as a Service system. If we do not perform throttling, we run the risk of overloading our real-time system, which would trigger the graceful degradation described above, which is something we would like to minimize.

So we do perform throttling, but instead of pushing back an error to the client, we divert the excess customer traffic to a different set of machines. We use a simple master/worker pattern, in which different queues have different set of workers, and we effectively push the transaction into a lower priority queue.

As with any multi-tenant system, isolation between tenants is a key component. More traffic is good for business, and shouldn’t be a cause for panic. By dynamically isolating traffic with different SLA requirements, we can keep all of our customers happy.

5. Use Dynamic Timeouts for I/O Operations

Each I/O operation (such as an SQL query or an internal Rest API call) must have a timeout. That timeout is a function of the minimum/maximum time for the operation to complete, and the time-to-live (remaining time) of this specific request.

Global Timeout

The timeout as defined by the SLA to the customer (median is good).

Minimum Timeout:

The time for the I/O operation to succeed 80% of the time. For example, most Redis read operations should return after 5ms.

Maximum Timeout

The time for the I/O operation to succeed 99.9% of the time. For example reading by primary key from an in-memory data store (such as Redis) should not take more than 10ms, and reading by primary key from a disk-based data store (such as MySQL) should not take more than 100ms. If it takes more, then there is either a network hiccup (that was not detected as a failure), or a data-store hiccup, or some type of congestion, and we consider the operation to be failed.

Absolute Timing

Absolute timing uses the current clock time and the time the request was received, in order to determine if we should use the maximum timeout, the minimum timeout, or somewhere in between.

UpstreamLatency = (CurrentTime - RequestReceivedTimestamp)
DownstreamLatency = Approximate time needed to return a response to the client after this I/O operation completes.
Timeout = Global Timeout - (UpstreamLatency + DownstreamLatency)
if (Timeout > MaximumTimeout) Timeout = MaximumTimeout
if (Timeout < MinimumTimeout) Timeout = MinimumTimeout
The three most important things in real-time systems are timing, timing and timing. In most cases, we would use the maximum timeout. However, in cases where a garbage collection or a network hiccup occurred, we would speed up the rest of the processing and lean towards the minimum timeout.

6. Use Auto-Healing (Automatic Failover)

This one is obvious, but it must be mentioned too. The reason that it doesn’t go without saying is that we humans are not fast enough to react to failures, and there are such good auto-healing tools out there, that it’s just a shame not to use them.

Automatic Process Restart

When an exception occurs in the code, it should bubble up the stack and crash the process (fail-fast unless the code handles retries or batch operations in which exceptions could be handled). The reason for fail-fast is that we do not trust our process state when an exception occurs. Once the process fails, an external operation system daemon (such as upstart, systemd or similar) should start it immediately.

Automatic Virtual Machine Replacement

When an ec2 instance terminates unexpectedly, we use an ec2 scaling group that detects it, terminates the instance and starts a new equivalent ec2 instance in its place. In rare cases, it takes quite a time to detect a problem with the instance as its hardware degrades slowly. In those cases a scaling group can take hours to respond.

HTTP Load Balancing

When a web server becomes unhealthy (for whatever reason, such as a single point of failure, or a availability zone problem), the Elastic Load Balancer diverts all traffic to the other healthy instances.

HTTP Fencing When the entire service stops responding, you can use Route 53 failover (or Incapsula Advanced load balancer) to divert all traffic to a stand-by service, and trigger an alert.

Auto-healing is awesome, and doesn’t take much time to set up properly. Everyone does it, and so should you.

7. Plot Latencies, it’s a Big Time Saver

We have an internal monitoring system (worth a separate post), that can plot for each transaction the latency and the absolute timing of each storm bolt (names redacted from the snapshot): StormGraphVisualization

A different view shows a timeline of the bolt latencies: StormTimeline

And of course a time-based view for multiple transactions, including latency percentiles on the right, exception plots, per deployment, per host , per transaction type filters and much more. One of the first things we learned from the time-based plots is that our system needs some time to warm up until it performs best, which has implications for the deployment life-cycle. StormKibana

One of the first things you understand about low latency monitoring, is that the CPU graph doesn’t say much. First, the CPU probing doesn’t have high enough resolution to capture milliseconds CPU spikes, and some of the latency is due to I/O operations which are not registered by the CPU. So unless the system is overloaded the CPU will always show some low average. As for profiling tools – tracing profilers skews the latency results too much, and unless the system is overloaded, sampling profilers shows most of the time is spent in scheduler threads.

The most reliable way we found to detect latency problems is a timestamp before and after every sub-component of the system. Once the problematic sub-component is detected, you first look for incriminating recent code changes. If you can’t find the problem, reproduce with a unit test together with a profiler to pinpoint the root cause.

8 Know Your (Latency) Enemy

In many cases, however, latency monitoring reveals degradation that affects multiple components at once. Some of the components may not even perform I/O operations or had no recent code changes. In those cases, we start looking for the usual suspects. AWS EC2 Flakiness We only use c3.large instances or higher, for the sake of our sanity. Any ec2 instance (that is not on dedicated hardware) could be a victim of being hosted in a tough neighborhood. It happens less with larger instances. However, once you have tens of ec2 instances you should start expecting at least one machine per month to act strangely (as indicated by a weird CPU flakiness in the ec2 graph, even though nothing changed on the instance itself). Sometimes the gods of luck provide us with good machines and we get a drop of 100ms latency improvement without changing any bit of code.

Logging/Eventing

Many logging frameworks are synchronous and contribute to high latency. Even async event clients (such as Riemann clients) may contain pushback which in some extreme cases (unless configured otherwise) block the sending thread when all network queues are full.

Regular Expressions

Some regexes are a real piece of work. What you need to know is that given the right regular expression, and the right input, you could get a CPU core stuck on 100% CPU for a very long time (see Catastrophic Backtracking). We use a custom automated tool that tries to find these little gremlins during build time, but it’s not perfect yet (worth an open source project and a blog post, I know). In Node.js, a runaway regex could block the entire process, not just a single core. In Python, it could block the global interpreter lock, which is just as bad.

TCP Nagle Algorithm and Delayed ACKs TCP

nagle algorithm is on by default in most operating systems and programming languages, and its purpose is to reduce the number of tcp packets that needs to be sent over the network on the expense of latency. If you send a small http request, then the TCP nagle algorithm would wait for either:

  • More data to be sent in the same tcp packet, or
  • An ack from the TCP receiver (for data previously sent).

Alas, most tcp receiver implementations also implement delayed acks in order to reduce the number of acks (up to 500ms for the ack), so the tcp sender could be delayed for up to 500ms if the nagle algorithm is turned on. In practice it introduces latency of around 30ms with spikes of 200ms. To overcome this set the tcp socket settings SO_NODELAY to true (preferably in both the client and the server).

JVM Garbage Collection, LRU caches and memory leaks

Most modern languages require some tweaking to the GC settings. JVM settings provide a lot of tweaking options. As a rule a stateless system should not require much JVM memory. Most transactions evaporate after less than a second, which means it could happily stay in the young generation, which is faster to clean up. Any LRU caching (Least Recently Used eviction, such as Guava Cache) must not be hosted in the same JVM process. Use Memcached/Redis, but for the love of GC, do not host Java objects that survive the young generation and are moved to the old generation. It’s ok to load lookup tables into the old generation since they never require cleaning, but LRU (least recently used) is bad. Beware of memory leaks. This is what a memory leak looks like – notice how the GC tries to clean up memory but can’t due to the memory leak, and it tries more and more often as time progresses. JVMMemoryLeak

It’s ok to print GC details in production. It should have no effect on performance. If you are on ec2 log to the ephemeral local drive, don’t use a network drive for Java logging. It’s synchronous.

-XX:+PrintGCDetails -Xloggc:/mnt/logs/gc-worker-pid%p.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize=10M -XX:-PrintTLAB

There is a tradeoff between the amount of work done during young GC and old GC. We run a small young generation and a small old generation heap and force very aggressive GC settings. We increase the heap sizes only when the GC logs indicate that we have to.

Good old parallel GC shows great performance tradeoff for small heaps. CMS has painful full GCs, but it might be worth checking on your use case.

-XX:+UseParallelGC -XX:+UseParallelOldGC

1024MB old, 768MB young (= 77MB x2 survivors + 615MB eden)

-Xms1792m -Xmx1792m -Xmn768m -XX:SurvivorRatio=8

Do not move from survivor space to old generation unless you have no other choice.

-XX:MaxTenuringThreshold=15

If Full-GC is already started, then make the most of it (easier to debug memory leaks that way too). This one is controversial at best.

-XX:HeapMaximumCompactionInterval=1

Use local object allocation (less multi-threading contention).

-XX:+UseTLAB -XX:+ResizeTLAB

No matter what your runtime environment or programming language is, optimizing for low latency has low-hanging fruits. The details change, but the principles are the same. Do not perform premature optimizations, but do take a moment and disable all loggers, or tail the garbage collection logs, or make a side-by-side latency comparison of the same code running on different ec2 instance from different days.

9. Low-Latency Tweaks for Data Stores

Most data store configurations are geared towards higher throughput, and also the benchmarks usually measure throughput (ops per second). When comparing various SQLs/NoSQLs make sure to think about low latency and compare apples-to-apples.

Indexes in Memory or Disk?

Are the indexes loaded from memory or disk? For example, with Couchbase there is a default configuration in which all indexes are loaded into memory. With other NoSQLs, they depend on the file system cache which after warm-up could load all indexes into memory.

Data in Memory or Disk?

Is the data loaded from memory or disk? For example, with Redis, all data is in-memory by default, and so fetches have lower latency. However memory is about 3 times more costly than SSDs. So if you are considering reading the data from disk, then benchmark the NoSQL based on your use case. Also, in the low latency use case more shards would probably be more performant than adding more read replicas due to better file system cache utilization (less disk copies, less cache needed).

Available CPU Cores?

Do you have enough available CPUs to handle the incoming requests? For example, with elasticsearch even if all data is cached, some complex queries require a dedicated CPU core available, and sending more queries than the available CPU cores would mean delaying some of them. Most systems never care about that latency spike since it has negligible effect on throughput, however if you are trying to improve the 99% percentile of latency, number of CPU cores could matter.

Read vs Write Trade-offs

Many data stores have different write/read trade-offs. For example, SELECT * on a table with tens of rows could be slower than SELECTing a single field. So during write you could lump all the required data in advance to a single field (like a K/V store would do). Similarly with a document database, when requesting the complete document it could take some time to push it through the network. Alas, when requesting specific fields, it would require parsing the document on the server side and creating a new document before sending it to the client (which is worse). In that case creating a scaled-down document during write time would reduce latency during read time.

Client I/O Threads

Many data store clients are optimized for high throughput, and are less optimized for low latency. For example, if you don’t have enough TCP connections open, some requests could be queued on the client side. In one case, we got lower latency (better) by proxying the request through a localhost proxy, only because we could set the proxy process priority to be higher. When the client was hosted in a Java process that had too many CPU hungry threads it was competing with the I/O threads of the NoSQL client.

Choosing the right data stores for each use case and then tweaking it takes time. Many startups call for external consultants to speed up the learning process, other startups learn it the hard way. We did both.

9.5 Offload Tasks Before/After the Low Latency Critical Path

As Forter’s fraud prevention system grew, we realized tweaking the low latency system would get us only so far, and that we needed to split our workload into three different subsystems.

Before the change:

Subsystem What does it do? Limiting Design Constraint Is Data Loss Acceptable? Hard Business Requirements?
TX Processing (The Critical Path) Approve/Decline Transactions in real-time Low Latency No Yes

After the change:

Subsystem What does it do? Limiting Design Constraint Is Data Loss Acceptable? Hard Business Requirements?
Event Stream Processing Pre-process events before Transaction High Throughput Yes No(best effort)
TX Processing (The Critical Path) Approve/Decline Transactions in real-time Low Latency No, but… No (best effort)
Batch Processing Post-process all data High Volume Yes Yes (reconciliation)

Event Stream Processing

Some of the TX processing could be done before the transaction even started. The transaction might eventually never happen, so we might be doing extra work for nothing (higher throughput), but the tradeoff is there. Having the flexibility to trade low latency transaction processing requirements for high throughput event stream processing requirements can sometime shorten development time considerably.

Batch Processing

Similarly, if we have a batch processing system that works after the transaction is complete, we could relax the transaction processing requirements to be a best effort system, and perform billing reconciliation and other hard business requirements post-transaction.

Closing notes

Getting below 30ms latencies requires a different set of technologies (peer-to-peer clusters or even share-nothing architectures). However, running a complex business logic in a few hundred milliseconds latency is still achievable with commonly used technologies (http load balancers, SQL databases, NoSQLs, JVM). My hope is that you could apply some of the ideas presented here to your use case.

We are always looking to learn new stuff, and are happy to improve. If you have any comments or questions, let’s discuss on Hacker News