0FLAKE – Reaching Reliable Non-Flaky Tests

Testing Philosophy

Tests provide confidence. They provide confidence during development, in new deployments, and continuously in production (monitoring). Flaky tests, on the other hand, fail or pass for the same configuration, and erode confidence in our tests and in our systems. More often than not, test flakiness is the result of a binary “assertion” (tests passing or failing). Confidence is not a binary, it’s a spectrum. 0Flake is about the tools to create a spectrum of results.

Confidence is a business advantage. When Forter approaches an enterprise e-commerce merchant, one of the most common questions is, “How do you prepare for Black Friday?” (that time of year during which they make most of their money). In the not so distant past, the “correct” answer was, “Code freeze 1 month ahead of the event.” Today, the “correct” answer is that the processes we have in place make us confident in introducing system changes all year long — including during peak seasons and holidays like Black Friday. Fraudsters (which Forter defends against) never rest, and neither do we. Having reliable non-flaky tests, used during development, deployment, and production are a big part of why we are able to be confident.

Some Context

Forter’s core offering is to prevent fraud for online e-commerce transactions. Transaction details are the input of this service, (such as the address related to the credit card holder and the shipping address). The output of this service is a decision. Declining the transaction means that the transaction is rejected, and the end user will likely reach out to customer support. As such, we aim to minimize those occurrences. Approving the transaction means that Forter gets a small cut (if the end user is legit) or returns money to our customer if the credit card holder claims they never bought a specific item, and that his identity or credit card were stolen.

Internally, the raw transaction data is processed into machine learning features. For example, the billing and shipping addresses are geo-coded into latitude/longitude and the distance in miles between the billing and shipping addresses is calculated. Next, the machine learning model takes these features and predicts the probability of the transaction being fraudulent. The result is calibrated to reflect the real-world probability, but it is still not a final decision. The next component applies customer specific policies and provides an approve/decline decision.

Flaky Unit Tests Due to Precision and Accuracy Issues

As every developer eventually discovers, some floating point operations might introduce unexpected results. Floating points were designed to give an approximation, so precision is traded for a high range of values. Try this in one of your unit tests:

This problem also applies to testing the results of the geopy library which calculates the distance between two geo-locations on earth:

Developers with healthy instincts quickly solved patched this test, by providing a range of values, just so the test would pass:

A few years passed, and on April 7th 2018, geopy improved the accuracy of the default distance algorithm, and upgrades to the latest version of geopy resulted in test failures (again).

This time, we involved a fraud analyst, and apparently, for smaller distances 1 mile precision is enough, and for larger distances 1% is sufficient. Furthermore, this issue is not specific to geo-distances or floating points. Many of the features we introduce to our machine learning model have this range of outputs.

Therefore, the underlying problem was that developers were maintaining tests, without enough domain specific knowledge. With this in mind, we created a slim testing framework, and moved the responsibility of maintaining these tests to analysts. This framework allows an analyst to specify the input as raw transactional values (addresses as text in this case), and the expected results as a range, or one of a few values.

Flaky Integration Tests Due to Non-Deterministic Results

A fraud machine learning model is being fed with thousands of input data points, and some of these data points will not be available for each transaction. The code therefore uses fallback logic, trying to feed the machine learning model with whatever data it has – even when it seems like it doesn’t. During integration tests, this problem manifests too, when a service that we are trying to integrate with, fails during the test.

The following integration test, usually declines the transaction since the “bill_ship_dist” is “1200 miles.” When one day the geocoding service failed, fallback logic kicked in. The “bill_ship_dist” result changed from “1200 miles” to “greater than 100 miles.” The fraud prediction model produced the result 19% fraud probability (just below the decline threshold), and the final decision changed from “decline” to “approve.” The test failed, even though nothing was wrong with the service that was actually being tested.

Developers with healthy instincts quickly solved patched this test, by adding a “@retry” annotation. This could have worked if the geocoding service was having a temporary hiccup, but more often than not, these problems may persist. Delaying tests for long retry periods could result in terrible developer experience.

So, the underlying problem here is that there is an inherent tradeoff between integration test stability, and integration coverage.


On the far top left end of this tradeoff, we stub other services. For example, we can use fakeredis instead of connecting to a test redis server. The only problem is that the parts that are stable (stubs) do not cover all of the other service APIs, and definitely not its latest changes. Any such change would be noticeable only in production.

Docker Sidecar

Improving on the stubs solution, the most common practice is that each other service would have its own test-specific Docker running on localhost. Each test suite comes with a Docker compose file, that starts one Docker per other service. Instead of a stub, we use the latest version of the services with which we want to integrate. This increases integration coverage, without degrading the test stability. In some cases (such as redis Docker), this solution is perfect. However, given enough Docker dependencies (and their recursive dependencies and configuration), the test setup time and complexity could result in a sizeable upfront investment, and maintenance.

Connecting to other Services

On the far bottom right of this tradeoff, we connect only to real instances of the other services. There is a sizeable investment in a “develop env,” which is similar to the “production env,” only scaled down, lesser SLA, and only has access to test data. With a lesser SLA, and with a complex enough service mesh, this test could fail every other week, and could result in terrible developer experience.

One could argue that we can similarly maintain a high SLA for the “develop env,” but this would result in redundancy, monitoring, alerts, and a general increase in developer fatigue. We measure developer fatigue by the amount of alerts each developer needs to handle, the amount of maintenance operation, and the context switches. Not to mention that a relapse of one service SLA in development would ripple throughout integration tests and block integration tests for everyone. This would not be a tenable situation.

Ignoring Some Exceptions

Improving on the previous solution, we decide to let the test pass even when some other services are offline. When the code runs in production, the monitoring service propagates exceptions to emails, alerts, and dashboards. During integration tests however, the monitoring service stub adds the exception to a list. Just before the test completes, the test inspects the swallowed exceptions and decides whether the test should pass or fail. Some exceptions could trigger a test failure, but others could be silently ignored, since they are a distinct indication for another service being offline.

This would result in a small reduction in test integration coverage but would result in a much more stable test environment.

Relaxing Asserts

Simple solutions usually hide in plain sight. Even after everything we’ve just described, the integration test may still fail on this line:
assert decision == ‘decline’
So we remove it. We remove the assert. We delete it. No git revert. No regrets. Confidence is a spectrum, not an equals statement. The test passes. Mission accomplished!

Well…not quite. Keep reading…

Testing In Production

(The truth is, we couldn’t remove that last assert, before we started testing in production).

Canary Deployment (à la Netflix)

Netflix has been doing canary deployments for a long time now, and as Netflix does so, they open source their deployment tooling. The domestic canary is a small songbird very sensitive to lethal gas. Coal miners used to bring this songbird into underground tunnels with them, and as the birds’ songs would come to an abrupt halt, the miners would know it was time to “git revert” and head back. Similarly, a Canary Deployment involves sacrificing a small part of the traffic in order to detect lethal bugs early.

Here is the gist:

  1. When a new version (v2) is ready to deploy, it starts with only three instances and the load balancer routes a small part of the traffic to this new version.
  2. We would like to compare the monitoring metrics of the canary deployment with the production deployment, but that wouldn’t be a fair comparison. The production deployment has all of its caches and connections setup, and has many more instances.
  3. So, simultaneously with the Canary, a third deployment, called Baseline, is deployed with the exact version as the production environment.
  4. Any statistical anomaly between Baseline and Canary is analyzed and a decision is made whether to complete the v2 deployment, or revert it.

Canary isn’t a good fit for us, since we can’t sacrifice any part of our traffic. Forter measures the loss incurred due to analytical and engineering errors. Analytical errors are part of the cost of doing business in an adversary environment. Fraudsters are very persistent and they try and retry their attempts to commit fraud. Engineering errors however, are of our own making.

The engineering loss in case of a faulty Canary Deployment is described in the following formula:

In a worst case scenario, in which the Canary Deployment approves all transactions, Forter would have to refund its customers for all products stolen by fraudsters (approximately 1.5% of the Canary transaction total value). In order to be able to detect statistically significant changes between the Canary and the Baseline deployment, we need a minimum number of transactions to be processed (tx_throughput * T1). Multiply these two numbers and the result is the engineering loss in case of a serious bug in the fraud detection logic or configuration.

Blue/Green Effectless Deployment (à la Forter)

In a blue/green deployment the v2 environment is already scaled out, and ready to process data. The effectless variation streams a copy of the real traffic to the v2 environment for about 14 minutes, and only if there is no deviation between v2 and v1, do we stream real traffic to the new v2 version.

  1. Every new version is deployed with the “Effectless” toggle turned on by default. This toggle changes the code behavior very slightly so it won’t overwrite production data (for example instead of uploading to S3 /mybucket/prodcution/myfilename it uploads to /mybucket/effectless/myfilename).
  2. We stream fake test traffic to the new environment, and validate that there are no exceptions being raised. This also forces the new environment to setup connections to the databases.
  3. We then forward a copy of all production requests to the effectless environment and compare the results of both versions (more on this later).
  4. If the blue environment (v2) results does not diverge from the green environment (v1), the new version is ready to process production traffic. The stream of traffic from the database is stopped, the blue env is drained, the effectless toggle is switched off, and production traffic is routed to the blue env and the green env is drained.
  5. Though we routed real traffic to the blue env, we don’t completely trust it just yet. The green env (v1) remains on standby as a quick fallback. Small problems slip through the effectless test and can eventually accumulate, resulting in precision drift. We monitor the business KPIs of the blue env against business thresholds, and after only 4 hours we can safely remove the green env.

The Effectless Test

The effectless tests run for around 14 minutes (T1). A report is sent to Slack after 7 minutes and again after 14 minutes:

The tests compares the statistics between the blue and the green environments:

  • The sum of transactions which received different decisions must have insignificant business impact.
  • API latencies difference between the environments:
    • It is easiest to compare the 50% and the 95% percentiles.
    • The 99% percentile is noisy, and requires more complex processing.
    • Caching effects favor the effectless environment, since effectless processes the same data that the production environment just processed a few seconds ago, and the databases’ LRU caches are hot.
  • Absolute and relative number of exceptions. Exception thresholds must be gradually tightened with the maturity of the code. It is not realistic to expect zero exceptions in production for new features.
  • All test stats need to be grouped by tenant / by sub-service / by host, in order to detect local problems that are not evident when looking at the entire cluster.
    • Small sub-services are ignored.
    • Small tenants need to be grouped together to be statistically significant.
    • Every host matters.

Sometimes the test fails because the blue environment is actually better than the green environment. In these cases the developer and the analyst that introduced the change need to dive into the details, and then they can override and deploy the blue environment, even though the effectless test failed.

Flaky System Tests Due to Data Pipeline Hiccups

Until now, we described a single (micro) service that handles transaction decisions. The system also contains an analytics service that aggregates interesting fraud-related insights for customers, and billing services that generate invoices.

Each service has its own database, and a reliable data pipeline keeps them synchronized. For example each time a new transaction completes processing, a notification is sent to the data pipeline, which eventually updates the analytics database. Even though the data pipeline is “reliable”, this can go wrong in a couple of ways:

  • Configuration of the publish-subscribe is not up to date, or spread across too many repositories.
  • Messages that failed processing went to the Dead Letter Queue, but the DLQ pattern requires manual intervention. The engineer that handled the DLQ may have made an honest mistake.
  • A bug in the analytics code got into production for a few days, and during that time it updated the analytics database incorrectly.

The analytics insights are used by our customer’s fraud manager on a daily basis and the invoices are checked by our customers’ billing departments. Any mistakes in that data are immediately visible to our customer’s top decision makers.

So, we decided to go ahead and implement a simple sanity system test. The test sends a fake transaction into the decision service, sleeps for 15 seconds, and then checks that the transaction has arrived to the analytics and billing databases. This test works well, most of the time. During traffic spikes, the data pipeline queues fill up, and the analytics database gets out of sync for longer periods of time. The result is a flaky system test. The important point here is that this doesn’t happen too often and is considered healthy behavior of the data pipeline. However, it is unacceptable for a system test to raise a false alert even once in two weeks.

Developers with healthy instincts quickly solved patched this test, by increasing the sleep period from 15 to 600 seconds. The problem is that the analytics service should only be 15 seconds behind the decision service, and not 600 seconds.

Data Reconciliation

Instead of this simple system test, we built a data reconciliation service. It has direct access to each one of the relevant microservices’ databases, and it performs simple sanity checks. An alert is raised if it detects:

  • Data is missing (max transaction timestamp, missing transaction id).
    • The time-drift threshold can be configured separately for each database.
  • Referential integrity problems (“broken links” between databases).
    • For example, the billing invoice requires special treatment for canceled transactions.

There are two important metadata prerequisites:

  • A mapping between each data “type” and the database that serves as its source of truth. In this example, the transaction’s source of truth resides in the decision service database.
  • Since the microservice API isolation is violated, a unified metadata repository is needed to reduce the tight coupling between the reconciliator and each service db schema.

Maintaining the reconciliation gets harder as you try to add more databases. Therefore, there must be a very good business justification for the implementation of one.

Positive Feedback Loops

The next step is for the data reconciliation service to proactively fix missing data, by “replaying” it into the Data Pipeline automatically.

The implementation details are important. Let’s say that a transaction takes 100ms to be processed by the analytics service. However, rare transactions with large payloads take 10 times longer to be processed. Ordinarily, these transactions are rare, and the system handles them slightly more slowly. Now, for some reason, all of these large transactions fail to process correctly, and the data reconciliation replays them all at once. Although full of good intentions, the reconciliation service has effectively clogged the analytics service. After a few minutes the reconciliator detects more missing transactions (the analytics service is clogged – remember?) and it replays them too…and so on. In control theory this is called a positive feedback loop:

Positive feedback tends to cause system instability…increasing oscillations, chaotic behavior or other divergences from equilibrium. System parameters will typically accelerate towards extreme values, which may damage or destroy the system…

Luckily, there is more:

Positive feedback may be controlled by signals in the system being filtered, damped, or limited, or it can be cancelled or reduced by adding negative feedback.

Translating to our domain:

Filtering – Fix only serious data problems that affects the business.
Damping – After fixing a problem, wait 1 minute before trying to fix another problem.
Limiting – Replay each missing transaction only once.
Negative feedback – Inspect the health of the analytics service (or its relevant Data Pipeline Input Queues), and pause reconciliation until service health is back to normal.

Data Reconciliation is hard, and would require strong business justification and some experimenting. Peer developers might resist the violation of micro-service isolation. However, if implemented correctly, it increases the confidence in the system to the level we desire.

The solutions described in this post were designed, implemented, tweaked and re-implemented by Re’em Ben Simhon, Dan Kilman , Tal Amuyal and Roy Tsabari. If you have any questions, or would like to join us at our Tel Aviv office, send me an email to [email protected], or directly to our VP Engineering [email protected].