The River of Bytes

TL;DR

We have an Elasticsearch cluster that used to move >3PB of data across AWS Availability Zones every month. The entire cluster houses ~100TB of data.
- That’s like moving 30 clusters’ worth of data between AZs every month!
We did stuff to find out what was causing this
We cut it down by 95% which amounts to hundreds of thousands of dollars in annual savings

Hopefully you’re now asking yourself “what’s stuff”. If so, read on.

Background

What We Found

At Forter, we have multiple Elasticsearch clusters, all of which are self-managed and deployed using AWS EC2.

Elasticsearch, while a very capable product, is a notoriously finicky beast. Among other metrics you should keep in check if you find yourself responsible for big-ish Elasticsearch clusters is cost.

One day, while examining the EC2 cost breakdown of one of our bigger Elasticsearch clusters, we noticed that over 30% of the cost is a result of the UsageType DataTransfer-Regional-Bytes. More specifically, we noticed that we were paying for Elasticsearch Data nodes to transfer 90-120TB of data between one another every day.

What is This Cost

In order to increase production resiliency, we deploy critical clusters to 2 Availability Zones in the same AWS region. This is how AWS describes Availability Zones:

Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity, and housed in separate facilities.

One of the drawbacks of this approach, is cost. AWS advertises a price of $0.01/GB that leaves or enters an availability zone in the same region. That is, $0.02/GB because a byte that leaves an AZ doesn’t disappear in the cosmic and mysterious dark space between AZs, it enters another AZ.
This may seem low, but when Giga becomes Terra or Peta it adds up to quite the pile of pennies.

The Legitimate Portion

Initially, we thought this kind of cost made sense for multi-AZ deployments. After all, we have to copy every byte we index across zones.

In order to actually be more resilient, our clusters are configured to make sure every document has at least 2 copies, spread across availability zones. An application writes (indexes) a document by connecting to a single node, and Elasticsearch takes care of copying (replicating) the data to a node in a different Availability Zone.

Elasticsearch has a shard allocation awareness feature. When replication is enabled (each shard has more than 1 copy in the cluster) this feature allows keeping each copy of a shard in a different AZ, by setting an attribute on nodes which denotes the AZ in which they reside. See the diagram below.

There is a problem with this theory however. We observed a cross-AZ data transfer rate of up to 120TB/day. We don’t index this much data. In fact, when checking the growth rate of actual cluster storage, we saw that we only index around 10-15 GB every day, including replicas. So what gives?

The Hunt

Where do Bytes Come From?

Using AWS Cost Explorer, we could see that all of the relevant cross-AZ data transfer cost was tagged as cost of our Elasticsearch data nodes. This means that both the source and the destination of the offending traffic was a data node, i.e. cluster-internal traffic.

AWS doesn’t know or care what bytes stream on their network backbone between Availability Zones, they just count them and bill. We, however, were very much interested in finding out what kind of barrage our data nodes are frantically sending one another.

Nodes in the same Elasticsearch cluster communicate with each other regularly, and they use a dedicated Transport protocol and a dedicated port (9300 by default). It is used for a very wide variety of internal tasks - from passing copies of documents for replication, distributing a query to all of the relevant shards, updating settings and more.

We wanted to understand what information the nodes in the cluster are passing around between one another, and if we can perhaps make them quiet down.

Setting Up

Our objective at this point was to understand what kind of messages were being transferred between data nodes. Are they documents being copied around? Some kind of keep-alive? Something else?

At first we were hopeful that Elasticsearch might offer some kind of monitoring data for Transport requests, similar to the client-level monitoring (e.g. query count and index rate). This would have been useful since we could have seen something along the lines of “99% of requests are of type XXX” and then investigated what creates XXX requests.
Unfortunately, we were not able to find metrics of the sort.

Diving In

Verification

So we decided to first verify our assumptions, by looking at the river (flood? tsunami?) of data from as close to it as we could get to it. We logged into a data node, and ran the network analysis tool iftop. If you aren’t familiar, iftop let’s you monitor your network interfaces, and specifically how much traffic moves through them alongside sources and destinations.

We ran iftop -i <interface> -f 'tcp port 9300' -t -s 10 -L 8 in order to find out if the rate of traffic to/from port 9300 explains the amount of cross AZ traffic we observed.

This is how the output looks like: “Iftop Output

We did the math and indeed, our data nodes were even more chatty than your average Rust enthusiast.

Further Down the Network Route

At this point, my questionable instincts pushed us to try and sniff the data with tcpdump and just look at it to see what it is they are saying to each other. Unfortunately for me, Forter Engineering is a respectable and responsible department, and we don’t pass around data in plain-text. Were I to try such a shenanigan I would have been presented with a perfectly useless endless stream of encrypted bytes.

Confession

To be honest, it is possible to hack around and read decrypted traffic if you are an admin. I wasted some time using the amazing mitmproxy and managed to sniff the data after all. As you are about to see - this was a redundant detour.

Looking for Logs

Elasticsearch is written in Java, and a quick look around revealed that it uses log4j to control its logs. This meant that by editing a node’s log4j2.properties we could control the verbosity of the logs.
log4j allows controlling separate loggers for separate packages, and configuring log verbosity for each on a package/class level. So which logger should be modified, and what classes do we want to look into?

In order to find the answer - we looked in the thankfully-publicly-available Elasticsearch code. We found out that transport related code lies in org.elasticsearch.transport and its sub-packages. Since we had no idea what specifically creates all the traffic, we decided to go the brutish route and enabled TRACE logging for the entire package. We did it by adding the following lines to log4j2.properties:

logger.transport.name = org.elasticsearch.transport
logger.transport.level = trace

Then we restarted the node and waited for a bit. Here is an example line we saw in the log:

[2024-01-31T08:39:55,007][TRACE][o.e.t.TransportLogger    ] [192.168.X.X] Netty4TcpChannel{localAddress=/192.168.X.X:9300, remoteAddress=/192.168.Y.Y:56789, profile=default} [length: 1736, request id: 128852428, type: request, version: 7.17.3, action: indices:data/write/bulk[s][r]] READ: 1736B

Let’s break it down: localAdress=/192.168.X.X:9300 is our node’s address in port 9300. This is probably the log of another node sending a request to this node’s management port. - remoteAdress=/192.168.Y.Y:56789 is another node’s address in same cluster, and the port it uses. - length - the length in bytes of the payload. Interesting, we’ll get back to it. - action: indices:data/write/bulk[s] - the type of the request. DING DING DING!

The action and length parts of the log contain all of the information we need. We can parse the log and find out which action account for the most length.

So I wrote some hand-made, artisanal bash with no ChatGPT additives (sadly, it shows. It spins up an embarrassing amount of processes)

Horrible Yet Wonderful Bash

for action in $(grep -E '(received|sent)'<trace.log | awk '{print $4}' | tr '[' ' ' | cut -d' ' -f3 | sort | uniq); do
grep -E $action <trace.log | grep length | sed -E 's/.*(length: [0-9]*).*/\1+/g' | cut -d' ' -f2 | xargs echo | sed -E 's/\+$//g' | bc | xargs -I{} echo $action':' '{}'B
done | sort -k2 -h -r

And got a list of actions, sorted by the cumulative size of their payloads (in bytes):

cluster:monitor/nodes/info: 21086079B
indices:data/write/bulk: 6488481B
indices:admin/seq_no/global_checkpoint_sync: 1145750B
indices:data/read/search: 61445B
cluster:monitor/nodes/stats: 44871B
indices:monitor/stats: 14651B
indices:admin/seq_no/retention_lease_background_sync: 9158B

Weird. We do see what seems to be the indexing action, but it’s in second place. The first place, and by a mile, is cluster:monitor/nodes/info. What is that?

When we noticedmonitor/nodes/info we suspected it’s related to Elasticsearch’s Node Info API. This API retrieves information about nodes in the cluster.
This was still odd, however. Most of the traffic was internal, within the cluster (remember - port 9300). And the amounts just didn’t make sense.

Looking at Code

It looked like some kind of node info request was responsible for most of the data. This new information created more questions: - Why is a supposedly innocent metadata request responsible for so much traffic? - What triggers this request, and can we kill it with fire?

First, in order to verify, we circled back to Elasticsearch’s source code. Searching for the log string led us eventually to the implementation of TransportNodesInfoAction.

We could see that it is triggered by an actual call to the Nodes info API. It looked like every time such an API call is made, an equivalent internal call is triggered from the node who got the request to all other nodes in the cluster. Since the cluster has 112 nodes, each Nodes Info call created 111 additional, internal calls.

So the open questions at this point were: - Whether something external to the clusters calls the Node Info API so much - If so, who is it - Does the math add up, because 3PB of data every month is a hell of a lot of info about nodes

Closing In

In order to look into external calls to the Node Info API, we enabled a different set of logs:

logger.http.name = org.elasticsearch.http
logger.http.level = trace

An example log line from the file was:

[2024-01-31T08:40:03,654][TRACE][o.e.t.TransportLogger    ] [192.168.X.X] Netty4TcpChannel{localAddress=/192.168.X.X:9300, remoteAddress=/192.168.Y.Y:58844, profile=default} [length: 10012, request id: 24133878, type: request, version: 7.17.3, action: cluster:monitor/nodes/info[n]] READ: 10012B

Great, we can extract the address after remoteAddress=/ and see who makes the most monitor/nodes/info[n] actions!

Some more swiftly handcrafted bash:

Less Horrible and Less Wonderful Bash

grep 'action: cluster:monitor/nodes/info' <http-requests.log | sed -E 's/.*remoteAddress=\/([^:]+).*/\1/g' | sort | uniq -c | sort -k1 -h -r | head -n10

And we got some (redacted) potential addresses of culprits:

  28 192.168......
  26 192.168......
  26 192.168......
  24 192.168......
  24 192.168......
  24 192.168......
  23 192.168......
  23 192.168......
  23 192.168......
  23 192.168......

All of the top addresses were of instances of the backend of one of our services, which queries this Elasticsearch cluster.

We did some more back of envelope math. Given the amount of external nodes/info requests made by this service, multiplied by the amount of nodes in the cluster, we could explain the amount of cross-AZ traffic we observed!

So why was this service so addicted to sniffing? > Sniffing in Elasticsearch is when the client queries a node in the cluster for a list of available nodes to choose from when sending queries.

Fix

We looked in the backend code, and specifically we looked for ES client creation and configuration. We found a few things: 1. By default, new connections were configured to sniff every 5 seconds. 2. Due to the way our ORM is set up, we create a connection for each index we query. We query over 10 indices. 3. We explicitly configured sniffing requests to fetch all HTTP attributes of every node, instead of the default, much smaller set.

So across our production, staging and development environments, each with many instances, we got a constant barrage of sniffing requests. Cluster Illustration

And the fixes, in order, were:

We changed the sniffing interval to 60 seconds. That’s not dangerous, since nodes are being replaced in a much slower rate. Clients only need 1 reachable node for every request, and errors trigger a sniffing request anyway.
We started reusing ES Clients in our ORM
We changed the client config and are now requesting only the attributes we needed for choosing nodes.

In the end, client configuration looked something like this:

const { Client } = require('@elastic/elasticsearch')
const client = new Client({
  ...
  sniffInterval: 60000,
  sniffEndpoint: '_nodes/_all/http?filter_path=nodes.*.attributes.aws_*,nodes.*.roles,nodes.*.host'
})

Summary

An internal service was sniffing like mad. 3-PetaBytes-of-node-metadata-mad. Despite the lack of readily available logs and metrics, our cloud bill gave us motivation to dive into the river of bytes and dam it.

After all fixes were deployed, cross-az Data Transfer rate was reduced from 3PB/month to less than 0.2PB/Month - a 95% reduction. Still a lot, but much less, and can be explained by regular DB operations. Seeing the following graph was very gratifying:

The moral of the story, if there is one, is that there is no such thing as a true black-box, especially in your own network and self-hosted services. Even if whatever it is you’re using provides no logs or docs, and encrypts traffic over-the-wire, it’s your data and your cores.

Bandaid Overview

We could have chosen to put a bandaid on the problem. Here is a short list of things we could have done which would have improved the situation a bit, but not solve the problem.

Compress transport data by changing transport.compress (an ES config). We tried, and it cut message size by ~half - 3PB is a lot, but 1.5PB is still a lot.
Deploy the cluster in a single AZ - would eliminate cost, but sacrifice resilience.