The River of Bytes

The River of Bytes

TL;DR

  • We have an Elasticsearch cluster that used to move >3PB of data across AWS Availability Zones every month. The entire cluster houses ~100TB of data.
    • That’s like moving 30 clusters’ worth of data between AZs every month!
  • We did stuff to find out what was causing this
  • We cut it down by 95% which amounts to hundreds of thousands of dollars in annual savings

Hopefully you’re now asking yourself “what’s stuff”. If so, read on.

Background

What We Found

At Forter, we have multiple Elasticsearch clusters, all of which are self-managed and deployed using AWS EC2.

Elasticsearch, while a very capable product, is a notoriously finicky beast. Among other metrics you should keep in check if you find yourself responsible for big-ish Elasticsearch clusters is cost.

One day, while examining the EC2 cost breakdown of one of our bigger Elasticsearch clusters, we noticed that over 30% of the cost is a result of the UsageType DataTransfer-Regional-Bytes. More specifically, we noticed that we were paying for Elasticsearch Data nodes to transfer 90-120TB of data between one another every day.

Initial Cost Graph

What is This Cost

In order to increase production resiliency, we deploy critical clusters to 2 Availability Zones in the same AWS region. This is how AWS describes Availability Zones:

Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity, and housed in separate facilities.

One of the drawbacks of this approach, is cost. AWS advertises a price of $0.01/GB that leaves or enters an availability zone in the same region. That is, $0.02/GB because a byte that leaves an AZ doesn’t disappear in the cosmic and mysterious dark space between AZs, it enters another AZ.
This may seem low, but when Giga becomes Terra or Peta it adds up to quite the pile of pennies.

The Legitimate Portion

Initially, we thought this kind of cost made sense for multi-AZ deployments. After all, we have to copy every byte we index across zones.

In order to actually be more resilient, our clusters are configured to make sure every document has at least 2 copies, spread across availability zones. An application writes (indexes) a document by connecting to a single node, and Elasticsearch takes care of copying (replicating) the data to a node in a different Availability Zone.

Elasticsearch has a shard allocation awareness feature. When replication is enabled (each shard has more than 1 copy in the cluster) this feature allows keeping each copy of a shard in a different AZ, by setting an attribute on nodes which denotes the AZ in which they reside. See the diagram below.

ES Query Diagram

There is a problem with this theory however. We observed a cross-AZ data transfer rate of up to 120TB/day. We don’t index this much data. In fact, when checking the growth rate of actual cluster storage, we saw that we only index around 10-15 GB every day, including replicas. So what gives?

The Hunt

Where do Bytes Come From?

Using AWS Cost Explorer, we could see that all of the relevant cross-AZ data transfer cost was tagged as cost of our Elasticsearch data nodes. This means that both the source and the destination of the offending traffic was a data node, i.e. cluster-internal traffic.

AWS doesn’t know or care what bytes stream on their network backbone between Availability Zones, they just count them and bill. We, however, were very much interested in finding out what kind of barrage our data nodes are frantically sending one another.

Nodes in the same Elasticsearch cluster communicate with each other regularly, and they use a dedicated Transport protocol and a dedicated port (9300 by default). It is used for a very wide variety of internal tasks - from passing copies of documents for replication, distributing a query to all of the relevant shards, updating settings and more.

We wanted to understand what information the nodes in the cluster are passing around between one another, and if we can perhaps make them quiet down.

Setting Up

Our objective at this point was to understand what kind of messages were being transferred between data nodes. Are they documents being copied around? Some kind of keep-alive? Something else?

At first we were hopeful that Elasticsearch might offer some kind of monitoring data for Transport requests, similar to the client-level monitoring (e.g. query count and index rate). This would have been useful since we could have seen something along the lines of “99% of requests are of type XXX” and then investigated what creates XXX requests.
Unfortunately, we were not able to find metrics of the sort.

Diving In

Verification

So we decided to first verify our assumptions, by looking at the river (flood? tsunami?) of data from as close to it as we could get to it. We logged into a data node, and ran the network analysis tool iftop. If you aren’t familiar, iftop let’s you monitor your network interfaces, and specifically how much traffic moves through them alongside sources and destinations.

We ran iftop -i <interface> -f 'tcp port 9300' -t -s 10 -L 8 in order to find out if the rate of traffic to/from port 9300 explains the amount of cross AZ traffic we observed.

This is how the output looks like: “Iftop Output

We did the math and indeed, our data nodes were even more chatty than your average Rust enthusiast.

Further Down the Network Route

At this point, my questionable instincts pushed us to try and sniff the data with tcpdump and just look at it to see what it is they are saying to each other. Unfortunately for me, Forter Engineering is a respectable and responsible department, and we don’t pass around data in plain-text. Were I to try such a shenanigan I would have been presented with a perfectly useless endless stream of encrypted bytes.

Confession

To be honest, it is possible to hack around and read decrypted traffic if you are an admin. I wasted some time using the amazing mitmproxy and managed to sniff the data after all. As you are about to see - this was a redundant detour.

Looking for Logs

Elasticsearch is written in Java, and a quick look around revealed that it uses log4j to control its logs. This meant that by editing a node’s log4j2.properties we could control the verbosity of the logs.
log4j allows controlling separate loggers for separate packages, and configuring log verbosity for each on a package/class level. So which logger should be modified, and what classes do we want to look into?

In order to find the answer - we looked in the thankfully-publicly-available Elasticsearch code. We found out that transport related code lies in org.elasticsearch.transport and its sub-packages. Since we had no idea what specifically creates all the traffic, we decided to go the brutish route and enabled TRACE logging for the entire package. We did it by adding the following lines to log4j2.properties:

logger.transport.name = org.elasticsearch.transport
logger.transport.level = trace

Then we restarted the node and waited for a bit. Here is an example line we saw in the log:

[2024-01-31T08:39:55,007][TRACE][o.e.t.TransportLogger    ] [192.168.X.X] Netty4TcpChannel{localAddress=/192.168.X.X:9300, remoteAddress=/192.168.Y.Y:56789, profile=default} [length: 1736, request id: 128852428, type: request, version: 7.17.3, action: indices:data/write/bulk[s][r]] READ: 1736B

Let’s break it down: localAdress=/192.168.X.X:9300 is our node’s address in port 9300. This is probably the log of another node sending a request to this node’s management port. - remoteAdress=/192.168.Y.Y:56789 is another node’s address in same cluster, and the port it uses. - length - the length in bytes of the payload. Interesting, we’ll get back to it. - action: indices:data/write/bulk[s] - the type of the request. DING DING DING!

The action and length parts of the log contain all of the information we need. We can parse the log and find out which action account for the most length.

So I wrote some hand-made, artisanal bash with no ChatGPT additives (sadly, it shows. It spins up an embarrassing amount of processes)

Horrible Yet Wonderful Bash

for action in $(grep -E '(received|sent)'<trace.log | awk '{print $4}' | tr '[' ' ' | cut -d' ' -f3 | sort | uniq); do
grep -E $action <trace.log | grep length | sed -E 's/.*(length: [0-9]*).*/\1+/g' | cut -d' ' -f2 | xargs echo | sed -E 's/\+$//g' | bc | xargs -I{} echo $action':' '{}'B
done | sort -k2 -h -r

And got a list of actions, sorted by the cumulative size of their payloads (in bytes):

cluster:monitor/nodes/info: 21086079B
indices:data/write/bulk: 6488481B
indices:admin/seq_no/global_checkpoint_sync: 1145750B
indices:data/read/search: 61445B
cluster:monitor/nodes/stats: 44871B
indices:monitor/stats: 14651B
indices:admin/seq_no/retention_lease_background_sync: 9158B

Weird. We do see what seems to be the indexing action, but it’s in second place. The first place, and by a mile, is cluster:monitor/nodes/info. What is that?

When we noticedmonitor/nodes/info we suspected it’s related to Elasticsearch’s Node Info API. This API retrieves information about nodes in the cluster.
This was still odd, however. Most of the traffic was internal, within the cluster (remember - port 9300). And the amounts just didn’t make sense.

Looking at Code

It looked like some kind of node info request was responsible for most of the data. This new information created more questions: - Why is a supposedly innocent metadata request responsible for so much traffic? - What triggers this request, and can we kill it with fire?