At Forter we develop an online, low latency fraud prevention service. We started out with a single real time system, but as throughput increased, we gradually decided to offload more and more jobs to run periodically. This varies from sophisticated anomaly detection systems to detect new types of fraud, to simpler, ETL jobs that perform maintenance on our rapidly growing databases.
An offline job is any program you run at scheduled intervals. Usually it is implemented as a command line tool that’s executed by a job scheduler (for example, Jenkins).
So you’ve decided to write such a job. Perhaps it’s some ETL that copies records from Elasticsearch to S3. Or perhaps it’s a daily report, sending you an email with interesting statistics on your production environment. Offline (or scheduled) jobs are first class citizens of most production systems.
However, it can be tempting to write offline jobs as a “one of”, or a “write and forget”. After all, since they are not in the real time critical path, they can feel less critical or less production grade. This can turn out to be a big mistake. In Forter’s case, it could mean a fraudster is slipping past our net. Or perhaps an analyst can’t do her work properly because the data she needs is not available.
Worry not, though, ‘cause I’m here to help, with some practices I found to be super helpful for writing robust and friendly offline jobs.
Properties of a good scheduled job
First up, let’s have a look at what I think makes for a good job:
- Good visibility: if it fails, you can find out why it failed. If it’s stuck, you can tell where it’s stuck.
- Great error handling: does the correct thing with errors – for example, does not fail the entire job because of one timeout against a database.
- Efficient “enough”: your job can keep up with the data it needs to run on and isn’t overly wasting money/resources.
- Doesn’t strain production resources: your job should not overload DBs in a way that interferes with your user facing production applications. For instance, our real time system at Forter should not suffer increased latency caused by offline jobs.
- Consistent and reliable: for the most part, it just works, as opposed to being flaky with many sporadic failures.
- Idempotent: Wikipedia defines idempotency as “… the system state remains the same after one or several calls”. For an offline job this means you can run it again and again and nothing bad would happen.
- Extendable: it’s important for any system to be open to changes. It’s very likely you’ll want to add new parameters and features to your job.
12 best practices for scheduled jobs
So how do we achieve these nice traits I listed above? Here’s what I consider the best practices, almost every job should consider.
1. Batch it baby
One of the first benefits of moving code from a real time service to a scheduled job, is the potential for efficiency and higher throughput. Data sources tend to prefer batched operations to work more efficiently. For example, in Elasticsearch bulk inserts are much faster, and less resource intensive, than inserting records one by one. Therefore, if your job is the kind which operates on a lot of data, prefer to work in batches of records, and not one by one.
However, resist the temptation to process all records at once – this can cause the job to run out of memory, or hit various limits in your data sources. How many records in a batch? It depends, of course – it could be 100, 1000 or 10000, you should try various sizes and choose what works best for your scenario. Sometimes it makes sense to batch things by date: e.g. run on records from Jan 1st, then Jan 2nd, etc. However, be careful there, because some days might contain significantly more data. For instance, on Black Friday, some online merchants have X10 more traffic than regular days. This can cause your job to process X10 more data than you expected, and run out of memory just when you need it most.
2. Log, log, log
Do you know that feeling of looking at the console output of a job that’s been running for hours, and seeing nothing printed in the last hour or so? “WHAT IS IT DOING?!” You scream into the keyboard, but to no avail. Let us avoid this dire situation, and log, log, log.
Here are some basic log entries we should consider when writing a batch job:
- As the job starts, log all the job’s parameters and relevant configuration about the environment or data sources you’re going to use. (e.g. ElasticETL is starting. Environment is PROD, scanning the last 2 days of data, connecting to prod-elastic-01 at IP 18.104.22.168)
- Log before each batch begins. (e.g. Processing Batch 23/50)
- Log after each batch completes. You should make sure log timestamps are enabled, so you can assess how long each batch took. (Batch 23/50 completed)
- Log every query to a database. If your job gets stuck later, this can help you find out what’s taking it so long. (Running query against my-db-02: select * from users…)
- Log every error or exception with a complete stack trace
- Before anything time consuming (Uploading extremely large results file to S3, this might take a while…)
3. Use constant memory
A nice, well behaved job doesn’t fail on out of memory exceptions, right? Beware of leaking memory between batches. Each batch should be self contained, so be wary of this pattern:
results = 
for batch in batches:
In this example, the list of results will grow and grow, and your job might fail on OOM. Instead, try to flush the results of each batch to disk or other persistent store. The same holds for collecting errors: beware of making a list of all the errors your job encountered.
4. Error handling is crucial
Scheduled jobs tend to process a lot of data, and query various databases. This means that errors and timeouts are common, and you should carefully think about how you want to handle each error. I like to split errors into two types:
- Critical failures – this type of failure should fail the entire job. There’s simply no point in continuing the job after this horrible thing had just happened.
- Non critical failures – these should be reported, but shouldn’t fail the job.
There’s a careful balance here: on the one hand, you don’t want to fail a job that’s successfully processed 10M record over the last hour, just because one record had a problem. On the other hand, simply swallowing all exceptions can hide the fact that the job is getting nothing done.
Here’s what I propose:
- Any operation that can be retried (e.g a DB query that timed out), should be retried.
- In most cases, you don’t want to fail the entire job because of one error. You can keep a count of the amount of errors you allow, and fail the job only if you cross some threshold. For example, if more than 1% of the data failed to process, you can fail the job and notify a human.
- Don’t swallow an error without logging it first.
- In many cases it is reasonable to add retries to the entire job. Most job schedulers will allow you to do this without adding any code to your job.
- Make sure you understand what happens if you swallow an error. Will the record whose processing you skipped be processed the next time your job runs? If it is skipped forever, is that OK in your use case? Consider using a dead letter queue.
5. Make your job idempotent
Since failures tend to be common in scheduled jobs, the likelihood of your job running twice on the same data is very high. Even when nothing fails, you might discover a bug and you’d want to reprocess the last few days of data again. It is important that nothing blows up in this case. It is easy to create duplicates if you’re not thinking thoroughly about reruns.
Non-idempotent code usually involves state changes, or stateful side effects. Here’s one example. By default Elasticsearch generates a new random identifier for each inserted document. Running such a job twice would result in two identical documents in Elasticsearch. If you choose the ids yourself by using some identifier from the raw data, you can make sure that your job only updates existing records when it eventually reruns.
6. Long running jobs should be resumable
Let’s say your job usually runs for 10 hours, but it failed after 7 hours. You fix the problem, but now you have to wait the full 10 hours again! Not fun. You should consider allowing your job to pick up where it left off. You can do it in one of two ways:
- The job itself maintains a resumable state, such as checkpoints that contain partially processed data. This is the preferred way, because then you can just rerun the job and it will do the right thing.
- You can also move the complexity outside the batch job itself. For example, provide an optional parameter to your job which tells it to begin at a later batch number/date. This way the job can be manually made to skip data it already processed.
7. Use a command line parser utility
Do not attempt to parse your job arguments by yourself. In every language there are lovely options for parsing command line arguments. These tools make it easier to add new parameters, give clear errors when a required argument is missing, and allow for a handy ‘help’ command which displays all the possible job args.
8. Set timeouts on EVERYTHING
Make sure to configure a proper timeout for each network call. Whether it is a DB query, a REST call or a message to a distributed logging system, you want to make sure that:
- Your job isn’t stuck infinitely on a network call.
- Your job doesn’t fail because of timeouts which are too short, that were configured for online production code and not batched work. As I mentioned in the beginning, it’s common to move code from real time systems to scheduled jobs, so forgetting to properly change the configuration when doing this is a common mistake.
9. Notify on job failures / jobs not running at all
Usually, when a job fails, you’d want someone to know about it. You can use Slack, email or even tools like PagerDuty for critical jobs. Don’t let your jobs fail silently!
The notification should contain, at the very least, a link to view the console output of the job. If possible, include the stack trace of the last relevant exception.
In addition to job failures, you’d probably want to know if your job isn’t being triggered at all. If your job is supposed to run each day, but hasn’t even started for the last week, you’d want to have some monitoring tool notify you about this.
10. Allow running your job locally
Jobs need to be maintained long after they’re written. You’ll want to fix bugs and add new features. Your fellow developers will love you if you include a flag that allows running your job on their laptop. When this flag is on, your job should work against local or develop environment instances of databases. It should not affect any production data. Instead, you can print out comments like INFO: deletion of record X skipped because we are in LOCAL mode, so you’d know that the production job would have deleted something here.
11. Do not strain your DBs: ‘sleep’ is ok for offline jobs
Your job is likely hitting a database that is also being used by other applications. Reading/writing too fast or too much might overload the CPU or other resources of these DBs. You want your job to be a good citizen, so it’d be a good idea to monitor the databases your job uses as it runs, and adjust the code accordingly. While frowned upon in real time code, placing a few sleep calls here and there, to make your job slower, can be a good idea.
The tradeoff is that the job would run longer, and in cases of distributed processing (such as huge spark clusters), it could incur extra costs to keep the job running longer. Usually, though, it’s a good tradeoff to make.
12. Test as much as you can
Testing is a tricky issue for scheduled jobs. They might not have a lot of logic, instead they might contain “glue” code that takes data from one DB, manipulates it and sends it to another. Testing code with a lot of DB access is always a hassle.
The easy way out is just not to write tests. For some simple jobs that’s okay, as long as you at least allow running your job locally. You can mock out DB access code, but that tends to be where most of the logic is, so you end up not actually testing anything.
So what should you do? Write integration tests that allow your code to access real instances of databases. This would require some infrastructure:
- You can use docker-compose to launch local stubs of your databases in the CI server.
- You can use develop-environment copies of your DBs, if you have them. Just make sure running your tests in parallel by different people/builds won’t break them. For this, your tests should use randomly generated identifiers to avoid conflicts, and clean after themselves when they finish.
There is a lot more to writing great offline jobs (for instance, I didn’t even get to talking about data pipelines and flow management). However, from my experience, sticking with the best practices above will get you off to a great start.