Monitor Everything

Or at least as much as you can.

When you start moving some of your tasks away from humans, you're loosing one key aspect of their work: the oversight.

And nobody may be thinking about it.

When you or your employees perform a task, not only do they do it, but they also have an idea of what result to expect from it. They have an expectation of what it involves and what the process should look like, in terms of input, execution, output, ... If something seems off, you would check why, try to fix it, or raise some alarm in order to investigate and make sure that there is no negative impact, problems get fixed, ...

How do you do that on your automated processes ?

Well, that's where monitoring comes in. Having tools and processes in place so that any issue gets detected, processed, fixed, learnt about and any other action you want to take as a result.

There are 2 things you need to watch:

  • Are processes running successfully
  • Are processes giving the expected result

To some extent, it is no different from people, where you want to be sure that they do what they are expected to do to deliver.

Where will things go wrong ?

Short and quick answer: everywhere...

Any component can fail for a various number of reason.

It can be the infrastructure: a component breaks down, an update messes with some configuration items, an heavier usage than expected or planned: lack of memory or disk space on a server, ...

It can be your script, application, processes: some data don't match an expected format, a processes hasn't been fully updated, you get into some edge cases, ...

It can be a 3rd party provider: a change their interface, they disable some feature you relied on, a change in their infrastructure, ...

When dealing with software, there are a lot of moving parts. It is not always feasible to keep track of everything. So to start, the minimum you should be doing is to abort a process when something goes wrong, get notified from it so that you can take action. Then you can improve on it by adding:

  • an alternate path
  • a recovery process
  • a way to handle issues
  • a more granular detection to help with troubleshooting next time. Knowing that a process failed because of "File A missing" will be a huge time saver compared to "Process Failed".

Preemptive actions

One of the best way to prevent issues is to make sure that your tools and code are flawless. That every possibility is taken into account, every possible value gives a specific result, ... Bad news, that's not going to happen. There will always be a risk of code or infrastructure related issue. Especially if you rely on 3rd parties you don't have control over.

Validation

When dealing with data, make sure that what you get into your process is expected. Programming languages facilitates that by using types, specifying that each piece of data must be of a specific kind. They are looking at the "shape" of the data : "is it a number?", "a piece of text?", "an object of a specfic kind?", ...

However, you may also need to make sure that besides the shape, the data you get is also valid. And that depends on your business logic. "Is an age of -1 valid?", "Is a person a minor when below 18years old? below 21?", ... Those are questions that are on you to answer and write in your code when relevant.

The more specific you can be, the less issue you will have. Your code should make sure it runs only in an expected manner. Proceed if possible, raise errors or process differently otherwise.

Error handling

Problems will happen, data will be invalid, issues can come from anywhere... you should aim to handle them as early as they appear and limit their impact.

If you are running a process on each contact from a list, unless you need some kind of aggregate, you could keep processing even if some fail. Put some notifications on to know which ones to look into later but the rest can likely keep being processed.

Error handling is that process, to specify what should happen when a problem arise. It is the process ran when something goes wrong. You need to think about it to limit the impact on the whole system and limit it to be as small as possible.

Sometimes, you may get into a situation you hadn't planned. Some value will be missing or incorrect in a way that wasn't known, thought about, ... A global error handler can be the solution to be warned of such cases and fail nicely if data has to be shown to the user. Beware, that handler shouldn't hide issues from you.

Error processing can be as easy as simply notifying that something has to be done. If you can plan that a situation requires some special processing but you know you don't need it now. Don't overcomplicate and think about it now. Set an exit point there specifying the context you have and notify yourself of some work to be done for that context to be handled. The write the code for it. Actually having a use specific use case will help you come up with what to do. In some cases, you may never need to handle it.

Logs

Logs are one way to get information about what's going on in your application. It is text that track the status / progress of the execution. For example "Starting evaluation of XXXXX".

When there is an issue, logs can help identify where and why the execution failed (provided there is enough information in the message).

Logs can have different severity level from that you can use for filtering and possibly alerting you. Trace, debug, info, warning, error, fatal are common levels used. Each level has a different level of granualarity on which data to include and how to react to them. All fatals and most errors should really be looked at. Others are mainly there to provide contextual details.

One more thing you need to make sure is that you have a way to link logs from a same execution together. If you send your logs all to the same destination, multiple source can get mixed and they become useless if you don't know how things relate. That linking component doesn't have to be meaningful, it is just an unique enough identifier that you can filter on.

Unit tests

On the subject of code quality, something that you may see come quite often is the need for unit tests. It is code that is written for the sole purpose of checking that the actual code works as expected.

You will find very passionate people that will tell you you need to test everything. Others saying that they are not useful and that you shouldn't bother.

My personal take on it is that unit/automated tests are useful. However they are only as good as they are written and won't prevent bugs or issues. In my opinion, you shouldn't overdo them. They are intended to make you confident that what you have works as expected. Use them for that.

Not every piece of code is worth testing.

Do just enough given what you know and your constraints.

Remember, don't be over-confident in your test coverage. They only cover what you know and what they are written to cover. A 100% coverage shouldn't be the target. It doesn't mean you won't have bugs or issues and will limit how flexible and agile you can be.

Continuous monitoring

The most important part of your monitoring, get alerted when things go wrong.

Sometimes, it can be obvious : the website is crashing. Sometimes, it may be silent and you won't notice anything for days, weeks, months, ... until you ask yourself the right question and start investigating why something isn't happening.

The less visible something can be the more you should be doing to make it visible.

The bare minimum is to check that what runs is successful. If something goes wrong, investigating the issue will identify which component(s) failed to deliver and needs to be looked at. Be aware that sometimes multiple components can fail simultaneously or in cascade, A failing because of B failing etc... The issue may not be where you identified it first. If many processes fail at the same time, the root cause may be identical.

The next step is to bring in a more granular system. Something that can supervise each component of your system. The goal there is to troubleshoot faster. To pinpoint which component has problem, and ideally why. Helping you identify the severity and impact of issues. Some may need immediate action when others wouldn't require anything or even be expected...

The final step in this monitoring ladder is to set up premptive checks. While some problems can't be predicted (some 3rd party APIs sending back wrong data for example), other may be anticipated. If you are using a database and the storage disk starts to get full, you could set checks at 80%-90% usage to be warned that a problem may be coming. Knowing that beforehand can allow to plan for a fix before it becomes a problem.

Set some basic checks with an alert system notifying you of issues. Have the information pushed to you

Fixing issues

All issues are not created equal. They have different levels of impact, different risks, different cost associated with fixing them, ...

It is on you to determine how to respond to an incident. It can be no action at all (if it is infrequent, costs too much, ...) quarantine of the failing process, full rewrite, ...

Any reaction is valid as long as it is evaluated and understood. There is no bad answer, besides ignoring it completely.

When a problem arises, ask yourself how can I know / fix / avoid it better in the future and set up what's needed if it's worth it.