This is the eighth part of the Chatterbox series. For your convenience you can find other parts in the table of contents in Part 1 – Origins
When working with software we typically need to integrate with other components — whether some external software like database, or some libraries we incorporate and use in our code base. Below are couple of stories of weird integration issues I had to solve.
Threading is hard. It’s even harder when we start integrating components.
I was using some Java library to handle protocol. I was running it in .NET process via IKVM. There is one big difference between unhandled exception behavior in Java and in C# — in Java they don’t kill the process (only the thread), in C# they take whole process down. So what happens if you take java code with no try/catch on the main thread function and you get an exception? Whole process dies.
How do you fix that? Either you hack or just move code to external process. The former works but is risky, the latter makes your infrastructure more complex (on the other hand you should have watchdog and multiple nodes anyway).
Similar situation with external library not handling threading correctly. It was deadlocking when sending a message. It was running in external process so it wasn’t blocking whole system but detecting the lock wasn’t easy. How do you recognize if a thread is working hard or is waiting indefinitely? Your ping thread won’t help (because ping works) so you need to go with other checks in place. It makes things much messier — you need to check if the action trigger (like a queued message) is processed in a given time. You could go with Wait Chain Traversal but detecting deadlocks is not simple. Not to mention that it doesn’t need to be “deadlock” technically but lost message etc.
Never wait indefinitely. Never. It’s a recipe for a failure. This is pretty clear when you take locks explicitly but what happens if you use
await? Having timeouts in place is harder but you still need to have them. Otherwise you end up with a bug when puppeteer doesn’t open new page and your await never finishes. You either fix it in the source or add more checks and timeouts around.
Whenever you incorporate external library into your process, you need to take care of segfaults and memory errors. While they are rare in managed code, they still happen. What do you do when your process segfaults? You need to restart it but you also need to make sure it doesn’t happen again (so you take memory dump or logs). Always run external code sandboxed and in an isolation, you just can’t let your code to fail because some other library is buggy.
Time management is hard. There are time zones, there are leap seconds, so many other things. And what happens if other library handles time differently?
Whenever you handle time always be consistent. Going with UTC is not a silver bullet but if your whole system does that then translate from local to UTC as early as possible. Never adhere to other library conventions because it’ll make your code messy.
You can never loose user data, especially in stateful situations which do not retry. Persist data as soon as you get it (either entered by the user or via some callback from library). Also, make sure you have audit mechanisms in place, retries, deadlettering and other solutions to get some insight on the system performance.
Always have some metrics in place. You’re not following your logs closely (especially when you travel) but you need to have some mechanism to notify about failures happening too often. Whether it’s p99 performance metric or just a token bucket for too many exceptions in 15 minutes — keep it in place and get some notification. There will be some false positives, there will be some false alarms, but it’s better so know something doesn’t work than to be disappointed. Especially when you travel and you just can’t log remotely to see logs or make sure things work. Once you start using system “for real”, you need to make sure the system lets you know something is wrong.
Keep your logs clean and tidy. Log enough but don’t log too much. This includes both logging request contents but also not logging every single line. Just make sure you log all side effects so you can reproduce them from logs.
Also, make sure you log important contextual things. Timestamp, thread id, process id, binary name, request context, these things are very important. You may need to trace bugs using memory dumps, things are much harder there and you’ll need as much details as possible. Similar thing for resource leasing or even mutex locking. If you log that “mutex was abandoned” then it’s helpful but how do you know which component held it? You need to log uniformly and as much as possible. See logging in distributed system.
If you log and take memory dumps, make sure you cleanup periodically. System may not fail “clearly” when you run out of disk storage but it won’t work and you’ll get weird exceptions. Just schedule your cron jobs. Make sure you archive artifacts so you don’t lose them.
Finally, document decisions. You won’t remember why things work they way they do. Same goes for features, you may implement something and then forget it’s there or how to use it. Just keep readme up to date, it is helpful. Especially if you travel and cannot log in to see the code.
Never deploy on Friday
People go with CD and it’s cool but never deploy in risky times. If you leave for the weekend — do not deploy on Friday evening. You may have rollbacks in place and capture all issues right away but you don’t want to log in remotely from some airport (been there done that). This “super cool feature I need today” really can be implemented when you get back. It’s cool you have a feature but it needs to be reliable as well.
Trust and check
Whenever you do action with side effects, make sure it succeeded. Checking HTTP code may not be enough, you don’t actually know if external system processes your action correctly. Check somehow, if you send message to your friend then ask server later on if it was delivered. Have another instance of your client and see if it gets ping from the server. Download message history periodically and make sure yours are there. And if it’s not — just let user know. It’s bad when system tells you “something failed and I don’t know what” but it’s even worse when it fails quiety.
Don’t exit deep inside your code
System.exit in a library. Don’t exit deep inside your code. And never use
System.exit directly, go with your thin wrapper like
ProcessHelper.exit. Log when it’s called from (don’t forget to log thread etc) because one day your system may be “failing” just because it exits in a place where you didn’t want to do it.
Name your threads. This gives you one crucial piece of information — you not only know what the thread does (by examining memory dump etc) but also what it is supposed to do. This is especially important with asynchronous code which can migrate between threads, maybe your callstack is okay but is not on the right thread.
Always assing unique id to events if possible. If you cannot do it easily (it’s not provided by the protocol), then derive it from some hash code function (based on sender, content, timestamp etc). You’ll be able to remove duplicates.
Deadletters because of deadletters
What happens if you have a deadletter? You probably want to notify yourself about it. But what happens if you fail when notifying about the deadletter? If you generate another one then your system can easily collapse. You get one deadletter, it then causes another two deadletters, couple minutes later you have thousands messages which you cannot process but they consume resources. Make sure you can cut this circle somehow, or at least delay so the system survives the load.
Generate new messages based on context
Whenever one message causes another one (like you send email and then you get email sent message) always have common code in place to maintain the continuity. This could be as simple as copying one correlation id field but could be much more sophisticated (storing context, deadletters, senders, ids etc). Also, one day you’ll need to copy new property between messages, you don’t want to go through your whole codebase and update every single place where you use constructor.
Have sanity checks in place
We already know waiting indefinitely is a bad idea. What about other things? If you send one message per second is it okay? Ten of them? One hundred? Sure, you’ll come up with some ridiculous limits which should never be met. Put alarms on them so you know when your system sends thousand messages in 5 minutes. It may not save you from spamming someone else but at least will let you stop it early.
You’ll need to change message schema, encryption keys, content format. Always make sure these are versioned and you can maintain compatibility.
Whenever you push a text to a storage, always control the format. Don’t rely on “default locale” or “the current format”, enforce it manually so it doesn’t break when you migrate the software to different machine/country/continent.