Radioactive Log Tracing

Shees Usman
3 min readJul 14, 2022

Merging Biology with Software Engineering

Photo by Ana Petrenko on Unsplash

Production level debugging is always a hassle. Figuring out those high priority burning issues can be really tough if you dont have a stack trace. And sometimes even if you have one. Just figuring out where in the application flow things broke and which component caused the issue can be a pain in the asphalt. It becomes harder on backend servers running somewhere else where you can't access them and of course, that makes sense because you want to keep those production machines safe and sound.

However, all is not lost. We have tools like cloudwatch and kibana and log aggregators of every kind helping us view and search those logs. The only problem is at a bigger scale, you get more logs and with more logs harder to figure out which log belongs to which transactions on the backend servers.

Now thankfully context aware logging exists. We implement that and end of the story. Thanks for reading! Bye!

Didn't buy it eh!? Ok then, I'll tell you exactly how we handled it in a few of our systems that made alot of difference.

So context-aware logging exists and what you do is attach context to your logs so at any time your logs give off the information of where in the life cycle they are, which service function which controller and the like. That does help you identify an individual log but doesnt help you identify all the logs in a transaction and differentiate them from all the others.

Thats a problem we faced a while ago, and incase you dont already know Im a lazy guy. I hate hard work, to the extent that I will work hard to find ways to make it easier for the next time I ever have to do it. Can't help it. So when I had to traverse 200 mixed logs to find a single issue I thought

This is so stupid. *Not an exact representation of the words I used but similar

It becomes impossible to find the logs you want to find when hundreds of users are doing the same thing at the same time. Even if I find one relevant log it takes me ages to find the next one and the next ones after that.

To solve the problem I decided to leverage radioactive isotopes tracing. Medical professionals use it to find cancer cells and clots all the time. What if we could add tracing attributes to our logs.
In came the custom unique id HTTP header attribute, generated at the edge microservices and forwarded to all subsequent requests to other microservices, unique to one single transaction and one transaction alone. Once generated by any microservice if available in HTTP calls, or whatever protocol you use, to other microservices they will use the one passed to them.

Add that to your logs for all transactions, just one unique string, a few characters long in our case, and you end up with a way to trace transactions end to end in those aggregated logs. It helps us trace the full flow of events and logs for a specific error case on a specific transaction running across our whole microservices system. Its not even anything new lots of system have way better tracing methods like Kubernetes does and lets not forget those very useful telemetry tools. Not only that you could later on leverage all these to build system-level diagrams execution flow maps and all that cool stuff if you do it right.

Now even though this works for us, this isn't really the final step. Properly leveraging the telemetry tools we have is where we are heading to next. Making sure we can identify issues via logs, trace their origins and fix them faster is only one part of the stability aim we are heading towards. Another thing which we have worked towards is using smarter logging and better usage of telemetry tools so issues are stack tracked and reported in a better way for us to identify if there is a breakage in the system or an error that the system shouldn't throw is being thrown.
Just today this helped us catch and patch an issue from a 3rd party’s end before anyone had even reported it to us. But more on that in the next one.

Thanks for reading :)

--

--

Shees Usman

An electrical engineer working in the beautiful world of software for fullstack application development.