Logging and observability best practices

You find out how good your observability is at the worst possible moment: when something is broken, customers are noticing, and you have no idea why. Everything I do here is aimed at that moment. The goal is to answer “what is happening and why” in minutes, not hours. Good observability is the difference between a calm incident and a frantic one.

Log structured data, not sentences

Human-readable log lines feel friendly until you need to search ten million of them. Then you are writing fragile regexes against prose. I log structured records, key-value or JSON, so logs are queryable like a database instead of grepped like a diary.

// Not this:
log.info("User " + userId + " failed login from " + ip)

// This:
log.info("login_failed", {
  user_id: userId,
  ip: ip,
  reason: "bad_password",
  attempt: 3
})

Now “show me all failed logins for this user in the last hour” is a filter, not an archaeology project. Pick consistent field names across services so the same concept has the same key everywhere, and a query written once works across the whole system.

Use levels with discipline

Log levels only help if they mean something consistent. When everything is logged at INFO, the level is noise. My rule of thumb:

ERROR is something broken that a human needs to look at. If it does not warrant attention, it is not an error.
WARN is unexpected but handled, the kind of thing worth watching for a pattern.
INFO is significant business events: an order placed, a job finished.
DEBUG is detail for local development, usually off in production.

The test for ERROR is simple: if a page fired for every one, would you be angry? If yes, it is not really an error, and you have just trained yourself to ignore the level that is supposed to wake you up.

Carry a request ID through everything

In any system with more than one service, a single user action becomes a dozen log lines scattered across machines. Without a thread connecting them you are guessing. I generate a correlation ID at the edge and pass it through every downstream call and into every log line. Then one ID reconstructs the entire path of a request, which is the same reason I keep error shapes consistent in REST API design guidelines: when something else has to follow the trail, structure wins.

Measure the three things that tell you about health

Logs tell you about specific events. Metrics tell you about the system as a whole, and they are what your dashboards and alerts run on. For any service that handles requests I track rate, errors, and duration: how many requests, how many failed, and how long they took. Watching the latency distribution rather than the average matters, because the average hides the slow tail where real users suffer.

Track the 95th and 99th percentile latency, not just the mean.
Track error rate as a percentage so it is meaningful at any traffic level.
Track saturation, how full your resources are, so you see trouble before it becomes an outage.

Alert on symptoms, not causes

An alert should mean a human needs to act now. If it does not, it should be a dashboard, not a page. The fastest way to make on-call hate their life is alerts that fire constantly and mean nothing, because people learn to swipe them away and then miss the one that mattered. I alert on user-facing symptoms, like error rate crossing a threshold or latency blowing past its budget, rather than on internal causes like high CPU, which may be perfectly fine.

One more thing that pays off: never log secrets, passwords, tokens, or full payment details. It is easy to leak them into logs by accident, and logs sprawl across systems that have weaker access controls than your database. The same constraint-driven care I described in database schema best practices belongs here too. Decide what is sensitive, then make sure it never reaches a log line in the first place.