101 of modern Logging
Originally article was published on https://www.dataart.team/articles/your-ultimate-guide-to-modern-logs and assembled from my LinkedIn posts.
A few weeks ago, I stumbled upon a message in one of the architecture-related discussion chats. It read, “The client asks us not to log anything on prod due to the cost of log message storage,” yet we need to trace the execution flow on dev and stage.
This situation opens a can of worms and leads us to the topic of modern logs, or more accurately, observability systems — how they work, how to set up observability properly, what observability backend systems exist across different platforms such as Java, .Net, NodeJS, etc., and what observability front ends exist, plus how to use them in a way that doesn’t break the bank. Speaking specifically, we expect the cost of observability to amount to <5–10% of the overall infrastructure cost.
I started building observability systems in the late ’90s, even creating an open-source library for basic low-cost logging. In a parallel universe, I might one day write a book titled “Observability Made Properly in N Steps.” However, life is moving too quickly for book writing, so here is an in-depth article about the matter. The topic of observability is broad and includes at least logging, monitoring, and tracing. In this article I will talk about logging.
Log Separation by Use Case
Logging is the most widespread tool of observability. When utilizing this tool, it’s crucial to distinguish between use cases. I can identify at least the following:
1. Development Time Logging: Real-Time, Line by Line
Use case: “Let’s see how it works for one request.”
When developing a new feature or fixing a bug, and examining the execution flow, you may encounter excessive logging or strange log messages like “ — -3,” without much context. You simply trigger some system function and observe the execution. Personal data and secrets are not a concern here. What you need is the timestamp, logger name, and the message itself.
2. Development Time Logging: Real-Time Under Load
Use case: “Let’s see how it works with 1,000 requests for 30 min.”
Main differences from the previous case:
- Instead of logging 100 lines, you have 100,000. The issue is somewhere in the middle of this.
- Due to massive parallel thread logging, two log lines that sit consecutively in the code may have 100 log lines from parallel threads in between. If a log line is not atomic, it is meaningless. Thus, a log message should contain all necessary info.
- With a massive number of logs, you need to set proper log levels, and most likely, you will need a logging frontend — a UI application that provides log filtering, etc.
The most interesting part is that you may need to verify that, for instance, all 1,000 requests were processed correctly, and if not, then you need to find all log messages related to request number #3452 spread somewhere between lines 40,000 and 50,000.
This means that you need to have log message context fields (MDC), and the trickiest thing is you need to be able to see some execution metrics. This leads us beyond just logging into libraries such as Micrometer in Java.
3. Production Time Logging
Use case: “Something bad happened 2 days ago, we cannot reproduce it, let’s find it using logs.”
This case has all the complexities of the previous one, with additional caveats.
- You cannot write too many log messages; otherwise, the cost of the observability subsystem will be higher than the main solution cost. A good metric here is the per-request ratio of the size of business data updated/stored in the system to the size of log messages generated. A rule of thumb — it should not be higher than 1 to 1000 or 1 to 10,000 in case you need to have very good traceability.
- At the same time, when something bad happens, the root cause always sits in that piece of code that always works well and has 0 lines of production logging. I know two ways to solve this. First — iterative polishing of application logging, removing non-useful messages, adding more useful ones. And a rarely used approach, which is called logbags — a tool that allows you to have debug-level logs in prod log files, but only in case an error happened.
Dealing with Sensitive Data in Logs
When it comes to log data, it’s not uncommon to encounter sensitive information. This could range from personal details to payment information. In both local and development environments, logging this information is often necessary. However, simply storing this data in logs, even with encryption in transit and at rest, is not a secure practice. The risk of log data leakage increases, especially when it interfaces with logging frontends like SaaS solutions (e.g., DataDog). It’s a consensus that unless there’s a highly isolated on-premises security perimeter around the entire system, logging sensitive data in production is a no-go.
So, what are the viable solutions? Let’s explore a couple of tools that can help mitigate these risks.
Tool #1: Structured Logging and Masking on Logging Backend
The essence here is simple: avoid directly inserting sensitive data into log lines; instead, add it to the logging context. Here’s a practical example:
Do Not:
log.info(“Received payment from: “ + creditCardNumber);
Do:
log.info(“Received payment”, { “creditCardNumber”, creditCardNumber });
By doing so, you can set up a filter in the logging backend that masks specific field values in certain environments, ensuring sensitive info is masked even in transit.
Tool #2: Handling Embedded Sensitive Info in Strings
At times, you’ll deal with data already containing sensitive info within strings, such as payment information in a booking request. Logging this data is necessary yet challenging. One approach involves setting up a filter in the logging backend for a particular logger, using regular expressions (regex) to clean up fields. However, this method has its drawbacks: it’s compute-intensive and unreliable, as regex efficiency depends on formatting and protocol versions, risking data leakage.
A more effective strategy is to store request response data in a separate secure storage, referencing them by ID in logs. This approach not only keeps logs readable but also simplifies securing the storage perimeter.
Handling sensitive data in logs requires careful consideration and the right tools. By implementing structured logging with masking and utilizing secure storage for sensitive request/response data, organizations can significantly reduce the risk of data leakage.
Log Levels: What Level Should I Use? A Simple Rule
When you write a log message, you normally use a log level. For example, `log.info(“Some log message”);`. Most logging backends support at least 4 log levels. Some support 7 or more, and some give you the ability to add custom log levels, but let’s stick with 4 for simplicity.
What is the purpose of using different log levels? You may say filtering, but filtering is not a use case; it is a tool. The real use case is to be able to trigger different reactions to different log messages.
The rule that I normally use:
- `debug` — for messages that I do not want to see in production logs unless I have switched them on explicitly for troubleshooting.
- `info` — normal messages that show me what the system is doing, and I can understand what happens.
- `warn` — something abnormal happens, but it does not require changes in code or urgent support actions. For instance, a connection with a 3rd party service dropped. It may be an Internet connection issue, for instance. I should be aware of this but should not do something immediately.
- `error` — simple rule — an error line in logs => ticket in JIRA (similar messages should be grouped). This ticket is one of:
- fix a bug in code
- fix prod infra, e.g., add more disk space, etc.
- some support team action, e.g., call a 3rd party vendor and complain about breaking service availability SLA
Rephrasing — one single error line in logs — action item for someone.
In some cases, there could be a level `critical` — wake up the dev/support team at night and start fixing immediately. But this heavily depends on the nature of the system. Case escalation is another topic and should be handled by something like PagerDuty. But that is another story.
Just for log messages, I prefer to keep things simple:
- DEBUG: I do not want to see it in prod.
- INFO: Just for your information (JFYI).
- WARN: Something wrong, but maybe it is temporary.
- ERROR: Go and do something with this NOW!
Logging meta information
What is wrong with the code below?
System.out.println(“Received payment from user: “ + userId + “ from card: “ + creditCardNumber);
- No timestamp
- No log level
- Sensitive info inline
- No good way to filter by userId
- No log origin info (logger name)
- Lack of other 5+ useful fields such as: service name, host name, tracing request id, auth user id, environment name, etc.
In the code above, we see a log line. Normal modern programs operate on log records (and log records could be converted to lines by a formatter). The more structured the log record is, the more useful things we can do with it. Instead of that code line, we can do something like this:
const log = logger(“payment processor”);
log.info(“Received payment”,
{ userId: user.id,
cardNumber: payment.cardNumber });
This looks way better. We can filter log records; we can mask sensitive info. But there is still an issue; we still do not have Mapped Diagnostic Context (MDC) — see #6 above. And we obviously do not want to add these 5+ fields to each and every message. Depending on the programming language and logging backend, this is usually solved with a special MDC class. For example:
const log = logger(“payment processor”); // this automatically sets fields like: env, host, etc…
MDC.set({ requestId: request.id, authenticatedUserId: auth.user.id });
try {
…
log.info(“Received payment”, {
userId: user.id, cardNumber: payment.cardNumber });
…
} finally {
MDC.remove([“requestId”, “authenticatedUserId”]);
}
Here we have two groups of context info:
- Service scope — added when we create a logger, or in Logback, it could be added in the appender configuration
- Request scope added for all log messages associated with the request
Summarizing
- Information should be added to the log record in a structured way. Not by string concatenation
- Fields should be set at the proper level once and then propagated through MDC to all log records in scope.
Logging Backend
Logging Backend is a library that receives log records, filters, slices and dices them, formats, augments, and then does some other fancy stuff to transform them into one or many logging destinations such as console, logging frontends — ELK, Splunk, Grafana Loki, etc.
The log record flow could be different depending on the logger, log level, and log record context.
Common actions that logging backend usually allows to do with log records:
- Filter based on field value, combination of fields, or custom code.
- Augment the record with context (hostname, env name, etc).
- Have different flows for different loggers e.g., all log records from “ndc.connector.*” are duplicated to a special “NDC Connector” appender that saves them for certification.
- Transform log records, the most common case being masking sensitive information.
- Do asynchronous buffering. Do not send log records immediately, but buffer them for batch sending.
Use cases could be:
- If the environment name does not start with “dev” — the threshold log level is “info”.
- For all log records, mask fields like creditCard and phoneNumber unless the env name starts with “dev”.
- Write all log records in a human-readable format to stdout.
- In parallel, do async buffering (1 sec/100 records max) and send to two targets:
- All logs to Grafana Loki
- Log records with level error send to PageDuty API
- Also, for the logger “ndc.connector.*” send to AWS S3 appender.
Well-developed logging backends allow configuring such flow in a declarative way without custom code.
Some anti-patterns:
- Writing JSON to the console and then streaming it with, e.g., AWS CloudWatch. This approach is usually motivated by saving up to 5% of CPU time on logging. Why I believe this is an anti-pattern:
- 5% in distributed scalable systems is not a game-changer. In most cases, it’s not even visible.
- Non-human readable logs add a constant 10+% penalty to development & support cost.
- Unless you’re doing something really performance-sensitive, it’s a good idea to use advanced features of logging backends.
- Writing log records as plain text to the console and then parsing them and sending to log storage. Usually, this story sounds like “developers are writing logs in multiline text, we are storing them in ELK somehow, task done (while logs in ELK are useless)”.
In conclusion, logs are a very useful tool for the development, testing, and support phases of SDLC, especially support. A mature logging backend makes this tool shine.
Logging Frontends and Typical Problems Associated with This Topic.
Disclaimer: The term logging frontend is not very widespread. It means the system which holds logs and allows us to view them, preferably in a convenient way. Typical representatives: DataDog, Splunk, New Relic, ELK, Grafana Loki. Other names could be log aggregation system, log data platform, etc. I prefer the term logging frontend because it is more straightforward.
All such systems have one design challenge. They need to store a huge volume of logs, provide flexible, fast access to log info, and not cost you a fortune. The volume of logs could be 100–1000 times higher than the volume of business data. A sample datapoint I have, just to understand the scale: I know one big bank that operates 50+PB (petabytes) of just logs.
Here‘s a joke that I heard from colleagues:
— Here is the bill for <name of enterprise PaaS log system>
— Is it a monthly bill, or are we buying a new company?
For small systems, logs could be stored just in files, and you can use some handy tools like FAR Manager to view and search. But this is a very limited approach. Once the system grows bigger, especially if it is not a monolith, you need to store logs somewhere else.
Some clouds have very convenient log frontends, my favorite is GCP Cloud Logging. A popular solution is ELK (Elasticsearch, Logstash, Kibana); you can run it right from a Docker container/compose, and up to some degree, it works OK. But again, the problem here — a server that runs on m5.large (2CPU/8GB RAM) may require an m5.xlarge to host ELK.
How to solve this issue?
- Have a reasonable retention policy. E.g., store live logs for 7 days. In parallel, save zipped logs to some blob storage (e.g., S3). This is extremely cheap. — This solution is the most popular one. And it is the least convenient. When something happens — it always happens outside this log retention window, and you need to wait for several hours while DevOps restores historical logs while everything is burning.
- Review each log statement; each log record should be meaningful. A good question is how this log record could be useful for my colleague who will be fixing a bug in this function in 3 months on prod.
- Do not save debug info. Use log bags for abnormal cases.
- Have team-wide metrics/rule of thumb — something like “an HTTP call may produce up to 10 log records, update/delete — up to 20 per entity. Corner cases should be reviewed explicitly.” (The real rule should be smarter).
- Do not log raw request/response. If you need them — use separate storage.
Basically, all this looks trivial, the only problem here — these good points should be integrated into the SDLC somewhere at early stages (who said, “Sprint 1”?), unfortunately, this does not always happen. One of the ways to fix this involves dev teams into the support process, so they understand typical problems that occur on prod env.
Management Aspect of Observability (Including Logging)
Developers write lines of code for a purpose. Each line (actually, each character) serves a purpose. What purpose do log statements serve, and how can we streamline the process of writing logs?
Let’s consider the following code:
function createOrder(order: Order) {
console.log(“Creating order”);
Is this useful? Maybe to some degree during development only. But is it useful in a production environment if the application creates 300 orders per minute on 3 nodes running in parallel? Very limitedly.
The main purpose of logs is to provide insight into what is going on (and especially what happened) in high environments like test, stage, and prod. During dev time we have plenty of tools to use.
A good way to test if a log statement is useful is to ask the question: “Something went wrong with order ORD52345–7 days ago on prod. Can this statement help me to understand what was wrong?” This may lead to a more detailed log statement:
log.info(“Creating order”, {orderId, userId, sessionId, basketId, orderItems: order.items});
Is it enough to tell developers to ask themselves, “Can this statement help me to understand what went wrong?” In most cases, it is not. Observability should be part of the Software Development Life Cycle (SDLC) from the stage of “tech implementation planning” to the support stage. Specifically:
- Tech Implementation Planning: Design the observability subsystem, keeping in mind Functional Requirements (FR) and Non-Functional Requirements (NFR).
- Development: Have logging guidelines and review logging during PR reviews.
- QA: Make log reviews part of the acceptance gates. Do not release if observability is not done right.
- Maintenance: There should be a feedback loop from the support team to the development team. Ideally, each development team member should participate in problem-solving on production on a regular basis. For example, each sprint could have two team members helping the L2/L3 support team with production issues.
This creates an additional burden on the development team, and they may move slower, but at the same time, the application becomes easier to support.
In summary: The key here is to have a clear purpose for the log subsystem from the early stages of the SDLC.
Useful Links
- LogBack (Java) — https://logback.qos.ch
- Winston (NodeJS) — https://www.npmjs.com/package/winston
- Serilog — https://serilog.net
- LogBack MDC — https://logback.qos.ch/manual/mdc.html
- DataDog — https://www.datadoghq.com
- NewRelic — https://newrelic.com
- Grafana Loki — https://grafana.com/oss/loki/
- ELK — https://www.elastic.co/elastic-stack