Architectureway.dev – Your Software Design Journey

Observability 2.0: A Unified, High-Resolution Approach for Modern Software Development

Konrad Jędrzejewski — Wed, 30 Apr 2025 13:13:59 GMT

Most of us used at work tools like Grafana, ELK, and Jeager to monitor, and track the behavior of our applications. The traditional “Observability 1.0” refers to an approach to monitoring and understanding system performance, based on three main data types: metrics, logs, and traces, commonly called “three pillars”. This model relies on separate tools for each data type, leading to siloed information and potential challenges correlating data across systems. Moreover, separate dashboards and disconnected data sources can make it hard to know what’s happening in your environment. In contrast to this approach modern approach, called “Observability 2.0”, focuses on high-resolution data unification. That means all telemetry data, from all pillars, are ingested as structured data by one technical solution.

Curious why so many teams are shifting in this direction? When your system can store structured, raw data with high-cardinality fields, you can analyze behavior at a very granular level. This evolution brings new capabilities like asking ad-hoc questions of their systems in real-time, providing deeper insights than traditional monitoring could, and revealing “unknown unknowns”. As a developer, wouldn’t it be nice to see everything in one spot?

What is the core concept behind it?

The Idea of Observability 2.0 is capturing detailed event data and keeping it accessible for immediate and deeper inspection. Instead of sprinkling metrics, logs, and traces across several tools, all that information goes into one place. By doing this, teams remove barriers that previously made correlation a challenge. The result is a flexible data foundation that supports both real-time alerting and retrospective investigation. All of that couldn’t be done without key concepts that stay behind it:

Structured High-Cardinality and High-Dimensional Data
Unified Data Store for Raw Events
Real-Time and Historical Analysis Capabilities

In the next couple of paragraphs, I will try to explain the constraints of traditional observability and explain how the above concept brings benefits by providing real-life application as an example.

Limitations of Traditional Observability

Conventional methods frequently rely on basic aggregations and partial samples. These may mask unusual behaviors that could be critical to spot. Having data in different systems also makes it difficult to piece things together, especially when production goes off-track. Teams end up juggling separate dashboards and might lose valuable time trying to reassemble the bigger picture.

What benefits do the key concepts provide?

Here I would like back to the key concepts I mentioned earlier. All of them have a purpose and bring benefits. Moreover, these concepts could be the beginning of journey with creating benefits around new observability implementation.

Structured High-Cardinality and High-Dimensional Data

Unlike older monitoring tools that struggled with data explosion, Observability 2.0 prioritizes capturing as much detail as needed. Event data are stored in a structured way and provide bespoke data granularity, enabling granular filtering and analysis. Telemetry with a large number of unique values, like user IDs or session IDs(High-Cardinality) and events enriched with many attributes or tags(high-dimensional data), give a comprehensive view of the system state. This rich context means engineers can differentiate even very similar events and slice data along virtually any dimension, e.g. filter events by a specific customer, feature tag, or event type, and to all that without having pre-aggregated metrics. That support is a cornerstone of Observability 2.0’s superpower.

Unified Data Store for Raw Events

Using a common standard for traces, logs, and metrics allows to simplify software architecture. Tools, like OpenTelemetry, are giving an opportunity to create "one source of truth” of systems, that means there are no longer need to break down data into the silos. Storing events in their raw, unaggregated for, the system can derive metrics, traces, and logs on the fly from the same dataset without to switching or integrating many tools. For metrics and traces nothing stands in the way to store data as structured logs or spans format. Thats model unification simplifies data management and correlation: metrics graphs, distributed trace views, and log search results all reference the same underlying events. Developers reap the benefits by elimination of context-switching between tools and formats and it’s easier to follow a chain of events across services, enhancing the ability to identify and resolve issues.

Real-Time and Historical Analysis Capabilities

All the benefits, mentioned earlier, allow us to easily implement real-time analysis. In the era of AI, it could be easily integrated with LLM models to detect some anomalies in the stream of data. Moreover, Models could be constantly trained to discover new types of anomalies. It significant change in comparison to traditional stack and annoying manual searches across many tools that consume a lot of time.

EFK as implementation of Observability 2.0

Implementing Observability 2.0 with EFK(Elasticsearch, Fluent(d/bit), Kibana) Stack has evolved from primarily a log management solution to a comprehensive observability platform capable of handling metrics, logs, and traces. You can measure all development aspects of Your technology stack from infrastructure to application metrics. Moreover Elastic allows You to handle anomaly detection and it’s ready to integrate with AI. Here I would like to focus on basic capabilities.

Basic components to implement a complete Observability 2.0 solution using the EFK Stack which typically includes:

Elasticsearch - Core storage and search engine
Fluentbit - Data log collection and processing
Kibana - Visualization and dashboarding
Elastic APM - Application performance monitoring
OpenTelemetry Collector - centralized Open Telemetry collector
OpenTelemetry integrations - For standardized telemetry collection intergrations from applications

You can find source code on my personal GitHub:
https://github.com/konradjed/observability-2.0

After setting up the infrastructure and instrumenting your services, you can use Kibana to create dashboards that combine metrics, logs, and traces:

APM Traces View: Navigate to Observability → APM to see service maps, traces, and performance metrics.
Logs Correlation: Kibana allows you to click on a trace and see logs correlated by trace ID.
Custom Dashboards: With any custom metrics that appear in your applications Custom dashboards in Kibana gives you powerful tool that might be used not only by technical team but by Your business.

To ensure logs are properly linked to traces, make sure your logs include:

trace.id - The OpenTelemetry trace ID
span.id - The span ID within the trace
service.name - The name of the service generating the log This enables Elastic’s APM UI to show logs related to specific traces, providing context when troubleshooting issues

Real-Life application examples

Maintaining resilient, cloud-native, and modern systems couldn’t be done without modern observability. An example that developed illustrates, how we can effectively monitor a heterogeneous microservices architecture, by using OpenTelemetry and Elasticsearch stack. Sample applications, payments services, are composed of multiple services written in different programming languages:

Python - A core payment-service responsible for processing transaction requests.
Java - A fee-calculator service that implements computation of transaction fees.
Node.JS - A user-service that provides user data stored in a PostgreSQL database.

Think of OpenTelemetry as a universal instrument panel: whether you’re running a service in Python, Java, Go, or Node.js, you just plug in the SDK and all your logs, metrics, and traces stream into Elastic. This breaks down silos between teams and tech stacks—no more guessing what happens when a request hops from one language to another. With SDKs and auto-instrumentation libraries for nearly every major language, exporting telemetry data is as simple as adding a handful of configuration lines.

I made a Postman collection to mimic real-world traffic, and to test resilience, I deliberately knocked the user-service, and fee-calculator offline and watched how the system healed itself. In the sections that follow, we’ll dive into key observability features: automatic service detection, performance dashboards, dependency graphs, error tracking, and more.

Service Detection and Building a Service Map

Elastic service map is like a live network diagram that updates itself: once you instrument your code with OpenTelemetry, Elastic spots each service and plots the connections in real time. You’ll instantly see who’s talking to whom, how often, and where the slowdowns live. It’s a godsend for both developers and architects—whether you’re hunting down a sneaky bottleneck or planning architecture refactor. You always get actual picture of your entire microservices landscape.

Elastic Observability automatically detects services instrumented with OpenTelemetry and constructs a dynamic service map. This visualization provides a comprehensive overview of the system architecture, highlighting the interactions between services. Such a map is invaluable for developers and architects to understand the system’s structure and identify potential bottlenecks or points of failure.

Overview of Service Performance Metrics

Imagine having a dashboard that tells you, at a glance, how fast your services are responding, how many requests they’re handling per second, and where errors are creeping in. I have created couple of them and I always need to create the same annoying, repeatable work to create them. That’s precisely what Elastic with automate delivers: fine-grained response-time histograms, throughput charts, and error-rate trends for each service. Armed with this data, your team can spot performance hiccups early—before they impact users—and drill down to the exact code paths or dependencies that need attention.

Dependencies Overview

Think of the dependencies view of your system’s as circuit wiring diagram—it instantly shows which services feed data downstream and which pull from upstream. When something breaks, you don’t have to guess which component caused the cascade; you can trace the failure’s path in seconds. This is a game-changer when we need to analyze root cause of failure and understand the true size of any outage.

Error Metrics and Logs

Imagine having a single command center where every service’s error counts and log entries funnel in. Moreover OpenTelemetry metrics and traces allows You to make advanced analyze application performance which might be an error cause. That’s what Elastic gives you: a centralized dashboard showing not just how often errors occur, but the exact log snippets and stack traces behind them. When an issue pops up, you can jump into trace dashboard and analyze cause. This kind of visibility is a lifesaver when you need to diagnose failures fast.

Service Map with Direct Calls

While the high-level service map gives you the big picture, Elastic also lets you zoom in to see exactly which services are calling each other—and how often. It’s like flipping from a city map to a street view. This kind of detail is incredibly useful when you’re trying to untangle inefficient request chains, debug chatty services, or just understand the real traffic patterns flowing through your system.

Log Filtering Per Service

In a busy system with dozens of services chatting away, digging through logs can feel like trying to find a signal in static. Elastic makes this a non-issue by letting you filter logs down to just the service you care about. Whether you’re chasing a bug in the payment API or tracking weird behavior in a background worker, you can zero in instantly—no more wading through noise from unrelated parts of the stack.

Tracing

If you’ve used tools like Jaeger or Zipkin before, you’ll feel right at home—but with more firepower. Elastic takes tracing a step further by tying together spans, logs, and metrics into a single, unified view. You can follow a request as it jumps across services, see exactly where latency creeps in, and pull the related logs without switching tabs. It’s like having X-ray vision for your distributed system—perfect for tracking down those elusive bottlenecks or unexpected slowdowns.

Infrastructure Metrics

It’s easy to forget that sometimes, the issue isn’t in your code—it’s in the box it’s running on. Elastic helps surface those problems by tracking infrastructure-level metrics like CPU load, memory usage, disk I/O, and network traffic. You can spot when a service is slow because the node it’s on is maxed out, or when a noisy neighbor is hogging resources. These insights give you the full picture, making it easier to fine-tune performance and plan capacity before things go sideways.

Custom Metrics and Dashboards

Elastic allows the creation of custom metrics and dashboards tailored to specific operational needs and key performance indicators (KPIs). This flexibility empowers teams to monitor aspects most relevant to their objectives and to visualize data in a manner that best supports decision-making.

Log Explorer

Elastic’s Log Explorer offers a centralized interface for searching, filtering, and analyzing log data across your entire system. It enables users to quickly access logs from various sources without the need to log into individual servers or navigate through directories. This tool is especially beneficial in heterogeneous environments, providing a unified view of logs from services written in different languages and running on diverse platforms.

Conclusion

This implementation provides a comprehensive Observability 2.0 solution using the ELK Stack, incorporating metrics, logs, and traces. The integration with OpenTelemetry ensures compatibility with the broader observability ecosystem while leveraging Elastic’s powerful search and visualization capabilities. By following this approach, you’ll have a complete view of your system’s health and performance, enabling faster troubleshooting and better understanding of your applications’ behavior.

Future Trends in Observability

As the technology landscape grows more complex, observability is undergoing a transformative shift. No longer confined to traditional monitoring, today’s observability practices are expanding to encompass new dimensions that empower teams to operate faster, smarter, and more securely. Several key trends are shaping the future of observability, redefining how organizations build, maintain, and scale their systems.

Multi-Dimensional Observability

In the past, observability primarily focused on system performance metrics like latency, throughput, and error rates. However, organizations are increasingly recognizing the need for a more holistic view. Multi-dimensional observability integrates cost analysis, compliance tracking, and security monitoring alongside traditional performance data. By doing so, it fosters a collaborative environment where DevOps, SecOps, and FinOps teams can align their efforts. This cross-functional approach not only enhances operational efficiency but also ensures that systems are resilient, secure, and cost-effective from the ground up.

AI-Powered Autonomic

The next major leap in observability is the shift towards AI-powered autonomic operations. In this paradigm, AI doesn’t just identify anomalies or predict potential failures—it actively remediates issues without human intervention. Machine learning algorithms analyze patterns, make decisions, and execute corrective actions in real time. This evolution reduces the burden on IT teams, minimizes downtime, and accelerates incident response, paving the way for self-healing systems that can operate at scale with minimal oversight.

Cost-Optimized Observability

As data volumes skyrocket, the cost of storing, processing, and analyzing telemetry data has become a significant concern. Organizations are now prioritizing cost-optimized observability strategies to manage expenses without compromising insights. Techniques like smarter data sampling, tiered storage solutions, and intelligent data retention policies are being widely adopted. These methods ensure that critical data remains readily accessible while less critical information is archived or discarded appropriately, striking a balance between cost-efficiency and operational visibility.

Closing Thoughts

The transition toward what many call “Observability 2.0” offers a compelling value proposition: faster incident resolution, deeper system insights, and a more unified approach to managing software lifecycles. By adopting a unified data model, teams can achieve consistency across monitoring, alerting, and analysis, resulting in smoother production releases and more effective troubleshooting. Organizations that embrace these emerging trends position themselves for greater agility and resilience in an increasingly complex digital world.

Understanding Circuit Breakers: A Complete Guide

Konrad Jędrzejewski — Mon, 10 Mar 2025 11:00:15 GMT

Imagine your favourite streaming app crashes during peak hours r an e-commerce site freezes during a flash sale, because one recommendation service fails. Sound familiar? In microservices, a single weak link can drag down the entire system. That’s where the Circuit Breaker pattern comes in— a architecture pattern inspired by electrical fuse that trips when there’s an overload. The goal is to protect the system and keep it working even if a others part goes down. Let’s break down why it matters and how to use it effectively.

Definition of Circuit Breaker

A Circuit Breaker in software tracks the calls from one service to another. as mentioned earlier it mirrors the idea of a power switch tripping to save an electrical circuit due to high current. When error rates go too high, it blocks further calls for a while and after a cooldown, it tests if the remote service is healthy again. If it is, traffic resumes. If not, it stays open to avoid more failures. No more cascading failures. No more angry users. Just controlling behaviour due to other service fail.

But why is this critical for microservices?

Microservices Need Guardrails

In a distributed system, services depend on each other. In my career I meet situations like of slow payment gateway processing or faulty recommendation API might that completely your blocked operability of application. The Circuit Breaker tackles three big issues:

Preventing System Overloads: Stopping repeated calls to failing services, safeguards the system from being overwhelmed.
Isolating Service Failures: Separates issues from individual services, preventing them from affecting the entire system.
Enhancing Resilience and Reliability: With controlled and managed failure responses, the overall system remains stable and can recover gracefully.

When developers are using this pattern, it gives them super power to isolates issues and don’t let them to escalate into critical system-wide outages. While leading a team to implement PSD2-compliant payment systems, we used Circuit Breakers to handle third-party bank API failures. Without them, small error prone API that calculates payment fee was able to break entire payment flow but we used circuit breaker pattern to boosts user trust and avoids widespread service interruptions.

The Three States of a Circuit Breaker

Working mechanism of the Circuit Breaker pattern is quite simple. Pattern operates in three basic states:

State	What dose it mean
Closed	The service is operating normally, and all requests are routed directly to it.
Open	Multiple failures have triggered the circuit breaker(e.g., timeouts, 500 responses) that causes block of subsequent requests to prevent further stress on the service.
Half-Open	A limited number of requests are allowed to test if the service has recovered. If successful response occurred, the breaker resets state to Closed otherwise, it returns state to Open.

They allow system to adapt dynamically to real-time conditions, allowing for both protective blocking during failures and cautious testing during recovery.

Supplementary patterns

Above base pattern works fine in most of cases but sometimes requirements could be more sophisticated. There are couple enhancements that could be applied to base mechanism:

Retry pattern: Automatically reattempts operations after transient failures, reducing the probability of tripping the circuit breaker unnecessarily. But You have to be careful with using that. Many retries don’t implicit that target service didn’t received request. Maybe there was some short network failures during getting response.
Timeout pattern: Enforces a maximum wait period for service responses, ensuring that delays result in prompt failure detection.
Fallback pattern: Provides an alternative action when a service is unavailable, maintaining functionality despite disruptions by serve as a safety net, ensuring that users still receive a response. There are many strategies to handle that pattern and You could choose one depending on your current requirements:
- Returning Default Data: Usually provides static data when the service is unavailable.
- Caching Results: Utilizes cached responses from previous successful calls.
- Redirecting Requests: Forwards requests to alternative services that can handle the operation.

Combination of all above maximizes resource utilization and ensures that applications remain responsive and robust even amid network fluctuations and service hiccups, thereby effectively mitigating various failure scenarios encountered in distributed systems. These complementary patterns work seamlessly together to sustain uninterrupted service operation.

Tools for Implementing the pattern

There are many libraries and tools simplify the implementation of the Circuit Breaker Pattern but here I would like to recommend few of them:

Netflix Hystrix: Well-established tool for Java applications that isolates points of access between services and provides robust fallback options.
Resilience4j: Lightweight, modular, and functional Java library for Java 8 and newer that offers extensive customization for circuit breaking.
Spring Cloud Circuit Breaker: Provides a consistent API to various circuit breaker implementations, including Resilience4j.
Polly: .NET library that allows developers to implement resilience strategies, including retry and circuit breaker policies, in a fluent and thread-safe manner.
Istio: Offers a more holistic approach by allowing us to enforce resilience policies at the network level, decoupling them from individual service implementations. It uses Envoy proxies to manage traffic between services, enabling the enforcement of circuit-breaking policies without altering application code. Thats crucial point by providing resilience without implementing fault tolerance in codebase especially if we are trying to provide pattern into legacy applications. Alternatively You could use Linkerd.

These tools are designed to integrate seamlessly with their respective platforms, allowing developers to focus more on business logic rather than low-level error handling.

Implementing the Circuit Breaker with Resilience4j in Spring Boot

As I most of my career spent of writing Java business applications I would like to provide comprehensive example of implementing pattern using Resilience4j:

Add Dependencies: Include the Resilience4j starter in your pom.xml:

 <dependency>
     <groupId>org.springframework.cloudgroupId>
     <artifactId>spring-cloud-starter-circuitbreaker-resilience4jartifactId>
 dependency>

Configure the Circuit Breaker: Define properties in your application.yml:

 resilience4j:
   circuitbreaker:
     configs:
       default:
         registerHealthIndicator: true
         failureRateThreshold: 50
         waitDurationInOpenState: 10s
         permittedNumberOfCallsInHalfOpenState: 3
         slidingWindowSize: 20
     instances:
       myService:
         baseConfig: default

Apply the Circuit Breaker: Use the @CircuitBreaker annotation in your service:

 @Service
 public class MyService {

     @CircuitBreaker(name = "myService", fallbackMethod = "fallbackMethod")
     public String callExternalService() {
         // Logic to call external service
     }

     public String fallbackMethod(Throwable throwable) {
         // Fallback logic
         return "Fallback response";
     }
 }

Create Service: to call unhealthy service lets create controller:

 @GetMapping(value = "/get-my-service")
 public ResponseEntity getMyServicecall() {
    try {
        MyServiceResponse user = myServices.callExternalService();
        return new ResponseEntity<>(new Response(user,"external service called successfully", Boolean.TRUE), HttpStatus.OK);
    } catch (Exception e) {
        return new ResponseEntity<>(new Response("Error during service call", Boolean.FALSE),
                HttpStatus.INTERNAL_SERVER_ERROR);
    }
 }

Use Spring Actuator: add Metrics to have full insignths:
Add Dependency First

 <dependency>
    <groupId>org.springframework.bootgroupId>
    <artifactId>spring-boot-starter-actuatorartifactId>
 dependency>

After lets configure Spring to gather metrics from actuator:

 management:
   endpoint.health.show-details: always
   health.circuitbreakers.enabled: true

Voila! Now you can mesure circuit breaker behaviour in Your application:

 {
     "status": "UP",
     "components": {
         "circuitBreakers": {
             "status": "UNKNOWN",
             "details": {
                 "myService": {
                     "status": "CIRCUIT_OPEN",
                     "details": {
                         "failureRate": "100.0%",
                         "failureRateThreshold": "50.0%",
                         "slowCallRate": "0.0%",
                         "slowCallRateThreshold": "100.0%",
                         "bufferedCalls": 3,
                         "slowCalls": 0,
                         "slowFailedCalls": 0,
                         "failedCalls": 3,
                         "notPermittedCalls": 2,
                         "state": "OPEN"
                     }
                 }
             }
         }
     }
 }

Real-World and Personal Examples

In real world there are many places where You could use Circuit Breaker Pattern there are some common examples from real world:

A payment service in an online store often calls external gateways. If a gateway fails, the Circuit Breaker halts further attempts. Customers can still browse and place orders. Payment finalisation happens once the gateway is back.
Another scenario is a recommendation service in a streaming platform. It might query several external sources. If one source is slow or offline, the Circuit Breaker prevents repeated timeouts and offers fallback suggestions. This way, the main service remains usable.

But over the years, I have had hands-on experience with implementing and analyzing resilient architectures.

During my work on one of the biggest bank in Europe I have used it when payment fee calculation provided by external team constantly fails. We build mechanism that provides some default value with a range of possible fee.

Another great example were when I was integrate some document management system with OCR solution. Company where I have used this pattern had many OCR engines with different level of performance. When one which has better was failed we continue operating by calling second OCR engine with less performance. Backoffice were able to continue they work with some limits of performance. That was crucial trade-off between doing work slower or waiting with whole stack of documents to process.

Final Thoughts

The Circuit Breaker Pattern is a fundamental component in building resilient microservices architectures by providing prevention single-service failures from cascading into large-scale outages. By temporarily halting calls to an unstable or non-responsive service and gradually reintroducing them, the system remains protected against overload and maintains overall stability. Its provides key strategic benefits into Your system like Overload Prevention, Failure Isolation, Resilience & Reliability.

Combination of proactive failure detection with fallback strategies and leveraging powerful tools like Netflix Hystrix, Resilience4j, and Polly, architects and developers can build systems that are both robust and flexible.

Whether you’re dealing with payment gateways, recommendation engines, or any external integrations, applying the Circuit Breaker pattern ensures that a single failure doesn’t cripple your entire platform. Instead, the system intelligently adapts and recovers, offering users a stable and reliable experience.

Outbox Pattern

Konrad Jędrzejewski — Mon, 03 Mar 2025 11:00:48 GMT

It was 2 AM, and I was staring at a production incident ticket titled “User payment processed, but order wasn’t created.” The guilty? Our team had naively implemented a dual write - updating the database and sending a message to Kafka in the same transaction, assuming both would magically succeed. They didn’t.

This incident sparked my deep dive into solving data consistency in microservices. Enter the Outbox Pattern, a lifeline for architects battling the chaos of distributed systems. In this article, I’ll share how this pattern rescued us from inconsistency hell, why it’s a cornerstone of microservice design, and how to implement it pragmatically in Java/Spring ecosystems.

Why Traditional Approaches Fail

The Transactional Mirage

In monolithic systems, ACID transactions rule. But microservices? They’re a different beast. Early in my career, I assumed distributed transactions could save us. They didn’t. The overhead was brutal, and failure modes multiplied.

Take our e-commerce platform: when a user placed an order, we deducted inventory and emitted an OrderPlaced event. If the Kafka call failed after the DB commit, the system became inconsistent. Classic dual write problem.

Outbox Pattern to the Rescue

The Outbox Pattern elegantly sidesteps this by treating event publishing as part of the same transaction as the business operation. No more “hoping” both writes succeed—either both happen, or neither does.

How It Works Under the Hood

Architecture: The Magic of an Outbox Table

The core idea is simple: add an outbox_table to your service’s database. When your application performs a business operation (e.g., saving an order), it also writes an event to this table—atomically.

CREATE TABLE outbox (
  id UUID PRIMARY KEY,
  aggregate_type VARCHAR(255),
  event_type VARCHAR(255),
  payload JSONB,
  created_at TIMESTAMP
);

In Java/Spring, this translates to a JPA entity (@Entity) that’s persisted within the same @Transactional boundary as your domain logic.

Relay: The Unsung Hero

Then, an independent component—often called the relay—regularly polls the outbox table, retrieves unprocessed messages, and sends them to the message broker (such as Kafka or RabbitMQ). In my projects, I consider two strategies for retrieving messages:

Polling: Simple (e.g., a Spring @Scheduled task), but adds latency and load.
CDC: More efficient (Debezium/Qlik streams database changes), but requires setup.
We chose CDC for scalability, but I’ve seen polling work well in smaller systems.

But How?

In practice, implementing the Outbox Pattern in a Java/Spring Boot environment is not complicated—provided that the transactions are well designed. In my projects, I typically proceed as follows:

Message Writing: In a method annotated with @Transactional, I first persist the domain changes (e.g., an order) and then insert a record into the outbox table.
Message Processing: A separate process (e.g., a scheduled task using @Scheduled) periodically scans the outbox table, sends messages to the message broker, and removes them upon successful delivery.
Idempotence: This is crucial—we must ensure that processing messages is idempotent to avoid issues with duplicate event delivery.
Housekeeping: Regularly cleaning up the outbox table is important to prevent it from growing uncontrollably, which could impact database performance.

Below are two examples demonstrating two different approaches: one using polling and another using CDC.

Implementation Using Polling

In this approach, we define a scheduled task (using the @Scheduled annotation) that periodically retrieves unprocessed records from the outbox table, sends them to a message broker (e.g., Kafka), and deletes them upon confirmation. Here’s an example:

@Service
public class OutboxPollingService {

    private static final Logger LOGGER = LoggerFactory.getLogger(OutboxPollingService.class);

    private final OutboxRepository outboxRepository;
    private final KafkaTemplate kafkaTemplate;
    private final ObjectMapper objectMapper;

    @Value("${app.outbox.topic}")
    private String outboxTopic;

    @Autowired
    public OutboxPollingService(OutboxRepository outboxRepository,
                                KafkaTemplate kafkaTemplate,
                                ObjectMapper objectMapper) {
        this.outboxRepository = outboxRepository;
        this.kafkaTemplate = kafkaTemplate;
        this.objectMapper = objectMapper;
    }

    // This task runs every 5 seconds
    @Scheduled(fixedRateString = "5000")
    @Transactional
    public void processOutboxEvents() {
        List events = StreamSupport
                .stream(outboxRepository.findAll().spliterator(), false)
                .collect(Collectors.toList());
        LOGGER.info("Found {} outbox events", events.size());

        for (OutboxEvent event : events) {
            try {
                String payload = objectMapper.writeValueAsString(event.getPayload());
                kafkaTemplate.send(outboxTopic, event.getEventType(), payload).get();
                // After a successful send, delete the record from the outbox table
                outboxRepository.delete(event);
            } catch (Exception e) {
                LOGGER.error("Error processing outbox event {}: {}", event.getId(), e.getMessage());
                // The record remains in the table for retry in the next cycle
            }
        }
    }
}

In the example above, the processOutboxEvents() method is executed periodically, retrieves all records from the outbox repository, sends each message to Kafka, and then removes the record upon successful delivery. This polling-based implementation is simple and easy to understand.

Implementation Using CDC

When in a high-throughput system processing 10k+ orders/minute, polling became a bottleneck, we can use Change Data Capture (CDC) technology to detect changes in the outbox table almost immediately while preserving the event order. One popular tool for this purpose is Debezium. Ideally, a CDC solution is deployed as a separate component close to the database to capture changes in near real-time while preserving event order. However, for demonstration purposes, the following example shows how to implement CDC within a Spring Boot application using Debezium Embedded Engine.

Below is an example configuration for Debezium Embedded Engine:

@Configuration
public class DebeziumConfig {

    @Bean
    public EmbeddedEngine debeziumEngine() {
        Properties props = new Properties();
        // Basic configuration for PostgreSQL (example)
        props.setProperty("name", "outbox-connector");
        props.setProperty("connector.class", "io.debezium.connector.postgresql.PostgresConnector");
        props.setProperty("database.hostname", "localhost");
        props.setProperty("database.port", "5432");
        props.setProperty("database.user", "postgres");
        props.setProperty("database.password", "postgres");
        props.setProperty("database.dbname", "microservices");
        props.setProperty("database.server.name", "dbserver1");
        // Limit monitoring only to the outbox table
        props.setProperty("table.include.list", "public.outbox");
        // Other CDC-specific settings
        props.setProperty("plugin.name", "pgoutput");
        props.setProperty("slot.name", "debezium_slot");

        return EmbeddedEngine.create()
            .using(props)
            .notifying(this::handleChangeEvent)
            .build();
    }

    // Method to handle CDC events
    private void handleChangeEvent(SourceRecord record) {
        // Extract event data and send message to Kafka, for example
        Map sourcePartition = record.sourcePartition();
        Map sourceOffset = record.sourceOffset();
        String topic = record.topic();
        Object value = record.value();

        System.out.println("Debezium event received: " + value);
        // Add logic here to send the event using kafkaTemplate
    }

    // Run Debezium Engine on application startup
    @Bean
    public ApplicationRunner runner(EmbeddedEngine engine) {
        return args -> {
            ExecutorService executor = Executors.newSingleThreadExecutor();
            executor.execute(engine);
        };
    }
}

In this configuration, Debezium listens for changes in the outbox table. When a new record (i.e., a message saved within a transaction) appears, Debezium triggers the handleChangeEvent() method, where you can process the event—for instance, sending it to Kafka using a KafkaTemplate. With CDC, we eliminate the delay associated with periodic polling, and changes are detected almost in real time.

Both approaches have their advantages and drawbacks. The polling mechanism is simpler to implement and may suffice for systems with lower loads, while CDC provides near-real-time detection of changes and better event order preservation, which is crucial for large-scale microservices systems.

What the good and bad things about…

When it comes to implement EDA style, the Outbox Pattern offers a robust solution for ensuring data consistency and reliable communication across services. One of the most significant benefits is that it guarantees that messages are only sent if the corresponding database transaction commits successfully. This "at-least-once delivery" mechanism minimizes the risk of inconsistencies between the state stored in the database and the messages sent to external systems. Furthermore, by decoupling the business logic from the actual communication process, the pattern simplifies the service’s core responsibilities. This separation allows the system to scale more effectively and increases overall resilience, since the asynchronous nature of message processing ensures that temporary failures in the messaging infrastructure do not immediately impact the business logic.

On the other hand, adopting the Outbox Pattern does introduce additional complexity into the system. Developers must manage an extra outbox table and implement mechanisms for polling or change data capture. Ensure that messages are processed in an idempotent manner are also critical practice. In a highly asynchronous environment, duplicate or out-of-order message processing can easily occur, potentially leading to inconsistencies across the system. There is also a potential overhead on the database due to the storage and regular cleanup of outbox records. Moreover, whether you opt for a polling method or CDC, you might experience a slight delay in message delivery, which could be acceptable or not based on your application's performance requirements.

Overall, while the Outbox Pattern brings added architectural complexity and may impose some performance considerations, its ability to enforce data consistency and decouple critical system components often makes it a valuable approach for building robust EDA style microservices.

Final thoughts

In general, the Outbox Pattern is highly effective in any scenario where a reliable exchange of information between independent services is critical. Many systems—regardless of their domain—benefit from the pattern when they need to ensure that state changes are propagated consistently and asynchronously. For example, any system that processes transactions, whether it's handling orders, processing payments, or updating user statuses, can utilize the Outbox Pattern to guarantee that the subsequent events reflecting these changes are delivered reliably to other services.

This approach is particularly valuable in event-driven architectures where multiple downstream processes depend on the successful completion of a business transaction. By recording events within the same transaction that updates the core data, and then processing these events asynchronously, the pattern supports workflows that involve notifications, integrations with external systems, and the triggering of further business logic. In essence, the Outbox Pattern serves as a foundational mechanism for coordinating complex, interdependent operations in a distributed system, regardless of the specific business domain.

I hope that sharing my experiences and insights helps you better understand and implement the Outbox Pattern in your microservices projects. If you have any questions or would like to share your experiences, please feel free to leave a comment - I always appreciate exchanging ideas with fellow professionals.

Thank you for reading, and see you in the next discussion on microservices patterns series!

Eventual Consistency in Microservices

Konrad Jędrzejewski — Sun, 23 Feb 2025 23:00:00 GMT

Data Consitency - why it is so important?

Achieving perfect consistency can be challenging and often unnecessary especially when we talk in according to microservice or Event-Driven Architecture styles. Unlike immediate consistency, which ensures that all nodes in a system reflect the same state instantly, eventual consistency offers a balance between availability and reliability. That could be achieved by giving enough time for all components to be eventually synchronized. In other words that is fundament of modern scalable architectures.

So why Eventual Consistency is so important? Regardless of communication style waiting for all components to be perfectly synchronized takes time, could cause reduced performance and make impact to system availability. Eventual consistency allows systems to handle large volumes of data and requests without breaking service SLA and sacrificing the user experience.

In this article, we’ll explore what eventual consistency is, why it’s essential for microservices, and how it’s implemented in real-world systems. Regardless the role you are playing in project, understanding eventual consistency is critical for designing resilient and scalable applications. Let’s dive into the details and see how this principle implies the systems we rely on every day.

Eventual Consistency Crucial in Microservices World

In few words microservice architecture relies on breaking down services into smaller services that communicate with each other to tackle complex business workflows. Each of these components are frequently deployed across multiple servers, containers, or even data centers, each managing its own data in formats ranging from SQL and NoSQL databases to in-memory caches. Synchronization of all of them in real time isn’t just complex—it can also slow down the system and reduce overall reliability.

One reason is rooted in the Fallacies of distributed computing, which remind us that we cannot assume the network is always reliable, latency is zero, or bandwidth is infinite. Networks might experience failures, variable latencies, and bandwidth constraints. Trying to maintain global strict data consistency in such an environment may involve potential risks:

Performance Bottlenecks: Global locks or coordination points increase response times, frustrating users and taxing infrastructure.
Reduced Resilience: A single slow or unreachable service can stall the entire system if strict consistency forces all other services to wait.
Increased Complexity: Protocols like two-phase commit become a tangled web of coordination, complicating both development and operations.

That's why in such architecture style like Microservices we are using eventual consistency. It allows services to operate independently as soon as it receive data update events. Over time, all services will handle data update and system will align with single source of truth.

Eric Brewer proposed theory that is fundamental concept of distributed computing. CAP Theorem states that only two out of three capabilities could be passed by distributed system or data store:

Consistency (C): All nodes see the same data at the same time.
Availability (A): Every request receives a response—even if some nodes are down.
Partition Tolerance (P): The system continues operating despite network failures that split it into multiple parts.

In distributed nature of microservices architecture, partition tolerance is non-negotiable capability, as network outages are inevitable. That forces architects to choose between consistency and availability. User experience, that is requiring system to keep working smoothly, is main factor that tends to choose availability.

By diminishing strict global consistency, you gain availability and performance. Requests in such case are processed quickly and users see faster responses, while system resolves data inconsistency in the background. In some point of time services will update data to consistent state.

When deciding whether to use eventual consistency, a key question is how tolerant your application is to short-lived data discrepancies. For instance, in an e-commerce platform, a user might briefly see an outdated inventory count if the stock microservice hasn’t yet propagated the latest change. However, this delay is often acceptable because it enables higher availability and faster overall response times. Similarly, social media apps can afford a momentary gap before a new “like” appears on all friends’ feeds, prioritizing uninterrupted interactions over perfectly synchronized data. On the other hand, systems that handle critical data (e.g., financial transactions requiring an immediately consistent global state) may need stronger guarantees. By weighing the importance of immediate accuracy against the cost of coordination overhead—such as performance bottlenecks and reduced resilience—you can decide whether eventual consistency strikes the right balance for your particular use case.

This approach, rooted in an understanding of both the Fallacies of Distributed Computing and the CAP theorem, underpins the design of resilient microservices. It acknowledges that networks are inherently unreliable and ensures the system continues to function effectively, ultimately delivering a better user experience through eventual consistency.

Mechanisms ensuring eventual consistency

As it could be deducted from above designing microservices that uses eventual consistency forces asynchronous data flows, all while accepting that certain updates may take time to propagate. Below are common mechanisms and architectural patterns that help achieve an eventually consistent system:

Asynchronous Communication
Sagas (Distributed Transactions)
Event Sourcing
CQRS (Command Query Responsibility Segregation)

By applying these techniques, each service can operate at its own pace, and the entire system converges to a consistent state over time.

Asynchronous communication

One of the most straightforward ways to achieve eventual consistency is by employing asynchronous messaging, where services communicate through message brokers like RabbitMQ or Apache Kafka. This approach allows microservices to publish events whenever they update data, while other services consume these events at their own pace and update their local state accordingly. In a publish/subscribe pattern, for example, a microservice that owns a specific domain—such as orders—emits events whenever there’s a change. Multiple subscribers, like inventory or billing services, then receive these events and refresh their respective data stores asynchronously. This results in minimal blocking, as services do not wait on each other to confirm every update. Instead, they rely on the eventual arrival of messages, leading to a loosely coupled architecture where each service can scale and deploy independently without disrupting the entire data flow.

Event sourcing

Event Sourcing captures every change to an application’s state as a sequence of immutable events. Rather than updating rows in a database, each action—such as OrderPlaced or PaymentReceived—is recorded in an event store. A service’s current state can then be reconstructed by replaying all relevant events in the correct order, which proves invaluable when data becomes inconsistent or corrupted, as you can reprocess the event log to restore a consistent view. This approach also offers robust auditability; because every state change is preserved, it’s far simpler to troubleshoot production issues or roll back to a particular point in time. Moreover, other services can subscribe to the event store to ensure they’re kept informed of changes, allowing the system as a whole to converge on the true state without requiring immediate, synchronous updates between all components.

Saga pattern

When a business process spans multiple services and each step must either succeed or be compensated, the Saga pattern enables consistency without resorting to a global transaction lock. Two common approaches to implementing Sagas are orchestration and choreography. In the orchestration model, a central coordinator oversees the entire process, issuing commands to relevant microservices and triggering compensating transactions if any step fails. In the choreography model, each service listens for events and reacts by performing its own work and then emitting additional events, allowing the workflow to emerge organically without a single orchestrator. Sagas are resilient to failures because compensating actions can be sent to reverse or adjust previous steps. This design also lends itself to scalability, as each microservice independently handles its part of the overall transaction logic, eliminating the need for a monolithic transaction engine.

CQRS

CQRS splits the way an application handles writes (commands) and reads (queries). The Command side is responsible for processing write operations, such as creating orders or updating user data, and it often incorporates Event Sourcing to record changes. Meanwhile, the Query side is optimized for reading data by maintaining materialized views or projections that are updated asynchronously in response to events originating from the Command side.

By separating these responsibilities, CQRS brings clear benefits in terms of performance and scalability, as each side can be scaled independently and can even use different database technologies suited to its specific tasks. However, this separation also introduces eventual consistency: because the Query side updates are asynchronous, users may occasionally encounter slightly outdated information. Despite this temporary lag, the system remains highly available and responsive, ultimately converging to a consistent state.

Examples from Microservices

In an e-commerce system, consider a scenario where one microservice manages inventory while another handles payments. As traffic increases, maintaining accurate product availability across these services becomes critical. The Microservices architecture allows each service to independently manage its domain, while Event-Driven Architecture (EDA) facilitates asynchronous communication between them. Changes in inventory availability are communicated as events asynchronously, ensuring scalability and responsiveness. However, this asynchronous nature can occasionally result in temporary delays, where the inventory status appears "out of sync" in the shopping cart during checkout. Technologies such as Spring Boot for rapid microservice development and Apache Kafka for event streaming and message brokering are commonly used in the Java ecosystem to ensure seamless communication and eventual consistency between these microservices.

Similarly, in a reservation platform handling ticket bookings for flights or hotels through different services, the short time window during which a ticket might appear available can lead to concurrency issues. The Saga pattern within an Event-Driven Architecture allows microservices to manage distributed transactions effectively. Events are propagated asynchronously, enabling services to coordinate booking processes across multiple steps while ensuring data consistency and resilience against failures. Tools like Axon Framework or Camunda provide the necessary orchestration capabilities to handle complex workflows across distributed components.

In the context of a social media application, where microservices handle user profiles, messaging, and activity feeds, Event-Driven Architecture principles ensure real-time updates and interactions. Asynchronous notifications play a crucial role in propagating status updates efficiently across the platform. Technologies such as Spring Cloud Stream for event-driven microservices and Redis for caching and pub/sub messaging enable seamless communication and scalability. By adopting both Microservices and Event-Driven Architecture styles, the application supports dynamic scalability, responsiveness to user interactions, and robust data synchronization across distributed services.

These examples illustrate how integrating both Microservices and Event-Driven Architecture styles, along with relevant technologies in the Java ecosystem, addresses complex challenges of scalability, real-time data synchronization, and resilience in modern distributed systems.

When to Use Strong Consistency

While eventual consistency offers advantages in scalability and availability, there are specific scenarios where strong consistency is essential to ensure data integrity and correctness:

In banking and financial sectors, transactions must be accurate and immediate to prevent discrepancies and ensure the reliability of monetary exchanges. Strong consistency guarantees that all transactions are processed in real-time and reflect accurate balances across all systems. Regulatory Compliance: Healthcare, Government Applications
Industries governed by strict regulations, such as healthcare and government, require strong consistency to maintain compliance. Accurate and auditable records are crucial for patient care, legal compliance, and government reporting. Strong consistency ensures that data updates are immediately reflected and verifiable. Real-time Data Analytics: Immediate Data Accuracy for Decision-Making
Applications that rely on real-time data analytics, such as stock trading platforms or predictive analytics tools, require strong consistency to ensure that decision-making processes are based on the most up-to-date information. Immediate data accuracy is critical for making informed decisions quickly and effectively. Inventory Control Systems: Preventing Overselling or Discrepancies

In these contexts, strong consistency ensures that all nodes in the system see the same data at the same time, providing immediate data accuracy and preventing conflicts that could arise from concurrent updates. While strong consistency may come with trade-offs in terms of latency and availability under network partitions, its use is essential in applications where data correctness and integrity are paramount.

Dealing with Inconsistencies

Handling data inconsistencies is a crucial aspect of designing resilient microservices architectures. Various approaches and strategies can mitigate the challenges posed by eventual consistency:

Approaches to Conflict Detection: Implement mechanisms to detect conflicts when multiple services update the same data concurrently. Techniques such as versioning—using timestamps, optimistic locking, or vector clocks—help identify and resolve conflicts by ensuring that updates are applied in the correct sequence.
Updating Data Based on Versioning: Utilize optimistic locking or vector clocks to manage concurrent updates effectively. These approaches allow services to determine the order of updates and ensure data consistency by preventing conflicting changes from being applied simultaneously.
Compensation Strategies: Implement saga for compensating transactions to rectify inconsistent states caused by failed operations or conflicting updates. For example, if a payment transaction fails after deducting funds from an account, a compensating transaction can be executed to refund the deducted amount and restore data integrity.
Timeouts and Retries: Set timeouts and implement retry mechanisms for operations that rely on eventual consistency. Timeout settings define the maximum duration a service waits for a response before considering the operation failed, while retry strategies ensure eventual success by reattempting actions until they are completed or an error condition persists.
Ensuring Idempotency: A critical practice in distributed systems is designing operations to be idempotent. Idempotency ensures that performing the same operation multiple times yields the same result as executing it once. This is especially important for retry mechanisms; if a service receives duplicate messages or retries an operation due to transient failures, idempotency prevents these repetitions from causing unintended side effects or data anomalies.

These strategies are essential for maintaining data consistency and reliability in distributed systems where services operate independently and communicate asynchronously. By adopting robust conflict detection, versioning mechanisms, compensating transactions, effective retry strategies, and ensuring idempotency, microservices can mitigate the risks associated with eventual consistency and maintain reliable operation under varying conditions.

Best Design Practices for Eventual Consistency in Microservices

Ensuring eventual consistency in microservices architectures demands careful attention to design practices that enhance resilience, reliability, and long-term data correctness. Here are several key best practices to consider:

Design for Failure and Idempotency - Distributed systems are inherently subject to partial failures and network delays. Architect your services to anticipate transient inconsistencies by implementing robust retry mechanisms with exponential backoff, circuit breakers, and fallback strategies. Ensure that operations are idempotent—processing the same message multiple times should not lead to duplicate effects—so that your system can safely recover from intermittent failures.
Robust Error Handling in Asynchronous Communication - Since eventual consistency relies heavily on asynchronous message passing, effective error management is critical. Develop comprehensive strategies to handle exceptions in event processing. This includes logging critical errors, employing alerting systems, and triggering compensating actions when necessary, so that any inconsistencies can be resolved over time and the system converges to a consistent state.
Comprehensive Monitoring and Observability - Implement monitoring tools and observability practices that provide real-time insights into the flow of events across microservices. Utilize distributed tracing, centralized dashboards, and detailed logging of key metrics—such as event propagation delays and processing times—to quickly detect and diagnose potential consistency issues. This visibility is essential for proactive maintenance and for ensuring that the system behaves as expected even during failures.
Test Automation and Simulation of Inconsistencies - Automated testing is crucial for validating eventual consistency mechanisms. Develop integration tests that simulate real-world scenarios, including network delays, message loss, and duplicate event deliveries. Such tests help verify that your compensating transactions and conflict resolution strategies are effective, ensuring that the system gracefully recovers and eventually reaches a consistent state.

By embracing these best practices, you can build a microservices architecture that scales effectively while ensuring that data eventually converges to a consistent and reliable state, even in the face of transient errors and asynchronous communication delays.

Tools and Libraries for Microservices Development

Enhancing microservices architectures relies on specialized tools and libraries that support eventual consistency, asynchronous communication, and robust distributed transactions. By leveraging messaging platforms like RabbitMQ and Apache Kafka, developers can enable asynchronous communication, decoupling services and ensuring reliable message delivery. Tools that support Sagas help manage distributed transactions, while event sourcing and CQRS patterns offer powerful mechanisms to track state changes and segregate command and query responsibilities. Integrating these approaches fosters best design practices for achieving eventual consistency in complex microservice environments. Together, these tools and libraries empower teams to build scalable, maintainable systems that effectively handle the inherent challenges of distributed architectures. Here's key tools how to implement good microservices with eventual consistency:

Messaging Platforms (RabbitMQ, Apache Kafka):
- RabbitMQ: Install RabbitMQ and set up exchanges and queues to route messages between microservices. Use client libraries in your preferred programming language to publish and consume messages, enabling asynchronous communication and decoupling of services.
- Apache Kafka: Deploy Kafka clusters and create topics to which microservices can publish and subscribe. Utilize Kafka's producer and consumer APIs to handle high-throughput, low-latency message exchanges, ensuring reliable data flow across services.
Schema Management (Avro, Protobuf, JSON Schema): Use a Schema Registry (e.g., Confluent, Apicurio) to enforce structured data contracts. Define schemas in Avro (binary efficiency), Protobuf (speed), or JSON Schema (readability) to standardize message formats. Producers serialize data using schema IDs from the registry; consumers deserialize by fetching schemas. Enforce compatibility rules (e.g., BACKWARD) to safely evolve schemas—like adding fields without breaking consumers. Automated validation during schema registration blocks incompatible changes, ensuring reliable cross-service communication.
Frameworks for Implementing Sagas (e.g., Axon, NServiceBus, Camunda):
- Axon Framework: Integrate Axon into your Java-based microservices to manage complex business transactions. Define command and event handlers to coordinate operations across services, and implement compensating actions to maintain consistency in case of failures.
- NServiceBus: Use NServiceBus with .NET applications to design long-running processes. Configure sagas to handle state transitions and message routing, ensuring that distributed transactions are managed effectively.
- Camunda: Deploy Camunda's workflow engine to model and execute complex business processes. Design BPMN diagrams to represent saga workflows, allowing for visual orchestration and monitoring of distributed transactions.
Libraries for Event Sourcing (EventStore, CQRS Frameworks, Temporal, Axon):
- EventStore: Set up EventStore to persist events generated by your microservices. Append events to streams and leverage projections to query the current state, facilitating event sourcing and ensuring an immutable log of state changes.
- Temporal: Integrate Temporal's SDKs into your services to manage workflows and state transitions. Define workflows as code, enabling reliable execution of complex processes and simplifying the implementation of event sourcing patterns.

By incorporating these tools and libraries into your microservices architecture, you can achieve more robust, scalable, and maintainable systems.

The Future of Eventual Consistency

As cloud and edge computing continue to evolve, managing distributed workloads and ensuring real-time responsiveness introduce new challenges—and opportunities—for eventual consistency in microservices. Wider data distribution, higher volumes of concurrent requests, and increasingly sophisticated use cases make asynchronous communication and event-driven architectures more important than ever. At the same time, many organizations are exploring hybrid consistency models that merge the benefits of strong and eventual consistency. For instance, semantic consistency approaches aim to provide a consistent view of data under certain conditions, offering performance advantages without compromising data integrity when correctness is critical.

Ultimately, eventual consistency remains a foundational principle for building scalable, resilient microservices. By understanding the trade-offs between strong and eventual consistency, teams can balance performance and complexity according to their specific application requirements. Adopting best practices, robust tooling, and adaptive strategies—such as monitoring, observability, and automated testing—ensures that distributed systems remain reliable even in the face of network partitions, high traffic, and partial failures. With careful planning and a clear-eyed approach to design, eventual consistency can help deliver services that are both efficient and responsive, ready to meet the demands of modern cloud-native environments.