System Modernization 101

21 min readNov 9, 2024

Recently, my colleagues and I discussed several aspects of system modernization, and questions like “How do we do X?” or “Where can we read about Y?” arose. I realized that my expertise in this area is based on my personal experience and articles I have read over the past 20 years. Even if I saved links, they might be from 5–7 laptops ago.

I decided to write a series of posts on this topic primarily for reference purposes. Instead of making a 20-minute ad hoc speech on a call, I can just provide a link and answer any questions that arise. Here are the areas I want to cover:

Reference architectures: 2-tier, 3-tier, service bus, service-based, microservices.
Pros & Cons, potential motivations for modernization for typical cases.
Splitting a monolith into services: how to split — typical approaches,
Common issues: distributed transactions, function per service grouping, data transfer volume between services.
Some common concerns with async processing.
Microservice monolith — why it is bad, how to avoid it.
Process-related issues of modernization: disposable environments, typical concerns related to development, testing, release, and backup.

Reference Architectures

Before doing a deep dive into the topic of modernization, let’s talk about reference architectures. Modernization is a path from point A towards point B, and for better navigation, we need to understand the landscape and know the most remarkable landmarks.
The list I want to cover includes: 2-tier, 3-tier, Service Oriented Architecture (SOA) (usually comes with ESB), and Microservices (usually not that “micro”).

Important note: Each reference architecture appeared for a purpose and has significant advantages, which is why people used them for decades. While in general, microservices are way better than 2-tier, each has its advantages and caveats.

2-tier

Essentially, this is a system where data and logic sit in one RDBMS, and some lightweight, business-logic-less client (e.g., Oracle Forms) is used to access the application.

PROs:
- Data is strongly consistent at any point in time. Data manipulations have almost no side effects (no race conditions, etc.), making it easy to develop and modify. If you need very strong data consistency, 2-tier is your choice (with a modern fancy UI on top).
- One tech stack to rule them all.

CONs:
- Easy to create, very difficult to modify; code is nailed to tables and vice versa.
- Modern SDLC best practices are not possible or are very effort-intensive.
- You can scale only by scaling hardware.
- Zero downtime deployment and release rollbacks are expensive exercises.
- Somewhat rusty tooling & languages.

3-tier

A system that has a database, one or many application services, and a client (SPA, mobile app, etc.) was the gold standard for almost 20 years and is still a solid choice nowadays. When someone says, “I need to update a monolithic app,” most likely this “monolith” is a 3-tier app.

PROs:
- DB storage and logic are separated and can be developed independently.
- You can use a variety of SQL and NoSQL DBs. Migration between RDBMS is a possibility.
- You can use modern languages and create complex application logic.
- Easy to scale until you hit DB performance limits, and if you use a DB that supports horizontal sharding — scale even beyond this.
- You can use all modern SDLC best practices with low DevOps complexity.

CONs:
- Not scalable and difficult to manage in terms of code base size. While a 3-tier app is a nice idea, the code tends to grow into a spaghetti mess after some point. Technically, in a microservice architecture, each microservice is a small 3-tier app. The key word here is small.
- Feature granularity. Under common assumptions — if you change one feature in a monolith app, you need to test all others afterward. There are no strict boundaries between features and modules. Everything in one execution module is tightly coupled. Full regression of 30M lines of code after one line change — why not.

Service-Oriented Architectures

The next obvious point in architecture evolution was the so-called service-based architecture. Usually, this means Service-Oriented Architecture (SOA) or microservices. The basic idea was simple: instead of one service in a 3-tier architecture, can’t we have many services? The difference between these two, in a few words, lies in how service communicate with each other and the lifecycle. SOA usually means the use of ESB, a centralized service contract registry, etc. Microservices are usually decentralized, more granular and less cumbersome.

Service-Oriented Architecture (SOA)

Key features of ESB architecture:
1. Use of ESB
2. Service registry
3. Standardized service contract
4. Enforced top-level approach to security & governance.

PROs:
1. Overcomes the main disadvantage of the monolith — the system consists of a set of granular modules/services. Each service is dedicated to a specific functional area.
2. Unified service communications, service registry, security, etc.

CONs:
1. SOA can easily turn into an enterprise monolith. You may have many services, losing the simplicity of a mono-service, but the lifecycle of services is still tightly coupled through ESB. The development process still requires a lot of coordination.
2. Steep learning curve. In most cases, the team needs to know the specifics of the used ESB, SOAP, etc.
3. QAA & SDLC could be cumbersome. Disposable pre-merge environments for end-to-end tests could be challenging to implement.

As a general note: nowadays, SOA is not a very popular choice for new systems.

Microservices

The basic idea of microservices is: what if we just create as many services as we need and group them using some criteria that make us happy? And use any language/lifecycle/etc. that we consider convenient.

Some later addition: it is better to use an API gateway for discovery and other functions, and it is better to build it in a way that each service manages its own data.

This makes microservice architecture very powerful. At the same time, it sets a high bar for service layout design.

PROs:
1. Highly scalable, flexible, universal, fits almost any purpose.
2. Low entry level. A developer can create a useful microservice in hours using just common programming language skills.

CONs:
1. Without proper system design, it quickly turns into a microservice monolith (or better said, microservice hell).
2. Proper SDLC is a must. When you have a system with only one application service (3-tier), it could tolerate almost any SDLC flow. When you have a system that consists of 30 (or 300, I personally saw 600+) microservices , any SDLC-related effort is multiplied by 30.
3. Deployment, configuration, observability, etc. require very good tooling. You must have a DevOps team that understands the needs of developers.
4. Making complex, consistent data updates could be overcomplicated.

Rephrasing SOA is delegating part of system design complexity to ESB provider, microservices are flexible, but you have to do system design yourself.

Motivation for architecture transformation

Once we define the system of coordinates, let’s talk about the motivation for change. Why might a business want to take risks and invest money into a system that has worked for decades?

The answer is quite pragmatic because:

It may stop serving its purpose in the short/midterm future. Example from the recent past: A system uses Adobe Flex as a frontend technology, and Adobe Flex will be decommissioned soon.
Its current capacity stops the business from growing.
There are critical risks associated with technology that may lead to a “stop the world” event for the business.
Constantly increasing operational costs. For instance, systems use Perl for backend or all business logic is based on PL/SQL and DB-to-DB direct communications (yes, this happens even nowadays).
Current setup works, but is very expensive in terms of hosting costs.

Or something less generic. For example, after several M&As, the core entity owns four companies, each with its own system with somewhat overlapping functionality. They obviously want to keep only one of them, but none of the four has the functional and nonfunctional capabilities to cover all needs. So, if they do not want to support four different platforms forever and tolerate fragmentation, they need to pick one platform and evolve it to a state that covers the other three platforms.

Rephrasing, there are two key points:
- Some newly appeared (discovered), tangible business nonfunctional requirements, or functional requirements that cannot be implemented with the current architecture.
- Money lost if the transformation is not done.

It should not be “we want to migrate from a .Net monolith to Rust microservices because .Net is not fancy anymore.” Unless “is not fancy” means the cost of a .Net team is 150% higher than Rust (which is not true).

Why are these points important to me as an architect? Because these considerations set the transformation vector. Solution architecture assumes the existence of a problem that this solution solves. No problem — no transformation.

Architecture transformations can eliminate some risks or bring new value (money). Usually, both. Knowing these factors allows us to draw a roadmap from the “as-is” architecture to the shiny “to-be” future state.

System Modernization Tooling

When discussing system modernization, once we define the current state and our goals, it’s time to create a modernization roadmap — the plan for moving from the “as-is” to the “to-be” state.

Important note: We need to define requirements first, especially non-functional ones such as latency, throughput, and data size. This is crucial. Designing a system for too low or too high throughput can be a potentially fatal mistake.

How to?

Use your best skills and knowledge to design the “to-be” architecture and transformation roadmap that fits business needs optimally. This statement, while true, is not actually helpful. To add value, let’s discuss common “to-be” architectures and techniques to transition legacy systems to the desired state.

To-Be State

Based on my observations, the most commonly used architecture today is microservice architecture (often tending towards larger services, which we can call macro service architecture) and asynchronous processing via RabbitMQ, SQS, or Kafka.

Key features include a database exclusively used by a service (see picture) and the use of an API gateway (or two gateways, public and private). Another crucial feature is having all business logic within services, avoiding stored procedures.

Why is this architecture so popular?

— Separate services allow granular development flows, resulting in lower complexity and reduced development costs.
— Exclusive use of databases simplifies data management and manipulation, and helps maintain low DB schema complexity.
— Asynchronous processing increases tolerance to load spikes and ensures services are loosely coupled (when used correctly).
— API gateways with proper URL structure act as service registries and request routers.

In most cases, this architecture meets business needs effectively.

Note: This is a high-level view. There are many caveats, such as stateful vs stateless designs and session management.

Tools for System Transformation

— Function shuffle: Extract/split function groups from the monolith to services, regroup functions, etc.
— DB structure shuffle: Similar to functions but requires higher skills.
— Configurations: Unify configurations, use API GW instead of direct links, and use secret stores, etc.
— Move logic from DB to services: Keep only data in the DB (D in DB stands for “data”, DB is “data base”, not CB for “compute base” or BLB for “business logic base”).
— Extract stateless, async function groups: Move these to async workers.
— Use NoSQL services: Especially Redis, to add additional processing qualities like idempotency and data correlation, achieving desired performance.
— Build proper instrumentation for SDLC: This includes QAA, IaC, API contract checks, etc.

These top seven techniques cover around 80% of the scope.

Service Split

How to split one service into several, move functional code between services, and understand why it could be easy or difficult.

One of the most powerful tools of architecture transformation is moving functional code (classes & methods) between services and extracting groups of functions between services.

For what purpose could this tool be used?
1. Group tightly coupled functions together in one service so changes in one business area lead to changes in just one service.
2. Move functions that cause specific load patterns (high CPU/high RAM, etc.) to a separate service, making it easier to scale and control system load.
3. Move functions with specific dependencies and/or configurations to a separate service. For example, all payments go through one service, and only one service contains payment-related secrets in its configuration.

Let’s assume we have two functions. FunBook calls funPay with parameters:

`int paymentResult = funPay(item, amount);`

And we decided to move funPay to a separate service. Technically, it is very easy — just do an HTTP POST, serialize and pass the item and amount data as parameters, then get the reply, deserialize, and save the result. For instance, in the Java world, OpenFeign https://github.com/OpenFeign/feign does this seamlessly.

So, is moving functions/methods between services not a big deal? Yes and no. Yes, because it is just a matter of passing data and making a call, local or remote. No, because there are caveats (as usual, the devil is in the details):

1. HTTP calls are 1,000–10,000 times slower. I mean the call mechanics itself.
2. Not all resources can be serialized, such as file handles, semaphores, etc.
3. Transactions. If two methods participate in a transaction, you are in trouble.
4. Passing huge data. If the size of a serialized parameter is 1 MB, it means the service will use lots of compute resources for serializing and then deserializing it. And 1 MB is not the biggest structure you may have in memory.
5. 1+N calls. If, for instance, we have 100 items in our order and the method pay is called 100 times, it means that we are doing 100 HTTP calls.
6. Implicit parameter changes. If funPay changes some fields in the item implicitly, these changes will not be returned back to funBook.
7. Global state. If funPay changes something in the global service state, this will not be transported back over the wire.

The points above do not mean we cannot move functions between services. We can, but each of these points should be considered, and specific fixes should be applied in each case.

How to Deal with Different Issues That Occur During Service Split

[Issue: 1] HTTP Calls Are 1,000–10,000 Times Slower Than Method Calls

An HTTP call is naturally much slower than an in-process method call. How can you deal with this? The short answer is, there is no way to completely bridge the gap. You can use gRPC or other techniques to potentially increase the speed by 2–3 times, maybe even 10 times. However, this will not cover a 1,000 times gap. If your function is extremely latency-sensitive and adding a 10ms penalty to the function call is critical, then do not split this function into several services; keep it in one process space.

But in most cases, even an additional 50ms does not matter much because the network latency from a mobile phone in the hands of a user in Yerevan to a data center in Frankfurt and back adds way more randomness and weirdness.

Important Points to Consider

1. During the design phase, the architect should clearly understand what is acceptable in terms of latency, SLA, etc. This is not always the case. For instance, there was a situation where a client’s system had to reply to a GDS call within 3 seconds. After 3 seconds, the reply was ignored. The chain of 20+ microservice calls was physically incapable of replying within 3 seconds. This requirement was not considered during the design stage.

2. When mentioning a 10ms increase, I mean close to ideal case, such as an HTTP call from one JIT-compiled Java service to another Java service, both using Spring Boot and Undertow. If we use a slower language or add additional layers in between, such as an API Gateway -> Apache HTTP -> Tomcat container + application packaged into a .WAR file, the latency could increase significantly. Therefore, some reasonable optimization of HTTP calls is essential.

3. It may seem trivial, but it is not always taken into consideration: the number of hops matters. One service-to-service call could be acceptable, but 10 calls will increase the call-related part of the latency by 10 times.

Rephrasing: When splitting code to services, understand SLA, mind you technology limitations and do not overengineer. Rule of thumb — for any incoming http request from browser, internal chain of sequential calls should not be longer than 5 hops, ideal number is 2–3.

[Service split: Issue 2]: Not all parameters are actually data, i.e., not everything is serializable.

For instance, in the method:

public void saveConfig(FileOutputStream configFile, Config config)

We can serialize config, but what about the file handle that points to a file opened in the local file system? There are plenty of resources that cannot be serialized. Specifically, these could be:

- Various types of system resources that are not actually data, but object wrappers around system handles, such as files, semaphores, ports, threads, locks, etc.

- Application objects that have references to system resources somewhere deep inside the object tree (basically #1 but hidden inside a harmless data hierarchy). E.g., a School has grades, grades have classes, classes have students, and a student has marks. Out of the blue, a SchoolMark object may have a field MongoDBConnection. This is bad design, but it happens.

- Objects that are theoretically serializable but are part of some framework and could not be restored safely in the context of another service, such as a Logger or Spring Application Context.

- Various kinds of callbacks.

- Thread-local variables.

Normally, you should not pass such kinds of resources between modules; they are supposed to be used in some local isolated context.
How to deal with this? Quick note: In some cases, mostly when you have tight performance requirements, it is not possible to split methods that depend on some critical local OS resources. In most cases, it is possible by using a combination of these tricks:

- Replace local resources like file, semaphore, lock, etc., with remote versions. E.g., use a Redis-based lock instead of the local system lock. Usually, this requires some adjustments in code logic; for example, a remote lock should always have an expiration period, and you need to deal with edge cases.

- Do not pass objects like Loggers, Locks, ClassLoaders between business functions. Just do not.

- Replace callbacks with remote calls of a different kind (queues, etc.). Try to avoid direct circular dependencies between services. If possible, try to avoid callbacks. Sometimes you do need them, but there are drawbacks.

- If you 100% need to pass some global state between business functions (e.g., some huge data structure), use Redis. It is fast, and you can split big state into more granular pieces. In some cases, when the global state is a 5GB video and functions are changing random parts of this video, do not split and group functions around this 5GB in one service.

In general, if you have some non-serializable resource and this is not a mistake in design, but a design decision with a good reason, this is a very good function grouping criterion. Just group your functions around this resource, especially if the cost of splitting is too high.

[Issues with split #3]: Transactions

What is good about monolith? You have transactional integrity at a very low effort cost. You just need to wrap the method call into a transaction, and you get DB integrity (unless you are doing something weird). Your changes to the DB are either applied all together or not applied at all.

Then how to deal with integrity in the case where you have a call chain methodA -> methodB -> methodC, the transaction is started in methodA, and all three methods sit in three different services?

There are two common, widespread answers:
- Distributed transactions
- Saga

See: Baeldung: Saga Pattern in Microservices: https://www.baeldung.com/cs/saga-pattern-microservices for more details. The actual list is longer; it could include Idempotent updates, 2PC, CQRS, TSS, etc.

The main problem here is that the level of complexity in most cases is 5–10+ times higher than just a basic transaction in a monolith. This effectively means that for each transaction, you need to review changes in several services and analyze corner cases of various kinds. You also need to keep it consistent over many years of development done by many people from many teams from many countries. The chance that in 5 years from now, some dev from another country who is currently studying at university will accidentally break the saga your team is currently working on is high.

And it is difficult to cover with auto tests. Tests can also be broken. Then what is the solution? My rule of thumb:

- Keep high-level parts of your data model tightly coupled, if possible.
- Have some reconciliation mechanisms — “If data from dbA is not consistent with dbB — synchronize it.” It sounds a bit scary, but humanity has lived in this model for the past 2000+ years. Data in archives of various kinds in different cities were always slightly different up until the beginning of the 21st century.
- Group functions around strongly consistent parts of data.
- Use queues + idempotent updates to transport data between data model parts (this is a kind of light saga). The link to the Baeldung site above is a good entry point for details on how it could work.

Rephrasing: Tolerate the fact that data between your data model parts is eventually consistent and build your business processes with this fact in mind. In the cases where it is not possible, keep all related functions in one service. If it is not possible read article and prepare for cognitive complexity spike (or better say strike).

Let’s look at this from another side. One of the main reasons for service split is to reduce the cognitive complexity of the service and make it possible to make atomic changes in the service behind a service contract without knowledge of other services. If you are splitting one transaction’s complexity across several services, most likely you are making the split senseless.

[Issues with split #4]: Big parameter size

Sometimes serialized arguments can be big. You might encounter something like:

refinedReply = refineSearchReply(searchReply);

Where `searchReply` size in serialized form is 2MB. Is it big? Or not? Passing 2MB of data over a 1Gbps network takes 16 ms, which is probably not that much. But there are additional costs like network stack overhead, CPU load, memory fragmentation, and marshaling/unmarshaling. The latter is the worst. I know cases when unmarshaling 1.5 MB of XML took up to 10 seconds, and the resulting object tree consisted of thousands of data objects.

So, let’s define “big.” A precise but not very useful definition is: “a parameter is big if passing it through the wire imposes a significant addition to call latency.” For example, if the total acceptable latency is 1 second, and one call with a big parameter takes 50–100 ms.

Some rough “rule of thumb” numbers:
- Up to 100 KB is OK
- 100 KB to 5 MB is big
- More than 5 MB is huge

“Huge” means that you probably should not pass parameters of this size between microservices in an OLTP system.

How to measure?

Let’s assume you want to extract some method and you have concerns regarding parameter size. How do you measure this? The easiest way is to add an aspect around the method definition that measures parameter size and serialization time. Then, in a prod-like environment, run a test suite and collect data.

OK, (actually NOK), data is big, or even huge, then what?

There are several tricks that help in such situations:

- Refine data, do not pass everything, pass only necessary data. This is not always easy because you may need to apply some refining business logic before a call, while the code that implements the logic sits inside the method. So, you may need to do some refactoring first.

- Slice data into granular pieces, store them somewhere, e.g., in Redis, and pass only a reference to the data piece. For example, once you receive a reply from the search API, slice and dice it, put it in Redis, and pass the data piece ID in the method call.

- As I showed earlier, the main issue is not passing data through the wire, but data processing inside the server process. You may use different tricks to make marshaling lighter/faster, not converting the whole data to an object tree, but extracting part of it, etc. Modern tooling offers a broad variety of fast and lightweight methods of data processing.

- An edge case, but still possible: if we are talking about modifications and queries to some huge data structure, put it into some NoSQL server on arrival and do all modifications and queries inside this server. This NoSQL could be MongoDB or PostgreSQL jsonb.

Recap: In my experience, mitigating the big parameter size issue is easy. There are plenty of powerful ways to do this. The key is to identify the issue at early stages. It is important to keep this issue in mind and use some tooling to check the numbers.

[Issues with split #5]: N+1 calls

Another issue is 1+N calls. Let’s imagine you have code like this:

  for (var orders : fetchOrders()){
    if (!orderIsValid(order)){
     throw new InvalidDataException(order.id())
    }
  }

While this code is part of one address space everything is OK. But what if you move functions fetchOrders() and orderIsValid() to separate services and in fact these function calls are REST calls.

This means that execution of this code block will lead to 1+N REST calls and REST call overhead will be multiplied. Usually this leads to unacceptable latency

What does this mean in terms of system design?

Detection

We have to detect such areas during design phase and this is not always easy. I makes sense to use instrumental way to trace number of service-to-service calls inside external transaction. Then run full set of regression tests and analyze if there were some 1+N (or 1+N*M) patterns

Fixing

OK we found that we have such pattern, what next? There are several basic tricks that may help

Change signature of validateOrder and pass the whole list of orders at once. This is probably the easiest solution, that may not work in case if orders list is huge
Make order validation asynchronous. Put list of orders into a queue (RabitMQ, Kafka, etc.) and read result back from a queue or from Redis.
Do not split services this way. Use another composition, e.g. merge order storage and order validator functions or make validation part of orders service.
If we know for sure that number of orders < 5 (for instance), leave this as-is, consider overhead acceptable. One danger here, this is implicit design assumption and if in future number of orders will increase, system will break in some non obvious way.

The key point here is detection. Usually if you aware of this pattern it is not a big deal to fix the issue. The problem is a case when:

You have 1+N pattern
Sometimes it is 1+3 or 1+5, but sometimes it is 1+50
And you have API gateway timeout for 0.1% of requests. Good luck 🍀 finding it without proper observability implemented.

[Issues with split #6&7]: Implicit parameter changes and global state

Two types of issues of a similar kind.

This code is OK:

for (var orders : fetchOrders()){
  if (!orderIsValid(order)){
    throw new InvalidDataException(order.id())
  }
}

This is not:

for (var orders : fetchOrders()){
  validateOrder(order);
}

...

private void validateOrder(Order order){
  if (someConition()){
    order.isValid = false; //change internal order state
  }
  invalidOrdersCount++; //change global state
}

Second code fragment aside of 1+N issue has issues with implicit state change. “Mutation of a hidden state” is one of the most difficult to solve issues in software development. For instance:

In February 2010, Toyota announced a significant recall affecting approximately 437,000 hybrid vehicles worldwide, including the 2010 Toyota Prius and the Lexus HS 250h models. The recall was initiated due to a software issue in the Anti-Lock Brake System (ABS), which is directly related to the brake pedal’s functionality.

One of the reasons why software was broken was use of 10,000 global variables.

Technically it is possible to convert second piece of code to REST without rewrite. You have to:

Return modified order object back and assignback to order element in the list.
Introduce notion of global context and pass updates in global context (invalidOrdersCount delta in out case) back. Or store global state in global storage such as Redis.

I strongly advice not to do this. This will make things even more hairy. The right way is to

Convert local code to functional style, when you do not change function arguments and do not change implicit state
If you have a global state, create explicit shared storage for it. Probably even a separate microservice.
Convert local code to REST calls.

The most difficult part here is change detection. Implicit changes may happen 3–4 or 5–7 levels below the in call graph and there is no good universal instrumental way to detect this, in generic case only manual full code scan which may involve reading of dozens of thousands of lines of code. Rephrasing if you (or folks who worked on the project 5 years before you joined the company) dive into dangerous waters of implicit data changes, everything is bad enough.

On some platforms there are tools that allows state changes detection for call graph, like Chronon Debugger for Java (see https://blog.jetbrains.com/idea/2014/03/try-chronon-debugger-with-intellij-idea-13-1-eap)/, but this is not an easy exercise anyway.

Intermediate summary

Article above discuss typical issues with monolith split the aspect of service code split. There are some aspects to be covered in further articles, specifically:

Database split
Use of NoSQL
Move business logics from DB to services
SDLC aspects of moving from monolyth to microservices
Configurations