Down the rabbit hole and into memory leak lane

Shay Dratler
10 min readDec 5, 2022

--

“The best way to explain it is to do it.” — Alice’s Adventures in Wonderland by Lewis Carroll

You’re a software engineer, and you’ve just joined a new team whose assignment is to increase and improve monitoring. While working on this new task, you find a neglected service, one that handles all background tasks, such as recovery, reports, and other long-running tasks. But, because it’s the “backyard,” no one has thought it’s worth measuring… until now. Once you start, though, you find that this service is constantly working with over 90% memory usage at all times.

Memory leak screenshot

“How did we get here,” you think to yourself, “and what now?”

Your team thinks the best way for you to step into the team and earn their trust is to handle this issue.

I’m Shay, a software engineer with over 10 years of experience, and I’ve encountered some challenges like this. If you think you have a similar issue or might encounter one in the future, you’ll find some useful tips here. So, let’s get started, shall we?

The basics

Before handling any issue, it’s important to lay down some ground rules. I don’t see any shortcuts to this step, but there are two books that can help and that every developer should read — Clean Code and The Clean Coder, both by Robert Cecil Martin, a.k.a. Uncle Bob.Both books provide solid explanations of what makes good code and what might cause bad code. Bealdung also provides best practices and more information.

The next step is to go over the service to understand what might cause issues and then run the project on your local machine to understand how the application works. Once you have a basic understanding, it’s the time to get some work done.

Known issues and low-hanging fruit

I consider the issues below to be relatively easy to find, and fixing them can reduce work time. But they can spread over large code areas, so take your time, and remember that the most important thing is to fix the leak without compromising the application business flows.

Nested loops
Nested loops look harmless at the start, but they can cause major increases in memory. This is because they need to be marked as free once the code block is resolved in order for the garbage collector to release the memory on these objects. For example, let’s say we have two lists of items, and we want to find all the items that are appear in both lists:

public List<Item> matchedItems(List<Item> liOne, List<Item> liTwo){
List<Item> outcome = new ArrayList<Item>();
for(Item one : liOne){
for(Item two : liTwo){
if (one.value().equals(two.value()){
outcome.add(one);
break;
}
}
}
return outcome
}

For this sample, the complexity is ON², which means that for each item on list liOne we might iterate on all items on liTwo.

Also, nested loops usually “hide” in the code. One might look more or less something like this:

public List<Item> matchedItems(List<Item> liOne, List<Item> liTwo){
List<Item> outcome = new ArrayList<Item>();
for(Item one : liOne){
if(existsInList(liTwo , one.value)){
outcome.add(one);
break;
}
return outcome
}

private boolean existsInList(List<Item> items , String value){
for(Item item : items){
if (item.value().equals(value){
return true;
}
}
return false;
}

We can see that instead of doing nested loops, there is a usage in other functions that have the same effect. In the end, the issue remains, but it’s harder to find.

There are some actions that can be performed, such as sorting the lists before iterating them and adding breakpoints. This might reduce the amount of items that need to be iterated, but what if we use other solutions, like a map, for iteration?

public List<Item> matchedItems (List<Item> liOne, List<Item> liTwo){
Map<String value,Item item> mappedItems = new HashMap();
for(Item one : liOne){
mappedItems.put(liOne.value(), one);
}
List<Item> outcome = new ArrayList<Item>();
for(Item two: liTwo) {
if(mappedItems.containsKey(two.value()) {
outcome.put(two);
}
}
return outcome;
}

We can see within this code that the complexity dropped to O2N or to ON. Yes, we needed to iterate on both arrays, but we did it with less memory consumption. We can be even more efficient if we use a better technique, such as a cache or a better algorithm.

Schedulers
Nearly every modern Java framework uses a scheduler, whether it’s Quartz or Spring Boot.So, you might encounter an issue if you need to use a heavy payload.

@Scheduled(fixedDelayString = "1000")
public void collectMessageFromSQS() {
ReceiveMessageRequest receiveMessageRequest = new ReceiveMessageRequest(sqsUrl);
receiveMessageRequest.setMaxNumberOfMessages(config.getMaxMessageToPull());
ReceiveMessageResult receiveMessageResult = config.amazonSQSAsync().receiveMessage(receiveMessageRequest);
//handle SQS message
}

Here we can see that the method connects to SQS every second and tries to collect messages. The downstream requires actions not being handled within that time. Then the queue begins, handling actions that need to be performed.From there on, memory will start to load, and the garbage collector will not have time to handle the backlog.

So, what can be done ? First, asynchronous working will force the frameworks to create a new thread fo reach process, creating a new thread using a thread pool. This way, you are leveraging the Java framework capabilities for multiple threads.
Maybe we can use the same connection or define connection pool (for SQS it’s not relevant).

Streams and anonymous functions
In the last few years, the smart people that built the Java language have tried to modernize it by adding streams and anonymous functions. Some of what they’ve done has made significant improvements, both in experience and writing complexity, reducing the amount of code being written.

All of the above is good practice for debugging, but working on anonymous functions is more challenging, for example:

List<Prop> goodProps = new ArrayList();
list.stream().forEach(item -> {
for(prop p : item.props){
if (p instanceof GoodValue){
goodProps.add(p);
}
}
});
return goodProps;

Here we can see a nested loop within the stream, and we can use the stream on the prop list, but the main idea is the same. So, what can we do better?

List<Prop> goodPropes = list
.stream()
.map(Prop :: item.getEnrichedProp)
.filter(p -> p instanceOf GoodValue)
.collect(toList());
return goodProps;

Do you see the difference? Here, we collected the goodProps, and now we can do whatever is needed with one iteration. We can do evenbetter using flat functions, but if you are working with a loop in streams, please read up on how to use it well and think very carefully about what you are doing. Performance testing on streams is not always straightforward; it depends on object manipulation complexity, the nested value of an object, and more.

Find problem static methods and static objects
Static variables are not collected by the GC, so once you create them, they will remain until the end of the JVM’s life. So, what needs to be done? Let’s look at an example:

public static Optional<Object> convertToJson(final String jsonString) {
ObjectMapper mapper = new ObjectMapper();
try {
return Optional.of(mapper.readValue(jsonString, Object.class));
} catch (Exception e) {
return Optional.empty();
}
}

The intent of the developer was good. They wanted to create a generic method that would get input and convert it to a JSON object. The problem, though, is the static method. Let me share why.

race condition

Let’s say we have two threads, 1 and 2. Both of them access the same static function, but because the code block is not synchronized and not singleton, two code instances might be created in some cases. Also, if you take a good look at the code itself, we are creating a new ObjectMapper for every request, meaning it will allocate a new object every time since the method is static.

So, what can we do differently ?

public Optional<Object> convertToJson(final String jsonString) {
ObjectMapper mapper = new ObjectMapper();
try {
return Optional.of(mapper.readValue(jsonString, Object.class));
} catch (Exception e) {
return Optional.empty();
}
}

Simply remove the static! For instance, this method will be created for every thread, but the GC will be able to clean it. Want something even better?

public class jsonUtils { 
private final ObjectMapper mapper = new ObjectMapper();
public Optional<Object> convertToJson(final String jsonString) {
try {
return Optional.of(mapper.readValue(jsonString, Object.class));
} catch (Exception e) {
return Optional.empty();
}
}
}

Over this code block, we are creating the mapper once for the JsonUtils class and will reuse it every time we call the convertToJson method. This way, the GC will be triggered once the method completes.

The same goes for constants:

public class MyConstants{
public static String MY_STRING = "this is my String";
}

We can create more than one instance of MyConstants, and we can even change the values, possibly impacting other conditions,

The right way is as follows

public final class MyConstants{
private MyConstants()
public final static String MY_STRING = "this is my String";
}

The final keywords prevent extending the class, and the other final prevents changing the values, making it permanent.

Third-party SDKs, decorators, and queues

When working with SDKs, usually for cloud-managed services such as SQS or managed Kafka services, some are adept at creating connections, but finding resource allocation problems is difficult and requires additional work. It may be best to do a Google search to see if someone has filed an issue on GITHUB or provided a workaround for such an issue.

Close all left over connection

It doesn’t matter if you create connections to a database, open a web socket or even a plain old HTTP client. if you obtained a connection, then you should think about how to close the connection and release it.

Some APIs don’t know how to release locked objects or don’t know how to reuse connections. You might encounter an API that allocates a connection over and over without releasing it on time, making the GC’s life harder and preventing connections from being closed.
The connection pooling might be open for a long time and might not be marked as ready to be cleaned.

Logger formats and MDC
Most log formats and format templates rely on an object called Mapped Diagnostic Context (MDC). This object is a very old one (from Java 1.1) and hasn’t changed since its inception. It’s not thread safe, but most templates use it.

If you are using it, release the value when you’re done by doing the following:

mdc.clear();

The above code will release your MDC variables.

Memory dump

There are no shortcuts to do this. You will need to log in to what holds that JVM. It can be a machine, virtual machine, or kubernetes pod for each one of them. I recommend you do two types of sampling — a thread dump and a heap dump.

You will need to take one sample of each kind both before any memory spikes in the application and after the memory increases. The logic behind this is that it give you the ability to compare what processes and which threads are holding the memory and not releasing their resources. You will also find out whether you are getting the primitive object or the class itself.

Memory consumption rate

There are three types of leaks as I see it:

“Hit the wall,” meaning something within the application drains all the memory at once

“Stairway,” where you see gradual increase over a relatively short time period

“Water leak,” where you see a constant increase over a long period of time

“Hit the wall”
Stairway
Water leak

Each one of these might indicate different issues within the code, but what they have in common is that the memory is not being released as quickly as it’s being allocated.

Read, evaluate, print loop

This is never a fun task, but from my experience, your application might have more than one leakage at different locations, so it’s important to be persistent. You will need to do more than one iteration on such processes with patience and consistency, carefully investigating the details.

Dry run

So, you found all the leaks, and you think you’re done… but wait! Now it’s time to

do a dry run on the non-prod environment by deploying it to QA. Then run an end-to-end test to measure memory allocation performance on small amounts of data.

I consider this to be the most critical phase. It might be the most difficult step, but skipping it can result in major problems.

Final words

Kudos to you for reaching this point! Working on memory leaks is difficult, annoying, and frustrating, but I hope these insights help you along your journey.

Even when I use these tips and tricks,it takes time, so stay sharp and be cool. Remember that a leak occurs when your code is not carefully thought through, necessitating that you find the issue and fix it. But it’s okay — mistakes happen. The important thing is that you learn from these mistakes and share lessons learned with your team.

As for me, here’s my progress:

Final step print screen

You can see that it took a great deal of trial and error, but in the end, I was able to decrease my application’s memory usage from 97% to around 75%. That’s a lot, and it will impact my application’s stability.

Happy coding, and keep learning!

--

--