If you use NodeJS on your Backend, you have probably run into memory problems at one point or another. Compared to other systems where a child process is created and terminated with every request, Node is initiated only once, long-running, and particularly unforgiving in any task that leaves residue behind.
This post is an attempt at explaining how to dive into the seeming mess of memory lane and come up with insights, or better yet, solutions.
Let’s start with a problem.
When we did the Foundation Release and put the production version of todoist.com up to the test of real traffic, this showed up in our panel under the wash of visitors:
(As background information, for the latest version of todoist.com, we have a NodeJS server backing up a NextJS front.)
Roots of Memory Leaks
There could be 3 types of leaks, depending on your situation.
1. The code you wrote leaks:
With NextJS, server side codes are usually limited to:
- server.js, and whichever files that it requires
- getInitialProps and beyond, on each page, if only one page leaks
And while checking your own code, memory leaks boil down to either:
- Global variables
- Closures
However no need to get too nervous around them–as long as no new data is appended to a global or closured array with every page fetch, simply keeping a fixed amount of objects and strings around is usually fine.
2. The package you use leaks:
Sometimes leakages come from the outside pipes. If after digging through the call chain hell you discovered that the content of a package came up time and again, it’s time to search for related memory leaks issues on Github.
3. Node itself leaks:
This is not common, fortunately. But with unstable or edge versions of Node, it is still a possibility. Always make sure that you are on a stable and supported version of Node.
What Tools to Use
1. Memwatch.
Memwatch is a popular NodeJS library to track leaks, and it shows up top on the list when Googling memory leaks. Sadly, as of the end of 2019, this package is 7 years out of date, and predictably doesn’t work with the current stable version of Node. Even memwatch-next, which was someone’s valiant attempt to make it work with the then current node, happened 2 years ago and fails to cooperate further with current Node.
2. Good ol’ Chrome Memory Inspector:
To make use of the Chrome developer tools, simply run your production build with the option --inspect
.
We’re running the production version of NextJS because in development mode, pages are served through NodeJS with on-the-fly compiling. And since production build includes the already compiled code, while debugging NodeJS leaks, we don’t need compilation muddling about in our memory.
Running Chrome Memory Inspector
If you don’t need to use Chrome Memory Inspector on a weekly or even monthly basis, it’s easy to forget what all those buttons and dials are for. Let’s start the server and take a look:
$ next build
$ NODE_ENV=production node --inspect server.js
Then go to chrome://inspect/ and open up the inspection window:
A note on this screen: every time the server is restarted, the old Memory Inspection Window would become stale. A new window would need to be opened from this page to inspect the current running server.
Here we finally enter the Memory Inspection Tab:
First it’s always a good idea to get a sense of what actions are related to the memory leak. Select the second option “Allocation instrumentation on timeline” and click “Record”:
This is what a healthy timeline should look like:
- Every time an action happens (in this case a page load), there is a burst of memory usage, in the form of blue bars.
- The next time when the same action is executed (another page load), the previous memory will be garbage collected, indicated by the blue bars going grey.
As supposed to an unhealthy timeline, where some percentages of the blue bars always get stuck behind.
In the case of NextJS, after starting the Node server, don’t forget to complete at least one page-load before starting the timeline record. Otherwise, the first blue bar will consist of the giant heap of memory used to generate the page for the first time–a typical number mounts to 1MB–which completely dwarfs the 50KB leak that happens every time an action is performed.
Try to play around with the app, trying all the actions that you suspect might be causing the leak, until you can consistently reproduce it. In the screenshot below we can see a consistent ~40KB of memory being left behind with every page load. Therefore, we will be using the page load action as an identifier.
Exactly How to Read the Memory Profiling
Now that we’ve pinned down the criminal action, it’s time to put this fine knowledge to good use:
- Restart the server to have a true clean slate.
- Load the page for a first time to get the first load memory consumption over with.
- Open up the Chrome Memory Inspector.
- Hit the “manual garbage collection” button to make sure what remains in memory cannot be garbage collected.
- Select the first profiling option “Heap snapshot”.
- Do a heap snapshot.
Next comes the magic trick:
- Perform the identified sinful action a certain amount of times, say 10.
- Hit the “manual garbage collection” button to make sure what can be collected is collected.
- Then take another heap snapshot.
Now we can check the summary view, sort by “Retained Size”, with the option of “Objects allocated between Snapshot 1 and Snapshot 2”:
Whichever objects that accumulated exactly 10 times, or a multiple of 10 times (20, 30, 40…), are very likely to be the culprit of the memory leak.
An explanation of terms, for the table headers in the above screenshot:
- Distance: how many references before this object reaches
window
. - Shallow Size: how much memory this object itself takes up.
- Retained Size: how much memory is held up because this object keeps references to them.
Depending on your situation the memory leak could either be in the objects themselves, or in the references that they hold. So try sorting by both Shallow Size and Retained Size, and see which one has a bigger number. Sort that column from highest to lowest, and check down the table for each row that accumulated exactly 10 times between Snapshot 1 and 2.
In our case, top of the list, the biggest item that accumulated in memory for 10 times is the i18n
object. Opening it up and checking the source reveals it to be a translation related third-party library:
The Object Tab on the lower half of the inspector is very helpful in tracing back to where the leak is coming from. Each level reveals what is referring to the previous one.
In this case, we click on one of the i18n
objects, and in the Object Tab, we can see that the references go back to a global variable called queueMicrotask
, which takes up 99% of its retained size.
This could be our problem. But just to be sure, check on more of those 10-time-consumers. The second one on the list is called ServerResponse
, which looks like an innocent Node service:
But checking its source, we can see that it points at another i18n
service instance.
The third down the list looks like a gzip library:
But Object Tab again reveals that it has something fishy to do with the i18n Service.
Other views can be helpful too, like the Comparison View.
Go to the second Snapshot, select Comparison from the top left corner, and on the second dropdown select for it to be compared with the first Snapshot. Sort all the Constructors by Delta, then check from top to bottom: are there any numbers that accumulated exactly 10 times?
For instance, this JSArrayBufferData
has exactly 10 instances of a 16,384
in Allocated Size:
After selecting one of those BufferData
objects and checking the Object Tab below, we can see that it is, yet again, related to the i18n
instance.
At this point, we can be relatively certain that whatever shady business this i18nSererInstance
is conducting is probably the source of our pains.
After some Googling, it turned out that this package already had an issue on it and an upgrade to fix things, so we only had to upgrade the packages. The alternative would have been to do a patch on our side while submiting a pull request to update the package, which could have taken much longer… Phew.
To Recap
Check Node for stable version, upgrade node packages, avoid leaky closures and globals. Use an allocation profile to nail the leaky action, then use the fixed-times method with 2 snapshots to nail the source.
Happy speeding! 😀