The day our web server reached 100% capacity 💾

Doaa Mahely - Sep 6 '20 - - Dev Community

How it began

One day a few months back, the QA engineer texted me on Slack saying that he couldn't log into the web application. Naturally, I tried to login with his credentials and was able to. I thought he forgot his password, so I sent it to him and he said that he could login now. I didn't think anything of it.

2 hours later, right about log off time, I get an email from a client saying they're not able to login. I dismissed it thinking they forgot their password, intending to get back to them first thing tomorrow. Then, a mobile developer on my team said the same thing.

Smells a bit fishy

So I get to investigating. I went to the website and tried to login and I couldn't. The page would reload when I hit Enter without showing any errors.

Wait what

I quickly started debugging when a colleague mentioned it could be an issue with the database connection. We had recently moved from using a single database instance to a database cluster, and assumed this might be causing an issue or that one instance was taking on too much load. Since this change had been the biggest and most recent one, we narrowed our focus on it. However, the database console looked fine, and didn't show any extra load on any specific instance.

What does that mean

This issue only happened on production, so it was safe to assume it was not related to any code changes. At this point, we started getting more and more complaints from clients, so I decided to do something dangerous: debug on production.

Buckle up

I connected to the server using Cyberduck, navigated to the login view file and logged something like logging in. To my surprise, when I hit save, the file didn't get saved. Cyberduck showed a vague error I can't remember and didn't understand at the time.

Huh

After a couple more hours of debugging, we realized that the server has reached 100% disk usage. That day, I learned two useful unix commands: du and df. From the man page:

The du utility displays the file system block usage for each file argument and for each directory in the file hierarchy rooted in each directory argument.

The df utility displays statistics about the amount of free disk space on the specified filesystem or on the filesystem of which file is a part.

This meant one thing: we had to upgrade the disk size. Thankfully my colleague figured out how to do that with no downtime.

Crisis was averted. People were able to login.

Phew

The end... Not

Believe it or not, but due to the immense workload we had at the time, no further action was taken to monitor the server disk space or dig deeper into why this happened. So somewhat unsurprisingly, two months later, the server reached 100% capacity again!

We were better prepared and quickly identified the issue and upgraded the disk size. This time around, I took the time to dig into why this happened, since we didn't upload a lot of files within the last two months that would justify filling up around ~90 Gigabytes.

Again, I utilized the du and df commands to pinpoint the directory that's eating up the disk space:

$ du -sh /var
...
170.3G    /mail
...
Enter fullscreen mode Exit fullscreen mode

Surprised

Imagine that, the mail directory was taking up 170 Gigabytes, almost 80% of the entire server's disk space! Further digging showed that the culprit was crontab. We had several cron jobs running, and crontab sends emails to the root user that get stored in /var/mail. This was listed clearly in the crontab file as shown below, but the output of a particular cron job was returning a lot of junk that somehow managed to fill up the directory quite quickly.

$ crontab -l
...
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
...
Enter fullscreen mode Exit fullscreen mode

Now what?

The plan of action was to first stop further emails, then to delete the existing emails to free up the server.

$ crontab -e
MAILTO="" # to disable cron emails

$ sudo rm /var/mail/ubuntu
Enter fullscreen mode Exit fullscreen mode

Smarter and wiser, we figured let's set up a monitoring service to catch this particular issue in case it happens again. The service of choice was Monit and it was surprisingly easy to start using. It creates a dashboard that allows us to visualize all the numbers we need easily, from disk space to CPU usage to memory, and sends emails alerts on custom events. This great article is very helpful in setting up Monit on an Ubuntu server.

And the rest is history. We didn't face an issue with disk space again. So far.

So relieved

Thanks for reading! Until next time 👋

Cover photo by Taylor Vick on Unsplash

. . . . . . . . . . . . . . . . . . . .
Terabox Video Player