Tales of production downtime — get a managed database!

Alex Renoki
FAUN — Developer Community 🐾
5 min readMar 13, 2019

--

Photo by Felix Mittermeier from Pexels

One incident made me rethink why I should use a managed database and why replication across different physical servers or regions is really important.

The following schema is the way the app I’m going to talk about’s infrastructure is:

  • two worker nodes with a load balancer on top of them
  • one MongoDB instance
  • one MySQL instance
  • one Redis instance

Each node communicates using the private network. The only public inbound/outbound is on the LB which distributes the incoming traffic onto worker nodes.

The incident I’m going to talk about is about having a hardware problem on the servers that your instances run on. Since the provider is DigitalOcean, I’m going to call them “droplets”.

Thursday. March. Sunny day…

12:44 –12:47: Our monitoring apps report that the app is down. Gear up and find what’s going on. The website is loading slow and after the timeout, it throws 504 — Gateway Timeout Error. This occurs when the server you requested data tried to connect to another server, usually under a private network, and the response took too much so it decided to timeout it.

13:04: After many tries to test each worker node connectivity with other instances, I found out that MySQL queries are timeout-ing. SSH is also throwing a timeout — one of the reasons can be hardware issues. Proceeding the soft rebooting of the droplet via CLI. Reporting the incident to DigitalOcean.

13:08: The app goes into maintenance mode, at least to turn into a proper response for users. (We got maintenance mode pages)

13:10: The soft reboot on the droplet stops with an error. Proceeding hard power-cycle (i don’t recommend doing this: it can corrupt data; since the droplet didn’t give or receive data for 15 mins and we guessed there were hardware issues, putting it to sleep by using a pillow on its face is considered somehow safe)

At this specific point, I’d spin up a new droplet with MySQL, import the backup (which is 11 hours behind this point), and fix the connection data within each worker node.

14:19: After attempts, spinning up a new instance and downloading and importing the backups considered to be too late: the app was down for almost 2 hours, but the hardware issues were fixed in the meanwhile. The app goes back online. Monitors report that the app is healthy.

The next day, early in the morning: DigitalOcean sends a mail explaining that there were hardware issues. The instance was allocated to other hardware, making the problem disappear.

You don’t want production downtime!

With a multi-node/AZ (availability zones) database, this shouldn’t have happened at all. The downtime wouldn’t have been noticed, unhealthy nodes would have recovered from this specific hardware issue while one of the standby nodes would have taken control and serve the requests.

So, why is it important to stop getting your head around deploying your database instance on, for example, Ubuntu and use a managed database?

A managed database service is a service offered by a cloud provider that helps you with some problems that occur in running your computing instance and manage it on your own:

  • It’s managed. It’s not your job to update the software version to the latest patch. Other people do that and there’s no downtime while doing it. It’s the most secure solution when it comes to upgrading database versions.
  • It is configured automatically, both single and multi-node. No hassle. Want more nodes? Click. Ensure multiple availabilities zones? Click. If one instance goes down, the service doesn’t go down. The whole traffic inbound is going only to healthy instances of your database. Once things go up again, the healthy databases will help the ex-unhealthy ones catch up (sync) when they get up again.
  • Scalability is easier. You can just replicate data on more nodes. No more hassle on configuring it, setting replica sets, and complex configuration. You can even have read replicas across the globe to ensure low latency.
  • One-click backups and no downtime. Snapshot the database at any point and this will let you sleep at night knowing that your database is safe and in case some disaster occurs, you’re able to replenish the database with the backed up data with clicks away.

What are my options?

AWS offers a variety of managed database services that will help you build highly available and easy-to-scale apps. AWS seems the most reliable and it gives you a lot of choices, either you’re running SQL or NoSQL. AWS even helps you migrate between database engines. For instance, you can migrate your MySQL/MariaDB schema to Postgres using their Database Migration Service.

For Relational Databases, you have only one strong option to choose from: Relational Database Service, which has support for major SQL engines like Postgres, Oracle or MySQL. Amazon also promotes Amazon Aurora, their own MySQL/PostgreSQL database engine which they state it’s built for the cloud and comes with much lower costs. It’s your choice on what to choose.

If you work with NoSQL and want a strong managed database, you have two options:

  • Amazon DocumentDB, which has compatibility for MongoDB. If you are working with Mongo, this is your choice of getting a fully managed database.
  • Amazon DynamoDB is also a NoSQL, but it’s working differently than Mongo. DynamoDB is a key-value database, which acts more like storing data on-the-go. It’s highly used in Gaming and IoT since has low latency and it’s easy to scale it up in case you want to store data.

Conclusion

Learning from others’ mistakes is a good thing. Take this lesson as a strong reason to switch to an easy and highly-available/scalable infrastructure for your projects. It’s worth the costs just so your app won’t face unexpected downtime.

💸 Sponsorship

Hi, I’m Alex, the founder of Renoki Co.. I’m thankful for taking your time to read this article, and I hope that it helped you. Developing and maintaining packages and delivering good articles about Laravel, Kubernetes and AWS take a lot of time, but I believe it’s a time well spent.

If you support more helpful articles, or you are using one or more Renoki Co. open-source packages in your production apps, in presentation demos, hobby projects, school projects or so, sponsor our work with Github Sponsors. 📦

Follow us on Twitter 🐦 and Facebook 👥 and join our Facebook Group 💬.

To join our community Slack 🗣️ and read our weekly Faun topics 🗞️, click here⬇

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author! ⬇

--

--

Minimalist Laravel developer. Full stacking with AWS and Vue.js. Sometimes I deploy to Kubernetes. 🚢