Vultr, Why?

Aug 07, 2020

Let me begin by saying, Vultr is a good provider, even though it used to get a lot of negative feedback in the community. My experience with them has been good, their prices are reasonable, and their servers, especially the High-Frequency instances, have excellent performance. Their support is…well, let’s say it’s ok. It’s fast, faster than some if it’s peers, but how helpful? That’s subjective. It’s ok if nothing is seriously wrong, and something did go wrong on the 15th of July, 2020.

I had just wrapped up a 15-hour workday and was winding down to sleep when I started getting downtime notifications on slack for one of our servers that had around 20 sites on it. It’s really common to receive approximately 5-8 downtime notifications daily for 1-2 mins as servers experience latency issues. Out of 850+ sites we currently host, I can live with 5-8 downtime notifications for 1-2 mins. Most of the time, they aren’t even down for 1 min, but we see 1 min as that’s the minimum our uptime monitoring tool tests the site for. I get the notification, I go and check the site and its up.

So back to the 15th. I got these downtime notifications and was like let’s see if they come back in 1-2 mins, but they didn’t. When 5mins had passed, I logged into my Vultr dashboard. I could see the host node of that instance was facing connectivity issues and is being investigated. Not something I haven’t seen before. I waited as these issues typically get resolved in 10-15 mins when 30mins had passed, I decided to create a support ticket. Which promptly got replied that they are looking into it (didn’t I say the support is fast). Being a nice guy, I waited patiently. As it was around 11pm EST, most of the clients were asleep, so I wasn’t getting any emails from angry clients, but this was starting to border unacceptable.

After 45 mins, I asked for an update. Again I get a quick reply we are working on it. 1.5 hours had passed since the servers went down. After an hour I asked for another update, I get the same reply. This process goes on for a couple of more times until it has been around 6 hours, and I’ve had enough. That’s when they tell me that they have tried recovering the host node. It hasn’t been successful yet. They are trying further, but it would take considerable time and no guarantee of success.

That’s when I knew, my night is long, I had been working for 21 hours at that point. I had backups for the 16 sites on this instance, the worst part was I couldn’t release the attached IP and add it to another instance. This caused the problem of getting access to DNS. Most of our clients have DNS on their domain registrars or with their IT providers, so getting 2FA codes and updated logins sometimes creates a mess. We spun up an instance and started restoring backups, within 2 hours all our backups were restored.

My cutoff was 7am EST. Suppose I wasn’t getting a resolution from Vultr. In that case, I’d have to start changing DNS for clients I had access to and start emailing and calling up clients to get their updated DNS logins for those I didn’t. Just at 6:45am Vultr responded that the host node is up enough for them to migrate the instance to another host node, within 30mins the server was back up after a downtime of around 8 hours. This was the most prolonged downtime we experienced.

This event just accelerated the move we were planning with BionicWP to GCP (Google Cloud Platform). It only solidified our intention to move everything to more redundant and highly available solutions by GCP.

After a work shift of 25 hours, I finally got a chance to take a 2-hour break before heading back to the office for another long day. What can I say? Thank you, Vultr.

Musings by AQ

Discussion about this post

Ready for more?