Your Web Site Is Never Down
Clients are already reaping the benefits of our third-generation Cirrus Cloud Hosting. With absolutely zero downtime of their web sites, we have been able to upgrade the operating system on their accounts. Operating system upgrades used to require rebooting the Cirrus account and that meant five to ten minutes of downtime. Such service interruptions are now quaint relics of the past.
We have written about our new “high availability cluster” but it can be hard to see real benefit from such highfalutin language. Let me make this more understandable by explaining how maintenance used to work and how it works now.
Once upon a time, in ye olde days of traditional computers, software updates were a simple, if disruptive, process. We would copy the files with the new software onto the computer and then reboot the computer. Since rebooting the computer meant that the web site would be “off the air” for a few minutes, we tried to do this at a non-disruptive time of day. We would also closely monitor the computer, to be sure that it actually rebooted correctly. It almost always did but, occasionally, a problem would occur and a web site might be down for much longer than “a few minutes.”
Cirrus Cloud Hosting accounts now run on top of a cluster of cooperating computers, instead of on top of a set of computers, each operating independently. In the cluster, each computer helps the others to keep the web sites running 100% of the time. Let’s consider a small example. In our simplified cluster, we have three machines named Alpha, Bravo, and Charlie. There are several Cirrus accounts (each with web sites) running on Alpha and several more running on Bravo. Charlie is just idling away, being held in reserve in case it is needed.
We get a security patch for the Linux operating system which needs to be applied to all of the Cirrus accounts. We do that by following this recipe:
- We patch the operating system on the spare machine, Charlie. Once patched, we reboot Charlie. Since Charlie had been idle, no web sites are disrupted during this process.
- We “live migrate” all of the Cirrus accounts off of Alpha and onto Charlie. During the live migrations, the Cirrus accounts (and the web sites) pause for a few seconds but they do not go down or reboot. People surfing the web sites or uploading files with FTP rarely even notice the brief pause. With the migrations complete, the Cirrus accounts are now running on Charlie and Bravo, while Alpha sits idle.
- With Alpha idle, we patch its operating system and reboot it.
- We then live migrate the Cirrus accounts from Bravo to Alpha, again, without any downtime for the web sites. This leaves Cirrus accounts running on Charlie and Alpha, with Bravo idle.
- Finally, we patch the operating system on Bravo and reboot it.
As you can see, all of the Cirrus accounts were always up, so all of the web sites were always up. Everything got upgraded. And we always have a spare server standing “just in case.” You might well wonder, just in case what? Excellent question! Tune in next month for the answer.