Using cvs for web server farms

September 19, 2000

Let's assume that you've adopted the techniques described in Using CVS for Web Development (or [Dreilinger]) and you are the Release Master. Now imagine that instead of one web server you have a web server farm of 10 or 100. How do you prevent chaos and carpal tunnel syndrome from keeping you chained at your desk for days on end? Does this look familiar:

  1. Log onto the web server
  2. Use CVS to checkout a new release, into a separate directory from the currently running production code
  3. Quickly update your data model and switch directories
  4. Restart the server
  5. Check that the new server is running okay
  6. Log out of the web server
  7. Go to step 1 until you run out of servers
Does this look like an algorithm to you, too?

The way out of the madness

Part of the problem of scripting a process like this is that there is a certain amount of uncertainty about what responses might be returned by the various tools necessary to accomplish the task. You don't want to blindly continue, because that might bring your site down. Enter a new tool to your quiver: Expect. It is a tool for automating interactive applications. Part of the Expect release is a sample script, passmass, which demonstrates how to change a user password on multiple systems (that are not using NIS or Kerberos, of course). With just a few modifications, a new script was born, cvsmass(1) which performs (almost) all of the tasks listed above.

cvsmass knows how to log onto remote machines in various ways (telnet, rlogin, slogin, ssh). It will ask you for the remote password, if it is needed, and then reuse it for each subsequent server (this assumes that you use the same password on each system). Once the shell session is started, it will change into a directory (e.g. /webroot) and then run any program that you specify (cvs update by default).

You'll also need a couple of utility shell scripts: cvs-pull-snapshot which pulls a release from the CVS repository and switchrel which will switch a release directory on the production server. I am making the assumption that you have a symbolic link, which is the name used in your server config files and that it points to a date-stamped directory. These scripts are heavily customized for my site, so you must modify them to suit your needs.

Once the tools are installed, here's the general process:

  1. I'll assume that the Release Master has created a branch in the CVS repository. We usually call ours Snapshot-yyyy-mm-dd.
  2. Run cvsmass to pull the snapshot onto each server:
    cvsmass -dir . -program 'cvs-pull-snapshot.sh yyyy-mm-dd' webserv1 webserv2 ...
    

    The utility script extracts the release into a directory with a date extension (yyyy-mm-dd) so it won't conflict with the running release. The date is also part of the CVS tag, so the two are related. Depending on how many servers need to be updated, this can take anywhere from minutes to hours, but so what? You can do this well before the cutover is scheduled.

  3. Now that the new release is available on each of the servers, you must go to each of the server, switch a few symbolic links and restart the servers.
    cd /web
    switchrel yyyy-mm-dd
    restart the server (usually kill `cat nspid`)
    
    I've considered scripting this part, too, but I'm too chicken to trust even my own tools. If switchrel fails, I want to be able go back and fix it before restarting the web server. In addition, our restart requires root privileges, which we grant by using sudo.
  4. Check the server logs to make sure everything is okay, move to the next server and go back to step 3.

Exceptions and variations

ssh magic

In order to use CVS on production machines, you need to communicate with the CVS repository server. There may be no direct (unfirewalled) connection between the production network and the development network. Rightly so. What is required is a "RemoteForward" connection using ssh. This creates a local port binding on the remote machine that tunnels back to the local machine, and opens a connection to another service.

[dev(ssh)]      ==> [webservN(sshd)]
[dev:2401(cvs)] <== [webservN:2401(sshd, RemoteForward tunnel)]
On the production systems, it appears that there's a local connection to a CVS server, in reality it is a tunnel back to the development network via ssh. You can do this on the command line by adding -o 'RemoteForward=2401 dev:2401' or you can add it to your .ssh/config file, as I have:
Host = webserv*
RemoteForward = 2401 dev:2401
The only caveat to this is that if you want to open a second shell to a webserv, you need to override the RemoteForward (because there can only ever be one process that binds to a particular port, in this case 2401). Add -o ClearAllForwardings=yes to the ssh command line for second and subsequent connections. You'll know that you that you've forgotten to do this if you get an error message like: bind: Address already in use

Updates that don't require a full release

Scenario 1: Your service is up an running just fine, and you need to make a minor change to half a dozen files. Do you make a whole new release? If not, how do you get the changes propagated out to all of the servers? The answer is simply run cvs update on all of your servers. If the changes don't change any function definitions or initializations, you can get away without restarting the servers at all.

  1. Your team makes the changes in the directories served by your stage server. They should be checked out using the same CVS tag as the one you used to make the original release. Once the changes are tested and QA'd, check them into the repository (either manually or a nightly cron job).
  2. Run cvsmass, but this time the program to run will be cvs update.
    cvsmass -dir www -program 'cvs -q up' webserv1 webserv2 ...
    

Scenario 2: You're running a service that has numerous co-brand arrangements. One of your customers has a legal requirement to be ever so slightly different that the rest of your site. It isn't worth creating a whole new body of code for it, so how do you keep track while still updating the site so it is in sync with the rest of the world.

It should be no surprise that I've had to deal with this problem first hand. We had to comment out one line of HTML code on one group of servers. It was a rush job, so we made all of the changes on the live production system. Here's what we did:

  1. Edited the offending code
  2. Ran cvs diff -u > patchfile
  3. Saved patchfile in a common spot (we have a few directories NFS mounted for utilities and such).
  4. Ran cvsmass to run the patch program on each of the servers
    cvsmass -dir www/bus -program 'patch -b < $HOME/patchfile' webserv4 webserv5 ...
    

An interesting note is that when we do minor updates, as in Scenario 1, the patch we made does not disappear. CVS is smart enough to apply incremental changes without wiping out any local modifications (it uses the same code as patch internally). On the other hand, we have a large note to remind ourselves to re-apply this patch every time we roll out a new release into new directories.

Some advantages and things that could be done better

It never fails that about 2 hours before I'm scheduled to switch to a new release, a developer comes running in with a last minute fix or modification. In the bad old days, I'd have to re-roll a tar ball and scp it to every server in the farm (and in those days there were fewer servers in the farm to begin with). Now, it's a no brainer. I just check the changes in (it did take a good month of nagging before the HTML folks got the idea that they had to make changes in the staging area), and run an update script that calls cvsmass with the appropriate arguments.

Rollbacks (in case the sh*t really hits the fan) are just the same as a release: call switchrel with the previous release date, restart the servers, lather rinse repeat.

It would be nice if I had also written a script that does the switching and restarting, but, at least in my current situation, it pays to go slow. We are using Cisco LocalDirector to load balance the traffic. If a site goes down (during a restart) it is taken out of the load balance pool for a few minutes. That is just about enough time for me to move (logout, login, cd, start a 'tail -f') from one server to the next, so we never have more than one server offline at a time, and we never really go "off the air."

My only one, true regret is that the name 'cvsmass' is a little inappropriate since CVS is not integral to its function, but I can't come up with a better name in any fewer characters.


References

Reference material and source code

Using CVS for Web Server Farms - September 19, 2000 - Ken Mayer