How do you deal with the loss of a single member server?

Recently I've been working on failover and high availability planning for my production servers. As I thought through all of the options I started wondering how most people handle this in their production environments.

Recently I've been working on failover and high availability planning for my production servers. As I thought through all of the options I started wondering how most people handle this in their production environments. Specifically, I'm interested in what happens when a standard (or "member") MetaFrame Presentation Server fails. How do you deal with it?

There are really only a few options:

  • Restore from backup
  • Manual reinstall
  • Repeat the original installation process
  • Break a mirror from another server
  • Mirroring software ??

I guess the most obvious / easiest way to handle this is to restore from backup. Then again, I don't think any of my clients actually back up their Citrix servers (usually on my recommendation). Why not? The Citrix member servers shouldn't contain any user or application data that could be lost, so there's no reason to back them up (well, not all of them anyway). If a server does fail then they can easily restore it using one of these other methods.

Another common way to deal with a crashed server is to wipe the drives and reinstall the operating system and applications from scratch. Depending on your environment, this could take anywhere from a few hours to a few days. One of the nice things about reinstalling from scratch is that it gives you a "pristine" server. Of course the downside (in addition to the time wasted) is that most likely this new server would not be 100% identical to your other existing servers, and that can make troubleshooting and management more difficult.

If you originally used unattended installation scripts or images to deploy your servers, then you could just re-run the original deployment process. You'd still have to manually apply all the changes that you made since the original installation date, but at least you wouldn't have to sit through the basic setup stuff again.

If your failed server has the same hardware as another good server in the silo, you could use the "mirror breaking" deployment method. To do this, just pull a RAID 1 drive out of a good server and put a blank drive in its place. That server will automatically replicate the stuff from the good drive to the new blank drive. In the meantime, you can unplug your new server from the network, pop the broken mirror drive into it, boot Windows, change the computer name, SID, and IP address, reboot, pop in the network cable, and add it back into the domain. (Don't forget to manually delete the old computer account from the domain.) This whole process takes about ten minutes, and you'll have a perfect copy for the new server. The main downside to this is that you have to have another server that's identical and that's stable enough to use as the source image.

The final option you have is new to me, and frankly I'm not sure whether it would work well in a Citrix environment or not. This option involves the use of server data mirroring or replication software. I got to thinking about this when I was working on my web servers. I'm implementing NSI's Double-Take product which is software that keeps files (even locked and in use ones) in sync and replicated across multiple servers.  In my case I have two SQL servers. All of the SQL database files on my primary server are continuously replicated to a second server. The primary server supports my website. But if it goes down or if I need to perform maintenance, the software fires up the SQL services on the second server and moves the IP address over, and (since the database files are already there) I have my full SQL database back up and running in about three seconds. The software continuously keeps the replicated copies of the software up-to-date.

This got me thinking. Could software like this be used in a Citrix environment? (I mean beyond the data store itself, in which case it's perfect.) My thoughts would be that you would have this mirroring software installed on all of your Citrix servers. If one server failed, you would have a live "image" of it on some other server that you could instantly deploy to new hardware. It might be kind of cool.

On the other hand, you'd have to pay for this imaging software. Double-Take costs $2500 per server, although there are cheaper solutions that don't have quite as many features. (Peer Software, for example, has one that does everything except the automatic failover for $500 per server.)

I'm not sure if this imaging software makes sense in a Citrix environment or not. There are certainly many other ways to be able to quickly recover from a server loss, and if you do have data on each server then why not just use regular backup software? Even in the worst case you should be able to backup one server from each silo that you can restore in the event that one fails.

So, what do you do if you lose one of your Citrix servers?

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

Image your existing servers (physical or virtual) as VMware based suspended servers (HD space is cheap) and monitor original server(s) heartbeat. In case server is considered down, jumpstart corresponding VMware image while making sure the original box does not comeback unannounced. This approach (in theory) can maintain a large pool of suspended servers that are ready to replace a few downed servers. Practical implementation will depend on complexity of the target server(s) and might be limited.
PS. VMware "jumpstart" can be automated with VMware SDKs.
is to simply reinstall the server from scratch using a fully automated unattended setup routine.

We´re using it all the time, I can barely think of a customer where I´ve installed servers manually in the last year.

But this of course heavily depends that you only work with Software Packaking tools (Citrix IM e.g.) and you´ve got your Unattended setup updated regulary with all the hotfixes etc.

If you have that, it´s only inserting the disk/CD and start your routine. the server will automatically do all the work.
You can use Microsoft RIS and in particular RIPrep to deploy your image out to the server. When ever you patch or otherwise update the server simply RIPrep the server again to get the latest image. You can use the process outlined in CTX18194 to prepare the box before the RIPrep and what to do after it comes up.

Lots of installs I've done recently have been on Blade hardware, and they often have their own imaging software (Altiris etc) which you can also use. These tend to be better as it's easier to implement things like scripting the breaking and re-creating on NIC teaming. (avoids duplicate virtual MAC addresses)
Never do ANYTHING on a citrix farm by hand! You're bound to forget something. Automate everything and a crashed server won't pose much of a problem!
We use RIS for the standard Citrix server installation, after the last reboot at every boot a script looks at a QFE registry key. We have jobs prepared in seperate numbered directories. If there is a directory with the same number the application , hotfix or registry files within this diredtory are installed, and therfor active ahter the next chedulled reboot

All applications are installed trough IM
We used to use Doubletake for our SQL server. It seemed like a good idea at the time. Keep this in mind if you decide to use it: after you have "failed over" and rebuilt/recovered the primary system, it is a manual process to "fail back". We've had mixed success with the software. When it was time to upgrade this year, we bit the bullet and opted for MS Clustering with a SAN backend. Not the cheapest route, but much, much more user-friendly.
Assuming no hardware failure, TrueImage is a good solution ( It can create an image while Windows is running and store the image in a secure partition on the logical drive (using mirrored drives to protect against drive failure).

If you hork your OS up, simple reboot, press F11 to launch the Acronis secure zone utility and restore the last image created. You can also store images in other locations (network shares and then back those up to tape) if you are affraid of hardware failures or dont use mirrored drives (but not as convenient as the SecureZone option). Heck - with an HP iLO you can reboot the server remotely and press F11 from the comfort of home using a web interface.

Acronis also supports incremental backups so you dont have to create a full disk image each time. It also has a scheduling system to you just setup your schedule of full and incremental backups to your secure zone.

Acronis has crappy tech support, but the product usually works pretty good.
We use RIS extensively to build everything from laptops to servers, including terminal servers. Now i haven't automated the Citrix software (since we are primarily straight Terminal Server), but I have a single VBS has the logic to install EVERYTHING.

The one exception - it doesn't disable unused NICs (that would otherwise show up as Red X'es in the tray). Otherwise, the server would be completely usable by any person, on any hardware platform we support, from a desktop to an 8-way.


A right sized Citrix farm can sustain server outages without the need for a failover strategy that includes an immediate rebuild. If budget is no problem, why wouldn't you just create a farm large enough to handle twice the load being placed on it. And if a single server rebuild is all you want, then simply have one extra server in the farm. This would be the equivalent of what would be accomplished with a one to one mirroring without the cost and complexity of the mirroring software solution. Then you can rebuild servers without pressure using whatever method suits you.

I would prefer an imaging solution or mirror break. Rapid deploy tools like altiris are good as well for images or scripted installs, and have the RIP and Replace functionality as well.

Additionally, I recommend removing the failed server from the farm prior to reintroduction to prevent any data store corruption. Using dscheck after removal. This is in addition to SID changes and domain removal when using the break mirrored drives method.

Ultimately, I am excited about the possibilities presented with solutions like Vmware and the potential of upcoming features like dynamic load balancing and failover capabilities. If Vmware can truly support production load Citrix servers, this is the solution to solve issues like failover and resource hungry apps (read runaway processes) that can intermittently consume too many server resources to the detriment of all other connected users. This type of dynamic load balancing could be a great addition towards solving the real pitfalls of server based computing, rerouting entire servers to the available resources it so desparately needs.
I use RIS and custom VBS and BAT files to build a server. It takes time to setup, but saves you several hours, plus server consistency across the server farm

Application can be installed by IM, or custom scripts, using administrative install feature, AD Software Policies, and sometimes simply "setup -s" :-D
No matter how you build your servers, you or some else will eventually make some undocumented changes to it. I suggest using Symantec Live State Recovery.
I always build my servers with unattended scripts for the OS/Citrix initial install. After that EVERYTHING gets deployed through Installation Manager. If a new patch comes out, I create a package and add it to my "complete install" package group. This way I can easily rebuild a server and ensure it has all the changes I need applied. I even use IM for configuration changes, etc. Images get out dated and are not as clean when hardware differs. I have seen many other solutions implemented and have ripped them out to go with unattended/IM.
Our servers are located at 2 Remote sites, so here we use Altiris Deployment Solution to Build and rebuild our servers, when we have a failed server we kick off a redeployment job that scripts the OS installation, configures the server, joins the domain and then installs the packages that we use on the servers.

I create MSI's or Altiris Rapid Install Packages (RIP) for all changes that we make on the server, any new applications are packaged and added as a job, any reg fixes that are applied to resolve problems or to improve optimisation are imported into a RIP and added to the build process as a job, if it's an application specific fix, then I usually go back to the original RIP, modify it to include the fix and then redploy it to the environment. This actually works well for us, because we manage around 60 servers and the number like to increase shortly, deploying a package of the changes works well and in general we can test the changes on our Test environment, compile a RIP and then deploy it to Production without fiddling. We also ensure consistency of build by using an automated process to deploy changes.

The result is that with proper management, we can rebuild any Citrix Server completely unattended, we have enough capacity that we should be able to lose half of our environment, so losing a server for a couple of hours while it does a rebuild is not an issue, we use load balancing for our Citrix Failover and similarly we have redundancy in every critical component of the system.

I would recommend Altiris as a product to anyone, I've also used it with Blades and it does some funky stuff with those, as it can keep a history of jobs on a per slot basis, if a blade in a slot fails, swap the blade and the previous history of jobs will be applied to the slot.
Ok... so that was my post... I just wasn't logged in.
Simply use Software Deployment Tools. In a Citrix Environment use VisionApp and you're out of trouble.
I administer a farm of 15 servers. Some are RAID5, some RAID1 and some are just Workstation class non-redundant systems.

I agree with comment from Anonymous "what's your budget" regarding time sensitive applications. With enough capacity, a failed server should have no real impact in production environment.

We started with the RAID5 servers- all manual build. The next set of applications inspired the "mirror breaking" method and were thus set up as RAID1. The final applications inspired the use of workstation class systems with single IDE drives. We use ghost in the style of the mirror break method.

In general, we prefer the mirror break method (aka Ghost creation for NonRAID systems) method to replace a defective unit.
We use the sepagoINSTALLER to build Citrix Farms from the scratch. It is really simple to implement a disaster recovery in that way. Never doing some manual stuff in a Citrix environment. You loss productivity and money!
Where do i get the sepagoinstaller?
I just use good old Ghost, have a batch of network boot disks and if something untoward happens to my TS server I just boot up and re-ghost.

Smaller scale network here, so that works perfectly for me and cheap as chips. :-)

We're running RAID 1 and for every member server we have 3 disks. Before we change anything on our member servers we pull a disk out and replace it with the other. This way if the change fails or the server goes down, we have a good disk with a config we know that works. With this solution you could even keep a disk off-site incase things went really bad and your offices burned/flooded/got raided! One large credit card and a pleading phone call to Dell/IBM/HP and your new server is being delivered :-)

Also, we always keep one server on our farm updated but off.
But is the SAN still a single potential point of failure? That is, if the SAN box goes down, then there is no failover. I am assuming the SAN is used for file storage.
For what it's worth here's what i've done at some clients:
Point an alias to the SQL server and create a dump on regular basis. When the sql server is beyond repair in reasonable time there's a backup server waiting. Restore the DB on the backup and point DNS to the new server.
This is an older article now, but i must say, Altiris RDP for HP blades is the best option i have come across.  It is fantastic, drop a job on your server and off it goes rebuilding or reimaging to an exact standard you have setup.  Takes about 1 hour to complete the entire reimage, app install and configuration.
Why did you not try Neverfail?
If you're interested, there's a Symantec Live State Recovery forum on
We have all, but one of our Citrix servers diskless, using iSCSI SAN. The SAN has a snapshot feature and we keep 7 days worth of snapshots for every server and our server platform is blade. Our strategy encompasses 2 areas, one is data corruption, the other is hardware failure. If data becomes corrupt for whatever reason, simply fallback in snapshots which takes minutes to perform. If a server blade fails, we have spare blades to replace the down server with, in the case that there are no spares, we can sacrifice a 'less critical' server until replacements can be acquired. Of course, the final question: What if the SAN fails? The answer to this falls into our Disaster Recovery strategy and entails Catastophe, so everything important in the environment is backed up to tape and will fall into our DR timelines for re-establishing services. Thus far, this strategy has proven VERY stable and has existed for three years running.
If you have never recovered a server from a snapshot, or by redirecting a SAN LUN to another boot initiator, you really don't know what you're missing.
Do you have any hints or tips for me on how to get Altiris to clone Citrix servers? I'm working with Citrix 4.0 servers and Altiris 5.6 sp1.
Kind regards