Your users have one objective-to access their applications. They don't care about uptime percentages or system backups or server hot-swap components. They just know that when they click the button, their application had better start loading.
It's up to you to determine how available your system needs to be. (This actually means that it's up to you to determine how much money to spend on making your system redundant.) For example, it's possible to create a system that is online and available 24 hours a day, 7 days a week, 365 days per year. There are many real world manufacturing companies who use MetaFrame XP to serve applications to their assembly line workers. These plants operate 24 hours a day; a single hour of downtime costs them millions of dollars.
On the other hand, there are quite a few companies that only have a few MetaFrame XP servers for a few dozen users. In these environments, system downtime might only cost a few thousand dollars per hour in lost business.
In order to determine what is required for your environment, we need to look at the various components of a MetaFrame XP system from the perspective of redundancy. Figure 17.1 shows a generic view of the technical components required for a user to access a MetaFrame XP application. If any one of these components is not available, the user will not be able to do so.
Figure 17.1: The MetaFrame XP components that must be functional
Throughout the remainder of this section, we'll analyze how each component in Figure 17.1 affects overall redundancy and what can be done to that component to strengthen the redundancy of the overall system. It's important to remember that all of these components work together as a system. Therefore, it does no good to think about the redundancy of one component without considering the redundancy of others. Your MetaFrame XP environment is only as strong as its weakest link.
ICA Client Devices
Of course by this point you are well aware that one of the major architectural advantages of thin client environments is that any user can connect from any client device. If a client device ever fails, a user can begin using a different device and pick up right where he left off.
Chapter 9 presented the issues to consider when designing your client device strategy. It can be summarized as follows: With ICA clients, apply "high availability" not by changing anything on the client itself, but rather by having a spare client device available to quickly replace a failed unit.
In a MetaFrame XP environment (or any thin client environment), your users instantly become unproductive if their network connection is lost. Short of running dual network cables to every user's ICA client device, you can configure clients to point to multiple MetaFrame XP servers on multiple network segments (as covered in Chapter 10).
You can also put dual network cards in your MetaFrame XP servers, configuring them for failover in case one stops working. Best practices suggest that each network card be connected to a different switch so that the servers can still function if a switch is lost.
NFuse Web Server
Since most people use NFuse to provide access to applications, take the necessary steps to ensure that a working NFuse web page appears whenever users enter the URL into their browser. There are two areas that need to be addressed to ensure that an NFuse web site is available:
- Users must be able to find a functional NFuse web server.
- The NFuse web server must be able to find a function MetaFrame XP server.
Let's examine the steps that can be taken to ensure that neither one of these becomes the Achilles' heal in your NFuse environment.
Ensuring Users can find an NFuse Server
The first item to consider when designing highly available NFuse environments is to ensure that your users will always be able to connect to a functioning NFuse web server, even if your primary server is down. Fortunately, people have been focusing on creating redundant websites for years, and there is nothing proprietary to prevent NFuse from working like a regular website. Three of the most common ways of ensuring website availability are:
- Connect to the server via a DNS name.
- Cluster the web servers.
- Create a manual backup address.
Option 1. Use a DNS Name to Connect to the NFuse Web Server
By connecting to an NFuse website via a DNS name rather than to an IP address, the DNS name can be configured to point to any IP address. If something happens to the main server, the DNS table can be modified to point to a backup server. The disadvantage here is that the failover must be done manually.
Advantages of Using a DNS Name for Redundancy
- Quick to implement.
- Transparent to end users.
Disadvantages of Using a DNS Name for Redundancy
- Manual failover.
Option 2. Create a Web Server Cluster
Many web servers can be configured in a cluster format, allowing one web server to take over if the other fails. Cluster failover is automatic, although the hardware and software needed to run them can get expensive.
Advantages of Building a Web Cluster for Redundancy
- Fast, automatic failover.
Disadvantages of Building a Web Cluster for Redundancy
- Specialized cluster hardware and software can be pricey.
Option 3. Manual Backup Address
Some people configure two identical web servers and instruct their users to try the alternate address if the first is not available. This is cheap and easy to implement, although it requires that your users remember a second address.
Advantages of Using a Manual Address for Redundancy
Disadvantages of Using a Manual Address for Redundancy
- Requires user competence.
Ensuring NFuse can find a MetaFrame XP Server
When using NFuse, in addition to making sure that an NFuse web server actually responds to user requests, make sure that the NFuse server is able to find a MetaFrame XP server running the Citrix XML service. There are two ways that this can be done:
- Configure multiple Citrix XML Service server addresses.
- Use Enterprise Services for NFuse to connect to multiple server farms.
This section describes how to configure these two options so that your overall environment is as highly available as possible. Full details about the advantages and disadvantages of these options, as well as how they are configured, are included in Chapter 11.
Option 1. Specifying Multiple Citrix XML Service Addresses
Remember that you can configure the Java Objects on an NFuse Classic web server to cycle through a list of MetaFrame XP servers when contacting the Citrix XML Service. While this list is primarily designed for load-balancing purposes, it can also alleviate the risk of a server being lost. If you specify multiple servers for the NFuse web server to contact, the loss of a single server will not affect the availability of the NFuse web site.
Option 2. Using Enterprise Services for NFuse
Since you can use Enterprise Services for NFuse to connect load-balance users across server farms, your users can still access their MetaFrame XP applications even if an entire server farm is lost (or more likely if communication to an entire server farm is lost).
MetaFrame Server Redundancy
The actual MetaFrame XP servers that host users' sessions are usually the first target when people begin to think about how to increase the availability of their MetaFrame XP environments. Similar to making NFuse available, there are two aspects that must be considered with MetaFrame servers:
- A functioning MetaFrame XP server must be available for users.
- The users must be able to seamlessly find that functioning server.
Ensuring a Functioning MetaFrame XP Server is Available
Make sure that there will always be a MetaFrame XP server available when users need to connect to one. There are two different strategies that can be used for this:
- Try to make each individual server's hardware as redundant as possible.
- View each MetaFrame XP server as "expendable." Build redundancy by having extra servers.
Chapter 4 outlined strategies for the "farm / silo" model of deploying MetaFrame XP servers and Chapter 6 detailed the advantages and disadvantages of building large or small servers. This section builds upon those two chapters by addressing the design options of whether you should approach server redundancy with "quality" or "quantity."
The exact approach that you take depends on your environment. What does "high availability" mean for you? Does this mean users' sessions can never go down, or does it mean that they can go down as long as they are restored quickly?
Option 1. Build Redundancy with High Quality Servers
One approach to making MetaFrame XP servers highly available is to increase the redundancy of the systems themselves. This usually involves servers with redundant hardware, including disks, power supplies, network cards, fans, and memory. (Yes, today's newest servers have RAID-like configurations for redundant memory banks.)
Advantages of Building Servers with Redundant Hardware
- By using redundant server hardware, you are assured that a simple hardware failure will not kick users off the system.
Disadvantages of Building Servers with Redundant Hardware
- No economies of scale. Every server must contain it's own redundant equipment.
- This strategy still doesn't mean that your servers are bullet-proof.
- What happens if you lose a server even after your planning? Will you have the capacity to handle the load?
Option 2. Build Redundancy with a High Quantity of Servers
As outlined in Chapters 4 and 6, you'll most likely need to build multiple identical servers to support all of your users and their applications regardless of your availability strategy. In most cases it's more efficient to purchase an extra server (for N+1 redundancy) than it is to worry about many different redundant components on each individual server.
Advantages of Building Extra Servers
- Better economies of scale.
- You will have the capacity to handle user load shifts after a server failure.
Disadvantages of Building Extra Servers
- If a simple failure takes down a server, all users on that server will need to reconnect to establish their ICA sessions on another server.
- An extra server might cost more than simply adding a few redundant components as needed.
If you have applications that cannot go down (because users would lose work), you'll have to spend money buying redundant components for individual servers. However, if it's okay to lose a server as long as the user can instantly connect back to another server, you can use the "high quantity" approach. Of course it is never okay to "lose a server." However, even without redundant components, losing a server is a rare event. Users are always safer on a server than on their workstations since the configuration and security rights are always configured on the server. Traditional environments don't have redundant components on every single desktop and they're still widely accepted, so not having redundant components on servers should also be acceptable as long as users can connect back in
Ensuring Users are Routed to a Functioning Server
Chapter 4 included details about how Citrix Load Manager functions in MetaFrame XPa and XPe environments. In addition to the obvious ways that Load Manager can be used to load balance multiple MetaFrame XP servers (and therefore enable users to always find a functioning server), it can be used to route users to "backup" sets of MetaFrame XP servers.
A Note about Server "Clustering"
Many people think of clustering when they think of high availability and redundant environments. Some people use the terms "load balancing" and "clustering" interchangeably. In MetaFrame XP environments, Citrix Load Manager performs load balancing. Load-balanced groups of MetaFrame servers should not be thought of as clusters.
The term "load balancing" is used to describe a mechanism that distributes load across multiple resources. "Clustering" is used to describe multiple resources that support a single load that is dynamically transferred from one resource to another in the event of a failure. Clustering is possible because the members of the cluster share certain common components, such as storage.
While it's true that Microsoft Terminal Services can be used in Microsoft Clusters, a user's remote session cannot be "clustered." What this means is that if a user's session is running on a server that goes down, it is not possible for that session to be dynamically switched over to another server. (Of course the user could instantly reconnect to the server environment, but they would lose whatever changes they made after they last hit "save."
Clustering in a Terminal Services environment presents several significant technical challenges and it will most likely be several years before we see a true clustered environment. To understand why, think about what happens when a user logs on to a MetaFrame XP server. To function as a cluster, the users' applications and memory spaces would have to be loaded onto multiple servers so that one server could pick up when another failed.
User and Application Data
When determining the actions that you will take to ensure that your data is highly available, you need to first classify your data. All data can be divided into two categories:
- Unique data is the important data that is unique to your environment that you don't want to lose. This includes user profiles, home drives, databases, the IMA data store, and application data.
- Non-unique data is anything that you can load off of a CD from a vendor, such as Windows 2000 server, MetaFrame XP, SQL Server, and your applications.
In the real world, it's unrealistic to build a server that "can't fail." Therefore, when designing redundant environments, assume that certain failures will occur and make the necessary provisions to deal with them. This is where the "redundant array of inexpensive servers" comes into play.
When thinking about your data, you need to ensure that your MetaFrame XP servers only contain non-unique data. Your environments unique data should be stored elsewhere, such as on a SAN or NAS device, as shown in Figure 17.2.
Figure 17.2: Redundant servers with data on a SAN
In this environment, your data is protected if you lose one or more MetaFrame XP servers. Your SAN should have the necessary redundancy built into it, such as RAID, multiple power supplies, multiple controller cards, and multiple interfaces to the servers. Instead of using a SAN, you can use a standard Windows 2000 Server file share driven by a Microsoft Cluster.
Advantages of using RAIS for Data Redundancy
- Quick recovery in the event of a failure
Disadvantages of using RAIS for Data Redundancy
- Doesn't work in smaller environments.
- Requires an "extra" server (for N+1 redundancy).
- Since all your non-unique data is on a SAN or NAS, you'd better make sure that's backed up.
MetaFrame XP Support Components
The final aspect to consider when designing a highly available MetaFrame XP environment is the back-end support components that MetaFrame XP requires.
Zone Data Collectors
As outlined in Chapter 3, zone data collectors are dynamically elected. If one is ever lost, an election will take place and another server will assume that role. Because the zone data collector plays such an important role in MetaFrame XP server farms, Citrix has designed it to be highly-available. As long as your server farm is designed following the practices from Chapter 3, your zone data collectors will be able to perform their jobs.
Ensuring IMA Data Store Availability
Since the IMA data store contains all of the information about the configuration of a server farm, you need to protect it and to ensure that your MetaFrame XP servers will always be able to access it. Remember from Chapter 3 that each MetaFrame XP server creates a local host cache from the IMA data store. This allows that server to function even if it loses communication with the data store. A MetaFrame XP server running Service Pack 2 will continue to function for up to 96 hours without any communication with the IMA data store. This is an increase from the 48 hour limit in environments prior to Service Pack 2.
Because of this, a loss of the IMA data store is not necessarily catastrophic. While it's true that you will not be able to make any administrative changes to your server farm without a data store, your servers will continue to function for a few days. Losing a data store does not affect in any way the operation of zone data collectors and the election process. It just means that you cannot manage or view your environment with the CMC (although you will be able to use the command-line tools since they query the ZDC).
If you lose the data store, you can take a few hours or days to get the database server rebuilt and the IMA data store restored from tape.
Your only real concern during the temporary timeframe while you're operating without a data store is what would happen if you had to reboot one of your MetaFrame XP servers. You can control whether the IMA Service needs to connect to the data store in order to start via the following registry key:
Data: 0 = No connection required, 1 = connection required
If you ensure that your MetaFrame XP servers have this registry value set to "0," they will be able to be rebooted even though the IMA Service will not be able to access the IMA data store after the reboot. In this case, the server will simply use its existing local host cache. If you plan to use this strategy, remember that it will only work for 96 (or 48 hours), even if you never reboot a server.