Yesterday on our Podcast, our guest was Login VSI's Adam Carter. He was there to talk about their new product called "Login PI." As I wrote in the title of today's post, Login PI is the most brilliant VDI & RDSH monitoring product that I've been wanting to exist for over 15 years. Thanks Login for making this thing!
(You may remember that we shot a video with Jeroen van de Kamp at VMworld 2014 where he first told us about this idea. Now it's a reality!)
The problem with traditional monitoring
I've never liked traditional systems monitoring when it came to RDSH and VDI environments. (Including Citrix XenApp / XenDesktop, VMware Horizon, Dell vWorkspace, etc.) The problem is that traditional monitoring solutions are based on system performance metrics--things like CPU load, memory usage, IOPS, etc.
Traditional monitoring tools watch these metrics and then alert you or take action when one of them moves outside of a certain threshold. The CPU usage on Server01 has gone above 95% for 30 seconds. Send an alert!
The problem with this is that desktop virtualization is all about the user experience, and one of these metrics moving above a pre-defined threshold doesn't necessarily mean that the user experience is suffering.
Take a look at this excerpt from the Citrix Resource Manager chapter from my 2001 MetaFrame XP book which highlights the problem:
Your Pager: <beep beep beep>
You: I see that Server A has gone red.
Other Citrix Administrator: Really?
You: Yeah, the context switches have hit 14,000 per second.
Other Citrix Administrator: Really?
You: Yeah, they've been that way for over two minutes now.
Other Citrix Administrator: Really?
You: Yeah, really.
Other Citrix Administrator: Is that bad?
Other Citrix Administrator: ?
You: It's red.
Other Citrix Administrator: So that's bad?
If this happened to you and you were at lunch, would you drop your sandwich and run back to the office? Probably not. You'd probably call someone back at the office and say, "Hey, I got a performance alert. Are any users complaining? No? Okay then, I'll finish my sandwich."
The flipside of this is also true. If you're at lunch and users start to complain of bad performance, then someone is going to call you to fix it. You can't tell them, "Well, our monitoring solution didn't alert me, so I'm not coming back now."
The point is that as a desktop virtualization admin, you don't actually dig into performance problems until user experience suffers.
Of course when that happens, it's nice to have logs of performance counters and stuff to try to find the cause of the problem, but the technical metrics are used to investigate why a problem happened after it happened. They're worthless when it comes to alerting you that a problem is currently happening.
But wait, don't we have "smart" monitors with artificial intelligence now?
Some might argue that my example is outdated; that instead today's modern monitoring solutions have artificial intelligence or big data or something that they're much smarter than they used to be. While that's certainly true if you read all the vendor marketing material, so far I haven't been impressed with these.
Sure, they might use artificial intelligence to automatically set alert thresholds or to know that a metric spiking at 6pm on Tuesday is an anomaly while the same spike at 8:30am on Monday is fine--but at the end of the day, they're all looking at perfmon and WMI counters that don't necessarily have a correlation to end user experience.
What I want instead
Since desktop virtualization (whether RDSH- or VDI-based, and whether full desktop or single app) is about end users, I want a monitoring solution that's based on actual real-world performance of what's happening in a user's session. I want to be able to know that the time it takes from when a user clicks on an application icon until that app comes up is usually 3 seconds, and if it takes more than 5 seconds, alert me. I want to know that a user running a spell check on a certain document usually takes 8 seconds, and if it ever takes longer than 11 seconds, let me know. I want to know if clicking the "New Message" button in Outlook starts taking 3 seconds for the New Message window to pop up when it usually takes 250ms.
The beauty of looking at that type of performance metric is that it's the actual user experience irrespective of what the server, network, or WMI metrics might say. If it suddenly starts taking 5 seconds for a patient record to pop up when it's usually 2 seconds, I don't give a hoot at what my metrics are--I have a problem that I want to address! And if my IOPS and CPU are through the roof but everything users are doing still runs at full speed, I'm not sure I really care about that.
(To be clear, again, I'm not saying that server metrics are completely worthless. If Outlook suddenly starts taking twice as long to pop up the New Message window, I'll need to look at server and network metrics to figure out what the problem is. And if I have a VDI server that usually runs at 80% memory consumption and it suddenly spikes to 100%, I do want to know about it so I can make sure there's not a security breach or anything. But I can't use metrics alone to tell me something is wrong with my environment.)
How Login PI works
This is exactly what Login PI does.
When you use Login PI, you write a script that automatically logs in a user and does a bunch of things. (They include a bunch of ones you use to get started, or your can customize or build your own to capture things in your own environment.) Login PI launches a remote VDI or RDSH session from a real client and then runs the script within the remote VM to do whatever you want: open a Word document across a network share, send an email, open a spreadsheet, launch a business app and generate a report, etc. You configure it to run that script at whatever interval you want. (Continuously, once a day, every 15 minutes.. whatever.) Login VSI tracks how long it takes for every step of the script, and then it can post events to the Event Log whenever something takes longer than it usually does. You can configure these event alerts to be based on specific times, i.e. post an event if this step takes longer than 3 seconds, or you can configure them to be posted if any step starts taking longer than a certain relative percent. i.e. post an event if any step starts taking 25% longer than it usually does.
Login PI is based on the same core concept and engine as Login VSI--the product used to simulate user loads on RDSH and VDI servers. So basically Login took that product, added the tracking for specific timings and a database to hold everything, and now they have Login PI.
I love the idea that end user customers can buy this product to make sure their virtual desktop environment is providing the experience they think it should. I also like the idea of using Login PI to verify that a DaaS provider is delivering the experience they promise. (How great would it be to start seeing SLAs which include terms like, "The New Message window in Outlook should open within 3 seconds?")
DaaS providers could also use Login PI to make sure their environment was working as it should as well as to proactively notify customers of problems. (Think about what kind of great service that would be. Imagine if you were a DaaS customer and you got an email from your provider which says, "We noticed that your records management app has slowed down." That's some great service!)
Login PI is a new product and I don't know of anyone using it yet. (I know tons of people using Login VSI, just none of them using Login PI yet.) But this product should definitely become a standard for every VDI and RDSH deployment, as it addresses exactly what we've needed in a monitoring solution in our space for a long time.