Citrix XenApp Resource Manager Metrics

submitted by John Streeter

Memory

Memory bottlenecks have a severe and immediate impact on performance, with all clients experiencing pauses and freezing. The average working set should not exceed 70% of available memory, an additional 30% of memory above the average working set is considered adequate to handle peak loads. The average working set is calculated by subtracting the average1 Available Bytes from the amount of physical memory installed on the server. Peak load periods can and should utilise more than 70% of available memory, with bursts at 90% utilisation being about normal.

Performance Counters Recommended

Available Bytes:The amount of physical memory not in use in bytes
Page Reads/Sec:Number of hard page faults per second
Pool Paged Bytes:The amount of memory in use in the system’s pool of pageable virtual memory
Pool Paged Resident Bytes:The amount of physical memory used by pages from the system’s pool of pageable virtual memory

Available Bytes

System performance degrades when there is excess paging for virtual memory addresses. Generally, available bytes of around 5% - 10% is adequate and memory bottlenecks are not expected if available memory is at or above these levels. When the available memory reaches 4Mb, windows takes a more aggressive approach to paging. Least recently used clean pages are zeroed out, dirty pages are flushed back to the page file, and working sets are agressively trimmed to restore an adequate supply of free memory. Observing available bytes is not a conclusive indicator, as memory can be near 100% utilisation without any impact on system performance. Performance only deteriorates when there is contention for physical memory from addresses in virtual memory. This contention results in high levels of swapping and attendant high disk I/O.

We use available Memory as a useful raw indicator of memory bottlenecks in the system. The following thresholds have been set to alert an impending or existing bottleneck in memory.

Available Memory


Yellow Alert194,495,000 Bytes5% of physical memory is available
Red Alert16,384,000 BytesFour times the 4Mb minimum buffer

This recommendation is based on a 32-Bit Windows server.

Hard Page Faults

Page Reads/Sec gives a raw number of the number of times per second the memory manager has to fetch a page from the the hard disk. Spikes are expected with this metric, as calling functions that are not present in physical memory are not uncommon. A persistent high value (50 or greater) is an indication of a performance bottleneck. The following thresholds are recommended:

Page Reads/Sec


Yellow Alert:50
Red Alert:200

Pool Paged Bytes
Pool Paged Resident Bytes

Windows does not provide a metric to measure contention for memory by running processes. This being the case, we need to provide our own indicator. We do this by calculating a memory contention index:
V:R Ratio = Pool Paged Bytes / Pool Paged Resident Bytes
This ratio reveals the contention for physical memory resources (R) by virtual memory addresses (V). The pageable pool is an allocation for all system resources that do not need to reside in physical memory permanently; this includes drivers and libraries that can be paged to disk when they are not actively being used. Pool Paged Bytes is the total usage of this pool. That is, it includes both pages loaded into physical memory and pages that have been swapped out to disk. Pool Paged Resident Bytes are pages from the pageable pool that are loaded in physical memory. If this ratio is high 3:1 or over, a significant volume of system resources have been paged out of physical memory and contention for these resources is likely to exist. We would expect Memory - Available Bytes to be low when such a ratio was observed.

The maximum possible size of page pooled memory is 650Mb, and memory manager attempts to trim the paged pool when it reaches 80% of this maximum, or 520Mb. Performance degradation may occur once this trimming commences. Based on this, the following thresholds have been set in Resource Manager:

Pool Paged Bytes:

Yellow Alert:455,000,00070% of the maximum paged pool memory size
Red Alert: 520,000,00080% of the maximum paged poolmemory size

Contention for memory resources can be expected if the paged pool resident in physical memory falls below 33% of the virtual paged pool. Therefore we're only interested in the Pool Paged Resident Bytes as a proportion of Pool Paged Bytes, the higher the value in proportion to Pool Paged Bytes. In the light of this, the thresholds have been set at the same values for Pool Paged Bytes.
Pool Paged Resident Bytes:

Yellow Alert:455,000,00070% of the maximum paged pool memory size
Red Alert:520,000,00080% of the maximum paged pool memory size

System

Performance Counters Recommended

Context Switches:Number of times per second the high speed processor cache for memory mapping tables is re-populated
Processor Queue Length:The number of processes queued for processor time-slices.

Context Switches

A context switch typically occurs when the processor swaps over to another thread. Intel processors have a high speed cache called the Translation Lookaside Buffers (TLB’s). This cache contains the most recent set of Virtual Memory:Real Memory Address translations. Having this cache on-chip substantially increases performance, as the processor does not have to access the page tables to convert a virtual address into an address in physical memory, prior to accessing memory for code or data pages. When a new thread is submitted to the processor, Control Register 3 is reloaded to point to a new set of page tables and the TLB’s are flushed. The initial lookup of each address translation now references the Page Table Entries (PTE’s) directly. Subsequent references will be from the TLB’s.

Context switches also occur when an application calls an operating system service. These calls can be file, memory or driver operations; or any of the tasks Windows performs on behalf of a process running in user mode.

A sustained high level of context switches will slow system performance as the processor is unable to perform at peak levels whilst continually accessing the page tables for address translations. This stress can be reduced by increasing the timeslice using the advanced/Performance option in the system applet. Terminal Services defaults to the applications option. The values created by this dialog can be fine tuned in the HKLM\System\CCS\Control\PriorityControl. The hexadecimal values for this key translate to a 6 bit binary. From the left, the first two bits indicate whether long or short timeslices are use, the next two bits indicate whether fixed or variable time slices are used and the final two bits indicate the degree of quantum stretching for the foreground application. Using this mapping, a setting of 0x18 is entered if the control panel applet is set to background services. This equals binary 011000. The timeslice instructions imbedded in this value are:

01Use long timeslice intervals
10Fixed length intervals.
00No Quantum stretching for foreground applications.

The value for the application preference is 0x26, which translates to binary 100110, indicating:
10Use short timeslice intervals
01Variable intervals
10Stretch the quantum of the foreground application by a factor of 3

Generally these values do not need changing, but if context switches are consistently high and no other measures have addressed the problem, the values can be experimented with to see if any improvement occurs.

Hyper-threading processors may also reduce the count of context switches. Processing multiple threads on each processor will reduce the number of times the TLB’s need to be flushed, as two threads share a single set of virtual/physical memory address maps.

A little birdie at Citrix Technical Support told me that we can expect handle 50,000 context switches per processor core on current generation hardware. Based on this advice, the following values are recommended for a server with two dual core processors:

Context Switches/Sec:

Yellow Alert:170,00085% full load value per processor core
Red Alert:200,000100% full load value per processor core

Processor Queue Length

Processor queue length is a better indicator of processor bottlenecks than processor utilisation. A processor can be running a 100% utilisation without any adverse performance impacts, if all threads are being processed in a timely manner. Bottlenecks occur when threads are not being processed in a timely manner and they are being queued at the processor. Processor Queue Length provides a real-time measure of the number of ready threads that are waiting to run. Any value greater than 0 indicates work is being delayed due to processor bottlenecks and the length of the delay is directly proportional to the queue length.

Any value over 20 for this metric should be investigated and understood. Any applications frequently causing high queue length values are candidates for siloing if they cannot be controlled by Citrix Processor Utilisation Management or AppSense Performance Manager.

These thresholds are a good starting point:

Processor Queue Length:

Yellow Alert:Red Alert:
4Moderate performance degradation
20Exceptional congestion at the queue

Processor

Performance Counters Selected

% Interrupt Time:The time spent receiving and servicing hardware interrupts.
% Processor Time:The time spent executing a non-idle thread.

% Interrupt Time

% Interrupt Time is less an indicator of a processor bottleneck than bottlenecks being caused by other hardware devices. An overloaded or malfunctioning hardware device can be expected to produce lots of Interrupt Service Routine (ISR) or Deferred Procedure Call (DPC) traffic. This metric only alerts us that a potential problem exists; it does not offer any indicators as to where the problem might lie.

Microsoft indicate that interrupt time exceeding 20% - 30% per processor is an indicator that the system is generating more interrupts than it can handle. Interrupts can be expected to increase as server workload, network packets per second and disk I/O operations increase. A Windows workstation is expected to generate over 100 interrupts per second. Scaling this to a 4 core Citrix server with 100 users, we could expect 1,200 - 3,000 interrupts per second on a normally functioning system.

Starting with Microsoft's benchmarks for a user workstation and taking into account the nature of a multi-user server the following thresholds are recommended:

% Interrupt Time:

Yellow Alert:Red Alert:
15 Investigation into the cause is warranted
20 Action to remedy this level of activity may be necessary

% Processor Time

This is the percent of time during a sampling period when the processor is performing work. If the idle thread was executing for 20% of the sample time, the % Processor Time value would be 80. This counter is loaded by default during Resource Manager’s installation. Any value is acceptable up to and including 100%, as long as the Processor Queue Length is not sustained at a level above 0.

Processor utilisation is highly variable, and any machine will report 100% utilisation some of the time. The values selected have deliberately been set high to minimise alerts.

% Processor Time:

Yellow Alert:Red Alert:
95High utilisation, potential problems.
100Full utilisation, potential problems.

Server Work Queues

Performance Counters Selected

Queue Length:The number of threads queued for timeslices on a processor.

Queue Length

These counters are the same as Processor Queue Length, but measure the number of threads at each processor. This is a useful metric when sustained queuing occurs. The offending threads can be identified on the processor where the delays are being experienced; and the source dealt with. Depending on whether the source is an application or hardware device, the following options are available:

  • Reconfigure
  • Assign affinity to one processor
  • Give a lower base priority
  • Disable
  • Silo

The thresholds for these metrics are set to 150% of the value for Processor Queue Length per processor. We are only interested in identifying delinquent threads when things get really bad. They are:

Queue Length 0
Queue Length 1
Queue Length 2
Queue Length 3

Yellow Alert:Red Alert:
31.5 times processor queue length
61.5 times processor queue length

Server

Performance Counters Selected

Files Open:
The number of files opened on a server

Files Open

Windows 2003 has a limit of 16,384 files open between a client and server, per user. This figure is the number of slots for File ID’s (FID) in Windows internal tables. When all slots have been allocated; a request for a new FID cannot be serviced and errors are thrown. It is highly unlikely that any single user would approach the maximum file ID’s, and applications are likely to be misbehaving if more than 100 files for any single user are open. To ensure resource manager alerts flag real problems, the thresholds have been set at 100 and 150 file ID’s per user at a load of 100 users.

Files Open

Yellow Alert:Red Alert:
10,000100 FID’s per user
15,000150 FID’s per user

Network Interface

Performance Counters Selected

Bytes Total/Sec:The amount of throughput per second on an adapter.

Bytes Total/Sec

The network interface is unlikely to be a bottleneck. Other components should max out well before the NIC comes under sustained load. We can reasonably expect a transfer rate of 250Mb/Sec on a server calss gigabit etherent interface. A Citrix user load of 100 is unlikely to generate a sustained throughput of that order. This metric is included to indicate transient problems and assist in identifying the cause. Additionally, some viruses and mal-ware may generate high levels of traffic. Be afraid if this metric turns to red on a sustained basis on all servers.

Bytes Total/Sec

Yellow Alert:160,000,000
Red Alert:200,000,000

Physical Disk

Performance Counters Selected

% Disk Time:The percentage of a time sample that the disk is busy.
Disk Queue LengthDisk requests queued for service.

% Disk Time

Sustained high levels of disk traffic are likely to be due to paging caused by memory shortage. This metric is only recording activity on local SCSI drives, which is application loading, profile and data caching, shared library loading and paging. Periodic spikes are expected as user activity reads and writes data.

If sustained disk activity is recorded and memory metrics do not reveal high loads, check the disk queue length. It should not be greater than 2. At a value higher than this, the offending process needs to be found and terminated. The thresholds for this metric are left at the defaults set by Citrix.

%Disk time

Yellow Alert:60Citrix default
Red Alert:80Citrix default

Disk Queue Length

If requests are queued at the disk controller awaiting service, you have an I/O bottleneck that is likely to be affecting users. Any sustained value on this counter is a problem; occasional spikes are expected. If requests are queued, the likely causes are defective hardware, arrays that are shared with too many hosts, or arrays that have an inadequate number of spindles for the amount of I/O they are expected to service.

Any sustained value of 2 or more on this metric warrants investigation. The Thresholds chosen should minimise alerts, and be indicative of an underlying problem if they are reached.

Disk Queue Length

Yellow Alert:4
Red Alert:8

Terminal Services

Performance Counters Selected

Active Sessions:The number of active sessions
Inactive Sessions:The number of inactive sessions

Active Sessions
Inactive Sessions

These metrics are a simple count of active and inactive sessions. Set the yellow threshold for active sessions to the planned capacity of your servers and the red threshold at a level approaching the estimated maximum capacity of the servers. Inactive sessions is a little less scientific, as we don't usually expect to have a lot of these about. I usually set the red and yellow values for these at 10% of the active sessions values. at These values are going to depend on how resource hungry your applications are, how many applications you are going to have running in each session and the size of the box.

Active Sessions

Yellow Alert:Need to CalculatePlanned Capacity
Red Alert:Need to CalculateApproaching estimated maximum capacity

Inactive Sessions

Yellow Alert:10% of active sessions valueCitrix default
Red Alert:10% of active sessions valueCitrix default

Citrix MetaFrame

Performance Counters Selected

Data Store Connection Failure:The number of minutes that the server has been disconnected from the data store.

Data Store Connection Failure

MetaFrame farms are capable of being disconnected from the data store for 96 hours. So it's not the loss of connection that's a problem, but the period of the outage. Most places we go we see the odd momentary disconnection, and this is no cause for alarm. I like to set the yellow alert at 10 minutes and the red alert at 1 hour. That way we know a red alert means there is something seriously wrong, but users aren't affected.

Data Store Connection Failure

Yellow Alert:15Citrix default
Red Alert:60Citrix default

ICA Session

Performance Counters Selected

Latency – Last Recorded:The last recorded latency measurement for the session.

Latency – Last Recorded

Latency last recorded captures the time taken for a packet to travel from the client to the server. It is a one way measure.

Latency values recorded only change when the client sends data back to the server such as keystrokes or mouse activity. Therefore a reading may remain high for an extended period, because the user is doing nothing. Citrix XenApp sessions deteriorate at round 600ms latency, are virtually unusable at 750ms, and die at around 1,000ms.

Active Sessions

Yellow Alert:450Approaching degradation point (600)
Red Alert:650Approaching unusable (750)

Results for Info search

Results for Info

You entered the search string "".
No results were returned for this string.

Copyright © 1997 - 2016 Mission Pacific Pty Ltd. All rights reserved. ezcom, the ez logo and hard tech cafe are registered trademarks of Mission Pacific Pty Ltd. Designed for firefox and safari. Sitemap