Linux CPU Reporting under VM is WRONG!
When running in a virtual environment
such as z/VM, ALL linux CPU accounting is wrong. This presentation
has been given many times at SHARE and IBM technical conferences,
but yet people want to depend on products that are just not aware
of this problem to provide their CPU measurements.
These tools just do NOT report CPU correctly - and has been
proven to be incorrect by as much as an order of magnitude.
TEST your solution before buying.
ALL agents and performance programs in Linux get their data from the same /proc file system. It is easy to show that this data is wrong by design. Linux measures CPU using a time of day sampling technique that in a dedicated environment works sufficiently to meet the requirements.
When running in a virtual environment, Linux allocates the processor as if it was dedicated. In an idle system, where there is only one linux server running under z/VM, the difference in reporting is not usually noticed. But under load when there is two or more servers active, Linux does not know if there are delays getting cpu or not. This results in larger reporting of CPU utilization than what was actually used.
To test this, run "top" or your agent of choice in one server, then logon multiple servers that are in loops. The more servers on, the more that "top" reports itself using - but it's requirements are unchanged. Or set a maximum share for the linux server to simulate what happens when resources are limited.
Imagine one of your servers goes into a loop. Any server that you logon to and run "top" will show very high CPU utilization. How can you correctly determine the real problem? Or use this data for capacity planning? The VM data is required.
Velocity Software corrects the Linux CPU numbers using a unique data capture method that absolutely requires z/VM performance data for the same interval as the Linux data. This method is unique to ESALPS. No other vendor or product can collect the data concurrently and correct the data. Ask them and ask for proof....
For more information, see the PRORATE PRESENTATION as presented at many conferences in 2001 and 2002.
How much CPU should your Idle servers require for instrumentation? The worst performance monitor is the one that becomes the performance problem. Most of the agents that work today in Linux, Unix or NT environments are not efficient, but in most distributed environments, it does not matter. With all the cycles available on those platforms, if an idle server is measuring itself, there is no problem.
In the shared resource environment with Linux running under z/VM,
there are two issues to address.
The first issue is the cost
of the instrumentation, usually in the form of an agent. If this
cost is 5% of a processor, and you expect to run 100 servers,
you have just allocated 5 processors to instrumentation. Not Good.
NETSNMP is a VERY low cost agent and readily available from sourceforge.
The second issue is the cost of measuring idle servers. Why wake
up a server to see what it is doing when it is not doing anything?
Waking up an idle server involves CPU, storage and paging, all of
which are unnecessary, and all of which take resource away from
the other servers.
Velocity Software's ESALPS is the Low Cost Performance Monitor! With ESALPS, the VM data tells us that the server is idle, thus there is no need to request the performance data. As ESALPS uses NETSNMP, the agent is passive and only wakes up when data is requested - unlike other available agents in this environment.
The z/VM platform is a shared resource platform that may be configured to support 100's of servers. In performing capacity planning or performance tuning, knowing where the resources are being used is critical. Agents that allow a perspective into one or two linux servers do not provide a system perspective.
ESALPS gathers all z/VM data, Linux data, network data, and data from other distributed servers into one database. From this exclusive integrated database, a full system perspective can be obtained.
Monitoring idle servers on z/VM wastes expensive CPU and storage resources that could be used for performing real work by other servers. ESALPS gives you the ability using the z/VM data to detect an idle server or a server with very low utilization and can then discontinue monitoring until the server becomes active. This can save significant resource and when running hundreds of servers still allows full monitoring of all of them.
The following published in the Domino 6.5 Redbook is one example of the kinds of reporting possible with ESALPS. This example was to show that for this workload, java was not the CPU problem, when adding up all the processes with the same name, the "server" process with 67 threads became the obvious place to start tuning. But with an average thread requirement of only .9%, it was not obvious until this report was analyzed.
Report: ESAHSTA LINUX HOST Application Report Domino Redbook ESAMAP ------------------------------------------------------------------------- Node/ Process/ <-Application Process Counts------> <-----Processor----> Date Application <---Utilization----> Time name Total active Running ResWait Loaded Percent seconds Avg -------- ----------- ----- ------ ------- ------- ------ ------- ------- ---- LINUXA java 15.0 15.0 2.0 13.0 0 10.3 92.6 0.7 kswapd 1.0 1.0 0 1.0 0 9.1 82.2 9.1 router 11.0 11.0 0 11.0 0 10.6 95.4 1.0 server 67.0 67.0 1.0 63.0 3.0 63.2 568.5 0.9 update 3.0 3.0 1.0 2.0 0 10.2 91.7 3.4