Many installations are asking for information on how to manage performance for SMT enabled LPARs. I will talk about the first 3 parts of performance management:
In an SMT enabled LPAR, it is rather important to understand that we talk in threads and IFLs. From a z/VM perspective, we now see threads. Where ever the term for traditional z/VM "CPU" is used, one now has to think "thread".
At a high level, the ESALPARS report will show how much IFL time is allocated to the LPAR. When an IFL is assigned to an LPAR, both threads on that IFL under SMT are also assigned to the LPAR. BUT, both threads on that IFL might not be used and thus there is time the core is assigned, but one thread is idle. Thread IDLE time is extra capacity that is available. This time shows up on the ESALPARS report as thread idle time. "thread idle" is the time where one IFL, two threads are assigned to an LPAR, but only one of them is actually doing work
Following is the LPAR SUMMARY showing the LPARs and their allocations. Starting from the far right, the "entitled" CPU is the amount of the shared engines that this LPAR is guaranteed by the given LPAR weights. The LPAR from which this data comes from is the LXB5 LPAR, which will get 10.9 cores allocated as its guarantee. In this case, this LPAR has been assigned a physical core 828% of the time, meaning 8.28 CORES were assigned on average in this one minute reporting interval.
When a core is assigned to an LPAR in an SMT-2 environment, both threads are part of that assignment. Even though both threads are assigned, that does not mean they are utilized. In the CP monitor is another metric, the "idle thread" metric. If this LPAR is assigned 828%, and then subtract the LPAR overhead (non-smt), that means for real work, 816% "core assignement" which would be 1632% "thread assignment". Of that 1632%, in this case 594% thread idle existed. This is time one thread was being utilized, and one thread was idle - but the core was assigned.
Report: ESALPARS Logical Partition Summary Monitor initialized: 07/07/15 at 13:03 --------------------------------------------------------------------------- <--------Logical Partition-------> <-Assigned Entitled Virt CPU <%Assigned> <---LPAR--> <-Thread-> CPU Cnt Time Name Nbr CPUs Type Total Ovhd Weight Pct Idle cnt -------- -------- --- ---- ---- ----- ---- ------ ---- ------ --- ------ 13:05:00 Totals: 00 71 IFL 1055 18.7 1001 100 LXB5 05 20 IFL 828.6 12.6 475 47.5 594.5 2 10.91 LXBX 0F 1 IFL 0.5 0.1 50 5.0 0 1 1.15 LXB2 02 12 IFL 1201 0.1 Ded 21.8 0 1 0 LXB3 03 20 IFL 2000 0.1 Ded 36.4 0 1 0 LXB8 08 10 IFL 224.7 5.7 475 47.5 0 1 10.91 TS02 0E 8 IFL 1.3 0.3 1 0.1 0 1 0.02 Totals by Processor type: <---------CPU-------> <-Shared Processor busy-> Type Count Ded shared Total Logical Ovhd Mgmt ---- ----- --- ------ ------ -------- ---- ---- IFL 55 32 23 1073.7 1036.5 18.7 18.4
From a capacity planning perspective, there is unused capacity. This happens to be a z13 that had inherrent bottlenecks in the TLB that was corrected on the z14 and z15. To understand this requires hardware understanding and use of the hardware metrics from the PRCMFC data. Many z13 installations had a "less than zero" increase in capacity by enabling SMT on the z13. This is only detectable using the MFC data when evaluating production data.
The objective of SMT is to better utilize the core processor. The "z" processors have a very sophisticated caching hierarchy to increase the amount of time a core can actually execute an instruction. Any instruction execution must have the instruction and all related data in the Level one cache. Any time there is a cache miss, the core sits idle while the data comes from level 2, level 3, level 4, level 4 on a remote book, or from memory. Each of these sources requires an increasing number of cycles to load the data into the level 1 cache. During these cache loads, if the core can process instructions from another thread, then core utilization should go up.
There are two measures then for measuring increased capacity when SMT is enabled.
Without proper measurement capability it is very difficult to know if capacity has increased or not. One installation says that the Linux admin's think their performance is better - the method of analysis was not scientific. From a capacity planning perspective, look at the instructions per second per core and cycles per instruction to know if more work is being processed. If the IFL utililization is low, then enabling SMT would change very little - SMT is useful from a capacity perspective when IFL utilization is high, when more capacity is desired.
Capacity planning becomes more difficult with SMT as there is no longer straight line capacity growth lines, needing multiple metrics and knowing that as CPU utilization grows, there will be more contention for cache and TLB, and thus less work done per cycle allocated.
It is stated by IBM in many places that when running in SMT mode, workloads WILL run slower. With SMT, there are now two workloads sharing the same processor core, so at some times, there will be cycles where both workloads are ready to run, but one has to wait. Then the question in regards to performance is "core contention" impact on performance.
The IBM Monitor provides metrics at the system level and the user level to assist in understanding how SMT impacts the system. There is also the PRCMFC (mainframe cache statistics) that show the impact of two threads on the hardware cache. zVPS has been enhanced to utilize and expose these new metrics for every processor from z196 to current z16 on the ESAMFC reports.
For a system level performance reporting, it is important to understand that for most metrics, with SMT enabled, there are two sets of counters. This means that from a z/VM perspective there are twice as many CPUs, all of which have the traditional measurements. But there are still the physical hardare utilization. As in any performance understanding, utilization of hardware has an impact on performance and throughput.
In the above case in LPAR LXB5, there are 20 physical COREs available to the LPAR - and in SMT-2 mode, z/VM will see 40 threads. From an LPAR perspective we see the 20 "CORES" and the assigned percentages and idle thread time per core.
Report: ESALPAR Logical Partition Analysis ftwareate ZMAP 5.1.1 10/29/20 Pg 1257 -------------------------------------------------------------------------------------------------------------- CEC <-Logical Partition-> <----------Logical Processor--- <------(percentages)------->Phys Pool VCPU <%Assigned> VCPU Weight/ Total User Sys Idle Stl Idle cp1/cp2 Time CPUs Name No Name Addr Total Ovhd TYPE Polar util ovrhd ovrhd time Pct Time -------- ---- -------- --- -------- ---- ----- ---- ---- --- --- ----- ----- ----- ----- ---- ------ --- --- 13:05:00 55 LXB5 05 . 0 41.2 0.8 IFL 475 Hor 52.9 1.4 3.0 144.8 2.32 27.81 0 / 0 1 39.6 0.6 IFL 475 Hor 50.3 1.3 2.1 147.9 1.75 26.65 2 / 3 2 34.1 0.6 IFL 475 Hor 41.1 1.1 2.1 157.3 1.63 25.18 4 / 5 3 34.8 0.5 IFL 475 Hor 41.1 0.9 1.7 157.5 1.39 26.68 6 / 7 4 38.4 0.6 IFL 475 Hor 47.3 1.1 1.9 151.0 1.64 27.57 8 / 9 5 43.5 0.6 IFL 475 Hor 55.0 1.2 2.3 143.3 1.66 30.12 10 /11 6 44.1 0.7 IFL 475 Hor 56.5 1.4 2.2 141.6 1.89 29.47 12 /13 7 40.3 0.7 IFL 475 Hor 50.1 1.4 2.3 148.0 1.95 28.37 14 /15 8 44.5 0.5 IFL 475 Hor 53.4 0.8 1.7 145.2 1.36 33.99 16 /17 9 39.2 0.6 IFL 475 Hor 48.1 1.1 1.8 150.3 1.62 28.38 18 /19 10 6.4 0.2 IFL 475 Hor 6.4 0.2 0.8 192.9 0.75 5.82 20 /21 11 5.8 0.1 IFL 475 Hor 5.8 0.1 0.4 193.8 0.38 5.41 22 /23 12 27.9 0.5 IFL 475 Hor 32.3 0.7 1.7 165.4 2.31 21.76 24 /25 13 30.4 0.6 IFL 475 Hor 36.0 0.9 2.3 161.2 2.78 22.70 26 /27 14 62.6 0.8 IFL 475 Hor 79.0 1.3 3.1 117.6 3.42 43.40 28 /29 15 52.7 0.9 IFL 475 Hor 65.1 1.3 3.4 131.4 3.47 37.30 30 /31 16 49.9 0.8 IFL 475 Hor 65.0 0.9 3.2 131.8 3.24 31.95 32 /33 17 64.6 0.8 IFL 475 Hor 75.4 0.9 3.2 121.3 3.28 50.80 34 /35 18 61.4 0.9 IFL 475 Hor 76.8 1.6 3.6 119.3 3.91 42.56 36 /37 19 67.1 1.0 IFL 475 Hor 81.9 1.2 3.9 113.9 4.11 48.57 38 /39 ----- ---- ----- ----- ----- ----- ---- ------ --- --- LPAR 828.6 12.6 1020 20.9 46.8 2936 44.9 594.5 0 / 0
And then from the z/VM side, we can look at the system thread by thread
Report: ESACPUU CPU Utilization Report Vel ---------------------------------------------------------------------- <----Load----> <--------CPU (percentages)--------> <-Users-> Tran CPU Total Emul User Sys Idle Steal Time Actv In Q /sec CPU Type util time ovrhd ovrhd time time -------- ---- ---- ---- - ---- ----- ----- ----- ----- ----- ----- 13:05:00 97 218 3.1 0 IFL 26.4 24.2 0.7 1.5 72.4 1.2 1 IFL 25.4 23.7 0.6 1.1 73.5 1.2 2 IFL 24.5 22.8 0.7 1.1 74.6 0.9 3 IFL 25.8 24.1 0.6 1.0 73.3 0.9 4 IFL 20.0 18.3 0.6 1.1 79.2 0.8 5 IFL 21.1 19.6 0.5 1.0 78.1 0.8 6 IFL 20.8 19.5 0.5 0.9 78.5 0.7 7 IFL 20.3 19.0 0.5 0.8 79.0 0.7 8 IFL 23.8 22.3 0.5 1.0 75.4 0.8 9 IFL 23.5 22.0 0.6 1.0 75.6 0.8 10 IFL 26.0 24.0 0.6 1.4 73.2 0.8 11 IFL 29.0 27.6 0.6 0.9 70.1 0.8 12 IFL 27.3 25.4 0.7 1.1 71.8 0.9 13 IFL 29.2 27.4 0.7 1.1 69.8 1.0 14 IFL 25.5 23.7 0.7 1.2 73.5 1.0 15 IFL 24.5 22.8 0.7 1.1 74.5 1.0 16 IFL 22.8 21.4 0.4 1.0 76.5 0.7 17 IFL 30.6 29.4 0.4 0.8 68.7 0.7 18 IFL 23.3 21.8 0.6 0.9 75.9 0.8 19 IFL 24.8 23.5 0.5 0.8 74.4 0.8 20 IFL 3.3 2.6 0.1 0.5 96.4 0.4 21 IFL 3.1 2.7 0.1 0.3 96.5 0.4 22 IFL 2.1 1.9 0.1 0.2 97.7 0.2 23 IFL 3.7 3.4 0.1 0.2 96.1 0.2 24 IFL 16.0 14.8 0.3 0.8 82.9 1.1 25 IFL 16.4 15.1 0.4 0.9 82.5 1.2 26 IFL 16.9 15.2 0.5 1.2 81.7 1.4 27 IFL 19.1 17.5 0.5 1.1 79.5 1.4 28 IFL 36.1 33.7 0.7 1.8 62.2 1.7 .... 38 IFL 36.7 33.9 0.6 2.2 61.2 2.0 39 IFL 45.2 43.0 0.5 1.6 52.7 2.1 ----- ----- ----- ----- ----- ----- System: 1019 951.3 20.8 46.4 2937 44.9
Now with 816% CORE assigned time (824 subtract 12 overhead), z/VM sees 1019% "total thread" busy time. With the 20 cores, there are two different utilization numbers, one for core busy: 824% out of 20 cores, and thread utilization: 1019% out of 40 threads. Both will be important from a performance analysis perspective.
One of the most interesting scenarios seen in understanding the value the mainframe cache data is the following. The ESAMFC from the following comes from an IBM benchmark without SMT. This is a z13 (the speed of the processor, 5GHz gives that away), with 6 processors in the LPAR. This shows the cycles being used by the workload for each processor and the number of instructions being executed by each processor - all at a rate of "per second". At the tail end of the benchmark, the processor utilization drops from 92% to 67% as some of the drivers complete. But please note the instruction rate goes up???
Even though the utilization dropped, the actual instructions executed went up as the remaining drivers stopped fighting for the CPU cache, the cache residency greatly improved. The last metric is the important one - cycles per instruction. If the processor cache is overloaded, then cycles are wasted loading data into the level 1 cache. As contention for the L1 cache drops, so does the cycles used per instruction. As a result, more instructions are executed using much less CPU.
Report: ESAMFC MainFrame Cache Analysis Rep ------------------------------------------------- . <CPU Busy> <-------Processor------> . <percent> Speed/<-Rate/Sec-> Time CPU Totl User Hertz Cycles Instr Ratio -------- --- ---- ---- ----- ------ ----- ----- 14:05:32 0 92.9 64.6 5000M 4642M 1818M 2.554 1 92.7 64.5 5000M 4630M 1817M 2.548 2 93.0 64.7 5000M 4646M 1827M 2.544 3 93.1 64.9 5000M 4654M 1831M 2.541 4 92.9 64.8 5000M 4641M 1836M 2.528 5 92.6 64.6 5000M 4630M 1826M 2.536 ---- ---- ----- ------ ----- ----- System: 557 388 5000M 25.9G 10.2G 2.542 ------------------------------------------------- 14:06:02 0 67.7 50.9 5000M 3389M 2052M 1.652 1 67.8 51.4 5000M 3389M 2111M 1.605 2 69.0 52.4 5000M 3450M 2150M 1.605 3 67.2 50.6 5000M 3359M 2018M 1.664 4 60.8 44.5 5000M 3042M 1625M 1.872 5 70.1 53.8 5000M 3506M 2325M 1.508 ---- ---- ----- ------ ----- ----- System: 403 304 5000M 18.8G 11.4G 1.640
A typical production workload looked at with SMT enabled shows the 8 threads with an average respectable cycle per instruction (CPI) ratio of 1.68. This is at about 50% thread utilization. The question for the capacity planner is what happens to the CPI when core utilization goes up? If the CPI goes up significantly, it is possible that work is being executed taking much more time (and cycles), and the system capacity available is much less than appears.
Report: ESAMFC MainFrame Cache Magnitudes ------------------------------------------------ <CPU Busy> <-------Processor------> <percent> Speed/<-Rate/Sec-> Time CPU Totl User Hertz Cycles Instr Ratio -------- --- ---- ---- ----- ------ ----- ----- 09:01:00 0 47.0 45.9 5000M 2290M 1335M 1.716 1 50.0 48.9 5000M 2439M 1480M 1.648 2 45.5 44.4 5000M 2219M 1329M 1.669 3 47.3 46.1 5000M 2313M 1331M 1.738 4 42.5 41.0 5000M 2078M 1164M 1.785 5 53.6 52.7 5000M 2623M 1750M 1.499 6 44.3 43.3 5000M 2163M 1179M 1.834 7 56.3 55.3 5000M 2758M 1665M 1.657 ---- ---- ----- ------ ----- ----- System: 386 378 5000M 17.6G 10.5G 1.681
In this case, there are 17B cycles ("G" on report is "Billion") per second being utilized The L1 cache is broken out in Instruction cache and Data cache. Of the 17B cycles consumed, 2.3B are used for Instruction cache load, and another 4.2B for data cache load. Thus of the 17B cycles per second used, only 11B are used for executing instructions.
Report: ESAMFC MainFrame Cache Magnitudes Velocity Software Corpor ------------------------------------------------------------------------ <CPU Busy> <-------Processor------> <Level 1 cache/second-> <percent> Speed/<-Rate/Sec-> Instruction <---Data--> Time CPU Totl User Hertz Cycles Instr Ratio Writes Cost Writes Cost -------- --- ---- ---- ----- ------ ----- ----- ------ ---- ------ ---- 09:01:00 0 47.0 45.9 5000M 2290M 1335M 1.716 13M 285M 8771K 470M 1 50.0 48.9 5000M 2439M 1480M 1.648 13M 287M 9592K 564M 2 45.5 44.4 5000M 2219M 1329M 1.669 13M 285M 8207K 455M 3 47.3 46.1 5000M 2313M 1331M 1.738 13M 289M 9584K 568M 4 42.5 41.0 5000M 2078M 1164M 1.785 11M 295M 7381K 447M 5 53.6 52.7 5000M 2623M 1750M 1.499 14M 283M 11M 566M 6 44.3 43.3 5000M 2163M 1179M 1.834 12M 309M 9235K 455M 7 56.3 55.3 5000M 2758M 1665M 1.657 14M 320M 15M 685M ---- ---- ----- ------ ----- ----- ------ ---- ------ ---- System: 386 378 5000M 17.6G 10.5G 1.681 102M 2353M 79M 4210MBut it gets worse. There is also the cost of DAT (Direct address translation). Each reference to an address must have a valid translated address in the TLB (Translation look aside buffer). In this installation's case where of the 17B cycles used, 6B cycles were used for loading the cache, and now we see that another 3.6B cycles are used for address translation. In this case, 19% of the cycles utilized are for address translation. This also goes up as the core becomes more utilized and there are more cache misses.
Report: ESAMFC MainFrame Cache Magnitudes Velocity Software Corpor ------------------------------------------------------------- <CPU Busy>.<-Translation Lookaside buffer(TLB)-> <percent> .<cycles/Miss><Writs/Sec> CPU Cycles Time CPU Totl User . Instr Data Instr Data Cost Lost -------- --- ---- ---- . ----- ----- ----- ----- ----- ----- 09:01:00 0 47.0 45.9 . 87 517 1832K 539K 19.13 438M 1 50.0 48.9 . 109 506 1471K 525K 17.48 426M 2 45.5 44.4 . 127 470 1258K 542K 18.66 414M 3 47.3 46.1 . 81 522 1980K 560K 19.55 452M 4 42.5 41.0 . 115 524 1363K 496K 20.06 417M 5 53.6 52.7 . 47 660 2949K 466K 17.01 446M 6 44.3 43.3 . 82 541 2050K 538K 21.27 460M 7 56.3 55.3 . 34 728 4796K 538K 20.10 554M ---- ---- . ----- ----- ----- ----- ----- ---- System: 386 378 . 72 557 18M 4205K 19.11 3609MAt this point, anyone having to perform capacity planning must realize that there is a lot of guess work in the future capacity planning models...
The traditional methods of chargeback are for CPU seconds consumed. CPU consumed was based on time the virtual machine was actually dispatched to a CPU, and that number was very repeatable. In the SMT world, when workload on one thread is sharing a core and cache with a second thread, the time to complete will normally be larger for a given workload. It is larger because even though the virtual machine is dispatched on a thread of a core for a period of time some of that time the core is being utilized by the other thread, increasing the time on thread, but not necessarily changing the cycle requirement for the unit of work.
The question then is what metrics should a chargeback model be using that is accurate and fair? The thread time should not be used as there are two threads sharing the hardware resource.
The IBM monitor facility attempts to alleviate this problem. The traditional metrics are still reported, with two additional metrics. The new metrics are "MT-Equivalent" or an estimation of what the server would have used if running alone on the core, and "MT Prorated" that actually attempts to charge for the cycles consumed. Customers have shown the IBM "prorated" metrics significantly overcharge users.
For chargeback, it is the resource consumed that should be charged for. In the following example from a database workload, ESALPARS shows the LPAR assigned an average of 11.23 IFLs for a period of one minute. The objective then is to charge for the resources assigned to the LPAR. One might subtract the 10.3% overhead associated with managing the LPAR, in which case charging for 11.13 engines is the ideal. The thread idle time of 7.03 Threads should be subtracted as that is extra capacity that was not utilized.
The capacity that should be charged to that LPAR is calculated as
((11.13 IFLs - (7.03 / 2) = 7.6 IFLswhich provides the CPU consumed by the LPAR.
Report: ESALPARS Logical Partition Summary Monitor initialized: 07/07/15 at 13:03 -------------------------------------------------------------------- <--------Logical Partition-------> <-Assigned Virt CPU <%Assigned> <---LPAR--> <-Thread-> Time Name Nbr CPUs Type Total Ovhd Weight Pct Idle cnt -------- -------- --- ---- ---- ----- ---- ------ ---- ------ --- 09:01:00 Totals: 00 295 IFL 7410 64.7 1198 100 MNGDMW08 08 30 IFL 1123 10.3 150 12.5 703.9 2
Analyzing the user workload, by the traditional CPU time, now really thought of as "thread time", capture ratio is 100%, we know exactly to which virtual machine to charge the 760% thread time consumed by virtual machines. As a charge back model, 100% capture ratio validates the data. But realistically, this measure is time a virtual machine was dispatched on a thread. This is very accurate but less useful for chargeback as it does not reflect that the resource consumed was 7.6 IFLs.
Subtract the z/VM system overhead, and we get a prorate that takes the virtual machine thread time and converts it to chargeable core time. This prorated value is shown as the "VSI Prorated" values and we believe this is what should be used in a chargeback model.
In the user CPU Consumption analysis, the numbers should start to make sense. The "traditional" CPU time is "thread time". The next "MT-Equivalent" metrics are noticeably less and is the time estimiated the thread would have used if SMT was disabled. This number has not been validated and is not believed to be useful.
The "IBM Prorate" values are what IBM provides in the CP monitor. It appears the unused "thread time" is being charged to the users. This would make the chargeback model variable based on other workload. Customers have indicated significant overcharging based on these metrics. That is the reason the additional "VSI Prorated" metrics are being provided.
Report: ESAUSP5 User SMT CPU Consumption Analysis Ve --------------------------------------------------------------------- <------CPU Percent Consumed (Total)----> <-TOTAL CPU--> UserID <Traditional> <MT-Equivalent> <IBM Prorate> <VSI Prorated> /Class Total Virt Total Virtual Total Virtual Total Virtual -------- ----- ----- ----- ------- ----- ------- ----- ------- 09:01:00 1454 1421 1206 1180 1078 1055 739.6 723.0 ***Top User Analysis*** MNGDB3F8 293.9 290.6 236.2 233.5 208.9 206.7 149.5 147.8 MNGDB3FD 274.7 258.7 228.0 215.0 196.0 184.7 139.7 131.6 MNGDB529 256.5 249.4 201.4 195.8 169.4 164.8 130.4 126.8 MNGDB41B 189.2 188.7 168.5 168.1 146.8 146.5 96.22 95.98
There are a lot of new metrics which require a need to understand how SMT really does impact user chargeback and capacity planning. Please provide feedback to Barton on any ideas or information you learn in your endeavors.