#### MAKLEE

M

software engineering solutions

## OpenVMS on BL890c i2 Servers

Guy Peleg President Maklee Engineering guy.peleg@maklee.com





### **Maklee Engineering**

- Consulting firm operating all over the world.
  - Team of "Top Gun" engineers.
  - Former members of various engineering groups at HP.
  - Gold Oracle partner.
- > Specialize in performance tuning of:
  - OpenVMS
  - Oracle (HP-UX, Linux, VMS, Solaris, AIX, Windows)
  - Oracle Rdb
  - Java (HP-UX, Linux, VMS, Solaris, AIX, Windows)
  - Adabas
- Also offers custom engineering services, custom training and on-going support contracts.







## Maklee Guarantees <u>doubling</u> the performance of your Oracle database or our service is provided free of charge !

## and....we speak German !!

http://www.maklee.com/indexDeutsch.html

#### MAKLEE

software engineering solutions



Oracle Services / SQI

ORACLE PARTNER

Tuning

#### For more information: info@maklee.com 1-800-224-4513

Homepage Über uns

Oracle Services/SQL Tuning

English

Maklee verfügt über umfassende Kompetenzen im Bereich Oracle Tuning mit spezialisierter Erfahrung bei der Arbeit am Tuning der anspruchsvollsten Workloads.

#### **Der Vorteil von Maklee**

Das Maklee-Team verfügt über ein tiefgreifendes Verständnis sowohl über Oracle als auch die darunter liegenden Betriebssysteme. Wir unterhalten enge Arbeitskontakte mit den Entwicklungsteams der führenden Hersteller von Betriebssystemen und mit den Entwicklungsgruppen der Oracle Corporation. In dem wir das Feedback des Kunden zu jeder Zeit berücksichtigen, erfüllen unsere Lösungen genau die Bedürfnisse des Kunden. Zusätzlich bleibt Maklee kontinuierlich bezüglich der aktuellsten technischen Entwicklungen und Veränderungen auf dem Laufenden.

#### Oracle Performance Tuning

Oracle Tuning birgt ein unendliches Potential zur Verbesserung der Performance. Die Standardeinstellungen von Oracle sind nicht immer optimal. Das Tuning ist ausschlaggebend, damit man das Beste aus einem System herausholen kann. Unser kreativer Ansatz resultiert in einem herausragenden Maß der Performance-Verbesserung. Kürzlich bei einem Einsatz für eine führende globale Bank konnte das Maklee-Team die Laufzeit einer Abfrage von 90 Minuten auf 4 Sekunden reduzieren – eine 1350-fache Steigerung der Performance konnte wiedergegeben werden. Unsere Spezialisierung beinhaltet das Monitoring und Tuning aller Oracle Datenbanken einschließlich RAC und Oracle Anwendungen, Oracle Instance Tuning und SQL Tuning. Um unsere Erfolgsgewährleistung realisieren zu können, führen wir während des gesamten Tuning-Prozesses Evaluationen durch. Diese Evaluationen berücksichtigen die Parameter des Betriebssystems und der Datenbank, die Execution-Pläne der Key SQL Statements und das Umschreiben der problematischen SOL Statements.

#### Maklee makes it possible.

#### Kontakt

<u>info@maklee.com</u> Telefon: 1-800-224-4513 Fax: 1-646-452-9402

#### **Corporate Headquarters:**

322W 57th street New York, NY 10019



- Why do we need to spend the next hour discussing OpenVMS on the new BL890c i2 server?
- > What's unique about the new server?

- What happened to VMS is VMS is VMS ?
- > The BL890c i2 was built using a new memory architecture.
  - Understanding the new architecture is essential for achieving optimal performance on the new server.



#### **Extreme Example**

- Superdome2 , 2TB RAM, HP-UX 11.31 (update 7)
- Oracle 11gR2





### Memory Latency and NUMA

- > The CPU is MUCH faster than physical memory.
  - CPU cycle is ~0.5 nanosecond.
- Memory latency is the amount of time it takes for data to arrive from physical memory into the CPU.
  - Varies from 40 500ns
  - 80-1000 times slower than the CPU
- Most CPUs spend significant amount of time waiting for data to arrive from physical memory.
  - From VMS perspective the CPU looks busy
- On a Non Uniform Memory Access architecture (NUMA) accessing local memory is faster than remote memory.



MAKLEE



## NUMA System

| Building Block #0             | Building Block #1             |
|-------------------------------|-------------------------------|
| CPUs 0-7                      | CPUs 8-15                     |
| Memory (interleaved)          | Memory (interleaved)          |
| SCHED and SCS spinlock        | Disk and Network I/O adapters |
| Disk and Network I/O adapters |                               |
|                               |                               |
| Building Block #2             | Building Block #3             |
| CPUs 16-23                    | CPUs 24-31                    |
| Memory (interleaved)          | Memory (interleaved)          |
| LCKMGR and TCPIP spinlock     |                               |
|                               |                               |
|                               |                               |



Life is not fair !!





## OKAY !

## BUT....does it really matter??

## Oh YES !!!



#### **Memory Latency**

- > 2 Cells 4P/8C rx8640 Integrity server.
- In preparation to future growth, a customer purchased 4 processors and spread them across 2 cells.
- 32GB RAM.
- Noticed very high CPU utilization comparing to older integrity box running the same workload.
- Maklee recommended consolidating all of the processors into a single cell, power off the second cell, and by that improve memory latency.





| [Cell]                                                                       |                  |                                  |                        | Memory                   |                      |                              |                              |                         | Use                |             |
|------------------------------------------------------------------------------|------------------|----------------------------------|------------------------|--------------------------|----------------------|------------------------------|------------------------------|-------------------------|--------------------|-------------|
| Hardware<br>Location                                                         | Actual<br>Usage  |                                  | OK/<br>Deconf/<br>Max  | (GB)<br>OK/<br>Decor     | ıf                   | Connect                      | ted To                       | Core<br>Cell<br>Capable | On<br>Next<br>Boot |             |
| cab0,cell0<br>cab0,cell1<br>cab0,cell2<br>cab0,cell3                         | Active<br>Absent | Base<br>*                        | 4/0/8<br>4/0/8<br>-    | 16.0/<br>16.0/<br>-<br>- |                      |                              | ay0,chassis0<br>ay0,chassis1 |                         | yes<br>yes<br>-    | 0<br>0<br>- |
| Notes: * =                                                                   | cell h           | as no                            | interlea               | ved me                   | emory.               |                              |                              |                         |                    |             |
| [Chassis]<br>Hardware Lo<br>cab0,bay0,c<br>cab0,bay0,c<br>[Partition]<br>Par | hassis<br>hassis | == ===<br>0 Act<br>1 Act<br># of | ive<br>ive<br># of I/0 | IO<br>yes<br>yes         | To<br>cab(<br>cab(   | nected<br>),cell0<br>),cell1 | 0                            |                         |                    |             |
| Num Status                                                                   |                  |                                  | Chassis                |                          |                      |                              | tion Name (f                 | irst 30 c               | hars)              |             |
| 0 Active                                                                     |                  | 2                                | 2                      | cab0                     | ,cel10               | ) Partit                     | tion O                       |                         |                    |             |
| [Partition<br>Par Num<br>======<br>0                                         |                  |                                  | ıd]<br>Iding Enal      | oled                     | Hyper<br>=====<br>no | thread:                      | ing Active                   |                         |                    |             |

### **The Golden Rules**

- Run your application on the smallest Integrity server that fits your workload.
  - rx6600 and blade BL870c provide outstanding performance, with very low memory latency.
  - On a cellular system, do not turn on extra cells unless you REALLY need it.
- > For workloads that do not fit a small system:

A process should be "close" to it's memory



#### **New Line of Integrity Servers**

#### HP Integrity server blades

Flexible mission-critical server blades combined with the efficiency of HP BladeSystem to accelerate IT effectiveness.

#### Server blades



HP Integrity BL860c i2 Server Blade

Infrastructure—a versatile and expandable

2 socket blade that is ideal for application tier

and transaction workloads, database, Java™,

Cost-effective mission-critical Converged

and technical computing applications



HP Integrity BL870c i2 Server Blade

applications such as SAP and Oracle enterprise



#### HP Integrity BL890c i2 Server Blade Flexible mission-critical server blades, combined with Kick off the mission-critical revolution with industry's first 8-socket UNIX scale-up server the efficiency of BladeSystem—4-socket blade that is ideal for the database tier of multi-tiered enterprise blade-ideal for larger mission-critical workloads such as enterprise resource planning, customer relationship management, business intelligence, and large shared-memory applications

| Processors supported           | Intel® Itanium® processor 9300 series<br>1.73 GHz (quad-core) with 24 MB cache<br>1.60 GHz (quad-core) with 20 MB cache<br>1.33 GHz (quad-core) with 16 MB cache<br>1.60 GHz (dual-core) with 10 MB cache | Intel® Itanium® processor 9300 series<br>1.73 GHz (quad-core) with 24 MB cache<br>1.60 GHz (quad-core) with 20 MB cache<br>1.33 GHz (quad-core) with 16 MB cache | Intel® Itanium® processor 9300 series<br>1.73 GHz (quad-core) with 24 MB cache<br>1.60 GHz (quad-core) with 20 MB cache<br>1.33 GHz (quad-core) with 16 MB cache |  |  |  |  |
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| Number of processors           | 1–2                                                                                                                                                                                                       | 2-4                                                                                                                                                              | 4-8                                                                                                                                                              |  |  |  |  |
| Maximum number of<br>cores     | 8                                                                                                                                                                                                         | 16                                                                                                                                                               | 32                                                                                                                                                               |  |  |  |  |
| Operating systems<br>supported | HP-UX 11i v3 <sup>1</sup><br>Microsoft® Windows® server 2008 R2 for<br>Itanium-based systems and<br>OpenVMS v8.4 <sup>2</sup>                                                                             | HP-UX 11i v3 <sup>1</sup><br>Microsoft® Windows® server 2008 R2 for<br>Itanium-based systems and<br>OpenVMS v8.4 <sup>2</sup>                                    | HP-UX 11i v3 <sup>1</sup><br>Microsoft® Windows® server 2008 R2 for<br>Itanium-based systems and<br>OpenVMS v8.4 <sup>2</sup>                                    |  |  |  |  |
| Maximum memory                 | 192 GB (24 x 8 GB)                                                                                                                                                                                        | 384 GB (48 x 8 GB)                                                                                                                                               | 768 GB (96 x 8 GB)                                                                                                                                               |  |  |  |  |
|                                |                                                                                                                                                                                                           |                                                                                                                                                                  |                                                                                                                                                                  |  |  |  |  |

applications

BL890c i2





### **The Tukwila Processor**

- Tukwila is the code-name for the generation of <u>Intel</u>'s <u>Itanium</u> processor family following <u>Itanium 2</u>, <u>Montecito</u> and Montvale. It was released on 8 February 2010 as the Itanium 9300 Series.
- Quad Core processor, 1.73GHz, 6MB L3 cache per core.
- Socket compatibility between Intel's Xeon and Itanium processors, by introducing a new interconnect called Intel QuickPath Interconnect (QPI).
  - Point-to-Point processor interconnect.
  - Allows one processor module to access memory connected to other processor module.
  - Developed by members of what had been DEC's Alpha group.
  - Replaces the Front Side Bus (FSB) for Xeon and Itanium.
  - First delivered on the Intel Core i7-9xx desktop processors and the X58 chipset.



The memory controller is part of the processor module and not the chipset.

### **Memory Subsystem Overview**







### **BL860c i2 Overview**





### **BL870c i2 Overview**





17

### **BL890c i2 Overview**



#### **Superdome2 Overview**

м

#### Superdome Blade



### **Local Memory Latency**

MAKLEE



Maklee Confidential

**Local Memory** 

.....

#### **Remote Memory Latency**

#### **Remote memory**



### Latency on the BL890c i2

#### Memory latency

#### • Inside interleaving domain

| • | Local latency                            | 217 nsec |
|---|------------------------------------------|----------|
| • | Latency to a 2nd processor in same blade | 288 nsec |
| • | Latency to a processor in 2nd blade      | 300 nsec |

#### • Across interleaving domains

| • | Latency to direct path processor | 300 nsec |
|---|----------------------------------|----------|
|   |                                  |          |

- Latency to processor in other blade 400 nsec
- Memory latency is not as good as we used to on the Alpha.
- > Applications should be tuned to utilize local memory as much as possible.



### Local Vs. Interleaved Memory

- > Challenges of NUMA based servers:
  - Some CPUs may have an advantage acquiring spinlocks.
  - Some CPUs may have an advantage acquiring locks.
  - Inconsistent performance
    - Performance may change based on the CPU a process is scheduled to.
- What could be done to make life a little more fair?
  - Make sure an application is running close to its memory.
    - For example, the dedicated lock manager needs to run close to the lock manager spinlock.
    - Oracle server processes need to run close to the SGA.
  - When the memory footprint of the application is high (shared memory sections than span over more than one domain), consider using Interleaved memory.
  - Until VMS V8.4, VMS only supported interleaved memory.
    - OpenVMS became NUMA aware again (Integrity) starting with OpenVMS V8.4



### MEMCONFIG

When migrating to the new BL890c i2, need to decide on memory management policy. Use the EFI MEMCONFIG utility.

| Option         | Description                                                                                                       | Comments                                                                                                                                                                                                                                                                               |
|----------------|-------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| MaxUMA         | Maximized Uniform Memory<br>Access, 100% ILM                                                                      | Memory is interleaved across all processor modules installed<br>in the system. Has the potential to improve bandwidth by<br>distributing memory regions across more DIMMs. When<br>choosing this option one needs to consider the longer<br>latencies associated with 1 or 2 QPI hops. |
| Mostly UMA     | Mostly Uniform Memory Access,<br>6/8 ILM and 2/8 SLM                                                              | 6/8 of the available system memory is interleaved across all<br>processor modules installed in the system and 2/8 is<br>interleaved as local memory.                                                                                                                                   |
| Balanced       | Equal allocation of Uniform and<br>Non-Uniform Memory Access,<br>4/8 ILM and 4/8 SLM                              |                                                                                                                                                                                                                                                                                        |
| MostlyNUMA     | Mostly Non-Uniform Memory<br>Access, 1/8 ILM and 7/8 SLM                                                          | Default memory interleaving selection at boot, optimum for HP-UX.                                                                                                                                                                                                                      |
| MostlyNUMA_MBI | Mostly Non-Uniform Memory<br>Access, Minimum Balanced<br>Interleaving, 1 GB ILM and the<br>rest of the memory ILM | Optimum for Windows. Allows for enough shared memory<br>space for the Kernel and any registers which need to be<br>accessed by all processor modules while minimizing memory<br>latency by configuring most of the memory space as SLM.                                                |
| MaxNUMA        | Maximized Non-Uniform Memory<br>Access, 100% SLM                                                                  | Lowest memory latency configuration.                                                                                                                                                                                                                                                   |





## **OpenVMS** Implementation



| the set Maria                                                 |  |
|---------------------------------------------------------------|--|
|                                                               |  |
| 🕞 app - Citrix XenApp Plugins for Hosted Apps                 |  |
| 🚆 (A) TELNET (thor) - PowerTerm 525                           |  |
| Datei Bearbeiten Terminal Kommunikation Optionen Skript Hilfe |  |

| View o | of ( | Cluster | from | svstem | ID | 10241 | node |
|--------|------|---------|------|--------|----|-------|------|
|        |      |         |      |        |    |       |      |

• 🖳 🛋 🖻 🛱 🛱 🛱 🗃 🐻 80 [132 💥 🔳 🏢 ?]

|      | SYSTEMS                                                                                                                                                                              |                                                                                      |                                                |  |  |  |
|------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|------------------------------------------------|--|--|--|
| NODE | HW_TYPE                                                                                                                                                                              | SOFTWARE                                                                             | STATUS                                         |  |  |  |
|      | HP BL870c i2 (1.73GHz/6.0MB)<br>HP BL870c (1.59GHz/12.0MB)<br>hp AlphaServer GS1280 7/1300<br>HP BL870c (1.59GHz/12.0MB)<br>HP rx6600 (1.59GHz/12.0MB)<br>HP rx6600 (1.59GHz/12.0MB) | VMS V8.4<br>VMS V8.3-1H1<br>VMS V8.3<br>VMS V8.3-1H1<br>VMS V8.3-1H1<br>VMS V8.3-1H1 | MEMBER<br>MEMBER<br>MEMBER<br>MEMBER<br>MEMBER |  |  |  |

2-SEP-2010 15:36:03

-

|                                                               | Contraction Contraction  |
|---------------------------------------------------------------|--------------------------|
|                                                               |                          |
|                                                               |                          |
|                                                               | A STATE AND A STATE OF A |
|                                                               |                          |
| 😨 app - Citrix XenApp Plugins for Hosted Apps                 |                          |
| 🧸 (A) TELNET (thor) - PowerTerm 525                           |                          |
| Datei Bearbeiten Terminal Kommunikation Optionen Skript Hilfe |                          |
|                                                               |                          |

٠

#### System Processor Configuration:

| CPU ID      | Θ                        | CPU State    | rc,pa,pp,cv,pv,pmv,pl                                               |
|-------------|--------------------------|--------------|---------------------------------------------------------------------|
| CPU Type    | Quad-Core Itanium (Intel | Itanium 9300 | Rev E0)                                                             |
| Halt PC     | 0000000.0000000          | Halt PS      | 0000000.0000000                                                     |
| Halt code   | Bootstrap or Powerfail   | Halt Req.    | Default, No Action                                                  |
| Slot VA     | FFFFFFF.9ADB9000         | CPUDB VA     | FFFFFFF . 8A1D6000                                                  |
| Package     | Θ                        | Core         | Θ                                                                   |
| Thread id   | Θ                        | Cothread id  | 16                                                                  |
| FW Usage    | 00000000.00000000        | CPU die      | 0                                                                   |
| ACPI CPU id | 00000000.00000000        | Serial Num   |                                                                     |
| LID         | 00000000.00000000        | CFG flags    | 00000000.00000631 Hardware Initialized Primary Present Reassignable |
| CPU ID      | 1                        | CPU State    | rc,pa,pp,cv,pv,pmv,pl                                               |
| CPU Type    | Quad-Core Itanium (Intel | Itanium 9300 | Rev E0)                                                             |
| Halt PC     | 00000000.00000000        | Halt PS      | 00000000.00000000                                                   |
| Halt code   | Bootstrap or Powerfail   | Halt Req.    | Default, No Action                                                  |
| Slot VA     | FFFFFFF.9ADBA000         | CPUDB VA     | FFFFFFF.9B852000                                                    |
| Package     | Θ                        | Core         | 1                                                                   |
| Thread id   | Θ                        | Cothread id  | 17                                                                  |
| FW Usage    | 00000000.00000100        | CPU die      | 0                                                                   |

Press RETURN for more.

SDA>

### RAD\_SUPPORT

#### RAD\_SUPPORT

(Alpha only) RAD\_SUPPORT enables RAD-aware code to be executed on systems that support Resource Affinity Domains (RADs); for example, AlphaServer GS160 systems. A RAD is a set of hardware components (CPUs, memory, and I/O) with common access characteristics.

Bits are defined in the RAD\_SUPPORT parameter as follows:

RAD\_SUPPORT (default is 79; bits 0-3 and 6 are set)

3 2 2 2 2 1 1 1 8 7 4 3 6 5 8 7 0 +----+ |00|00| skip|ss|gg|ww|pp|00|00|00|00|0p|df|cr|ae| +----+

Bit 0 (e): Enable - Enables RAD support

Bit 1 (a): Affinity - Enables Soft RAD Affinity (SRA) scheduling Also enables the interpretation of the skip bits, 24-27.

M

Bit 2 (r): Replicate - Enables system-space code replication

MAKLEE

## RAD\_SUPPORT

| Bit 3 (c): Copy  | - Enables copy on soft fault                                                                                                      |
|------------------|-----------------------------------------------------------------------------------------------------------------------------------|
| Bit 4 (f): Fault | <ul> <li>Enables special page fault allocation</li> <li>Also enables the interpretation of the allocation bits, 16-23.</li> </ul> |
| Bit 5 (d): Debug | - Reserved to HP                                                                                                                  |
| Bit 6 (p): Pool  | - Enables per-RAD non-paged pool                                                                                                  |
| Bits 7-15:       | - Reserved to HP                                                                                                                  |
| Bits 16-23:      | - If bit 4 is set, bits 16-23 are interpreted as follows:                                                                         |
|                  |                                                                                                                                   |
| Bits 16,17 (pp): | Process = Pagefault on process (non global)<br>pages                                                                              |
| Bits 18,19 (ww): | Swapper = Swapper's allocation of pages for<br>processes                                                                          |
| Bits 20.21 (gg): | Global = Pagefault on global pages                                                                                                |
|                  |                                                                                                                                   |
| Bits 22,23 (ss): | System = Pagefault on system space pages                                                                                          |



### **VMS representation MostlyNUMA**

#### \$@sys\$examples:rad

Node: XXXX Version: V8.4

System: HP BL870c i2 (1.73GHz/6.0MB)

| RAD | Memory (GB) | CPUs                                    |
|-----|-------------|-----------------------------------------|
| === | =========   | ======================================= |
| 0   | 28.00       | 0-3,16-19                               |
| 1   | 28.00       | 8-11,24-27                              |
| 2   | 28.00       | 4-7,20-23                               |
| 3   | 28.00       | 12-15,28-31                             |
| 4   | 15.99       | 0-31                                    |



### SDA SHOW PFN

sda> show pfn/rad

Page RAD summary

| RAD  | Free pages | Zeroed pages |
|------|------------|--------------|
|      |            |              |
| 0000 | 0          | 0            |
| 0001 | 233783     | 65535        |
| 0002 | 3538223    | 1 *          |
| 0003 | 3395833    | 3396682      |
| 0004 | 0          | 0            |

There are -3247242 additional pages in the free list

\* An error occurred scanning this list The count of additional pages given may not be correct SDA>



### show rad/pxml

Locality #03 (RAD #02)

Size: 00000D8 Address: **FFFFF802.ECF22788** 000001B0 Spread: Average: 000016AA Base RAD: 04 CPU count: 80000008 CPU bitmap: 0000000.00F000F0 Memory range(s): 00000020.00000000:0000026.FFFFFFF 0000001 (as PFNs) 00000000.01000000:00000000.0137FFFF Total memory: 00000007.00000000 (28672MB) RAD preference array: 00000002 0000004 0000003 0000000 0000001



# VMS

- Use the SHOW FASTPATH command and move device interrupts to the low numbered CPUs.
- > Move the dedicated lock manager close to the lock manager spin lock.
- Move TCP/IP close to the TCP/IP spinlock.
- Memory sections
  - Use the /RAD qualifier allocating reserved memory from a specific RAD.
    - mc sysman add reserved\_section\_name /rad=x
  - Use interleaved memory for shared memory sections that span over one RAD.
    - Makes sense also for systems running a single Oracle database
    - /rad=4 pn BL890c i2
  - Use local memory for small memory sections.

Experiment with RAD\_SUPPORT



No documentation as to what is happening under the hood, disable if interleaved memory is used, reduce unnecessary overhead in MMG

### Now...Can you explain it??

- Superdome2 , 2TB RAM, HP-UX 11.31 (update 7)
- Oracle 11gR2



