(Re)Designed for High Performance: Solaris and Multi-Core Systems

(Re)Designed for High Performance: Solaris and Multi-Core Systems James C.
McPherson Solaris Modernization group Solaris Core OS Engineering, Systems Division Oracle Corporation January 14, 2013 Abstract Solaris 11 provided a rare opportunity to redesign and reengineer many components of the Solaris kernel. Designed to efficiently scale to the largest available systems (currently 256 cores and 64 TB of memory per system on SPARC, a little less on x64), significant work was done to remove obsolete tuning concepts and ensure that the OS can handle the large datasets (both in-memory and on disk) which are now common- place. The continuing growth in virtualisation plays to Solaris’ strengths: with the built-in hypervisor in the multi-threaded multi-core SPARC T- series we can provide hardware partitions even within a single processor socket. Solaris also features support for Zones (somewhat similar to the BSD Jails concept) which provide soft partitioning and allow presentation of an environment which mimics earlier releases of Solaris. This allows software limited to running on those older releases to obtain some of the performance and observability benefits of the host operating system. Many years of development and support experience have given the Solaris engineering division an acute awareness that new features must include the capacity to observe them. While we have extensive testing, benchmarking and workload simulation capabilities to help bring a new release to the market, building in tools to help customers and support teams diagnose problems with their real-world usage are essential. The Solaris 11 release extended the work done with DTrace in Solaris 10, providing more probe points than ever before. This paper will describe some of the changes made to several parts of the operating system in Solaris 11, and the motivations behind those changes. 1 How did we get here? (or, init()) The Solaris operating system has been designed around multiprocessor systems for over two decade. Starting with support for the Sun SPARCServer 600MP 1The author wishes to thank Blake Jones, Jonathan Adams, BJ Wahl, Liane Praza, Bart Smaalders and Rafael Vanoni for their comments and support. 1

SMP, with 4 processors, there was a small increase to
support for 20 cpus, then to 64 in 1997 with the release of Sun Enterprise 10000 (the “E10k” or Starfire system) and Solaris 2.6. The physical characteristics of the E10k provided a reasonably good upper bound on the performance envelope for cpus and memory (up to 64 physical cpus on 16 system boards, and up to 64Gb ram), which when coupled with the hardware partitioning scheme, dynamic reconfiguration and capacity-on-demand provided scope for early steps in virtualisation development. The next eight years saw incremental tuning to scaling, to support more ram, new IO busses (Sbus and UPA gave way to PCI and PCI-X) and more emphasis on 64-bit operation. Dual-core SPARC IV+ cpus were first shipped in 2004, closely followed in 2005 by the the throughput-oriented, chip multithreading UltraSPARC T series (“Coolthreads”, but more commonly known as “Niagara”) shipped. The initial release of the T series had up to 4, 6 or 8 cores with 4 hardware threads per core, the memory controller and a crossbar switch between the cores on the die, but only one floating point unit in the socket. The architecture name changed too: sun4v, with the ’v’ reflecting the hypervisor capabilities designed in from the start. Later incarnations of the T series added on-die 10Gi- gabit Ethernet, PCI Express controllers, hardware crypto engines and support for SMP operation (four sockets) without extra glue circuitry. Year NCPU Max ram Solaris version SPARC Architecture 1992 4 1Gb Solaris 2.1 sun4m 20 5Gb sun4d 1997 64 64Gb Solaris 2.6 sun4u (Starfire) 2000 512 1152Gb Solaris 8 sun4u (Serengeti) 2007 64 64Gb (T2000) Solaris 10 sun4v (Coolthreads) 2007 256 4Tb (M9000) Solaris 10 sun4u (M series/OPL) 2009 256 1Tb (T4) Solaris 10 sun4v The progression is clear: scale within the socket first, then scale outside it, then bring that scale back inside the socket as the socket contents becomes more capable. Virtual Memory The same forces that have pushed microprocessor capability and scaling over the past several years have pushed memory capacity as well. Today’s entry-level servers can have nearly as much memory as Sun’s highest-end “Starfire” system from fifteen years ago. To deal with this increased memory capacity, many microprocessors have added support for large memory pages. Large pages increase the amount of memory that a CPU can reach without delay by a factor of 512 or much more; they are critical for the performance of workloads with large memory footprints. Solaris has adapted to these changes over the years. Since the growth in memory capacity has been at least as intense as the growth in CPU capability, 2

the Solaris virtual memory system has been at the forefront
of many incremental scalability improvements. The most important features added to the virtual memory system prior to Solaris 11 were the Memory Hierarchy Awareness (i.e. NUMA) and Multiple Page Size Support projects. These features, added in So- laris 9, allowed applications much more control over NUMA placement and the use of large pages. Although these features were critical for dealing with some of the biggest changes in system memory usage over the past fifteen years, many parts of the virtual memory system were not designed with them in mind. For example, the large page support had its own subtle performance bottleneck built in, as many software operations on large pages actually required operating on a collection of small pages. When a large page was only 512 times larger than a small page, this wasn’t a big problem. But modern systems have large pages that are made up of a quarter-million small pages, and future systems threaten to stretch that ratio yet further. Solaris 11 has taken substantial steps toward a more durable solution to these problems. The core of the virtual memory system has been changed so that software operations on large pages are just as efficient as operations on small pages. We collect much more information on memory usage within each NUMA partition and of each page size, and we use that information to make better-informed decisions about how to create large pages. And since OS virtualization has become ubiquitous, we have laid the groundwork to make memory hot-add and hot-remove events much more robust. There is more work in progress that will allow applications to take full ad- vantage of the new virtual memory system. But we have already seen upwards of 40% improvement on some networking throughput benchmarks due to better control over memory placement, and it has become easier to innovate in this part of the operating system. CPUs The first SPARC T-series systems were released during the process of Solaris 11 development. The T-series CPUs in these systems inspired many changes in Solaris. First, these CPUs were the first in the new ”sun4v” system architecture, which assumes the existence of a hypervisor. Solaris uses the hypervisor on these systems to perform very CPU-specific operations, such as programming the MMU or fixing a memory ECC error; the hypervisor also does the standard virtualization features such as running, managing, and migrating guest operating systems. The T-series CPUs were also much more heavily multi-threaded and multi- core than any that Solaris had previously run on, and as a result, they caused several innovations in the thread scheduler. Solaris maintains a model of the topology of the system’s CPUs. This model describes not only how the CPUs are connected with one another, but also how various resources (such as caches or execution pipelines) are shared between various parts of a single CPU. The scheduler uses this model to maximize cache warmth and minimize contention 3

for CPU-internal resources when scheduling threads. Finally, the recent SPARC
T4 CPUs have a feature where a core can execute a thread more quickly if there is only one thread of execution running on it. The ”critical threads” project leveraged the traditional Unix notion of thread priority to automatically determine whether a thread should be able to use a whole core by itself, in order to use this speedup. Solaris 11 removed support for two major cpu implementations: the 32bit- only x86 family of cpus, and the UltraSPARC-I, -II, -III and -IV (collectively, the sun4u family as implemented by Sun). These implementations restricted our ability to virtualise the OS in hardware,2 do not have support for optimised, extended instructions3 which made a measurable difference to system performance, and prevented efficient operation with the large address spaces which we need to support for future growth. The Scheduler The introduction of the sun4v architecture provided an opportunity to re-write the scheduler. From its inception, Solaris has provided a facility for the system administrator to define preferred scheduling timeslice quanta4. The manual page for the command warns that Inappropriate values can have a negative effect on system performance, which was encountered too often during the Dot.Com Bubble of 1997-2000. This in turn lead to the development of the Solaris Fair Share Scheduler (FSS) and influenced our understanding of use-cases to check when testing changes.5 [Oram and Wilson(2007)] [Vanoni(2012)] A typical Solaris 11 system has seven scheduler classes available: Class Description SYS System Class TS Time Sharing SDC System Duty-Cycle Class FX Fixed Priority IA Interactive RT Real Time FSS Fair-Share Scheduler The final class, FSS, differs from the traditional timesharing class (TS) in 2There is no hardware support for virtualization in the UltraSPARC-I/II/III/IV family. The hardware partitioning support available in the Enterprise and Sun Fire family is limited to a per-system board granularity, configuration and management was done via a supervisor which inhabited the LOM (Lights Out Management). With the industry’s increased focus on virtualization this was another reason to remove support for booting and running the Solaris 11 kernel on this family of cpus and hardware. 3Such as those for on-chip crypto engines and efficient 64- and 128-bit operations. 4dispadmin(1M) 5One specific case which the author was involved in analysing was summarised by the bug synopsis: Low priority TS (timeshare class) threads on a sleep queue can be victimized. In that case, large cpu/memory and IO configurations hit a single writer lock for a filesystem module. The workaround of forcing UFS DirectIO was unpalatable for long-term operation; a proper solution was called for. 4

that the goal of the class is to enforce explicit
allocation of available cpu cycles for projects. System processes and threads have been made visible via naming, and plac- ing them in the System Duty-Cycle class: $ ps -o pid,class,pri,time,comm -p 1,898,3719,6000,25930 PID CLS PRI TIME COMMAND 1 TS 59 00:27 /usr/sbin/init 898 SDC 99 03:28:52 zpool-sink 3719 TS 59 16:21 /usr/java/bin/java 6000 IA 59 00:52 prstat 25930 IA 59 44:35 /opt/local/firefox-17/firefox This enhances observability by enabling the system administrator to see how much cpu time is being used by, for instance, the controlling thread for your imported zpools. Coupled with insight provided by aggressive use of DTrace, use of libmicro, and early access to Niagara class systems we determined that switching hardware threads across cores incurred a more signiﬁcant performance cost than had been expected. Rewriting the scheduler and thread dispatcher to avoid this switching played well with the optimisations which the NUMA group were working on, based around locality groups (“lgroups”). By giving each core its own dispatch queue and decreasing the likelihood of the scheduler migrating a thread outside of its core (making the thread sticky) it was a fairly simple effort to make the same condition true for an lgroup – and then start thinking of an in-socket core as its own lgroup as well. On a related note, work to improve the hyperthreading (aka SMT or “si- multaneous multithreading”) support on Intel and AMD multi-core processors gave a surprising result. Using the HALT instruction in preference to SLEEP improved single-threaded workloads by about 30%, as well as decreasing the cpu’s power needs. Device drivers: networking and storage For many years now Solaris has provided well designed frameworks and APIs to assist writing device drivers [Oracle Corporation(2012a)]. The network framework is called GLD (Generic LAN Device), and the storage framework is called SCSA (Sun Common SCSI Architecture). Both of these underwent major revi- sions during the Solaris 11 development project. In both cases the enhance- ments were designed to pull in support for new technology (such as second generation SAS, 10Gb and faster Ethernet, Inﬁniband) while at the same time simplifying the work that drivers need to do. Virtualisation was designed in - ranging from pervasive multipathing and target-mode operation on HBAs to virtual NICs and support for converged network adapters. 5

The Network The original Solaris network stack was designed and
implemented when the in- terconnected world was much simpler and slower. For years, the use of streams had provided a flexible and performant architecture. Yet by the time we were picking through the pieces of the Dot Com bubble and looking at how we could improve our network stack, many of those features had grown other features on top or alongside. [Smaalders(2006)] The result was a very complex and opaque stack with many knobs to tune, large stack depths (commonly 15 calls, though up to 22 were seen under some conditions) and performance which was often an order of magnitude worse than linux under comparable conditions. Fixing this required work at all levels of the stack and measuring each change against how well it measured against just two metrics: latency and throughput. While it is possible to just attack the number of instructions required to process a packet, that doesn’t always help to improve either of these two metrics. The increase in network device wire speed meant that we couldn’t do all the processing in an interrupt context, and the stream plumb/unplumb cost coupled with message passing between the tcp and ip modules was not “shrink to fit”6 in any way, shape or form. To enable the high performance and sustainable network stack which So- laris deserved (and which we very clearly needed), the multi-pronged rewrite stuck to two basic design principles: data locality (packets for the same connection are processed on the same core whereever possible) and use of a function call API rather than message passing. Enforcing the first blended in very well with the afore-mentioned work from the NUMA group for memory placement optimisation (MPO) and the scheduler. Enforcing the second decreased the general packet processing time and thus latency by a significant percentage, enabling better throughput and increasing system idle time as well. Storage Solaris 11 started development when switched 4Gb fibre channel fabric was the peak of storage interconnection technology. While 8Gb fabric came along very shortly afterwards, we did not see it in production for another year or so, while vendors ramped up their offerings. At that point, Solaris’ fibre channel framework had already been split apart from SCSA and driver development was (and still is) mostly done by two IHVs, QLogic and Emulex. The “leadville” layer allowed Solaris to provide a consistent multipathing implementation with the QLogic and Emulex drivers underneath, obviating the need for third-party implementations such as EMC’s PowerPath, or Veritas (now Symantec) Dynamic Multipathing, vxdmp. On a cosier level (direct connection within a host), Solaris’ support for first generation of Serial attached SCSI (SAS) controllers arrived in April 2007. This driver (for the LSI 1064/1068E chip) also supported parallel SCSI as does that 6The phrase “shrink to fit” achieved great popularity within Solaris engineering as we started to remove excess baggage from our engineering processes. 6

chip. While we7 designed and implemented this support we were
very con- scious that the need to deliver the product mediated against cutting out parallel SCSI. That task was left to a follow-on project which delivered support for second generation SAS with a new chip from LSI. This subsequent project was worked in parallel with the effort to update the SCSA framework to properly support the features which were showing up on the SAS and Fibre Channel protocol roadmaps. Central to the new version of SCSA are the concepts of multipathing and hotplug (sometimes known as “Dynamic Reconfiguration”). Prior to Solaris 10’s release, an effort was made to enable pervasive, always-on multipathing. There was insufficient design-time coordination between the install group and the multipathing group, and it was believed that the feature’s impact on customers would be unacceptable. The feature was disabled by default (except for FC- attached devices on x86/x64 systems) and a utility written to allow turning multipathing on if desired. Leaving the Solaris-using community stuck in the world of static topology using a parallel SCSI nomenclature held us back in delivering new storage features, so the SCSA update during the Solaris 11 cycle made this a central goal. This time, however, there was significant redesign and reimplementation work occurring in the installation8 and packaging spaces9, allowing cooperation to occur at the right time. Hotplug operation was held back for years by limitations imposed by the parallel SCSI protocol. People who used Sun’s six- or twelve-disk SCSI Mul- tiPacks in clusters will recall the difficulties with cable lengths, termination, reservations and the pain caused when you had to physically replace a disk. Fibre Channel switched fabric and SAS fabrics have uncoordinated (dynamic) target and logical unit discovery designed in to the protocol. Once we no longer had to worry about supporting parallel SCSI hardware we were able to forge ahead with a correct implementation of dynamic discovery. System Topology, or, why is it that blue LED? A perhaps-unexpected pairing of technologies, SCSAv3 and the Solaris Fault Management Architecture (FMA), turned into a really interesting usability feature: system topology. We’ve known for a very long time what the relation- ship is between a device path and its physical position in a system (assuming that it’s a Sun/Oracle-branded piece of hardware). The system designers as- sign addresses to each slot and know which hardware is installed in that slot. Coordinating code efforts with the designers allows us to match up this information and change LED states, or deliver telemetry from the hardware to the fault management daemon, fmd. The path from the hardware up to the fault management daemon makes use of dynamic discovery. The daemon notifies subscribers to classes of events, minimising the need for downtime to physically explore the system in the event of an error. 7The author was part of the four-strong team which delivered this project 8The Automated Installer (AI) and Distro Constructor (DC) 9The Image Packaging System, or IPS 7

This output from fmdump -v -u (edited for clarity) illustrates
the principle: Oct 22 14:31:17.7232 a6539f33-f21f-4417-94df-9e573a107e7e FMD-8000-6U Resolved 100% fault.io.scsi.cmd.disk.dev.rqs.derr Replaced Problem in: hc://:chassis-mfg=Sun-Microsystems :chassis-name=SUN-Storage-J4200:... :chassis-serial=0848QAJ001:fru-serial=0821T4P519--------3LM4P519 :fru-part=SEAGATE-ST330055SSUN300G:fru-revision=0B92 :devid=id1,sd@n5000c5000b21f4e3/ses-enclosure=1/bay=3/disk=0... FRU Location: SCSI Device 3 We’re given a timestamp, an event UUID, its classification in the FMA scheme and status (Resolved). The 100% is the confidence level that the engine has in its diagnosis as to the cause of the fault, and the system administrator ac- tion (Replaced) shows what has occurred. We’re then provided with a more human-friendly piece of information - the type of chassis where the problem was observed, its serial number and the FRU’sfield replaceable unit serial number and part number. We supply the devid, which uniquely identifies the device to the storage stack, the fact that it is located in bay 3 of enclosure #1, and finally the silkscreen label that you would look for on the chassis. This particular event is associated with a disk replacement, so there is an equivalent message from the ZFS diagnosis engine, which informs the user of which pool is involved and which devid is affected. The system topology work started with the delivery of FMA in Solaris 10, but accelerated for Solaris 11 with the FISHworks project10, a stealth project started by the DTrace team designed to produce a storage appliance based on Solaris. Each Sun/Oracle-branded system as well as those from Fujitsu have the hardware topology built into the fault management system. To provide a better customer experience, we also have support for some generic hardware – as long as that hardware supports certain generic inquiry methods.11 [T10.org(2009)] [T10.org(2008)] [T10.org(2012)] As a further example, here is the (slightly edited) output from two instances of the diskinfo command. The first is from an x64 host with a J4200 JBOD attached, the second is from a Sparc T3-1: D:devchassis-path c:occupant-compdev ------------------------------------------------- ------------------ SUN-Storage-J4200.0848QAJ001/SCSI_Device__0/disk c11t43d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__1/disk c11t26d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__2/disk c11t38d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__3/disk c11t42d0 10Fully-integrated Software and Hardware Works 11For those wishing to experiment, ensure that you have pkg:/system/fault-management installed, and using a privileged account (root or sudo) walk through the output from /usr/lib/fm/fmd/fmtopo. You might want to use a pager such as /bin/less. 8

SUN-Storage-J4200.0848QAJ001/SCSI_Device__4/disk c11t29d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__5/disk c11t30d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__6/disk c11t41d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__7/disk c11t32d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__8/disk c11t40d0
SUN-Storage-J4200.0848QAJ001/SCSI_Device__9/disk c11t39d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__10/disk c11t35d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__11/disk c11t36d0 D:devchassis-path c:occupant-compdev --------------------------- --------------------- /dev/chassis//SYS/HDD0/disk c4t5000CCA0153966B8d0 /dev/chassis//SYS/HDD1/disk c4t5000CCA012633C24d0 /dev/chassis//SYS/HDD2/disk c4t5000CCA01263C824d0 /dev/chassis//SYS/HDD3/disk c4t5000CCA0125DC1A8d0 /dev/chassis//SYS/HDD4/disk c4t5000CCA0153A0B7Cd0 /dev/chassis//SYS/HDD5/disk c4t5000CCA0125DDE44d0 /dev/chassis//SYS/HDD6/disk c4t5000CCA0125BF590d0 /dev/chassis//SYS/HDD7/disk c4t50015179594F5DD2d0 The J4200’s occupant names show that we do not have multipathing enabled for the controller which this JBOD is attached to, whereas those in the T3-1 are enabled for multipathing. There is physically only one path to the disks, but the framework handles this seamlessly. We also make use of the system topology information at the level of the format command level, to remove doubt and misunderstanding about which device is being operated on: AVAILABLE DISK SELECTIONS: 0. c4t5000CCA0153966B8d0 <SUN300G cyl 46873 alt 2 hd 20 sec 625> /scsi_vhci/disk@g5000cca0153966b8 /dev/chassis//SYS/HDD0/disk 1. c4t5000CCA012633C24d0 <HITACHI-H106060SDSUN600G-A2B0-558.91GB> /scsi_vhci/disk@g5000cca012633c24 /dev/chassis//SYS/HDD1/disk 2. c4t5000CCA01263C824d0 <HITACHI-H106060SDSUN600G-A2B0-558.91GB> /scsi_vhci/disk@g5000cca01263c824 /dev/chassis//SYS/HDD2/disk ... Observability It is widely recognised that unless you know what is happening in a problem space, you cannot do anything to make a meaningful change to that problem. [Tregoe and Kepner(1997)] [McDougall et al.(2007)McDougall, Mauro, and Gregg] [Gregg(2012)] [Gregg(2010)] Having a tool like DTrace at our disposal meant that Solaris engineering and support staff could peek under the covers of the 9

system to observe what was really happening. With DTrace we
could clearly see just which locks were hot, which lowlevel code paths were ripe for hand-tuning and just how long interrupt service routines took to process data, to handle success and failure. As noted earlier, we made use of DTrace at every level of the stack – from timing kernel functions using the fbt provider12 up to seeing per-instruction call paths in user code. As the development cycle progressed, engineering teams became more fa- miliar with turning to DTrace to investigate and measure problems. The older and heavyweight TNF tracing system was removed entirely, and we observed a decrease in the number of times that debugging a problem required booting a debug kernel. Since DTrace is so lightweight (no probe effect when it’s not enabled, and minimal effect otherwise) we could run new nightly or biweekly kernel builds with less overhead, with significantly increased confidence that we were solving the same problem that our internal end users were seeing. More probes were added to every driver; frameworks such as GLD and SCSA were enhanced to use the feature and some driver error processing was rewritten to use DTrace static probes rather than custom logging which would traditionally end up in syslog - with a noticeable performance cost. It is very important to remember that DTrace is not merely an engineering or software development tool for operating system engineers. Since it was designed to help us solve customer problems, we have been delighted to see it used in anger whether by field support staff or customer system administrators and developers, illuminating customer problems when they occur rather than requiring an outage to reboot to a custom debug kernel which may or may not have helped gather the necessary data. With that data available, we’ve been able to solve problems for the Solaris 11 release and fix the customer problem with a backport to Solaris 10. Assisting in that effort has been the DTrace Toolkit created by Brendan Gregg13 before he joined the FISHworks group. The toolkit provides a collection of example scripts which are easily accessible to every user, in several cases showing how to do with DTrace what you might have done using truss in previous releases. Filesystems In addition to the other areas where Solaris 11 broke with the past, the standard filesystem changed too. Whereas previous versions had installed to the now-venerable (shorthand for crufty, unreliable and much hacked-upon) UFS, Solaris 11 only installs to ZFS. UFS is still supported as a filesystem14 but it is no longer preferred. ZFS’ advantages in data integrity, extensibility, ease of maintenance (for the customer) and performance rendered all arguments supporting installation to and booting from UFS irrelevant. 12Function Boundary Tracing 13http://brendangregg.com/dtrace.html. You can install the tookit in Solaris 11 with ’pkg install pkg:/system/dtrace/dtrace-toolkit’, or download a tarball from http:// brendangregg.com/dtrace.html#DTraceToolkit 14You can even create a UFS filesystem on a ZFS dataset if you wish. 10

Taking a step back from what we have come to
know about filesystems, it is helpful to ask what function they really serve. When you get past “to store my data” their raison d’etre is to provide read and write access to blocks of data. Nothing more, nothing less. All that your application should care about is whether the correct data is there when requested. Not only should the filesystem take care of everything else, but the system administrator should not need to know intimate details of disk labels, partition tables, how many files can be stored on the filesystem, can I take a snapshot of this filesystem (and if so, how long will it take and how much space will it need?) and so on. ZFS solves these pain points elegantly and simply. [Brown(2007)] [Bonwick(2007)] By removing the volume manager layer present in other products15, ZFS si- multaneously removes a layer of administrative complexity and a layer of code complexity. The volume manager layer ties filesystems to specific sizes determined at creation time, and gets in the way if you wish to virtualise. Treating the pooled storage in a similar manner to memory (add a new dimm pair, it’s immediately available to the system) and allowing the system to manage how and where it puts data onto physical media just makes more sense. By integrating ZFS with Solaris FMA and other features found in Solaris 11 such as delegated administration, iSCSI, NFS and SMB/CIFS sharing, the filesystem becomes almost boring. It provides the data you need, where and when you need it. It’s fast and easy to manage (you need to remember just two commands) and gets out of your way. The first command to remember is /usr/sbin/zpool: blinder:jmcp $ zpool list rpool NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT rpool 136G 113G 23.9G 82% 1.00x ONLINE - blinder:jmcp $ zpool status rpool pool: rpool state: ONLINE scan: scrub repaired 0 in 4h35m with 0 errors on Sat Sep 15 17:40:23 2012 config: NAME STATE READ WRITE CKSUM rpool ONLINE 0 0 0 mirror-0 ONLINE 0 0 0 c5t5000CCA00510A7CCd0s0 ONLINE 0 0 0 c8t1d0s0 ONLINE 0 0 0 errors: No known data errors The second is /usr/sbin/zfs: blinder:jmcp $ sudo zfs create -o mountpoint=/help rpool/example Password: blinder:jmcp $ zfs list -r rpool 15Primarily, the older Solaris Volume Manager SVM and Veritas Volume Manager VxVM 11

NAME USED AVAIL REFER MOUNTPOINT rpool 113G 21.7G 3.17M /rpool
rpool/ROOT 112G 21.7G 21K legacy rpool/ROOT/s12_09 22.8M 21.7G 12.0G / rpool/ROOT/s12_10 112G 21.7G 12.4G / rpool/VARSHARE 15.5M 21.7G 15.5M /var/share rpool/example 31K 21.7G 31K /help There is no need to add a separate entry to /etc/vfstab for the new dataset - it’s mounted at creation time at the given mountpoint and will be remounted there on every reboot. Further examples of zpool and zfs usage can be found in the ZFS Administration Guide16. Virtualization When discussing virtualization we talk most often about hardware virtualization solutions such as OVM, Oracle VM for Sparc (LDOMs), VirtualBox, LPARs and VMWare’s ESXi. While there was a project during the Solaris 11 development cycle to port Xen to Solaris for the x64 platform, this was discontinued and never shipped. Solaris Logical Domains continue to receive the majority of hardware virtualization work; the sun4v hypervisor layer provides a very convenient way for Oracle to get customers running old SPARC systems on to newer hardware and a gateway to running their applications on newer versions of Solaris. For x64 systems Oracle offers Oracle VM Server (OVM), a Xen implementation, for which Solaris 11 has support for running as a domU. There is also Oracle VM VirtualBox; many Solaris developers run Solaris inside Vir- tualBox as well as on bare metal. VirtualBox makes it easy to spin up a new environment to mimic a customer environment, or to test a fix. Logical Domains, OVM domU and VirtualBox aren’t the only virtualisation options are available with Solaris. The Solaris Zones feature present in Solaris 10 for both architectures has been expanded to include support for “n-1” zones on SPARC systems: you can configure a zone to look like a system running Solaris 8, 9 or 10. Again, this is primarily a method of getting customers onto newer hardware and providing the upgrade gateway. Zones are a fast and lightweight method of consolidating many systems onto one – installing the minimum required software for a new zone takes as little as a few minutes. The Image Packaging System makes use of ZFS snapshots to create and update zones; your zones will be in sync with your global zone boot environment. At a lower level, virtualisation support is built in to ZFS (pooled storage), iSCSI, fibre channel target hba mode operation, PCI Express’ Single Root IO Virtualization (SR-IOV), InfiniBand RDMA as well as virtual network interfaces (vnics) from the GLD framework. 16http://docs.oracle.com/cd/E19082-01/817-2271/index.html 12

fini() Solaris 11 represents the culmination of close to eight
years of development effort by more than 1000 engineers. In that time the industry underwent sig- niﬁcant upheaval, changing focus from straight-line performance to throughput. By identifying obsolete assumptions and cutting out needlessly complex code we laid the foundation for reinventing Solaris as the performance platform of choice. The result is an operating system that scales from one core to thousands, from one gigabyte of ram to terabytes, with fault management and observability features which are unrivalled. Solaris 11 is built for high performance, multi-core operation from the silicon to the application. Why aren’t you running your application on top of it? References [Oram and Wilson(2007)] A. Oram, G. Wilson (Eds.), Beautiful Code, O’Reilly & Associates, Inc., Sebastopol, CA 95472, 2007. [Vanoni(2012)] R. Vanoni, Extending the semantics of scheduling priorities, ACM Queue: Tomorrow’s Computing Today 10 (6), URL http://queue. acm.org/detail.cfm?id=2282337. [Oracle Corporation(2012a)] Oracle Corporation, Writing Device Drivers, vol. 819-3196-13 of Oracle Solaris 11 Documentation Library, Oracle Press, 500 Oracle Parkway, Redwood City, CA 94065, U.S.A., URL http://docs. oracle.com/cd/E23824_01/html/819-3196, 2012a. [Smaalders(2006)] B. Smaalders, Performance anti-patterns, ACM Queue: To- morrow’s Computing Today 4 (1) (2006) 44–50, ISSN 1542-7730 (print), 1542-7749 (electronic). [T10.org(2009)] T10.org, Serial Attached SCSI – 2 (SAS2), URL http://www. t10.org/cgi-bin/ac.pl?t=f&f=sas2r16.pdf, 2009. [T10.org(2008)] T10.org, SCSI Enclosure Services – 2 (SES2), URL http:// www.t10.org/cgi-bin/ac.pl?t=f&f=ses2r20.pdf, 2008. [T10.org(2012)] T10.org, SCSI Primary Commands – 4 (SPC4), URL http: //www.t10.org/cgi-bin/ac.pl?t=f&f=spc4r36e.pdf, 2012. [Tregoe and Kepner(1997)] C. H. Tregoe, B. B. Kepner, The New Rational Man- ager: An Updated Edition for a New World, Princeton Research Press, Hillsborough, NJ 08844, USA, ISBN B0064X2G5O, 1997. [McDougall et al.(2007)McDougall, Mauro, and Gregg] R. McDougall, J. Mauro, B. Gregg, Solaris Performance and Tools: DTrace and MDB Techniques for Solaris 10 and OpenSolaris, Sun Microsystems Press/ Prentice Hall, Upper Saddle River, NJ, ISBN 0-13-156819-1 (hardcover), URL http://www.loc.gov/catdir/toc/ecip0615/2006020138.html, 2007. 13

[Gregg(2012)] B. Gregg, Thinking Methodically about Performance, ACM Queue 10
(12) (2012) 40, URL http://doi.acm.org/10.1145/ 2405116.2413037. [Gregg(2010)] B. Gregg, Visualizing System Latency, ACM Queue 8 (5) (2010) 30, URL http://doi.acm.org/10.1145/1794514.1809426. [Brown(2007)] D. Brown, A Conversation with Jeff Bonwick and Bill Moore, ACM Queue: Tomorrow’s Computing Today 5 (9), URL http://queue. acm.org/detail.cfm?id=1317400. [Bonwick(2007)] J. Bonwick, ZFS, in: LISA, USENIX, URL http://www. usenix.org/events/lisa07/htgr_files/bonwick_htgr.pdf, 2007. [Cantrill et al.(2004)Cantrill, Shapiro, and Leventhal] B. Cantrill, M. W. Shapiro, A. H. Leventhal, Dynamic Instrumentation of Production Systems, in: USENIX Annual Technical Conference, General Track, USENIX, 15–28, URL http://www.usenix.org/publications/library/ proceedings/usenix04/tech/general/cantrill.html, 2004. [Bonwick(2004)] J. Bonwick, 128-bit storage: are you high? URL https:// blogs.oracle.com/bonwick/entry/128_bit_storage_are_you. [Cantrill(2006)] B. Cantrill, Hidden in plain sight, ACM Queue 4 (1) (2006) 26–36, URL http://doi.acm.org/10.1145/1117389.1117401. [Cantrill and Bonwick(2008)] B. Cantrill, J. Bonwick, Practice: Real-world concurrency, Communications of the ACM 51 (11) (2008) 34–39, ISSN 0001-0782 (print), 1557-7317 (electronic), doi:\bibinfo{doi}{http:// doi.acm.org/10.1145/1400214.1400227}. [Cantrill et al.(2008)Cantrill, Shapiro, and Leventhal] B. M. Cantrill, M. W. Shapiro, A. H. Leventhal, Solaris Kernel Development URL http://www. sun.com/bigadmin/content/dtrace/dtrace_usenix.pdf. [Oracle Corporation(2012b)] Oracle Corporation, Oracle Solaris Administra- tion: ZFS File Systems, vol. 821-1448-12 of Oracle Solaris 11 Documenta- tion Library, Oracle Press, 500 Oracle Parkway, Redwood City, CA 94065, U.S.A., URL http://docs.oracle.com/cd/E23824_01/html/821-1448/, 2012b. [Oracle Corporation(2011)] Oracle Corporation, Oracle Solaris 11 11/11 - What’s New, pdf, URL http://www.oracle.com/ technetwork/server-storage/solaris11/documentation/ solaris11-whatsnew-201111-392603.pdf, 2011. [McDougall and Mauro(2007)] R. McDougall, J. Mauro, Solaris internals: Solaris 10 and OpenSolaris kernel architecture, Sun Microsystems Press/Prentice Hall, Upper Saddle River, NJ, USA, second edn., ISBN 0-13-148209-2 (hardcover), URL http://www.loc.gov/catdir/toc/ ecip0613/2006015114.html, 2007. 14

[Teer(2005)] R. Teer, Solaris Systems Programming, Addison-Wesley, pub- AW:adr, ISBN
0-201-75039-2, 2005. [Zhang et al.(2010)Zhang, Rajimwale, Arpaci-Dusseau, and Arpaci-Dusseau] Y. Zhang, A. Rajimwale, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, End-to-end Data Integrity for File Systems: A ZFS Case Study, in: R. C. Burns, K. Keeton (Eds.), FAST, USENIX, ISBN 978-1-931971-74-4, 29–42, URL http://www.usenix.org/events/fast10/tech/full_papers/ zhang.pdf, 2010. 15

(Re)Designed for High Performance: Solaris and ...

(Re)Designed for High Performance: Solaris and Multi-Core Systems

Multicore World 2013

More Decks by Multicore World 2013

Other Decks in Programming

Featured

Transcript

(Re)Designed for High Performance: Solaris and Multi-Core Systems James C.

SMP, with 4 processors, there was a small increase to

the Solaris virtual memory system has been at the forefront

for CPU-internal resources when scheduling threads. Finally, the recent SPARC

that the goal of the class is to enforce explicit

The Network The original Solaris network stack was designed and

chip. While we7 designed and implemented this support we were

This output from fmdump -v -u (edited for clarity) illustrates

SUN-Storage-J4200.0848QAJ001/SCSI_Device__4/disk c11t29d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__5/disk c11t30d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__6/disk c11t41d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__7/disk c11t32d0 SUN-Storage-J4200.0848QAJ001/SCSI_Device__8/disk c11t40d0

system to observe what was really happening. With DTrace we

Taking a step back from what we have come to

NAME USED AVAIL REFER MOUNTPOINT rpool 113G 21.7G 3.17M /rpool

fini() Solaris 11 represents the culmination of close to eight

[Gregg(2012)] B. Gregg, Thinking Methodically about Performance, ACM Queue 10

[Teer(2005)] R. Teer, Solaris Systems Programming, Addison-Wesley, pub- AW:adr, ISBN