Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing for illumos

Developing for illumos

My talk at SCALE 10x (Jan 2012) presenting some topics about developing (and debugging) for illumos -- especially for folks working with kernel software and subsystems unique to illumos

Avatar for Garrett D'Amore

Garrett D'Amore

March 16, 2012
Tweet

Other Decks in Programming

Transcript

  1. Overview of illumos illumos descended from Solaris & OpenSolaris POSIX

    & “real” UNIX derived Distros add other bits (OpenIndiana, NexentaStor, StormOS, etc.) illumos is the OS core - kernel, drivers, core libraries, key system utilities Friday, March 16, 12
  2. Documentation Resources manual pages - note that sections are numbered

    differently ... e.g. 1m instead of 8 for admin pages Google! A lot of printed matter Sadly, docs.sun.com is gone now. :-( We need help documenting more! Friday, March 16, 12
  3. Userspace Development Its just POSIX (mostly) POSIX is the open

    standard for UNIX like operating systems Linux tries to be POSIX, but beware of “embrace and extend” Portable applications stick to POSIX Latest POSIX not fully supported Friday, March 16, 12
  4. X11/Gnome/KDE/etc. These all come from the distro! OpenIndiana Gnome based

    Check your distro docs Probably like your fav. Linux distro Really, not part of illumos core! :-) Friday, March 16, 12
  5. Services Traditional init(8) and inetd(8) are replaced by smf Service

    is described via XML Service configuration stored in SQL-Lite Friday, March 16, 12
  6. Example SMF Manifest <service_bundle type='manifest' name='SUNWckr:intrd'> <service name='system/intrd' type='service' version='1'>

    <create_default_instance enabled='false' /> <single_instance/> <dependency name='milestone' grouping='require_all' restart_on='none' type='service'> <service_fmri value='svc:/milestone/multi-user' /> </dependency> <exec_method type='method' name='start' exec='/lib/svc/method/svc-intrd' timeout_seconds='60' /> <exec_method type='method' name='stop' exec=':kill' timeout_seconds='10' /> <stability value='Unstable' /> <template> <common_name> <loctext xml:lang='C'> interrupt balancer </loctext> </common_name> <documentation> <manpage title='intrd' section='1M' manpath='/usr/share/man' /> </documentation> </template> </service> </service_bundle> Friday, March 16, 12
  7. Interface Stability Solaris very careful to provide for interface guarantees

    But only if documented “stable” interfaces are used Google for PSARC cases can help “Bundled” code can use more interfaces (“Consolidation Private” in PSARC speak.) Friday, March 16, 12
  8. Interfaces Unique to Solaris & illumos STREAMS - message oriented

    API sysevent - used to receive events from kernel doors (ported to Linux) - very fast local RPC/IPC libdevinfo - device tree Friday, March 16, 12
  9. Source Tree Organization usr/src - top level tools - contains

    tools to build, incl. env. cmd - commands uts - kernel & drivers (“UNIX time share”) common - shared bits intel - intel x86 bits i86pc - specific PC hardware support lib - ludicrous indigo bits Friday, March 16, 12
  10. Kernel SVR4.2 derived (e.g. STREAMS, etc.) A “real” DDI/DDK -

    binary stability as well as source stability But, many useful things have no DDI :-( We are working to fix this. Wanna help? Threaded through-out Friday, March 16, 12
  11. Kernel Doc Resources man section 9 (and sometimes 7) Writing

    Device Drivers (a bit dated, but remember the DDI!) PSARC case logs Use the Source, Luke! Seriously, [email protected] Friday, March 16, 12
  12. Surprising Kernel Things kmem_free() requires you to know how much

    you allocated Everything is a thread! (almost) Device tree and auto-configuration Portable DMA interfaces (ala NetBSD) ddi_put() & ddi_get() access device memory & registers - no endianness hacks! 32-bit processes with 64-bit kernels Friday, March 16, 12
  13. Kernel Frameworks gldv3 - for NICS SCSAv3 - for HBAs

    and targets blkdev - simple block devices USBA - USB devices (you were surprised?) boomer - audio Nexus - undocumented (NDI) STREAMS - message based, warrants a talk unto itself Friday, March 16, 12
  14. Device Tree All Devices Presented in a Tree... prtconf displays

    the tree Use prtconf -vp to see the “hardware” tree from “PROM” (BIOS/ACPI) Each node also can have properties Both nodes and properties can originate from hardware or software Friday, March 16, 12
  15. Device Tree Example System Configuration: Joyent i86pc Memory size: 1228

    Megabytes System Peripherals (PROM Nodes): Node 0x000001 bios-boot-device: '9?' stdout: 00000000 name: 'i86pc' Node 0x000002 existing: 00dd0000.00000000.109a0001.00000000 name: 'ramdisk' Node 0x000003 acpi-namespace: '\_SB_.PCI0' compatible: 'pciex_root_complex' device_type: 'pciex' reg: 00000000.00000000.00000000 #size-cells: 00000002 #address-cells: 00000003 name: 'pci' Node 0x000004 reg: 00000000.00000000.00000000.00000000.00000000 compatible: 'pci8086,7190.15ad.1976.1' + 'pci8086,7190.15ad.1976' + 'pci15ad,1976' + 'pci8086,7190.1' + 'pci8086,7190' + 'pciclass,060000' + 'pciclass,0600' model: 'Host bridge' power-consumption: 00000001.00000001 devsel-speed: 00000001 max-latency: 00000000 min-grant: 00000000 subsystem-vendor-id: 000015ad subsystem-id: 00001976 unit-address: '0' class-code: 00060000 revision-id: 00000001 vendor-id: 00008086 device-id: 00007190 name: 'pci15ad,1976' Friday, March 16, 12
  16. Driver Binding Drivers bind by alias There is a search

    order - ‘compat’ property in the tree Typically e.g. pci8086,1000 Only one driver name for a given alias - no dynamic probing! Well, there are exceptions (ISA, *cough!*) A nexus can supply a node name with an implicit alias Friday, March 16, 12
  17. Minor Nodes Minor nodes = service points = character special

    devices (usually) Nexus drivers don’t need (but often have) /devices ... links put into /dev by devfsadm Drivers can have 0, 1, or many minor nodes Backed in driver by cb_ops entry points Typically created in attach and destroyed in detach Friday, March 16, 12
  18. Adding a New Driver ELF modules live in /kernel/drv/amd64 (usually)

    Example: # add_drv -i’”pciex1077,8000”’ qlge rem_drv to remove a driver Usually only done once, for testing use modload and modunload Friday, March 16, 12
  19. Autoconfiguration _init() called at driver load time should initialize globals,

    etc. calls mod_install() to register module entry points attach() called to attach a specific instance of a driver Friday, March 16, 12
  20. Deconfiguration detach() called to detach an instance of a device

    should return DDI_FAILURE if instance is in use _fini() called before unloading ELF module calls moduninstall(), which will also call detach() for all conf’d instances should free up globals, etc. Friday, March 16, 12
  21. Contexts User context: current thread associated with a user process,

    called as a result of a system call from userland Kernel context: like user context, can sleep, etc., but no user process associated (so e.g. copyout() is not possible) Interrupt context: running in an interrupt thread High level interrupt context: running on a very high priority interrupt - special, and rare. Friday, March 16, 12
  22. Synchronization mutex_enter/exit - simple mutexes mutexes need an interrupt cookie

    rw_enter/exit - RW locks cv_broadcast/signal/wait - cond vars Friday, March 16, 12
  23. Asynchronous Stuff Prefer taskqs - see ddi_taskq_dispatch() For periodic things,

    ddi_periodic_add() For one off timers, timeout() For rare cases, soft interrupts (ddi_intr_add_softint()) Friday, March 16, 12
  24. Module Types drv - drivers (have a struct dev_ops) misc

    - libraries, common support code, etc. strmod - STREAMS modules crypto - crypto modules sys - system calls fs - filesystems Friday, March 16, 12
  25. Module Directories /kernel - modules needed for root filesystem and

    early boot - most live here /platform/kernel - platform specific modules /usr/kernel - other modules (e.g. audio) Friday, March 16, 12
  26. 32 vs 64 bit x86 supports both 32 and 64

    bit kernels (for now) SPARC is only 64 bit /kernel/drv for 32-bit /kernel/drv/amd64 for 64-bit Friday, March 16, 12
  27. Modules/Linking Global symbols in unix/genunix visible to all modules Global

    symbols in drivers/modules not visible unless link dependency established E.g. for SCSI drivers, you need ld -N misc/scsi Friday, March 16, 12
  28. Adding A New Driver Locate source under uts/common/io/ <driver> or

    somesuch Edit common/Makefile.files Edit common/Makefile.rules Create intel/<driver>/Makefile Edit intel/Makefile.intel.shared Friday, March 16, 12
  29. Building the Source Copy illumos.sh & edit to taste Especially

    CODEMGR_WS and GATE /opt/onbld/bin/bldenv -d illumos.sh WAIT.... depending on hardware hours or days.... See the wiki Friday, March 16, 12
  30. Building Just One Module Setup the environment for a full

    nightly Recommend doing a full build first Otherwise you must do “make setup” cd usr/src/uts/intel/afe dmake install Modules live in obj64, obj32, debug64, debug32 Friday, March 16, 12
  31. Register Access Registers and device memory accessed via handles ddi_regs_map_setup()

    ddi_getXX, ddi_putXX - these will automatically deal with endianness Friday, March 16, 12
  32. Managing DMA When possible allocate memory for DMA rather than

    trying to bind it ddi_dma_alloc_handle() ddi_dma_mem_alloc() ddi_dma_addr_bind_handle() Access via ddi_putXX/ddi_getXX will handle endianness for you (e.g. for descriptor rings) Friday, March 16, 12
  33. Cache Flush ddi_dma_sync() used to sync/flush caches Should *always* be

    done Doesn’t solve PCI posted-write problem Uncached/consistent memory fast to sync Friday, March 16, 12
  34. 32-bit/64-bit Interop Use ddi_copyin() and ddi_copyout() Try to avoid model

    sensitive structures ddi_model_convert_from() Look for _MULTI_DATAMODEL for examples Strategy is to copy between 32-bit specific structure and native structure Friday, March 16, 12
  35. kstat - kernel statistics very lightweight -- “usually” read-only generally

    typed (strings, numbers, etc.) User access via kstat(1M), or libkstat Kernel API: kstat_create() et. al. basis for iostat, netstat, vmstat, etc. good for aggregate data only Friday, March 16, 12
  36. Debugging Tools mdb and kmdb DTrace kstat truss (like strace)

    snoop/tcpdump lockstat prstat, pfiles, and friends logs (syslog same as Linux, BSD) Friday, March 16, 12
  37. mdb - modular debugger Uses CTF data to provide “symbolic”

    and type-aware debugging Same debugger can debug both kernel and user space Extensible via “modules” - subsystem authors can supply their own modules Not a source level debugger! Friday, March 16, 12
  38. mdb examples > ::prtconf !grep e1000g ffffff00dfe322b8 pci8086,100f, instance #0

    (driver name: e1000g) > ffffff00dfe322b8 $<devinfo !grep data devi_parent_data = 0xffffff00e15f72c0 devi_driver_data = 0xffffff00e0859000 > 0xffffff00e0859000$<e1000g { instance = 0 dip = 0xffffff00dfe322b8 priv_dip = 0xffffff00e0eb8d00 priv_devi_node = 0xffffff00e1d110d0 mh = 0xffffff00e1cb6a08 mrh = 0 shared = { back = 0xffffff00e085d578 hw_addr = 0xffffff009afaa000 flash_address = 0 io_base = 0x2000 mac = { ops = { init_params = e1000_init_mac_params_82540 id_led_init = e1000_id_led_init_generic blink_led = e1000_null_ops_generic check_for_link = e1000_check_for_copper_link_generic check_mng_mode = e1000_null_mng_mode cleanup_led = e1000_cleanup_led_generic clear_hw_cntrs = e1000_clear_hw_cntrs_82540 clear_vfta = e1000_clear_vfta_generic get_bus_info = e1000_get_bus_info_pci_generic Friday, March 16, 12
  39. mdb threadlist > ::threadlist !head ADDR PROC LWP CMD/LWPID fffffffffbc2fa00

    fffffffffbc2eac0 fffffffffbc31500 sched/1 ffffff0002605c40 fffffffffbc2eac0 0 idle() ffffff000260bc40 fffffffffbc2eac0 0 thread_reaper() ffffff0002611c40 fffffffffbc2eac0 0 tq:kmem_move_taskq ffffff0002617c40 fffffffffbc2eac0 0 tq:kmem_taskq ffffff000261dc40 fffffffffbc2eac0 0 tq:pseudo_nexus_enum_tq ffffff0002623c40 fffffffffbc2eac0 0 scsi_hba_barrier_daemon() ffffff0002629c40 fffffffffbc2eac0 0 scsi_lunchg1_daemon() ffffff000262fc40 fffffffffbc2eac0 0 scsi_lunchg2_daemon() > ffffff000262fc40 $< threadlist ADDR PROC LWP CLS PRI WCHAN ffffff000262fc40 fffffffffbc2eac0 0 0 60 fffffffffbd17d10 PC: _resume_from_idle+0xf1 THREAD: scsi_lunchg2_daemon() stack pointer for thread ffffff000262fc40: ffffff000262fb30 [ ffffff000262fb30 _resume_from_idle+0xf1() ] swtch+0x141() cv_wait+0x70() scsi_lunchg2_daemon+0x121() thread_start+8() Friday, March 16, 12
  40. mdb Example: Interrupts [root@host ~]# echo ::interrupts | mdb -k

    IRQ Vect IPL Bus Trg Type CPU Share APIC/INT# ISR(s) 1 0x40 5 ISA Edg Fixed 1 1 0x0/0x1 i8042_intr 3 0xb1 12 ISA Edg Fixed 1 1 0x0/0x3 asyintr 4 0xb0 12 ISA Edg Fixed 0 1 0x0/0x4 asyintr 9 0x80 9 PCI Lvl Fixed 1 1 0x0/0x9 acpi_wrapper_isr 12 0x41 5 ISA Edg Fixed 0 1 0x0/0xc i8042_intr 15 0x43 5 ISA Edg Fixed 1 1 0x0/0xf ata_intr 17 0x42 5 PCI Lvl Fixed 1 1 0x0/0x11 mpt_intr 18 0x60 6 PCI Lvl Fixed 0 1 0x0/0x12 e1000g_intr 20 0xd1 14 PCI Lvl Fixed 0 1 0x0/0x14 hpet_isr 24 0x81 7 PCI Edg MSI 1 1 - pcieb_intr_handler 25 0x30 4 PCI Edg MSI 0 1 - pcieb_intr_handler 26 0x82 7 PCI Edg MSI 1 1 - pcieb_intr_handler 27 0x31 4 PCI Edg MSI 0 1 - pcieb_intr_handler Friday, March 16, 12
  41. DTrace DTrace lets us probe dynamically Free “function boundary probing”

    (kernel only) Static probes give us other events D language makes powerful constructs possible Useful as base for other things e.g. lockstat Friday, March 16, 12
  42. DTrace Examples [root@host ~]# dtrace -n fbt::e1000g_intr: dtrace: description 'fbt::e1000g_intr:'

    matched 2 probes CPU ID FUNCTION:NAME 0 54336 e1000g_intr:entry 0 54337 e1000g_intr:return 0 54336 e1000g_intr:entry 0 54337 e1000g_intr:return 0 54336 e1000g_intr:entry ^C Friday, March 16, 12
  43. More DTrace Examples (Credit: Brendan Gregg) # New processes with

    arguments, dtrace -n 'proc:::exec-success { trace(curpsinfo->pr_psargs); }' # Files opened by process, dtrace -n 'syscall::open*:entry { printf("%s %s",execname,copyinstr(arg0)); }' # Syscall count by program, dtrace -n 'syscall:::entry { @num[execname] = count(); }' # Syscall count by syscall, dtrace -n 'syscall:::entry { @num[probefunc] = count(); }' # Syscall count by process, dtrace -n 'syscall:::entry { @num[pid,execname] = count(); }' # Read bytes by process, dtrace -n 'sysinfo:::readch { @bytes[execname] = sum(arg0); }' # Write bytes by process, dtrace -n 'sysinfo:::writech { @bytes[execname] = sum(arg0); }' # Read size distribution by process, dtrace -n 'sysinfo:::readch { @dist[execname] = quantize(arg0); }' # Write size distribution by process, dtrace -n 'sysinfo:::writech { @dist[execname] = quantize(arg0); }' # Disk size by process, dtrace -n 'io:::start { printf("%d %s %d",pid,execname,args[0]->b_bcount); }' # Pages paged in by process, dtrace -n 'vminfo:::pgpgin { @pg[execname] = sum(arg0); }' # Minor faults by process, dtrace -n 'vminfo:::as_fault { @mem[execname] = sum(arg0); }' Friday, March 16, 12
  44. DTrace Example > dtrace -n 'sysinfo:::readch { @dist[execname] = quantize(arg0);

    }' dtrace: description 'sysinfo:::readch ' matched 4 probes ^C sshd value ------------- Distribution ------------- count 1 | 0 2 |@@@@@@@@@@@@@@@@@@@@ 1 4 | 0 8 | 0 16 | 0 32 |@@@@@@@@@@@@@@@@@@@@ 1 64 | 0 nscd value ------------- Distribution ------------- count 512 | 0 1024 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 2048 | 0 Friday, March 16, 12
  45. Contribution Process Bugs located on bugs.illumos.org Send out a code

    review (webrev) to [email protected] Fully nightly build (incl. lint clean) hg pbchk clean (cstyle, etc.) RTI - email [email protected] Process subject to change Friday, March 16, 12
  46. webrev Generates a web-based review from an hg tree You

    have to upload it somewhere Nice because it provides more context than other tools And gives options for different views Friday, March 16, 12