Linux x86 Storage

From trapsink.com
Jump to: navigation, search


Overview

This article focuses on a number of core concepts combined to provide an overview of the process without going down the rabbit hole too far in any one subject area. In general, the areas of focus:

The use of partitioned mass storage devices (such as fixed disks) is our target medium; discussing non-partitioned devices (such as floppies) is not considered as part of this article. This article also focuses on traditional rotating magnetic media as Storage - newer devices such as SSD drives are still in thrall to the older design mechanisms, so understanding the basic mechanics is key.


BIOS vs. UEFI

BIOS is still the most widely deployed firmware - modern systems are embracing UEFI but until it becomes the de-facto standard we must adhere to the design and limitations of the BIOS firmware infrastructure in order to address storage and boot operating systems. Most of this article is discussing traditional BIOS design.

BIOS is only capable of reading the boot sector of the disk and executing it's code, while recognizing the MBR partition format itself. BIOS has no concept of filesystem types and only executes the first 440 bytes before relinquishing control. This leads to the use of secondary (multi-staged or chained) boot loaders such as GRUB.

UEFI by contrast does not execute boot sector code; instead there exists an EFI System Partition (ESP) in which the required firmware files are loaded (called an UEFI Application); this partition is typically a FAT32 formatted 512 MiB space and supports multiple UEFI Applications at the same time. The firmware itself has a boot menu embedded that defines the disks and partitions to be launched for the applications, effectively acting like Stage 1 of GRUB.

BIOS-GPT is a hybrid; it allows BIOS to load boot code of the protected MBR on a GPT partition and execute it. Typically this requires a BIOS boot partition around 1 MiB in size as the 440-byte bootstrap area is not large enough and there exist no extra sectors where the boot loader is typically located in MBR.


4096 Everywhere

The traditional storage sector size is 512-bytes; with the introduction of the Advanced Format 4096-byte sector size this brings to light a question - why is everything based on 4kb? The Linux kernel memory page size, the largest Linux filesystem block size that can be used, the disk sector size are all a max of 4kb - this is all based on the classic x86 MMU design.

The classic x86 MMU architecture contains two page tables of 1024 4-bytes entries making each one 4kb in size. One table is called the paging directory and the other one the paging table, they work together to provide virtual-to-physical access to the memory within the system. If you're doing the math that limits us to 4GiB (32bits) of memory - hence the introduction of PAE to allow up to 64 GiB (36 bits) to be addressed. The x86_64 platform further increases this currently to 128 TiB (48 bits) per current spec, but a theoretical maximum of 256 TiB (64bits) of virtual address space could happen.

Modern CPUs are starting to offer larger page tables ("huge pages") but are not the norm - as such, the alignment of the x86 MMU 4kb page to other parts of the system is why we don't see 8096-byte sectors or 16kb filesystem blocks possible at this time with most Linux infrastructure even though the code itself supports it (such as XFS block size > 4096).

The Linux kernel has various extensions such as transparent huge pages to perform virtual mappings, however at the end of the day performance is limited to the specific x86 CPU that is being used at the time. As such the 4096-byte sector size, block size and memory page size are our daily use scenarios.


Drive Geometry

The root of everything is the radial geometry of the physical drive itself, how it's segregated and most importantly how do we "find" specific data at any given moment on the magnetic medium.


CHS Addressing

At the core of our x86 history is the cylinder-head-sector addressing scheme that was invented to determine where on a platter the needed information lives on a storage device. This design is what has imposed the 2 TiB limit for storage using the Master Boot Record (MBR) format as the spec has limits (more on that below). CHS is expressed as a tuple - 0/0/1, 12/9/17, etc. - to refer to a finite physical location of data.

Term Description
Platter The platter is the thin piece of magnetic storage medium; it has two sides that can be used. All the platters are stacked on top of each other with heads in between.
Head A head is the little "arm" that moves over the platters to read the magnetic information, so for each platter you have 2 heads (one on each side). The maximum is 256 however an old bug limits this to 255 in use.
Track A track is one of the concentric rings on a platter; they start with 0 at the outer edge and increase numerically inwards.
Cylinder A cylinder is the set of "stacked" tracks of all platters in the same physical location; they start at 0 along with the tracks. The address of a cylinder is always the same as an individual track since it's just a stack of them.
Sector A sector is the segregated block within a track, maximum 63 sectors/track with 512-bytes per sector. Sectors start counting at 1, not 0.

Where this all comes into play is the concept that BIOS will look specifically at location 0/0/1 (first cylinder/track, first head, first sector) to load the initial machine language boot code. This creates an absolute physical location for every storage device to boot and has carried forth into the more modern LBA addressing mechanism.

Note that the maximums for tracks, cylinders and sectors evolved over time and ended with the ATA-5 specification of 16383/16/63 requiring a full 24-bit value.

16383 cylinders * 16 heads * 63 sectors = 1032*254*63 = 16514064 sectors / 2 (512-bytes/sector) = 8257032 kb = ~ 8Gib

The INT 13H BIOS EXT extension is what permits us to read beyond the original CHS limit; INT-13H CHS is 24-bits and ATA spec is 28-bits, BIOS routines exist to translate between the two for full compatibility. The ATA 16:4:8 bit scheme to 10:8:6 bit scheme used by INT 13H routines are what allow mapping up to 8 GiB.


ZBR Tracks

Initially the design was that all tracks contained the same number of sectors (MFM and RLL drives) - this was updated with a newer technique called zone bit recording in ATA/IDE drives that allowed more sectors on the outer (larger) tracks and fewer moving inwards. This technique, however, created a problem - the physical geometry of the drive no longer matched the CHS addressing.

Because data (such as a partition) needs to start/end on a track/cylinder boundary, this leaves surplus sectors at the end of the drive less than 1 cylinder in size since they almost never line up perfectly. This is why when making partitions in tools like fdisk or parted you will see unused sectors even though you specified using the whole drive - the tools are translating your request into cylinder boundaries and discarding any surplus sectors as unusable since they are not aligned.


LBA Addressing

LBA Value CHS Tuple
0    0 / 0 / 1
62    0 / 0 / 63
1008    1 / 0 / 1
1070    1 / 0 / 63
16,514,063    16382 / 15 / 63

The limitations of the CHS design were quickly encountered; as such a more extensive format was introduced called Logical Block Addressing. Now that CHS has been defined understanding LBA becomes easy and is best explained with a simple table.

As exemplified, LBA addressing simply starts at 0 and increases by 1 for each CHS tuple. The original LBA was native 28-bit (see the CHS mapping above), the current ATA-6 spec is a 48-bit wide LBA allowing addressing up to 128 PiB of storage. As might be obvious there is a cutoff after 8 GiB of being able to translate CHS to LBA for backwards compatibility. Modern INT 13H extensions allow native LBA access thereby negating any need to use CHS style structures.

Our CHS tuple 0/0/1 and LBA value 0 are aligned, however - this is what we care about most for booting the system.


Boot Sector

Now that we understand CHS and LBA addressing, let's look at what's going on once the BIOS reads the first 512-byte sector of the drive to get going. This breaks down into two formats - traditional Master Boot Record (MBR) format, and GUID Partition Table (GPT) format. The Wikipedia pages on both are fantastic, I highly recommend reading both to gain a deeper understanding.

There are two kinds of basic boot sectors:

We are used to thinking of the boot sector as the MBR, but in fact there are two present in our x86 partitioned storage. GPT contains a 512-byte MBR protection mechanism for backwards compatibility. Essentially a MBR and VBR are the same thing, just located at different locations for different purposes. A non-partitioned device like a floppy disk uses only a VBR at the beginning, whereas a partitioned device typically uses a MBR (which may then load a VBR later).


MBR Format

Classic Generic MBR Structure
Address Description Size
(bytes)
+0 Bootstrap code area 446
+446 PTE #1 16
+462 PTE #2 16
+478 PTE #3 16
+494 PTE #4 16
+510 Boot signature (55h AAh) 2
Modern Standard MBR Structure
Address Description Size
(bytes)
+0 Bootstrap code area (part 1) 218
+218 Disk timestamp (optional, MS-DOS 7.1-8.0 (Windows 95B/98/98SE/ME), alternatively can serve as OEM loader signature with NEWLDR) 6
+224 Bootstrap code area (part 2) 216
+440 Disk signature (optional, UEFI, Windows NT/2000/Vista/7 and other OSes) 6
+446 PTE #1 16
+462 PTE #2 16
+478 PTE #3 16
+494 PTE #4 16
+510 Boot signature 2

The MBR is at minimum the first 512-byte sector of the storage. There are two basic structures that are in use for our purposes as detailed in the table; of most import are the bootstrap code area and partition table entries. The first data partition does not start until sector 63 (for historical reasons), leaving a 62 "MBR gap" present on the system. This gap of unused sectors is typically used for Stage 1.5 chained boot managers, low level device utilities and so forth.


Bootstrap Code Area

This area of the sector is pure machine language code run in real time mode; think of it like a BASIC program, it's read line by line and executes each one of those lines sequentially to manipulate CPU registers (more or less - it's complicated!). This mechanism allows the CPU to execute arbitrary code without understanding anything about the higher level filesystem or storage design.

Notice that we have an extremely limited amount of space (440 bytes) - this is nowhere near enough room to run a fancy modern boot manager like GNU GRUB full of graphics, features and whatnot. Hence we have the concept of staged (or chained) boot managers. This area represents Stage 1 of the bootloader process and serves to simply provide instructions on where to load the next bit of code from physically. More on that in the GRUB section.


Partition Table Entry

We come finally to the Achilles' Heel of the MBR design - partition table design and it's relation to the CHS addressing format. As each PTE is only 16-bytes we have a finite limit on what can be stored; extrapolating this is where are limit is created in how much disk can be addressed, leading to our 2 TiB limit of a MBR-based storage disk.

16-byte PTE
Length Description
1 Status (active/inactive)
3 CHS address of partition start
1 Partition type
3 CHS address of partition end
4 LBA address of partition start
4 Total sectors in partition

Given this design, at most 4-bytes (32-bits) can store the number of sectors in LBA mode and the limitations as discussed above. These are referred to as the Primary partitions of the disk and the above exemplifies why only 4 of them exist when using tools like fdisk and parted.


GPT Format

LBA 0 (Legacy MBR)
Address Description Size
(bytes)
+0 Bootstrap code, Disk timestamp and signature 446
+446 PTE #1 Type 0xEE (EFI GPT) 16
+462 PTE #2 (unused) 16
+478 PTE #3 (unused) 16
+494 PTE #4 (unused) 16
+510 Boot Signature (55h AAh) 2
LBA 1 (Primary GPT Header)
+512 Definition of usable blocks on disk, number and size of PTEs, GUID, etc. 512
LBA 2-33 (Primary Partition Table Entries)
+1024 128x 128-byte PTEs 16128
LBA 34+ (Partitions)
+17408 Actual partitions n/a
LBA -33 to -2 (Secondary Partition Table Entries)
-1023 128x 128-byte PTEs 16128
LBA -1 (Secondary GPT Header)
-511 Definition of usable blocks on disk, number and size of PTEs, GUID, etc. 512

The GUID Partition Table format was invented to solve the whole mess of CHS, MBR and 32-bit LBA limitations. It's actually part of the Unified Extensible Firmware Interface (UEFI) designed to replace the aging Basic Input/Output System (BIOS) design, however due to it's widespread use to utilize larger than 2 TiB of storage it's often considered it's own project. LBA is used exclusively, there is no CHS mapping. While it's possible to use LBA 34 to start partitions, due to the prevalence of MBR track boundary requirements the first partition often starts at LBA 63. This allows chained bootloaders such as GRUB to store their Stage 1.5 images similar to the MBR technique prior to sector 63.


Legacy MBR Sector

Notice the Legacy MBR is clearly defined per the specification; this allows booting a GPT-based storage medium using BIOS techniques as it contains the same area for bootstrap code and PTEs in the same disk locations.

The bootstrap code area remains, only the first PTE is used and it denotes a type of EFI. This sufficiently protects the disk from tools which do not understand EFI, as they should report simply a partition of type "unknown" in the worst case scenario.


Partition Table Entry

The PTE format of GPT is very similar to the MBR style and should come as no surprise; most notable of the structure is the use of 8-byte (64-bit) values for the LBA address. Much like MBR, this defines a hard limit on the maximum LBA value that could be addressed as useful storage within GPT.

128-byte PTE
Length Description
16 Partition type GUID
16 Unique partition GUID
8 First LBA (little endian)
8 Last LBA (inclusive, usually odd)
8 Attribute flags
72 Partition name (36 UTF-16LE code units)



OS Compatibility

Userspace tools such as fdisk (2.17.1+) and parted contain checks and balances for this more modern approach - one must ensure to not use "DOS Compatibility Mode" and use Sectors mode inside a utility like fdisk or parted to achieve the desired perfect alignment. Additionally the LVM subsystem starting with 2.02.73 will align to this 1 MiB boundary - previous versions used a 64 KiB alignment, akin to the LBA 63 offset. Same goes with software RAID - as long as it's using the modern Superblock Metadata v1 it will align to 1 MiB.[1][2][3]


GRUB Bootstrap

Understanding how GRUB loads becomes fairly straightforward once the mechanics of the MBR/GPT world are understood. The installation of GRUB onto the MBR (or, optionally the VBR) consists of three primary parts, the first two of which are concrete in design.

Stage 1 The boot.img 440-byte code is loaded into the boostrap area as defined in the MBR design, and is coded to load the first sector of core.img (the next stage) using LBA48 addressing.
Stage 1.5 MBR The core.img ~30 KiB code is loaded into the 62 empty sectors between the end of the MBR and beginning of the first partition (sector 63). This code contains the ability to recognize filesystems to read stage 2 configuration.
Stage 1.5 GPT The core.img ~30 KiB code is loaded starting at sector 34 after the GPT structure. This code contains the ability to recognize filesystems to read stage 2 configuration.
Stage 2 This stage reads the configurations by file/path names under /boot/grub to build the TUI and present choices. The majority of userspace code and configuration is located here.

Once stage 2 is loaded this is where the higher level GRUB magic begins; the most visible example of this is the user interface allowing for selection of multiple boot choices on multiple partitions. UEFI is similar, however instead of core.img a different piece of code (grub*.efi) is copied to the EFI System Partition and acts as the UEFI Application as outlined above.


LVM Boot Volumes

The default storage mechanism for pvcreate is to use the second 512-byte sector of the MBR/VBR to hold it's metadata structure; however the LVM subsystem will scan the first 4 sectors of the MBR/VBR for it's data. The physical volume label begins with the string LABELONE and contains 4 basic items of information:

  • Physical volume UUID
  • Size of block device in bytes
  • NULL-terminated list of data area locations
  • NULL-terminated lists of metadata area locations

Metadata locations are stored as offset and size (in bytes). There is room in the label for about 15 locations, but the LVM tools currently use 3: a single data area plus up to two metadata areas.[4]

Historically the lack of LVM-capable boot loaders (such as LILO and GRUB1) required the /boot/ filesystem to reside at the beginning of the disk (yet another CHS legacy issue) or be of a more basic filesystem format such as ext2 or ext3. With the advent of GRUB2 (GRUB version 2) the ability exists to read from more complex filesystems such as ext4, LVM and RAID.[5]

In the default installation of many server-oriented Linux distributions such as Ubuntu 14.04 LTS and RHEL/CentOS 7 the /boot/ partition is still non-LVM and the first partition of the disk for maximum backwards compatibility even though they use GRUB2.


Scanning Devices

When expanding an existing device or adding a new device, the underlying controller(s) needs to be performed. There are two separate interfaces into the kernel to perform this work, each a little different than the other.

Scan for new Devices

Given either a single controller or multiple controllers for the same storage (in the case of high availability) we need to issue a scan requests to those controllers to look for new devices presented and create /dev device nodes for the ones found. The host0 is always the local controller, host1 and above tend to be add-in controllers to external storage for example.

# echo "- - -" > /sys/class/scsi_host/hostX/scan  (where X is your HBA)

Local controller (which includes VMware vDisks):

# echo "- - -" > /sys/class/scsi_host/host0/scan

Add-on HBA (Host Based Adapater) cards of some sort:

# echo "- - -" > /sys/class/scsi_host/host1/scan
# echo "- - -" > /sys/class/scsi_host/host1/scan


Rescanning and Deleting Devices

The scenario is an existing block device is already presented (i.e. /dev/sda) and it's been expanded upstream of the OS already - for example the VMware vDisk was grown or the SAN/DAS LUN was expanded. In this case every block device that comprises that piece of storage has to be rescanned -- for a single controller it's only one device, but for HA situations (using Multipath for instance) all individual devices need rescanned.

# echo 1 > /sys/block/sda/device/rescan

Multiple paths to the same storage (Multipath, etc.):

# echo 1 > /sys/block/sdb/device/rescan
# echo 1 > /sys/block/sdc/device/rescan

Deleting those block device entries from the Linux kernel maps is just as easy -- the devices have to be completely unused and released from the OS itself first, and do not force it - a kernel panic may (and most probably will) ensue if you try and force a block device delete while the kernel still thinks it's in use.

# echo 1 > /sys/block/sdb/device/delete
# echo 1 > /sys/block/sdc/device/delete


The udev Subsystem

Udev is the device manager for the Linux 2.6 kernel that creates/removes device nodes in the /dev directory dynamically. It is the successor of devfs and hotplug. It runs in userspace and the user can change device names using udev rules.[6] The udev subsystem allows for a very wide variety of user control over devices, whether they be storage, network, UI (keyboard/mouse) or others - one of the common uses in udev is to name network interfaces.

When it comes to Linux storage this can have subtle yet extremely important implications on how the server finds and uses it's boot devices. When the kernel initializes it and the udev subsystem scan the bus and created device nodes for the storage devices it finds. Logistically, this means if a supported HBA (Host Based Adapter, a PCI-based Fiber/SAS card for instance) is found before the internal SCSI controller it's highly possible (and in practice and experience does happen) that a device consumes device node "sda" on the system from outside the chassis (SAN/DAS) instead of the internal disk or RAID array.

Care should be taken when researching modern udev - in some distributions it's now been subsumed by systemd and is no longer a discreet entity within the Linux ecosphere; the specific methodology has changed for certain parts of the process. For example, in traditional udev the bootstrap process initialized /dev from the pre-prepared data in /lib/udev/devices tree; in the systemd implementation it reads from /etc/udev/hwdb.bin instead.


HBA Blacklisting

One graceful solution to the boot-from-HBA problem is to simply blacklist the kernel module from initrd only to prevent the kernel from having the device driver on boot, so it doesn't find the HBA controllers. Once the kernel switches to the real root filesystem and releases the initrd, it has already assigned "sda" to the internal expected array and can then load the HBA driver at runtime and initialize the controllers and find storage.

The rdblacklist mechanism is used on the kernel boot line of your GRUB configuration - just append as needed with the specific HBA to blacklist:

Blacklist the Brocade HBAs:

 rdblacklist=bfa

Blacklist the QLogic HBAs:

 rdblacklist=qla2xxx

The kernel will then respect the /etc/modprobe.d/*.conf entries to load the appropriate module once it's switched to the real root filesystem and discovers the devices during scan.


References


Citations

  1. http://www.thomas-krenn.com/en/wiki/Partition_Alignment
  2. http://www.ibm.com/developerworks/linux/library/l-4kb-sector-disks/#benchmarks
  3. http://www.rodsbooks.com/gdisk/advice.html
  4. https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Logical_Volume_Manager_Administration/lvm_metadata.html
  5. http://www.gnu.org/software/grub/manual/grub.html#Changes-from-GRUB-Legacy
  6. http://www.linux.com/news/hardware/peripherals/180950-udev