Device Mapper Mechanics

From trapsink.com
Jump to: navigation, search


Overview

Concisely, device mapper is a method for the Linux kernel to map physical block devices into logical devices ("virtual") for further use; most notably implemented via LVM2, Multipath and dm-crypt (LUKS) in widespread use. Each implementation is known as a target to the device-mapper subsystem within the kernel.

The fundamental tool used is named dmsetup and provided by the device-mapper package on most (if not all) Linux distributions. In a sense, it's analogous to common partitioning - using it requires a start, end and size component of a device. By default, when creating a logical device with dmsetup the human-readable name ends up symlinked in /dev/mapper/ as evidenced with higher level tools like lvcreate.

This wiki presents low-level examples for building targets; when a higher level subsystem such as dm-raid, dm-crypt, LVM and dm-multipath exists it should always be used. The higher level subsystems provide a wealth of additional features required for a production level situation.

A secondary tool named dmstats is also delivered which allows for collecting statistics about the underlying regions and areas of the device map (similar to what one might see from sar and the sysstat package). This can be useful for examining the performance characteristics of combined physical block devices and looking for deltas.


Basic Usage

In this example we have a single 75G physical block device name /dev/xvdb (a Xen virtual disk). Using dmsetup it will be split into two virtual devices without using a partition table (aka "raw" device). Each virtual device is then formatted and mounted as normal with any block device. Key to all the work is understanding the table mapping format as listed in the man page, as it can be a bit confusing at first.

First, we need to determine what our starting and ending sectors on the disk will be for use; for this a small one-liner shell script can do all the math for us. Note that we ensure a 2048s offset into the device for the standard performance alignment across all tiers - this is unique per situation, used here as a Best Practice.

DISK="/dev/xvdb"; OFFSET=2048 \
  parted ${DISK} unit s print 2>/dev/null | \
  grep "^Disk ${DISK}" | \
  awk -v OFF=${OFFSET} '{gsub(/s$/,"",$3); \
    printf "STA1=%s\nEND1=%s\nLEN1=%s\nSTA2=%s\nEND2=%s\nLEN2=%s\n",
            OFF,(($3/2)-OFF),((($3/2)-OFF)-OFF),
            ((($3/2)-OFF)+1),$3,($3-((($3/2)-OFF)+1))
  }'

STA1=2048
END1=78641152
LEN1=78639104
STA2=78641153
END2=157286400
LEN2=78645247

Given the start, end and size of each part of the disk we can use dmsetup to build the virtual maps; exactly like what may be familiar from previous LVM, LUKS and Multipath work the real devices are /dev/dm-?, the kernel uses symlinks to inform the user of which map is which logical name for further use. We will also create a statistics area to see how that looks.

// Table format for linear target used below:
//  <virtual start> <virtual size> linear <physical device> <physical start offset>

# dmsetup create xyzzy1 --table "0 78639104 linear /dev/xvdb 2048"
# dmsetup create xyzzy2 --table "0 78645247 linear /dev/xvdb 78641153"

# ls -og /dev/mapper/xyzzy*
lrwxrwxrwx. 1 7 Jan 10 18:32 /dev/mapper/xyzzy1 -> ../dm-0
lrwxrwxrwx. 1 7 Jan 10 18:32 /dev/mapper/xyzzy2 -> ../dm-1

# dmsetup table
xyzzy1: 0 78639104 linear 202:16 2048
xyzzy2: 0 78645247 linear 202:16 78641153

# dmstats create /dev/mapper/xyzzy1 
xyzzy1: Created new region with 1 area(s) as region ID 0
# dmstats create /dev/mapper/xyzzy2 
xyzzy2: Created new region with 1 area(s) as region ID 0

# dmstats list
Name             RgID RgSta RgSize #Areas ArSize ProgID 
xyzzy1              0     0 37.50g      1 37.50g dmstats
xyzzy2              0     0 37.50g      1 37.50g dmstats

# dmstats report
Name             RgID ArID ArStart ArSize RMrg/s WMrg/s R/s  W/s  RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait
xyzzy1              0    0       0 37.50g   0.00   0.00 0.00 0.00     0     0       0  0.00  0.00  0.00    0.00    0.00
xyzzy2              0    0       0 37.50g   0.00   0.00 0.00 0.00     0     0       0  0.00  0.00  0.00    0.00    0.00

Now it's just the same work as usual using these two new names; a simple dd is used below for testing.

# dd if=/dev/zero of=/dev/mapper/xyzzy1 bs=1024 count=100
100+0 records in
100+0 records out
102400 bytes (102 kB) copied, 0.0371653 s, 2.8 MB/s

# dd if=/dev/zero of=/dev/mapper/xyzzy2 bs=1024 count=100
100+0 records in
100+0 records out
102400 bytes (102 kB) copied, 0.0024724 s, 41.4 MB/s

# dmstats report
Name             RgID ArID ArStart ArSize RMrg/s WMrg/s R/s   W/s    RSz/s   WSz/s   AvgRqSz QSize Util% AWait RdAWait WrAWait
xyzzy1              0    0       0 37.50g   0.00   0.00 88.00  25.00 556.00k 100.00k   5.50k  0.18 12.30  1.58    1.47    2.00
xyzzy2              0    0       0 37.50g   0.00   0.00 63.00 200.00 455.50k 100.00k   2.00k  0.51 10.60  1.94    1.75    2.00

# dmstats delete /dev/mapper/xyzzy2 --allregions
# dmstats delete /dev/mapper/xyzzy1 --allregions
# dmsetup remove /dev/mapper/xyzzy2
# dmsetup remove /dev/mapper/xyzzy1


Linear Target

The linear target is the most basic as shown above; however in a slightly more complex example we can build our own LVM-like single filesystem that spans two physical block devices. The LVM subsystem at it's core uses this linear methodology by default, however is contains many additional features (mapping UUIDs, maintaining block device lists and the tables, checksumming, management, etc.) which make it desirable in daily use.

Two 75G block devices are presented to the host; to add a bit more complication to exemplify the math, each block device has an empty GPT partition table to simulate not being able to use the end of the disk (the backup GPT partition table is kept in the last 34s).

# parted /dev/xvdb mktable gpt
# parted /dev/xvdc mktable gpt

Next, we need to get the last usable sector of the disk and subtract our performance-oriented beginning 2048s offset from it to get the size of the fully usable disk area; for this I prefer sgdisk utility (part of the gdisk package):

# sgdisk -p /dev/xvdb | grep "last usable sector" | awk '{print $NF-2048}'
157284318

# sgdisk -p /dev/xvdc | grep "last usable sector" | awk '{print $NF-2048}'
157284318

Because this requires two lines to feed dmsetup (one line for each disk), we create the mapping in a text file:

// Table format used below is the same as the Basic example, but notice that the
// virtual start of the second disk is the same as the ending of the first - 
// remember, 0 offset not 1

# cat linear.table 
0 157284318 linear /dev/xvdb 2048
157284318 157284318 linear /dev/xvdc 2048

Now it's just a matter of creating the map using the table and testing it out by making a filesystem and writing a file larger than any one single physical device (below, 120G is used):

# dmsetup create xyzzy linear.table

# mkfs.ext4 -v /dev/mapper/xyzzy 
# mkdir /mnt/xyzzy
# mount /dev/mapper/xyzzy /mnt/xyzzy/

# dd if=/dev/zero of=/mnt/xyzzy/testfile bs=512M count=240
240+0 records in
240+0 records out
128849018880 bytes (129 GB) copied, 367.6 s, 351 MB/s

# umount /mnt/xyzzy
# dmsetup remove xyzzy
# dmsetup create foobar linear.table 
# mount /dev/mapper/foobar /mnt/xyzzy/
# ls -og /mnt/xyzzy/
total 125829148
drwx------. 2        16384 Jan 10 21:16 lost+found
-rw-r--r--. 1 128849018880 Jan 10 21:25 testfile

Notice that the maps are disassembled and recreated as part of the testing to simulate what will happen when the server is rebooted - device maps are in memory only so in real use startup/shutdown scripts would be required to implement the above correctly. We also tested giving the device map a randomly different name the second as a test.


Striped Target

The striped target is the basis of software RAID0 and can be used with LVM. Using the same techniques as the linear target, we'll build a simple striped target of our two physical block devices with the intent of increasing performance (so we'll add dmstats).

First, we have to do a bit of math; when using striping technology design each group of data is written in a chunk that is a multiple of 2 and typically of a size that is optimized for the data. A chunk of 256k is very common for physical RAID controllers, we'll use this as our chunk size. However, our sector size is 512b so we'll need to use 512 as our divisor; given that, we must ensure that an entire stripe can be written.

In order to determine the largest size we can make the striped target, we take the usable size of the disk (in sectors), divide it by 512 and then get the floor() of that value re-multiplied times 512 (in laypersons' terms, divide the size by 512, throw away the remainder and re-multiply by 512 to get the perfect multiple). For this we'll use a bc function:

# bc

// add both usable sizes to get one large size for striping
157284318*2
314568636

// now divide by 512, throw away the remainder, re-multiply by 512
define floor(x) {
  auto os,xx;os=scale;scale=0
  xx=x/1;if(xx>x).=xx--
  scale=os;return(xx)
}
floor(314568636/512)*512
314568192

Armed with this perfect multiple of 512 (ergo 256k), build a striped map. Create the device as before and this time we'll create 2 dmstats areas (one for each physical disk's sectors used) so that we can contrast./compare performance of each one. Notice that because we have two identically sized devices the dmstats --areas 2 usage perfectly splits it for us so we don't have to define each area by hand:

# cat striped.table
0 314568192 striped 2 256 /dev/xvdb 2048 /dev/xvdc 2048

# dmsetup create xyzzy striped.table 

# dmstats create xyzzy --areas 2
xyzzy: Created new region with 2 area(s) as region ID 0

# dmstats list
Name             RgID RgSta RgSize  #Areas ArSize ProgID 
xyzzy               0     0 150.00g      2 75.00g dmstats

# dmstats report
Name             RgID ArID ArStart ArSize RMrg/s WMrg/s R/s  W/s  RSz/s WSz/s AvgRqSz QSize Util% AWait RdAWait WrAWait
xyzzy               0    0       0 75.00g   0.00   0.00 0.00 0.00     0     0       0  0.00  0.00  0.00    0.00    0.00
xyzzy               0    1  75.00g 75.00g   0.00   0.00 0.00 0.00     0     0       0  0.00  0.00  0.00    0.00    0.00

Now that we have the statistics gathering readied, create the filesystem and write data as per our normal testing plan:

# mkfs.ext4 -v /dev/mapper/xyzzy 
# mount /dev/mapper/xyzzy /mnt/xyzzy/
# dd if=/dev/zero of=/mnt/xyzzy/testfile bs=512M count=240
240+0 records in
240+0 records out
128849018880 bytes (129 GB) copied, 224.258 s, 575 MB/s

# dmstats report
Name             RgID ArID ArStart ArSize RMrg/s WMrg/s R/s    W/s        RSz/s   WSz/s  AvgRqSz QSize    Util%  AWait RdAWait WrAWait
xyzzy               0    0       0 75.00g   0.00   0.00 185.00 1786193.00   1.13m 71.35g  41.50k 23550.26 100.00 13.18   30.79   13.18
xyzzy               0    1  75.00g 75.00g   0.00   0.00  61.00 1257156.00 244.00k 51.14g  42.50k 17752.53 100.00 14.12   35.92   14.12

# umount /mnt/xyzzy
# dmsetup remove xyzzy
# dmsetup create xyzzy striped.table 
# mount /dev/mapper/xyzzy /mnt/xyzzy
# ls -og /mnt/xyzzy
total 125829148
drwx------. 2        16384 Jan 10 22:28 lost+found
-rw-r--r--. 1 128849018880 Jan 10 22:33 testfile

Based on our data above, it appears that we're getting better performance from /dev/xvdb than we am from /dev/xvdc; this is a public cloud instance so it's expected we have two data block volumes from different back-end cloud hosts via iSCSI. This exemplifies the risk in LVM of combining such objects in the cloud, performance characteristics will vary from data block device to data block device in this environment. What we did see, though, was the raw write performance go up about 2.5x over linear as expected.


Mirror Target

The mirror target is arguably the most difficult to construct; in essence it's a RAID1 (but again can also be used in LVM) but it requires a log device (similar to a classic journal in a filesystem if you will, a space to record metadata about writes). In order to construct this example, we're going to construct disk partitions and use a technique generally known as "mirroring the mirror".

First we'll prep out 2 partition on each physical block device, one to store log data to disk - so that when rebooting/recreating the mirror has it's data on each leg, otherwise a bootstrap would be required again - and one to store data. They will be exactly the same on both disks as this is a mirror configuration.

# sgdisk -Z /dev/xvdb 
# parted /dev/xvdb mktable gpt
# parted /dev/xvdb mkpart primary ext3 2048s 18432s
# parted /dev/xvdb mkpart primary ext3 20480s 100%

// Note that we choose some arbitrarily sized numbers; 8k for logs is plenty
# sgdisk -Z /dev/xvdc
# parted /dev/xvdc mktable gpt
# parted /dev/xvdc mkpart primary ext3 2048s 18432s
# parted /dev/xvdc mkpart primary ext3 20480s 100%

# parted /dev/xvdb unit s print
[...]
Number  Start   End         Size        File system  Name     Flags
 1      2048s   18432s      16385s                   primary
 2      20480s  157284351s  157263872s               primary

Now we need to build the virtual maps for these devices; in a sense it's like connecting two linear targets together but with some other items as outlined on the dmsetup man page, noting that core is stored in memory so the regions will be pushed to disk. Notice that since partitions are already defined, the offset is 0 (beginning of partition) for each virtual map.

// Table format used:
// <virtual start> <virtual size> core <params = 1> <param1 = size> <devs = 2> <dev1> <offset1> <dev2> <offset 2> <features = 1> <feature = handle_errors>

# cat mirror-log.table
0 8192 mirror core 1 1024 2 /dev/xvdb1 0 /dev/xvdc1 0 1 handle_errors

// Table format used:
// <virtual start> <virtual size> disk <params = 2> >param1 = log dev> <param2 = size> <devs = 2> <dev1> <offset1> <dev2> <offset 2> <features = 1> <feature = handle_errors>

# cat mirror-data.table 
0 157263872 mirror disk 2 /dev/mapper/xyzzy-log 1024 2 /dev/xvdb2 0 /dev/xvdc2 0 1 handle_errors

Now we create the device maps using these tables, ensure the log is created first:

# dmsetup create xyzzy-log mirror-log.table 
# dmsetup create xyzzy-data mirror-data.table 

# mkfs.ext4 -v /dev/mapper/xyzzy-data 
# mount /dev/mapper/xyzzy-data /mnt/xyzzy/
# dmstats create xyzzy-log
# dmstats create xyzzy-data
# dd if=/dev/zero of=/mnt/xyzzy/testfile bs=512M count=120
120+0 records in
120+0 records out
64424509440 bytes (64 GB) copied, 235.429 s, 274 MB/s

# dmstats report
Name             RgID ArID ArStart ArSize RMrg/s WMrg/s R/s   W/s        RSz/s   WSz/s   AvgRqSz QSize     Util%  AWait  RdAWait WrAWait
xyzzy-log           0    0       0  4.00m   0.00   0.00  0.00   35357.00       0 690.57m  20.00k     42.16 100.00   1.19    0.00    1.19
xyzzy-data          0    0       0 74.99g   0.00   0.00 47.00 1486089.00 188.00k  60.23g  42.00k 163854.02 100.00 110.26   66.28  110.26

# umount /mnt/xyzzy
# dmsetup remove xyzzy-data
# dmsetup remove xyzzy-log
# dmsetup create xyzzy-log mirror-log.table 
# dmsetup create xyzzy-data mirror-data.table 
# mount /dev/mapper/xyzzy-data /mnt/xyzzy/
# ls -og /mnt/xyzzy/
total 62914584
drwx------. 2       16384 Jan 10 23:39 lost+found
-rw-r--r--. 1 64424509440 Jan 10 23:45 testfile

# umount /mnt/xyzzy/
# dmsetup remove xyzzy-data
# dmsetup remove xyzzy-log

# mount /dev/xvdb2 /mnt/xyzzy/
# ls -og /mnt/xyzzy/
total 62914584
drwx------. 2       16384 Jan 10 23:39 lost+found
-rw-r--r--. 1 64424509440 Jan 10 23:45 testfile

# umount /mnt/xyzzy/
# mount /dev/xvdc2 /mnt/xyzzy/
# ls -og /mnt/xyzzy/
total 62914584
drwx------. 2       16384 Jan 10 23:39 lost+found
-rw-r--r--. 1 64424509440 Jan 10 23:45 testfile

Notice that we also test destroying the device maps and manually mounting and verifying the raw block device as might be needed in a real situation where one member of the mirror goes offline.


References