LVM Mechanics

From trapsink.com
Jump to: navigation, search


Overview

After understanding the x86 storage design and partitioning, the next step is LVM - Logical Volume Management. Generically, LVM is the concept of taking individual pieces of storage (whole disks or individual partitions) and combining them together with a layer of software to make the group appear as one single entity. This single entity can then be sub-divided into logical filesystems and treated individually, even though they share the same physical storage.

While we use the short phrase LVM in practice, technically we are referring to LVM2 as opposed to LVM1. LVM1 hasn't been around in modern distros for quite some time once the Device Mapper infrastructure was introduced to the kernel. LVM2 utilizes device-mapper fully, unlike LVM1 - we are discussing only LVM2 as LVM here.

LVM has many configuration options, this article is an introduction to the overall LVM world and does not cover all the ways lvm.conf​ can be tuned. See man 5 lvm.conf​ for more information.


Best Practices

A few rules to live by in the LVM world:

  • Name the Volume Group with a name that represents the design - for example vglocal01, vgiscsi05, vgsan00, vgraid5, etc.
  • Never combine two disparate objects together - for example, do not combine local (in the chassis) storage with remote (iSCSI/SAN/DAS) storage
  • Never combine different performance tiers - for example, do not combine a RAID-1 array and RAID-5 array in the same group
  • Never combine non-partitioned and partitioned devices - this could lead to performance issues or end-user confusion in the future


Device Mapper

At the heart of the system is the Device Mapper kernel level infrastructure introduced with kernel 2.6.x; it's a kernel framework for mapping block level devices to virtual block devices. Not only is it the underpinnings of LVM2 but also RAID (dm-raid), cryptsetup (dm-crypt) and others like dm-cache. It's the component that provides the snapshots feature for LVM as well.

Normally ioctls (I/O controls) are sent to the block device itself; within the DM world there is a special endpoint /dev/mapper/control that is used instead, all ioctls are sent to this device node. The userspace tool dmsetup can be used to manually investigate and manipulate the device-mapper subsystem; dmsetup ls is a common usage used by techs to quickly review device maps.

With LVM this is exactly what we're doing - creating virtual block devices on top of physical block devices.


kpartx

While not technically LVM related, a mention of kpartx should be made here - where the tool partx is designed to read partition maps and create the proper device nodes, the kpartx tool is used to read partition tables and create device maps over partition segments detected. This tends to come into play more when using Multipath than LVM, however be aware there are two tools to perform these functions in their own discreet fashion.


LVM Components

LVM is the name/acronym of the entire puzzle; there are three discreet components to make it all work:

Physical Volume (PV) A physical volume is the actual storage space itself as a single item. It can be the whole object (entire local drive, entire SAN LUN, etc.) or the individual partitions on that storage device. Each partition is it's own discreet physical volume in the latter case.
Volume Group (VG) A volume group is a collection of physical volumes as a single entity. It is used as a single block of storage to carve up into logical volumes. Space from one logical volume can be transferred to another logical volume within the same group, filesystem type permitting.
Logical Volume (LV) A logical volume is an area of space - akin to a partition on a drive - that is used to hold the filesystem. A logical volume cannot span volume groups, and is the object which is manipulated with userspace tools like mount, umount, cryptsetup, etc.


Physical Volumes

A PV is the most direct part of the LVM puzzle, and the most critical. The PV first has to be created using the command pvcreate -- what this is doing is writing a metadata block to the 2nd sector up to 1MB of the VBR (on a partitioned device) or the beginning of the device if it is not partitioned. When scanning for metadata the LVM subsystem will read this data to determine all information needed - for example, on a 512-byte sector drive 2048 sectors might be scanned to try and locate the PV metadata.

The pvcreate utility provides basic protection - if a partition table is present but there are no partitions defined, it will still not work - the partition table (whether MBR or GPT) must be zapped before you can use a whole device as a PV without partitions. Otherwise a partition must be created and then pvcreate used on that partition - but be careful, if a device is already a PV it is possible to use a tool like fdisk to create a partition table in the first 512-byte sector afterwards!

The metadata area starts with the string LABELONE followed by several groups of data:

  • Physical volume UUID
  • Size of block device in bytes
  • NULL-terminated list of data area locations
  • NULL-terminated lists of metadata area locations

What's important to understand is that the concept of the higher level VG and LV are stored in this PV metadata. This protects the VG and LV from activating if one of the PVs in the group is missing - by every PV having a complete view of the objects it becomes self-referential and self-sufficient. Because this metadata also stores the location of data as it's written it's critical that all items be present before it becomes available for use to the end user.

This design makes the PVs themselves portable as a group - a group of disks can be removed from one server and presented to another, and so long as all PVs are present and the metadata is intact that LVM group can simply be activated on the new server with ease.

An example of the PV-stored LVM metadata can be found at the end of this page.


Whole Device vs Partition

The decision whether or not to use partitions with a PV has only one concrete advantage, but several disadvantages in practice. So long as the alignment to storage is correct (radial geometry) there is no performance gain or loss in using either method. The concrete advantage to using an entire device for the PV is the ability to expand that PV using pvresize at a later date. This simplifies the work in expanding the underlying storage and increasing the PV/VG/LV sizes.

The practical disadvantages to using an entire device all revolve around visibility. For example, when a tech uses fdisk/parted/gdisk and does not see a partition they may be inclined to think the drive is unused; this can result in adding a partition table to a device already in a VG by accident. The boot device which does need a MBR/GPT to operate cannot be used as a whole disk, so if other PVs will be added later it's considered bad practice to combine partitioned and non-partitioned PVs into the same VG. While not technical disadvantages these considerations should be taken on when setting up the PV.


Volume Groups

A volume group is the abstract layer that sits between the PV and the LV -- it's role is to combine and hide the physical block devices, presenting one picture of unified storage to the LVs on top. The VG operates on Physical Extents - think of these as blocks of data of a given size, where 4 MiB is the default size when creating the VG. Much like the physical sectors of a disk, the PE is treated as a unit -- data is read/written to it as one chunk and the ability to move it is handled in the same chunk. During vgcreate different sizes can be chosen depending on expected workload - 1 KiB minimum and must be a power of 2.

The default VG allocation policy when writing PEs is to use normal mode -- this means it has some basic intelligence built in to prevent parallel stripes from being placed on the same PV for instance. This can be changed to other methods - for example, contiguous policy requires new PEs being written to be placed right after the existing PEs; the cling policy places new PEs on the same PV as existing PEs in the same stripe of the LV. Note this is not the same as the LV type of segments.

A single VG can span many, many PVs however a VG cannot be combined with another VG; ergo, a VG has a finite size of the PVs underneath it and how they're used by the LVs on top. The VG can be expanded or reduced by adding/removing PVs or expanding/reducing the existing PVs.

Migrating Volume Groups

Volume Groups are independent of the system itself, providing the VG is not the container for the root filesystem of the server. They can be exported from one system, physically moved, then imported to another system. The vgexport command will clear the VG metadata hostname and deactivate it from the current system, while the vgimport command will set the VG metadata hostname to the current system and activate it. The VG should be deactivated with vgchange first to ensure it's unmounted and not in use.

# vgs vglocal
  VG      #PV #LV #SN Attr   VSize  VFree
  vglocal   1   1   0 wz--n- 50.00g    0 

# vgexport vglocal
  Volume group "vglocal" has active logical volumes

# vgchange -an vglocal
  0 logical volume(s) in volume group "vglocal" now active

# vgexport vglocal
  Volume group "vglocal" successfully exported

# vgimport vglocal
  Volume group "vglocal" successfully imported


Logical Volumes

The logical volume is the top-most container segmenting a given amount of space from the underlying VG; a LV is restricted to the single VG it is on top of, a LV cannot span 2 more more VGs for increased space. To increase space in a LV the underlying VG has to be increased first. The LV acts as the final virtual block device endpoint of the device mapper design -- this container is what is used with tools like mkfs.ext4, mount and so forth. It acts and reacts just like a real block device for all intents and purposes, save that it is more like a single partition instead of a whole device (doesn't use a MBR/GPT table).

The LV can be manipulated in 2 primary ways - by the /dev/vgname/lvname symlink or the /dev/mapper/'vgname-lvname symlink. Using either is fine since they point to the same actual, real device mapper node entry in the /dev/ tree that corresponds to the virtual block device:

# ls -l /dev/vglocal/lvtest /dev/mapper/vglocal-lvtest 
lrwxrwxrwx 1 root root 7 May  8 19:54 /dev/mapper/vglocal-lvtest -> ../dm-0
lrwxrwxrwx 1 root root 7 May  8 19:54 /dev/vglocal/lvtest -> ../dm-0

# lvdisplay | awk '/LV Name/{n=$3} /Block device/{d=$3;sub(".*:","dm-",d);print d,n;}'
dm-0 lvtest

So our actual DM node is /dev/dm-0 -- this should never be used as it's possible it could change after a reboot for instance; always use the symlink name instead for maximum resilience to change. These have the major character node type 253 in Linux, the minor character is simply the position it was discovered/added in by the kernel. These can be examined with the dmsetup tool as outlined above:

# dmsetup ls
vglocal-lvtest	(253:0)

# dmsetup info vglocal-lvtest
Name:              vglocal-lvtest
State:             ACTIVE
Read Ahead:        256
Tables present:    LIVE
Open count:        0
Event number:      0
Major, minor:      253, 0
Number of targets: 1
UUID: LVM-c1MITBqcCORe5icvRwAhlAQUJvVceVDfQXSRQz0T42vcwnPKbotggmXwxrTWB1l5

Logical volumes can be created in a number of different modes that might look familiar: linear, striped and mirrored are the three most common. The default mode is linear - use the space from beginning to end as a whole. Striped and mirrored are exactly like your basic RAID - both require minimum 2 PVs and write across both PVs like RAID-0 and RAID-1. Other modes exist, one of which is Snapshot -- the usage of striped and mirrored LVs is not being covered here as they tend to be specific use case oriented solutions.


LV Sizing Methods

A note here about specifying the size of the logical volume to be created, extended or reduced: two commandline options exist to perform the same work but tend to be confusing. Think of them this way:

-l dynamic math - "100%VG", "+90%VG" and so forth
-L absolute math - "100G", "+30G" and so forth

Thinking of these flags in this manner aides in their usage later - the same operation could be used using either one of them, however it may be easier to do with one or the other depending on the exact situation. The -l by default with no quantifier is using PE as a value, handy if you need to move an exact number of physical extents around for the need at hand.


LVM Filters

One of the more critical parts to using LVM within environments which contain multiple HA paths to the storage is setting up LVM filters to ignore the individual paths and only respect the meta (pseudo) path to the storage, whether that be SAN, DAS or iSCSI in nature. If the filters are not set correctly the underpinnings of LVM will use a single path by name – if that path dies, LVM dies.

The way filters are written is simple - "add these, remove these" in nature. Looking at a few examples reveals the concepts used in /etc/lvm/lvm.conf:

# A single /dev/sda internal device, PowerPath devices:
filter = [ "a|^/dev/sda[0-9]+$|", "a|^/dev/emcpower|", "r|.*|" ]

# Two internal devices, /dev/sda and /dev/sdb, and PowerPath devices:
filter = [ "a|^/dev/sd[ab][0-9]+$|", "a|^/dev/emcpower|", "r|.*|" ]

# Two internal devices, /dev/sda and /dev/sdb, and Device Mapper Multipath devices:
filter = [ "a|^/dev/sd[ab][0-9]+$|", "a|^/dev/mapper/mpath|", "r|.*|" ]

# Two internal devices, Device Mapper Multipath and PowerPath devices all at once:
filter = [ "a|^/dev/sd[ab][0-9]+$|", "a|^/dev/mapper/mpath|", "a|^/dev/emcpower|", "r|.*|" ]

Note in the above examples that regular expressions can be used for configuration.


LVM Snapshots

A snapshot is a special type of LV that has a method called Copy on Write (CoW) used to store a point in time view of the source LV plus all changes since that point. A source LV is specified during creation and the name of the new snapshot LV; the snapshot LV must exist on the same VG as the source LV. The snapshot LV only needs to be large enough to hold changes made since the time it was taken, it does not need to be the same size as the source entirely. The VG must have this space free - it cannot be used by any LV already.

When creating the snapshot LV basically a copy of the inode table is taken - hence the need for the source and snapshot LVs to exist within the same VG. At that point all new changes are recorded in the CoW table to either be discarded or applied depending on usage. However, be aware that a lot of magic happens under the hood to support this! It's not a simple LV, let's take a look:

# lvremove /dev/mapper/vglocal-lvtest 
# lvcreate -l 50%VG -n lvtest vglocal
# lvcreate -L 10G -s -n lvsnap /dev/vglocal/lvtest 

# lvs
  LV     VG      Attr       LSize  Pool Origin Data%
  lvsnap vglocal swi-a-s--- 10.00g      lvtest   0.00                          
  lvtest vglocal owi-a-s--- 50.00g

# dmsetup ls
vglocal-lvsnap-cow	(253:3)
vglocal-lvsnap	(253:1)
vglocal-lvtest	(253:0)
vglocal-lvtest-real	(253:2)

Notice how there are two additional device maps - "vglocal-lvsnap-cow" and "vglocal-lvtest-real" - that are used behind the scenes to store and work with those CoW changes to the source volume that occur while the snapshot is alive. If the snapshot fills up with changes and flips to read-only mode it can be a bit of an ordeal to get the snapshot fully released correctly if something goes wrong within LVM, so proper planning should be take to remove the snapshot in a timely fashion or plan for it's expected growth.


Reverting Snapshots

It is possible to roll back changes made to the original logical volume (lvtest) by merging the original LV extents from the CoW volume back onto the origin volume, provided that the snapshot-merge target is available.

Check if supported by kernel
# dmsetup targets | grep snapshot-merge
snapshot-merge   v1.1.0

This operation is seamless to the user and starts automatically when the origin (lvtest) and snapshot (lvsnap) volumes are activated but not opened. If either the origin or snapshot volumes are opened, the merge operation is deferred until the next time both volumes are activated. As soon as the merge operation starts, the origin volume can be opened and the filesystem within it mounted.

From this point, all read and write operations to the origin volume are seamlessly routed to the correct logical extents (at the start of the merge operation, these would be the original extents on lvsnap-cow and the unchanged extents on lvtest-real) until the merge is complete. The lvsnap-cow, lvsnap and lvtest-real volumes are then removed from the system.

Following the lvtest/lvsnap the following command would start the merge/rollback operation:

# lvconvert --merge /dev/vglocal/lvsnap


Selected Examples

Expand VG and LV

One of the more common scenarios: your boot disk has two partitions, 1 and 2. 1 is /boot using non-LVM and 2 is LVM based / (root) filesystem. You have run out of space and wish to add more -- the new space can be either a new partition on the same storage device that was just expanded (SAN/DAS LUN, VMware vDisk, etc.) or a new device and partition entirely.

After creating your new partition and using pvcreate on it, review the mission goal - we're adding the new space from xvdb3 to the VG, growing the LV and resizing the ext4 filesystem.

# pvs; echo; vgs; echo; lvs
  PV         VG      Fmt  Attr PSize  PFree 
  /dev/xvdb2 vglocal lvm2 a--  10.00g     0 
  /dev/xvdb3         lvm2 a--  20.00g 20.00g

  VG      #PV #LV #SN Attr   VSize  VFree
  vglocal   1   1   0 wz--n- 10.00g    0 

  LV     VG      Attr       LSize
  lvroot vglocal -wi-a----- 10.00g

First, add the new PV into the VG. Then grow the LV with the newly added space. Lastly grow the ext4 filesystem itself:

# vgextend vglocal /dev/xvdb3 
  Volume group "vglocal" successfully extended

# lvextend -l +100%FREE /dev/vglocal/lvroot 
  Extending logical volume lvroot to 29.99 GiB
  Logical volume lvroot successfully resized

# resize2fs /dev/vglocal/lvroot 
  Resizing the filesystem on /dev/vglocal/lvroot to 7862272 (4k) blocks.
  The filesystem on /dev/vglocal/lvroot is now 7862272 blocks long.

Check our work again:

# pvs; echo; vgs; echo; lvs
  PV         VG      Fmt  Attr PSize  PFree
  /dev/xvdb2 vglocal lvm2 a--  10.00g    0 
  /dev/xvdb3 vglocal lvm2 a--  20.00g    0 

  VG      #PV #LV #SN Attr   VSize  VFree
  vglocal   2   1   0 wz--n- 29.99g    0 

  LV     VG      Attr       LSize
  lvroot vglocal -wi-a----- 29.99g


Migrate PVs

The scenario: an existing LV contains a PV we wish to replace - this could be for migrating from one type of storage to another for instance, or replacing several small PVs with one large PV for better performance at the storage side. The pvmove command is used, and the PV being added must be at least as large as the one being removed!

Existing LV has one PV in VG "vglocal" of 9.3 GiB in size:

# pvs
  PV         VG      Fmt  Attr PSize  PFree 
  /dev/xvdb1 vglocal lvm2 a--   9.31g     0 
  /dev/xvdb2         lvm2 a--  10.00g 10.00g

We will replace xvdb1 with xvdb2 - note how it's 10 GiB, at least as large as the one being replaced. After the VG is extended to add the second PV, we check again and see that it's been added but all the PEs (PFree) from xvdb2 are still unused. Do not extend the LV on top of the VG, the new PV must show as free in order to use it as a migration device.

# vgextend vglocal /dev/xvdb2

# pvs
  PV         VG      Fmt  Attr PSize  PFree 
  /dev/xvdb1 vglocal lvm2 a--   9.31g     0 
  /dev/xvdb2 vglocal lvm2 a--  10.00g 10.00g

Now we move all the PEs from xvdb1 to xvdb2 with a few commandline options to show verbose info and a progress update every 5 seconds. After all the PEs have been moved to xvdb2 we do a quick check again, then if all looks kosher we remove the old PV:

# pvmove -v -i5 /dev/xvdb1 /dev/xvdb2 

# pvs
  PV         VG      Fmt  Attr PSize  PFree  
  /dev/xvdb1 vglocal lvm2 a--   9.31g   9.31g
  /dev/xvdb2 vglocal lvm2 a--  10.00g 704.00m

# vgreduce /dev/xvdb1

# pvs
  PV         VG      Fmt  Attr PSize  PFree  
  /dev/xvdb1         lvm2 a--   9.31g   9.31g
  /dev/xvdb2 vglocal lvm2 a--  10.00g 704.00m


LVM Metadata Example

Using the lvmdump -m command is the easiest way to extract the metadata from all the PVs on the system; here is an example of the data with basic formatting added (spaces/indents, etc.) for easier readability. Note that the metadata area stores rolling revisions of the changes made, it might be useful in a given situation to determine what has transpired.

LABELONE LVM2 001wTDfgkU6aRyAwCheopo1LeCEFWWodQbd

vglocal {
 id = "7PHX1A-PJ0n-fgdv-qRup-In2G-dah1-iOgWm4"
 seqno = 2
 format = "lvm2" # informational
 status = ["RESIZEABLE", "READ", "WRITE"]
 flags = []
 extent_size = 8192
 max_lv = 0
 max_pv = 0
 metadata_copies = 0

 physical_volumes {
  pv0 {
   id = "wTDfgk-U6aR-yAwC-heop-o1Le-CEFW-WodQbd"
   device = "/dev/xvdb1"
   status = ["ALLOCATABLE"]
   flags = []
   dev_size = 209711104
   pe_start = 2048
   pe_count = 25599
  }
 }

 logical_volumes {
  lvtest {
   id = "ZSHT4d-K4lc-pUma-6UtB-vJ9e-9jox-hRTibF"
   status = ["READ", "WRITE", "VISIBLE"]
   flags = []
   creation_host = "r7rc-ha"
   creation_time = 1399583935
   segment_count = 1
   segment1 {
    start_extent = 0
    extent_count = 12800
    type = "striped"
    stripe_count = 1        # linear
    stripes = [
     "pv0", 0
    ]
   }
  }
 }
}

# Generated by LVM2 version 2.02.100(2)-RHEL6 (2013-10-23): Thu May  8 21:18:55 2014
contents = "Text Format Volume Group"
version = 1
description = ""
creation_host = "localhost"     # Linux localhost 2.6.32-431.11.2.el6.x86_64 #1 SMP Tue Mar 25 19:59:55 UTC 2014 x86_64
creation_time = 1399583935      # Thu May  8 21:18:55 2014


References