Device Mapper Multipath

From trapsink.com
Jump to: navigation, search


Overview

The device-mapper-multipath (a sub-component of device-mapper) subsystem is the native way of configuring 2 or more individual paths to the same storage LUN, typically used in a HA (failover) capacity. If one underlying path fails the system transfers I/O to another path; higher level operations (such as LVM) use the single multipath pseudo device and are abstracted from the underlying physical links.


Initial Setup

A standard setup requires 2 RPMs which provide the multipathd service and udev rules for naming the multipaths:

  1. device-mapper
  2. device-mapper-multipath

For a Dell DAS such as the MD32xx 2 more packages are required, typically from the vendor install media:

  1. dkms (Dynamic Kernel Module Support - framework required for the below RPM)
  2. scsi_dh_rdac (Dell custom version, the kernel also contains one)

The multipathd service is what pulls it all together.


DAS Config

A well formed Dell MD32xx DAS deployed config might look like:

# DAS /etc/multipath.conf
blacklist {
  device {
    vendor  "*"
    product "Universal Xport"
  }
  device {
    vendor  "*"
    product "MD3000"
  }
  device {
    vendor  "*"
    product "MD3000i"
  }
  device {
    vendor  "*"
    product "Virtual Disk"
  }
  device {
    vendor "*"
    product "PERC|Perc"
  }
}
defaults {
  user_friendly_names  yes
  max_fds              8192
  polling_interval     5
}
devices {
  device {
    vendor                "DELL"
    product               "MD32xxi"
    path_grouping_policy  group_by_prio
    prio                  rdac
    path_checker          rdac
    path_selector         "round-robin 0"
    hardware_handler      "1 rdac"
    failback              immediate
    features              "2 pg_init_retries 50"
    no_path_retry         30
    rr_min_io             100
  }
  device {
    vendor                "DELL"
    product               "MD32xx"
    path_grouping_policy  group_by_prio
    prio                  rdac
    path_checker          rdac
    path_selector         "round-robin 0"
    hardware_handler      "1 rdac"
    failback              immediate
    features              "2 pg_init_retries 50"
    no_path_retry         30
    rr_min_io             100
  }
  device {
    vendor                "DELL"
    product               "MD36xxi"
    path_grouping_policy  group_by_prio
    prio                  rdac
    path_checker          rdac
    path_selector         "round-robin 0"
    hardware_handler      "1 rdac"
    failback              immediate
    features              "2 pg_init_retries 50"
    no_path_retry         30
    rr_min_io             100
  }
  device {
    vendor                "DELL"
    product               "MD36xxf"
    path_grouping_policy  group_by_prio
    prio                  rdac
    path_checker          rdac
    path_selector         "round-robin 0"
    hardware_handler      "1 rdac"
    failback              immediate
    features              "2 pg_init_retries 50"
    no_path_retry         30
    rr_min_io             100
  }
}


NAS iSCSI Config

An example config for a Netapp NAS iSCSI might look like:

# NAS iSCSI /etc/multipath.conf
blacklist {
  device {
    vendor "*"
    product "PERC|Perc"
  }
  device {
    vendor "*"
    product "Universal Xport"
  }
  device {
    vendor "*"
    product "Virtual Disk"
  }
}

defaults {
  user_friendly_names yes
  max_fds max
  queue_without_daemon no
}

devices {
  device {
    vendor "NETAPP"
    product "LUN"
    getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
    #
    # RHEL5 style
    prio_callout "/sbin/mpath_prio_ontap /dev/%n"
    # RHEL6 style
    # prio ontap
    #
    features "1 queue_if_no_path"
    hardware_handler "0"
    path_grouping_policy group_by_prio
    failback immediate
    rr_weight uniform
    rr_min_io 128
    path_checker directio
    flush_on_last_del yes
  }
}


Multipath Names

By default in RHEL/CentOS, the names of the multipath will be in /dev/mapper/ and begin with "mpath" and be followed by a number (v5) or a letter (v6). A partition within that path will then have "p" followed by it's number. These are controlled by udev and a config file installed by the device-mapper-multipath RPM; for example on RHEL6/CentOS6 it's named /lib/udev/rules.d/40-multipath.rules.

Examples:

  /dev/mapper/mpath1p2 - 2nd partition on path #1 (1) (v5)
  /dev/mapper/mpathbp1 - 1st partition on path #2 (b) (v6)

These are a human-friendly format of the WWID triggered by the setting user_friendly_names yes in the config file. These can be changed to suit needs - it's easy and can save a lot of confusion later if a dozen LUNs are used as RAW devices (such as in an Oracle RAC).


Administrating Multipaths

The main tool for administering multipaths is called multipath and is normally found in /sbin/ (root only). The primary use day-to-day will be the -l or -ll flags to simply list multipaths and their associated 'real' SCSI devices (paths). Using this tool you can examine the health of the (multi)paths and all associated information.

Example:

## DAS multipath
# multipath -l
[...]
VOTING5 (3690b11c0001b99ba0000098f5192345e) dm-5 DELL,MD32xx
size=1.0G features='3 queue_if_no_path pg_init_retries 50' hwhandler='1 rdac' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 2:0:0:4  sdw  65:96  active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
  `- 1:0:0:4  sdf  8:80   active undef running

The system knows which SCSI devices belong together by their WWID (aka WWN, UUID) that are presented from the storage host - if they match, they belong together in a multipath. From the above example LUN, using the -v3 flag will show they match:

## DAS WWIDs (WWN/UUID)
# multipath -v3
[...]
uuid                              hcil     dev  dev_t  pri dm_st chk_st vend/p
3690b11c0001b99ba0000098f5192345e 2:0:0:4  sdw  65:96  14  undef ready  DELL,M
3690b11c0001b99ba0000098f5192345e 1:0:0:4  sdf  8:80   9   undef ready  DELL,M

The WWID 3690b11c0001b99ba0000098f5192345e matches on both SCSI devices, so now the multipath daemon knows they belong together and creates a pseudo device for us to work with. If one underlying path (device) fails, it goes over to the other one without any manual intervention. Magic.

There are other uses of the multipath tool, such as the -f/-F flags (flush paths) and -p (change policies) -- be careful using these on a live server. Check the man page for detailed information, and know there is a -d (dry run) option to test things before commit. It's sometimes easier to restart the multipathd daemon instead depending on what you're doing (such as renaming - see below).


Partitioning Multipaths

The tool kpartx is what an administrator will use to have the kernel re-examine newly partitioned multipaths and create new device entries for us; it's the equivalent of using partx on normal devices.

## Normal SCSI device

# parted /dev/sdb    (create new partition 1)
# partx -a /dev/sdb
# ls -1 /dev/sdb*
/dev/sdb
/dev/sdb1

## Multipath device

# parted /dev/mapper/mpathb     (create new partition 1)
# kpartx -a /dev/mapper/mpathb
# ls -1 /dev/mapper/mpathb*
/dev/mapper/mpathb
/dev/mapper/mpathbp1

The device /dev/mapper/mpathbp1 is now used just like /dev/sdb1 would be for any other tools (mkfs, pvcreate, vgextend, etc.) -- the multipath daemon takes care of routing the actual SCSI commands out to the active device (path) in the multipath to storage.


Clustered Multipaths

Using the WWIDs as described above will allow you to ensure that if you have a host group of LUNs presented to 2 or more servers match multipaths. The mapping of a WWID to multipath on one node must match on all other nodes, otherwise you're writing to different storage areas on different nodes. If your examination finds they do not match you may need to rename them manually - see below.

Always double-check the WWID to multipath mappings match on all nodes in a cluster! This may not be quick but it's extremely important the time be spent doing this work. Never assume it's "just right" on a new build.

Renaming Multipaths

Renaming them is easy - add a new stanza to the bottom of multipath.conf that has a grouping, then rename each one. The setting user_friendly_names yes is required in multipath.conf for this to work as expected. For example, here's is a rename of a shared Oracle RAC voting LUN from the spurious name into something that makes sense for use inside Oracle as a RAW device:

multipaths {
  multipath {
    wwid   3690b11c0001b99ba0000098f5192345e
    alias  VOTING5
  }
}

Restart multipathd service and now the multipath is named like so:

# ls -1 /dev/mapper/VOTING5
/dev/mapper/VOTING5

The partitions within a renamed multipath follow the same convention, 'p' followed by a number. You would expect names like /dev/mapper/VOTING5p1, /dev/mapper/VOTING5p2, etc. if you partitioned this LUN for use as a normal filesystem.


Multipath Ownership

One of the other common desires is to set the UID, GID and mode on the multipaths; alas there's a different method for RHEL/CentOS v5 and v6.


RHEL5 / CentOS5

This is done in the same block schema as renaming them like so:

multipaths {
  multipath {
    wwid   3690b11c0001b99ba0000098f5192345e
    alias  VOTING5
    uid    503
    gid    503
    mode   755
  }
}

Note that the system requires the numerical UID/GID and octal mode as shown above.


RHEL6 / CentOS6

The above method was deprecated in RHEL6 in favor of udev rules - Red Hat's article on how to set it up is wee bit lacking; use a ruleset like this instead of their official doc:

/etc/udev/rules.d/12-dm-permissions.rules
ENV{DM_NAME}=="VOTING5", OWNER:="oracle", GROUP:="oinstall", MODE:="660"

This is based on renaming the multipath outlined above; to get the value of the DM_NAME you are trying to rename the "udevadm" tool is used to query the raw device-map node.

  • Get the raw node-name with a simple ls:
# ls -l /dev/mapper/VOTING5
lrwxrwxrwx 1 root root 7 May 30 22:41 /dev/mapper/VOTING5 -> ../dm-5
  • Use that dm-?? number against the sysfs interface for it:
# udevadm info --query=all --path=/devices/virtual/block/dm-5/
P: /devices/virtual/block/dm-5
N: dm-5
S: mapper/VOTING5
S: disk/by-id/dm-name-VOTING5
S: disk/by-id/dm-uuid-mpath-3690b11c0001b99ba0000098f5192345e
S: block/253:5
E: UDEV_LOG=3
E: DEVPATH=/devices/virtual/block/dm-5
E: MAJOR=253
E: MINOR=5
E: DEVNAME=/dev/dm-5
E: DEVTYPE=disk
E: SUBSYSTEM=block
E: DM_SBIN_PATH=/sbin
E: DM_UDEV_PRIMARY_SOURCE_FLAG=1
E: DM_UDEV_RULES_VSN=2
E: DM_NAME=VOTING5
E: DM_UUID=mpath-3690b11c0001b99ba0000098f5192345e
E: DM_SUSPENDED=0
E: MPATH_SBIN_PATH=/sbin
E: DEVLINKS=/dev/mapper/VOTING5 /dev/disk/by-id/dm-name-VOTING5 /dev/disk/by-id/dm-uuid-mpath-3690b11c0001b99ba0000098f5192345e /dev/block/253:5

Use any line item that begins with "E: " as the match clause in your udev rule; it seems the most obvious to use DM_NAME however your situation may require using one of the others.


References