Thursday, August 31, 2006

Linux Native Multipathing (Device Mapper-Multipath)

Over the past couple of years a flurry of OS Native multipathing solutions have become available. As a result we are seeing a trend towards these solutions and away from vendor specific multipathing software.

The latest OS Native multipathing solution is Device Mappper-Multipath (DM-Multipath) available with Red Hat Enterprise Linux 4.0 U2 and SuSE SLES 9.0 PS2.

I had the opportunity to configure it in my lab a couple of days ago and I was pleasantly surprised as to how easy was to configure it. Before I show how it's done, let me talk a little about how it works.

The multipathing layer sits above the protocols (FCP or iSCSI), and determines whether or not the devices discovered on the target, represent separate devices or whether they are just separate paths to the same device. In this case, Device Mapper (DM) is the multipathing layer for Linux.

To determine which SCSI devices/paths correspond to the same LUN, the DM initiates a SCSI Inquiry. The inquiry response, among other things, carries the LUN serial number. Regardless of the number paths a LUN is associated with, the serial number for the LUN will always be the same. This is how multipathing SW determines which and how many paths are associated with each LUN.

Before you get started you want to make a sure the following things are loaded:

  • device-mapper-1.01-1.6 RPM is loaded
  • multipath-tools-0.4.5-0.11
  • Netapp FCP Linux Host Utilities 3.0

Make a copy of the /etc/multipath.conf file. Edit the original file and make sure you only have the following entries uncommented out. If you don't have Netapp the section then add it.


defaults {
user_friendly_names yes
}
#
devnode_blacklist {
devnode "sd[a-b]$"
devnode "^(ramrawloopfdmddm-srscdst)[0-9]*"
devnode "^hd[a-z]"
devnode "^cciss!c[0-9]d[0-9]*"
}


devices {
device {
vendor "NETAPP "
product "LUN"
path_grouping_policy group_by_prio
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
prio_callout"/opt/netapp/santools/mpath_prio_ontap /dev/n"
features "1 queue_if_no_path"
path_checker readsector0
failback immediate
}
}

The devnode_blacklist includes devices for which you do not want multipathing enabled. So if you have a couple of local SCSI drives (i.e sda and sdb) the first entry in the blacklist will exclude them. Same for IDE drives (hd).


Add the multipath service to the boot sequence by entering the following:

chkconfig --add multipathd
chkconfig multipathd on


Multipathing on Linux is Active/Active with a Round-Robin algorithm.

The path_grouping_policy is group_by_prio. It assigns paths into Path Groups based on path priority values. Each path is given a priority (high value = high priority) based on a callout program written by Netapp Engineering (part of the FCP Linux Host Utilities 3.0).

The priority values for each path in a Path Group are summed and you obtain a group priority value. The paths belonging to the Path Group with the higher priority value are used for I/O.

If a path fails, the value of the failed path is subtracted from the Path Group priority value. If the Path Group priority value is still higher than the values of the other Path Groups, I/O will continue within that Path Group. If not, I/O will switch to the Path Group with highest priority.

Create and map some LUNs to the host. If you are using the latest Qlogic or Emulex drivers, then run the respective utilities they provide to discover the LUN:



  • qla2xxx_lun_rescan all (QLogic)
  • lun_scan_all (Emulex)

To view a list of multipathed devices:

# multipath -d -l

[root@rhel-a ~]# multipath -l

360a9800043346461436f373279574b53
[size=5 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [active] \
\_ 2:0:0:0 sdc 8:32 [active]
\_ 3:0:0:0 sde 8:64 [active]
\_ round-robin 0 [enabled]
\_ 2:0:1:0 sdd 8:48 [active]
\_ 3:0:1:0 sdf 8:80 [active]

The above shows 1 LUN with 4 paths. Done. It's that easy to set up.

33 comments:

Anonymous said...

Nice blog on linux native multipathing _ I keep reading your blog regularly & follow all update postings. please post more SAN or storage infrastructure information and all the troubleshooting involved.
Please post more guys we have been reading your posts regularly. Its very informative & I thank you for your time.

http://storage-jobs.blogspot.com storage area network or SAN jobs

Anonymous said...

Do you know if multipath and FCP utility are available for RHEL3?

Nick Triantos said...

Device-Mapper multipath is only available with the RHEL 4.0 U2, SuSE 9 SP2 or above.

RHEL3 uses mdadm (multiple device administration driver) for path failover. It's active/passive and personaly i've found the setup process prone to error.

mdadm creates a multipath device accessible as /dev/md[x] using partition 1 on each of the LUNs (i.e /dev/sda /dev/sdb). However, it doesn't check to see if they are paths to the same LUN or partition. Very easy to mess it up. Last I had played with it, it supported automatic failover but the failback was a manual process.

If you are running RHEL 3 with Qlogic cards then you may want to use the Qlogic driver which provides failover/failback capability.

While the Qlogic driver does not provide I/O load balancing for the same LUN across 2 different HBA host ports, it does give you the ability to balance the LUNs across host ports. In this scenario, using Qlogic's SANSurfer you'd assign a "Preferred Paths" to each of your LUNs.

Anonymous said...

How do you get multiple paths into a path group? We've got 2 HDS LUns presented to 2 HBAs. 'multipath -l' looks like this:

mpath1 (1HITACHI_R4509C66106C)
[size=339 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active]
\_ 1:0:0:1 sdc 8:32 [active][ready]
\_ round-robin 0 [enabled]
\_ 2:0:0:1 sde 8:64 [active][ready]

mpath0 (1HITACHI_R4509C661F15)
[size=13 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active]
\_ 1:0:0:0 sdb 8:16 [active][ready]
\_ round-robin 0 [enabled]
\_ 2:0:0:0 sdd 8:48 [active][ready]


with each path in its own path group, we're in active/standby for each LUN, when we want to be active/active and make use of the round-robin load balancing.

Nick Triantos said...

Hi, in order to get them into different Path Groups you have to use the group_by_prio path grouping policy. Then you have to have a callout program which provided by the vendor (HDS in this case) that assigns pririty values to each path.

You need to look at the /etc/multipath.conf and look at your path grouping policy. There are 2 policies:

1) group_by_prio which we discussed.

2) The default path group policy is "multibus". The algorithm used in this case is simple. All the paths available for a lun are grouped into one single group. This group will be active all the time. The io is started on the first path in the group.

After "rr_min_io" number of io's are completed on that path, io is switched to the next available path in the group in a round robin method. If one of the paths fail, that path is excluded for selection for io. That is io is round robin'd on remaining paths, if you have more than 2 per LUN. If the "failback" parameter is set to "immediate" in /etc/multipath.conf, the path is added into group for selection for io, as soon as it becomes available.

The rr_min_io setting for dm-multipath specifies the number of I/Os sent through a path before switching to the next path.

Lowering this value from the default value of 1000 has been seen to dramatically affect overall throughput for dm-multipath, especially for large I/O workloads (64 KB).

The rr_min_io value can only be set by changing the defaults section in /etc/multipath.conf. You cannot change it in the devices section. You must put a
section such as the following at the very top of the /etc/multipath.conf file:
defaults {
rr_min_io 128
}

Here's what the /etc/multipath.conf for a Netapp box would look like using multibus grouping policy:

devices {
device {
vendor "NETAPP"
product "LUN"
path_grouping_policy multibus
features "1 queue_if_no_path"
path_checker readsector0
failback immediate
}
}

The above will round-robin I/O across all available paths in a single Path Group.

The "1 queue_if_no_path" entry enables I/O queuing in case all paths to a device are lost.

Eyal Traitel said...

I want to note that it's possible to follow the same procedure with CentOS - but I have encountered some package name inconsistencies.
I have it documented it in here:
http://filers.blogspot.com/2006/08/configuring-fcp-multipathing-in-redhat.html

Eyal Traitel
http://filers.blogspot.com
http://stupidstorage.blogspot.com

Anonymous said...

Hi Nick,

I'd like to add multipathd to the boot sequence, but when I run chkconfig --add multipathd, I see the following: multipathd: unknown service. I installed multipath-tools-0.4.7. Any ideas why this command is returning with unknown service? Any help is greatly appreciated.

Thanx!

Anonymous said...

Thank you for all the information about DM Multipath for Linux.

I read through the comment for Hitachi. I have a similar question.

I have 2 HBAs, 2 array ports for my HITACHI Tagmastore array. Is there a way I can set a "preferred" path for I/O or redirect I/O from one path to another?

multipath output :
------------------
mpath0 (360060e800427dd00000027dd00003f80)
[size=500 MB][features="0"][hwhandler="0"]
\_ round-robin 0 [prio=1][active]
\_ 2:0:0:0 sdc 8:32 [active][ready]
\_ round-robin 0 [prio=1][enabled]
\_ 3:0:0:0 sde 8:64 [active][ready]

iostat displays that the I/O is routed through the path "sdc". Is there a way to route it through "sde" ? And I want to be able to do this without any I/O disruption.

extended device statistics
device mgr/s mgw/s r/s w/s kr/s kw/s size queue wait svc_t %b
sdc 0 4625 0.0 74.4 0.0 4700.9 63.2 36.6 492.9 10.9 81
sde 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0
dm-2 0 0 0.0 4700.9 0.0 4700.9 1.0 2364.9 503.1 0.2 81
dm-4 0 0 0.0 4700.9 0.0 4700.9 1.0 2364.9 503.1 0.2 81

Also why do I see I/O on dm-2 and dm-4?

Thanks, really appreciate the help.

Nick Triantos said...

Response to the multipathd:uknown service...

Check to see if the device-mapper-multipath package is loaded. Most likely it's not. I know that when I was installing RHEL im my lab by default the rpm was not loaded. Also check for device mapper.

For RHEL
# rpm -q device-mapper
# rpm -q device-mapper-multipath

If you're on SLES:

# rpm -q device-mapper

Anonymous said...

Device-mapper appears to be loaded :

[root@dot4 dev]# rpm -q device-mapper
device-mapper-1.02.07-4.0.RHEL4
[root@dot4 dev]# rpm -q device-mapper-multipath
device-mapper-multipath-0.4.5-16.1.RHEL4
[root@dot4 dev]#

Nick Triantos said...

What kind of Tagma do you have?

Is it a USP Tagma or is it an AMS Tagma?

HDS provides a callout prioritizer for the AMS series, "/sbin/pp_hds_modular" but i don't which distro is part of.

If you have a USP then the policy should be multibus.

Regardless of what you have (AMS or USP) here's HDS section. Whichever policy is used is dependent on the "product ID" (OPEN-V/DF600F etc)

devices {
device {
vendor "HITACHI"
product "DF600F"
path_grouping_policy group_by_prio
prio_callout "/sbin/pp_hds_modular %d"
path_checker readsector0
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
}
device {
vendor "HITACHI"
product "DF600F-CM"
path_grouping_policy group_by_prio
prio_callout "/sbin/pp_hds_modular %d"
path_checker readsector0
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
}
device {
vendor "HITACHI"
product "OPEN-V"
path_grouping_policy multibus
path_checker readsector0
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
}
device {
vendor "HITACHI"
product "OPEN-V-CM"
path_grouping_policy multibus
path_checker readsector0
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
}
}

Anonymous said...

Thanks Nick,

I have Hitachi USP 100. I don't want to use "multibus" since I am testing one of our company's product which specifically requires me to turn off load balancing/round-robin. So I intend to use "failover" policy even though I have Hitachi USP 100.

Also the other main question I had was, how would I re-direct I/O to a specific path? When I am testing my product, I will be presenting more paths to the USP 100 LUN and I want to be directing I/O to the newly presented path.

Basically I need I/O routed through only one path and I want to be able to choose which path I can route I/O through.

Thanks again for the help.

Anonymous said...

Does anyone have any experience on Multipath with OpenFiler either in SUSE/RHEL?

Anonymous said...

Today I tested your recipe and found some small errors: the line in multipath.conf should be
prio_callout "/opt/netapp/santools/mpath_prio_ontap /dev/%n"
and to get multipath devices I had to run multipath without -l.

m.schlett@fzd.de

Anonymous said...

Hi Nick,

Excellent blog, good info. A quick question if I may. Have you any experience with IBM DS4700 and dm-multipath? IBM keeps telling me to use RDAC but I can't get SLES 9 SP3, boot from SAN and RDAC working reliably. Novell provided good info on setting up dm-multipath but I don't recall reading anything about callouts. Any insight is most appreciated.

Nick Triantos said...

Hi, thanks for the kind words.

IBM's nudging you towards RDAC because DM-Multipath on SLES 9 or RHEL 4.0 does not support SAN Boot. In fact it is problematic.

This is addressed in the RHEL 5.0 and SLES 10 even though I have not done any testing with those.

The multipath-tools package supports pretty much every vendor. IBM should have several entries in there some of which use a callout program and some don't depending on the array.

Anonymous said...

Hi Nick,
very good blog and useful info.

We have many servers with no internal disk at all, only SAN-disks(USP600) and booting into the SAN.
How should I do configure Multipath?
I know multipath does not support system/root-disk but I need multipath for all other LUNs.
Some parameter is common for all LUNs, for example ql2xfailover=0(which is needed) in modprobe.conf disables failover for all luns, even the root disk.

Thanks for your help
/David
david_at_dahey_dot_com

Nick Triantos said...

David,

Like you said, DM-Multipath does not support SAN booting right now. However, one route you could take is use Qlogic HBAs and deploy the qlogic driver. The parameter you mentioned is a Qlogic driver parameter (0=disable failover 1=enable). Supports both boot and non-boot LUNs and works well. While it doesn't provide any dynamic I/O load balancing for a given LUN, you can manually balance the LUNs across a set of host HBAs.

ethan.john said...

Basically I need I/O routed through only one path and I want to be able to choose which path I can route I/O through.
I've been working toward this end extensively for a while. The best method I've found is pretty nasty:

- Use a custom script as the prio_callout, so that you can make the script return whatever values you need for path priorities.
- Set rr_min_io to something very large -- 2 billion or so.

Linux multipathing doesn't currently allow you to set custom path priorities.

Anonymous said...

We have been testing Linux DM in both RedHat and SUSE against our storage with both QLogic and Emulex HBAs. All scenarios work except for SUSE 9 SP3 with Emulex HBAs.

In the SUSE 9 SP3 w/ Emulex environment, DM successfully fails I/O over on an Active/Active service (in this test, a cable-pull is performed). Replugging the cable back in, the I/O does not failback. We can only bring the paths back to an A/A state by entering a "multipath -v2" command.

With a QLogic HBA, or running I/O in RedHat, we always see:
"tur checker reports path is up
.. multipathd: 66:240: reinstated
.. mpathA6: remaining active paths: 2"

Are there any unique setting with Emulex in SUSE that must be set for DM to work automatically?

Nick Triantos said...

Hi,

You're not the only one that has seen this issue with Emulex. Somebody told me that it happens only with LUNs with an ID of 0 but haven't tried it.

Anonymous said...

Hello!
Can I use such kind of multipathing, if I boot blade servers from external storage controller? So I need multiple paths to root partition.

Nick Triantos said...

Hi,

If you are using it for RHEL 4 you can't use dm-multipath for SAN booting. However, I have been told that RHEL 5 U1 will supports this.

Anonymous said...

As others have said, this is great info...do you have any experience with RHEL4 on IBM's power5 (p-Series)? I know dm-multipath doesn't support SAN boot, but in power5 thru a VIO server the LUNs are presented to the client linux LPAR as vscsi. I still haven't been able to get the boot partitions to work with dm-multipath tho non-boot partitions work fine with this method. Any suggestions? mdadm? If I can get this to work, we'll have redundancy and SAN boot via the VIO servers. Any help is appreciated.

Nick Triantos said...

Hi,

I have not played with VIO to set up Linux MP using Device-Mapper. Having said that, I'm not so sure this would alter anything from a san booting perspective regardless of the nature of whether the LUN is presented as a vscsi device.
The issue with device-mapper and san boot support is not dependent upon HW and capability (i.e virtualization).

As far as mdadm, you'd have the same challenge. You have to create software based RAID partitions and then you'd configure mdadm for multiple paths. This works for any partition other than /boot. Tied it long time ago and failed miserably but that was with RHEL 3.

If you'd like mdadm instructions you can find them here:

http://www.redhat.com/docs/manuals/enterprise/RHEL-4-Manual/en-US/rhel-ig-s390-multi-en-4/s2-s390info-raid.html

Anonymous said...

There are new doc on RHEL multipath:
http://www.redhat.com/docs/manuals/enterprise/RHEL-5-manual/en-US/RHEL510/DM_Multipath/index.html

Also - multipath boot from SAN is supported on RHEL 5.1

Unknown said...

Does the default example you post send i/o through the Netapp interconnect cables as well or does it just send i/o to the controller that owns the lun? Does the Netapp group policy automatically know which path to send the i/o so that they do not cross the interconnect on filer?

Nick Triantos said...

No I/O is sent thru the Interconnect. Although, you have 4 paths per LUN, you round-robin ONLY thru the 2 Direct paths to the controller who owns them.

Cheers

Anonymous said...

Great blog! nice content, good responses.

I have a compellent san connected to sles 10 via both iscsi and FC. Currently the fc and iscsi paths are in a single group with no priority. I'm trying to find a way to give greater priority to the fc path and use the iscsi path for failover only. Looks like the customer script for prio_callout. Anyone know what it would look like,

BCEgg said...

Hi Nick - I'm a little confused about the whole Multipathing configuration. Do I use fdisk to create the partitions or kpartx?

Thanks in advance.

Unknown said...

Hi,

I was wondering if anyone has had any experience with IBM DS4700 SAN with CentOS 5.1, w/multipath. DS4700 reports errors saying "preferred path not defined" (I am only relaying this message so it could be slightly different).

Thanks beforehand
Dmitry

Unknown said...

I m using multipathing with Solaris sytem , but no idea abt linux. Can you just let me know the below mentioned all 3 requires for Multipathing?

* device-mapper-1.01-1.6 RPM is loaded
* multipath-tools-0.4.5-0.11
* Netapp FCP Linux Host Utilities 3.0


Is all are built in within Linux ?

Is multipathing available on RHEL5?

Nick Triantos said...

suchi,
RHEL 5 provides built in multipathing. Everything you need is by default part of the distro.