GlusterFS Build Steps

From trapsink.com
Jump to: navigation, search


Overview

Prior to starting work, a fundamental decision must be made - what type of Volume(s) need to be used for the given scenario. While 6 methods exist, two are used most often to achieve different results:

Type Description
Replicated This type of Volume provides a file replication across multiple bricks, it is a best choice for environments where High Availability and High Reliability are CRITICAL, as well as if you wish to self-mount the volume on every node such as with a webserver DocumentRoot - the GlusterFS nodes are their own clients.

Files are copied to each brick in the volume similar to a RAID-1, however you can have 3+ bricks and an odd number as well; usable space is the size of one brick, and all files written to one brick are replicated to all others. This makes the most sense if you are going to self-mount the GlusterFS volume, for instance as the web docroot (/var/www) or similar where all files must reside on that node. The value passed to replica is the same number of nodes in the volume.

Distributed-Replicated In this scenario files are distributed across replicated bricks in the volume. You can use this type of volume in environments where the requirement is to scale storage as well as having high availability. Volumes of this type also offer improved read performance in most environments, and are most common type of volumes used when clients are external to the GlusterFS nodes themselves.

Somewhat like a RAID-10, an even number of bricks must be used; usable space is the size of the combined bricks passed to the replica value. For example, if there are 4 bricks of 20G and you pass replica 2 to the creation, your files will distribute to 2 nodes (40G) and replicate to 2 nodes. With 6 bricks of 20G and replica 3 it would distribute to 3 nodes (60G) and replicate to 3 nodes, but if you used replica 2 it would distribute to 2 nodes (40G) and replicate to 4 nodes in pairs. This would be used when your clients are external to the cluster, not local self-mounts.

All the fundamental work in this document is the same except for the one step where the Volume is created as outlined above with the replica keyword. Using Striped-based volumes is not covered here.


Prerequisites

  1. 2 or more servers with separate Storage
  2. Private network between servers


Build Document Setup

This build document will use the following setup that can be easily stood up; using Cloud block devices is no different than VMware vDisks, SAN/DAS LUNs, iSCSI, etc.

  • 4x Performance 1 Tier 2 Rackspace Cloud servers - a 20G /dev/xvde ready to use for each brick
  • 1x Cloud Private Network on 192.168.3.0/24 for GlusterFS communication
  • GlusterFS 3.7 installed from Vendor package repository


Node Prep

  • Configure /etc/hosts and iptables
  • Install base toolset(s)
  • Install GlusterFS software
  • Connect GlusterFS nodes

Configure /etc/hosts and iptables

In lieu of using DNS, we prepare /etc/hosts so that every machine and ensure they can talk to each other. All servers have the name glusterN as a hostname, so we'll use glusN for our private communication layer between nodes.

# vi /etc/hosts
  192.168.3.2  glus1
  192.168.3.4  glus2
  192.168.3.1  glus3
  192.168.3.3  glus4

# ping -c2 glus1; ping -c2 glus2;  ping -c2 glus3;  ping -c2 glus4

## Red Hat oriented:
# vi /etc/sysconfig/iptables
  -A INPUT -s 192.168.3.0/24 -j ACCEPT
# service iptables restart

## Debian oriented
# vi /etc/iptables/rules.v4
  -A INPUT -s 192.168.3.0/24 -j ACCEPT
# service iptables-persistent restart

Granular iptables

The above generic iptables rule opens all ports to the subnet; if more granular setup is required:

  • 111 - portmap / rpcbind
  • 24007 - GlusterFS Daemon
  • 24008 - GlusterFS Management
  • 38465 to 38467 - Required for GlusterFS NFS service
  • 24009 to +X - GlusterFS versions less than 3.4, OR
  • 49152 to +X - GlusterFS versions 3.4 and later

Each brick for every volume on the host requires it’s own port. For every new brick, one new port will be used starting at 24009 for GlusterFS versions below 3.4 and 49152 for version 3.4 and above.

Example: If you have one volume with two bricks, you will need to open 24009 - 24010, or 49152 - 49153.

Install Packages

  1. Install the basic packages for partitioning, LVM2 and XFS
  2. Install the GlusterFS repository and glusterfs* packages
  3. Disable automatic updates of gluster* packages

Some of the required packages may already be installed on the cluster nodes.

## YUM/RPM Based:
# yum -y install parted lvm2 xfsprogs
# wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo
# yum -y install glusterfs glusterfs-fuse glusterfs-server

## Ubuntu based (Default Ubuntu repo has glusterfs 3.4, here's how to install 3.7):
# apt-get install lvm2 xfsprogs python-software-properties
# add-apt-repository ppa:gluster/glusterfs-3.7
# apt-get update
# apt-get install glusterfs-server

Ensure that the gluster* packages are filtered out of automatic updates; upgrades while it's running can crash the bricks.

# grep ^exclude /etc/yum.conf
exclude=kernel* gluster*

## Ubuntu method:
# apt-mark hold glusterfs*

Prepare Bricks

  1. Partition block devices
  2. Create LVM foundation
  3. Prepare volume bricks

The underlying bricks are a standard filesystem and mount point. However, make sure to mount each brick in such a way so as to discourage any use from changing to the directory and writing to the underlying bricks themselves. Writing directly to a Brick will corrupt your Volume!

The bricks must be unique per node, and there should be a directory within the mount to use in volume creation. Attempting to create a replicated volume using the top-level of the mounts results in an error with instructions to use a subdirectory.

all nodes:
 # parted -s -- /dev/xvde mktable gpt
 # parted -s -- /dev/xvde mkpart primary 2048s 100%
 # parted -s -- /dev/xvde set 1 lvm on
 # partx -a /dev/xvde
 # pvcreate /dev/xvde1 
 # vgcreate vgglus1 /dev/xvde1 

Logical Volumes
---------------
 Standard LVM:
  # lvcreate -l 100%VG -n gbrick1 vgglus1

 For GlusterFS snapshot support:
  # lvcreate -l 100%FREE --thinpool lv_thin vgglus1
  # lvcreate --thin -V $(lvdisplay /dev/vgglus1/lv_thin | awk '/LV\ Size/ { print $3 }')G -n gbrick1 vgglus1/lv_thin

Filesystems for bricks
----------------------
 For XFS bricks: (recommended)
  # mkfs.xfs -i size=512 /dev/vgglus1/gbrick1
  # echo '/dev/vgglus1/gbrick1 /data/gluster/gvol0 xfs inode64,nobarrier 0 0' >> /etc/fstab
  # mkdir -p /data/gluster/gvol0
  # mount /data/gluster/gvol0

 For ext4 bricks:
  # mkfs.ext4 /dev/vgglus1/gbrick1
  # echo '/dev/vgglus1/gbrick1 /data/gluster/gvol0 ext4 defaults,user_xattr,acl 0 0' >> /etc/fstab
  # mkdir -p /data/gluster/gvol0
  # mount /data/gluster/gvol0

glus1:
 # mkdir -p /data/gluster/gvol0/brick1

glus2:
 # mkdir -p /data/gluster/gvol0/brick1

glus3:
 # mkdir -p /data/gluster/gvol0/brick1

glus4:
 # mkdir -p /data/gluster/gvol0/brick1


GlusterFS Setup

Start glusterfsd daemon

The daemon can be restarted at runtime as well:

## Red Hat based:
# service glusterd start
# chkconfig glusterd on

Build Peer Group

This is what's known as a Trusted Storage Pool in the GlusterFS world. Note that as of early release of version 3, you only need to probe all other nodes from glus1. The peer list is then automatically distributed to all peers from there.

glus1:
 # gluster peer probe glus2
 # gluster peer probe glus3
 # gluster peer probe glus4
 # gluster peer status

[root@gluster1 ~]# gluster pool list
UUID                                    Hostname   State
734aea4c-fc4f-4971-ba3d-37bd5d9c35b8    glus4      Connected
d5c9e064-c06f-44d9-bf60-bae5fc881e16    glus3      Connected
57027f23-bdf2-4a95-8eb6-ff9f936dc31e    glus2      Connected
e64c5148-8942-4065-9654-169e20ed6f20    localhost  Connected

Volume Creation

We will set up basic auth restrictions to only our private subnet as by default glusterd NFS allows global read/write during Volume creation. glusterd automatically starts NFSd on each server and exports the volume through it from each of the nodes. The reason for this behaviour is that in order to use native client (FUSE) for mounting the volume on clients, the clients have to run exactly same version of GlusterFS packages. If the versions are different there might be differences in the hashing algorithms used by servers and clients and the clients won't be able to connect.

Replicated Volume

This example will create replication to all 4 nodes - each node contains a copy of all data and the size of the Volume is the size of a single brick. Notice how the info shows 1 x 4 = 4 in the output.

one node only:
 # gluster volume create gvol0 replica 4 transport tcp \
    glus1:/data/gluster/gvol0/brick1 \
    glus2:/data/gluster/gvol0/brick1 \
    glus3:/data/gluster/gvol0/brick1 \
    glus4:/data/gluster/gvol0/brick1
 # gluster volume set gvol0 auth.allow 192.168.3.*,127.0.0.1
 # gluster volume set gvol0 nfs.disable off
 # gluster volume set gvol0 nfs.addr-namelookup off
 # gluster volume set gvol0 nfs.export-volumes on
 # gluster volume set gvol0 nfs.rpc-auth-allow 192.168.3.*
 # gluster volume set gvol0 performance.io-thread-count 32
 # gluster volume start gvol0

[root@gluster1 ~]# gluster volume info gvol0
Volume Name: gvol0
Type: Replicate
Volume ID: 65ece3b3-a4dc-43f8-9b0f-9f39c7202640
Status: Started
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: glus1:/data/gluster/gvol0/brick1
Brick2: glus2:/data/gluster/gvol0/brick1
Brick3: glus3:/data/gluster/gvol0/brick1
Brick4: glus4:/data/gluster/gvol0/brick1
Options Reconfigured:
nfs.rpc-auth-allow: 192.168.3.*,127.0.0.1
nfs.export-volumes: on
nfs.addr-namelookup: off
nfs.disable: off
auth.allow: 192.168.3.*
performance.io-thread-count: 32

Distributed-Replicated Volume

This example will create distributed replication to 2x2 nodes - each pair of nodes contains the data and the size of the Volume is the size of a two bricks. Notice how the info shows 2 x 2 = 4 in the output.

one node only:
 # gluster volume create gvol0 replica 2 transport tcp \
    glus1:/data/gluster/gvol0/brick1 \
    glus2:/data/gluster/gvol0/brick1 \
    glus3:/data/gluster/gvol0/brick1 \
    glus4:/data/gluster/gvol0/brick1
 # gluster volume set gvol0 auth.allow 192.168.3.*,127.0.0.1
 # gluster volume set gvol0 nfs.disable off
 # gluster volume set gvol0 nfs.addr-namelookup off
 # gluster volume set gvol0 nfs.export-volumes on
 # gluster volume set gvol0 nfs.rpc-auth-allow 192.168.3.*
 # gluster volume set gvol0 performance.io-thread-count 32
 # gluster volume start gvol0

[root@gluster1 ~]# gluster volume info gvol0
Volume Name: gvol0
Type: Distributed-Replicate
Volume ID: d883f891-e38b-4565-8487-7e50ca33dbd4
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: glus1:/data/gluster/gvol0/brick1
Brick2: glus2:/data/gluster/gvol0/brick1
Brick3: glus3:/data/gluster/gvol0/brick1
Brick4: glus4:/data/gluster/gvol0/brick1
Options Reconfigured:
nfs.rpc-auth-allow: 192.168.3.*
nfs.export-volumes: on
nfs.addr-namelookup: off
nfs.disable: off
auth.allow: 192.168.3.*,127.0.0.1
performance.io-thread-count: 32

Volume Deletion

After ensure that no clients (either local or remote) are mounting the Volume, stop the Volume and delete it.

# gluster volume stop gvol0
# gluster volume delete gvol0

Clearing Bricks

If brick(s) were used in a volume and they need to be removed, there's an attribute that GlusterFS had set on the brick subdirectories. This needs to be cleared before they can be reused - or the subdir can be deleted and recreated.

glus1:
 # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/
 # setfattr -x trusted.gfid /data/gluster/gvol0/brick1
 # rm -rf /data/gluster/gvol0/brick1/.glusterfs
glus2:
 # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/
 # setfattr -x trusted.gfid /data/gluster/gvol0/brick1
 # rm -rf /data/gluster/gvol0/brick1/.glusterfs
glus3:
 # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/
 # setfattr -x trusted.gfid /data/gluster/gvol0/brick1
 # rm -rf /data/gluster/gvol0/brick1/.glusterfs
glus4:
 # setfattr -x trusted.glusterfs.volume-id /data/gluster/gvol0/brick1/
 # setfattr -x trusted.gfid /data/gluster/gvol0/brick1
 # rm -rf /data/gluster/gvol0/brick1/.glusterfs

...or just deleting all data:

glus1:
 # rm -rf /data/gluster/gvol0/brick1
 # mkdir /data/gluster/gvol0/brick1
glus2:
 # rm -rf /data/gluster/gvol0/brick1
 # mkdir /data/gluster/gvol0/brick1
glus3:
 # rm -rf /data/gluster/gvol0/brick1
 # mkdir /data/gluster/gvol0/brick1
glus4:
 # rm -rf /data/gluster/gvol0/brick1
 # mkdir /data/gluster/gvol0/brick1

Adding Bricks

Additional bricks can be added to a running Volume easily:

# gluster volume add-brick gvol0 glus5:/data/gluster/gvol0/brick1

The add-brick command can also be used to change the LAYOUT of your volume. For example, to change a 2 node Distributed volume into a 4 node Distributed-Replicated Volume. After such an operation you must rebalance your volume. New files will be automatically created on the new nodes, but the old ones will not get moved.

# gluster volume add-brick gvol0 replica 2 \
   glus5:/data/gluster/gvol0/brick1 \
   glus6:/data/gluster/gvol0/brick1
# gluster rebalance gvol0 start
# gluster rebalance gvol0 status

## If needed (something didn't work right)
# gluster rebalance gvol0 stop

When expanding distributed replicated and distributed striped volumes, you must add a number of bricks that is a multiple of the replica or stripe count. For example, to expand a distributed replicated volume with a replica count of 2, you need to add bricks in multiples of 2 (such as 4, 6, 8, etc.):

# gluster volume add-brick gvol0 \
   glus5:/data/gluster/gvol0/brick1 \
   glus6:/data/gluster/gvol0/brick1

Volume Options

To view configured volume options:

# gluster volume info gvol0
 
Volume Name: gvol0
Type: Replicate
Volume ID: bcbfc645-ebf9-4f83-b9f0-2a36d0b1f6e3
Status: Started
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: glus1:/data/gluster/gvol0/brick1
Brick2: glus2:/data/gluster/gvol0/brick1
Brick3: glus3:/data/gluster/gvol0/brick1
Brick4: glus4:/data/gluster/gvol0/brick1
Options Reconfigured:
performance.cache-size: 1073741824
performance.io-thread-count: 64
cluster.choose-local: on
nfs.rpc-auth-allow: 192.168.3.*,127.0.0.1
nfs.export-volumes: on
nfs.addr-namelookup: off
nfs.disable: off
auth.allow: 192.168.3.*,127.0.0.1

To set an option for a volume, use the set keyword like so:

# gluster volume set gvol0 performance.write-behind off
volume set: success

To clear an option to a Volume back to defaults, use the reset keyword like so:

# gluster volume reset gvol0 performance.read-ahead
volume reset: success: reset volume successful


Client Mounts

From a client perspective the GlusterFS Volume can be mounted in two fundamental ways:

  1. FUSE Client
  2. NFS Client

FUSE Client

The FUSE client allows the mount to happen with a GlusterFS "round robin" style connection; in /etc/fstab the name of one node is used, however internal mechanisms allows that node to fail and the clients will roll over to other connected nodes in the Trusted Storage Pool. The performance is slightly lower than the NFS method based on tests, however not drastically so - the gain is automatic HA client failover which is typically worth the performance hit.

## RPM based:
# wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/CentOS/glusterfs-epel.repo
# yum -y install glusterfs glusterfs-fuse

## Ubuntu based(glusterfs-client 3.4 works with glusterfs-server 3.5 but for most the recent version do this):
# add-apt-repository ppa:gluster/glusterfs-3.7
# apt-get update
# apt-get install glusterfs-client
##

##Common:
# vi /etc/hosts
  192.168.3.2  glus1
  192.168.3.4  glus2
  192.168.3.1  glus3
  192.168.3.3  glus4

# modprobe fuse
# echo 'glus1:/gvol0 /mnt/gluster/gvol0 glusterfs defaults,_netdev,backup-volfile-servers=glus2 0 0' >> /etc/fstab
# mkdir -p /mnt/gluster/gvol0
# mount /mnt/gluster/gvol0

NFS Client

The standard Linux NFSv3 client tools are used to mount one of the GlusterFS nodes; the performance is typically a little better than the FUSE client, however the downside is the connection is 1-to-1 – of the GlusterFS node goes down the client will not round-robin out to another node. A different solution has to be added such as HAProxy/keepalived, load balancer, etc. in order to provide a floating IP proxy in this use case.

## RPM based:
# yum -y install rpcbind nfs-utils
# service rpcbind restart; chkconfig rpcbind on
# service nfslock restart; chkconfig on

## Ubuntu:
# apt-get install nfs-common
##

## Common:
# echo 'glus1:/repvol1 /mnt/gluster/gvol0 nfs rsize=4096,wsize=4096,hard,intr 0 0' >> /etc/fstab
# mkdir -p /mnt/gluster/gvol0
# mount /mnt/gluster/gvol0


References