[UPDATED with more info] regression in Gluster

Its looking like updates with older versions lying around didn’t make the new versions very happy. Actually made them very unhappy. More technical than this, it appears that a library search path found the older libs first, and they didn’t mesh well with the newer libs. This was with an rpm -Uvh upgrade at at that …

On an absolutely clean install, I cannot reproduce this problem. With an upgrade I seem to be able to reproduce.

So its not solved, but I have a work-around. The solution is to scrub the old libraries off (remove them from the search paths, etc).

… and yes, I filed a bug, and yes, it was closed on me. Still can’t quite figure out why. Just asked for an escalation.

Gluster case 00002644 and Red Hat case 00568461

Short version. It appears (and we’ve been able to reproduce this in the lab), that glusterfs no longer uses RDMA transport even when volumes are built with the RDMA and tcp transport options.

Imagine you have one storage pool for a Franken-Cluster. This cluster has Infiniband, 1GbE and 10GbE connectivity. Not all nodes have all connectivity options, though all storage is reachable over the 1GbE on all nodes. For the IB connected nodes, we want to use rdma. For the 10GbE nodes we want to use those nics. In the event of no other options, we want to use the 1GbE nics.

First problem is in the way gluster specifies its volumes. I’d opened a ticket before on this … basically you have to play dns games to have the volume names mapped to a relevant IPoIB or IP address, in order to see the storage at all. That is, an IB-only connected cluster (and yes, there are these out there), may not see the same mapping of

brick_1 -> ip_address_1

as the 1GbE or 10GbE node, as the packets may take a different route to reach the storage unit. In the past, gluster appeared to do the right thing, and negotiate the fastest possible connection. As of 3.1 and higher, this no longer appears to be happening.

But the real issue is the rdma functionality. Which appears to be broken.

In 3.2.5 (and in earlier releases above 3.0.x), the mechanism to force rdma transport had changed from a mount option:

mount -o transport=rdma,... -t glusterfs server_1:/brick_1 /gluster/mount/point

to

mount -o ... -t glusterfs server_1:/brick_1.rdma /gluster/mount/point

Which is all well and good. Apart from the not working part.

Here’s a session on an IB and 10GbE connected server:

root@dv4-2:~# ifinfo --ifname=ib0
device:	address/netmask			MTU	   Tx (MB)	   Rx (MB)
ib0:	172.16.64.102/255.255.0.0      	2044	     0.219	     2.425

root@dv4-2:~# gluster -V
glusterfs 3.2.5 built on Dec  4 2011 19:58:44
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. <http: //www.gluster.com>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.

root@dv4-2:~# mount | grep data
10.100.100.250:/data/tiburon/diskless/hn on / type nfs (rw,errors=remount-ro,vers=4,addr=10.100.100.250,clientaddr=10.100.0.33)
/dev/md1 on /data/1 type xfs (rw)
/dev/md2 on /data/2 type xfs (rw)

root@dv4-2:~# gluster volume create sicluster_glfs transport tcp,rdma dv4-2:/data/1/glusterfs/dht/ dv4-2:/data/2/glusterfs/dht/
Creation of volume sicluster_glfs has been successful. Please start the volume to access data.\

root@dv4-2:~# gluster volume start sicluster_glfs
Starting volume sicluster_glfs has been successful

root@dv4-2:~# gluster volume set sicluster_glfs auth.allow "*"
Set volume successful

root@dv4-2:~# gluster volume info

Volume Name: sicluster_glfs
Type: Distribute
Status: Started
Number of Bricks: 2
Transport-type: tcp,rdma
Bricks:
Brick1: dv4-2:/data/1/glusterfs/dht
Brick2: dv4-2:/data/2/glusterfs/dht
Options Reconfigured:
auth.allow: *
</http:>

Now on the client side

[root@jr5-lab ~]# ifinfo --ifname=ib0
device:	address/netmask			MTU	   Tx (MB)	   Rx (MB)
ib0:	172.16.64.241/255.255.0.0      	65520	     2.887	     0.616


[root@jr5-lab ~]# ping -c 3 172.16.64.102
PING 172.16.64.102 (172.16.64.102) 56(84) bytes of data.
64 bytes from 172.16.64.102: icmp_seq=1 ttl=64 time=0.174 ms
64 bytes from 172.16.64.102: icmp_seq=2 ttl=64 time=0.144 ms
64 bytes from 172.16.64.102: icmp_seq=3 ttl=64 time=0.145 ms

[root@jr5-lab ~]# modprobe -v fuse
insmod /lib/modules/2.6.32.46.scalable/kernel/fs/fuse/fuse.ko 

[root@jr5-lab ~]# mount -t glusterfs 172.16.64.102:/sicluster_glfs.rdma /mnt/t

[root@jr5-lab ~]# mount
/dev/md0 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
none on /ipathfs type ipathfs (rw)
192.168.1.2:/home on /home type nfs (rw,noatime,intr,bg,tcp,rsize=65536,wsize=65536,addr=192.168.1.2)
172.16.64.102:/sicluster_glfs.rdma on /mnt/t type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072)

[root@jr5-lab ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               24G   22G  861M  97% /
tmpfs                  34G     0   34G   0% /dev/shm
192.168.1.2:/home     451G  341G  110G  76% /home
df: `/mnt/t': Transport endpoint is not connected

Yeah, its broken.

And it gets worse.

[root@jr5-lab ~]# mount -t glusterfs 172.16.64.102:/sicluster_glfs /mnt/t

[root@jr5-lab ~]# df -h /mnt/t
df: `/mnt/t': No such file or directory

[root@jr5-lab ~]# df -h 
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               24G   22G  861M  97% /
tmpfs                  34G     0   34G   0% /dev/shm
192.168.1.2:/home     451G  341G  110G  76% /home
df: `/mnt/t': Transport endpoint is not connected

Will do some more experimentation. But it does seem that something is broken. Working on figuring out what.

Needless to say, we are catching lots of heat from customers over this. Its kind of a shame to be left twisting in the wind. We’d never do this to our customers or partners.

[update 1]

-bash-4.1# mount -t glusterfs dv4-2:/sicluster_glfs.rdma /mnt/t
-bash-4.1# df -h
Filesystem            Size  Used Avail Use% Mounted on
tmpfs                 7.9G     0  7.9G   0% /dev/shm
none                  7.9G  4.0K  7.9G   1% /tmp
/dev/md1               15T   34M   15T   1% /data/1
/dev/md2               15T   34M   15T   1% /data/2
dv4-2:/sicluster_glfs.rdma
                       30T   67M   30T   1% /mnt/t

-bash-4.1# touch /mnt/t/x
-bash-4.1# ls -alF !$
ls -alF /mnt/t/x
-rw-r--r-- 1 root root 0 Dec  5 04:28 /mnt/t/x

While this works, it works on the local host, not on a remote host.

On the remote host (another dv4 in the same cluster on the same IB and 10GbE switch), the mount silently falls over to tcp (IPoIB). We can see the network traffic as tcp traffic over the ib0 interface.

Note: testing being done with RHEL 6.1 on storage node with Ubuntu client. Same issue on Ubuntu (client and server). Will try Centos 5.x later (not fully integrated into Tiburon yet … and that makes it real easy to select which OS to boot from … 🙂 )

Of course, RHEL 6.1 to 6.1 seems to work now with OFED 1.5.4-rc5. Ugh. Just what I want to tell our customer. They must upgrade to make this work.

Viewed 36362 times by 5096 viewers

Facebooktwittergoogle_plusredditpinterestlinkedinmail