[UPDATED with more info] regression in Gluster

Its looking like updates with older versions lying around didn’t make the new versions very happy. Actually made them very unhappy. More technical than this, it appears that a library search path found the older libs first, and they didn’t mesh well with the newer libs. This was with an rpm -Uvh upgrade at at that …
On an absolutely clean install, I cannot reproduce this problem. With an upgrade I seem to be able to reproduce.
So its not solved, but I have a work-around. The solution is to scrub the old libraries off (remove them from the search paths, etc).

… and yes, I filed a bug, and yes, it was closed on me. Still can’t quite figure out why. Just asked for an escalation.
Gluster case 00002644 and Red Hat case 00568461
Short version. It appears (and we’ve been able to reproduce this in the lab), that glusterfs no longer uses RDMA transport even when volumes are built with the RDMA and tcp transport options.
Imagine you have one storage pool for a Franken-Cluster. This cluster has Infiniband, 1GbE and 10GbE connectivity. Not all nodes have all connectivity options, though all storage is reachable over the 1GbE on all nodes. For the IB connected nodes, we want to use rdma. For the 10GbE nodes we want to use those nics. In the event of no other options, we want to use the 1GbE nics.
First problem is in the way gluster specifies its volumes. I’d opened a ticket before on this … basically you have to play dns games to have the volume names mapped to a relevant IPoIB or IP address, in order to see the storage at all. That is, an IB-only connected cluster (and yes, there are these out there), may not see the same mapping of
brick_1 -> ip_address_1
as the 1GbE or 10GbE node, as the packets may take a different route to reach the storage unit. In the past, gluster appeared to do the right thing, and negotiate the fastest possible connection. As of 3.1 and higher, this no longer appears to be happening.
But the real issue is the rdma functionality. Which appears to be broken.
In 3.2.5 (and in earlier releases above 3.0.x), the mechanism to force rdma transport had changed from a mount option:
mount -o transport=rdma,... -t glusterfs server_1:/brick_1 /gluster/mount/point
mount -o ... -t glusterfs server_1:/brick_1.rdma /gluster/mount/point
Which is all well and good. Apart from the not working part.
Here’s a session on an IB and 10GbE connected server:

root@dv4-2:~# ifinfo --ifname=ib0
device:	address/netmask			MTU	   Tx (MB)	   Rx (MB)
ib0:      	2044	     0.219	     2.425
root@dv4-2:~# gluster -V
glusterfs 3.2.5 built on Dec  4 2011 19:58:44
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. 
You may redistribute copies of GlusterFS under the terms of the GNU General Public License.
root@dv4-2:~# mount | grep data on / type nfs (rw,errors=remount-ro,vers=4,addr=,clientaddr=
/dev/md1 on /data/1 type xfs (rw)
/dev/md2 on /data/2 type xfs (rw)
root@dv4-2:~# gluster volume create sicluster_glfs transport tcp,rdma dv4-2:/data/1/glusterfs/dht/ dv4-2:/data/2/glusterfs/dht/
Creation of volume sicluster_glfs has been successful. Please start the volume to access data.\
root@dv4-2:~# gluster volume start sicluster_glfs
Starting volume sicluster_glfs has been successful
root@dv4-2:~# gluster volume set sicluster_glfs auth.allow "*"
Set volume successful
root@dv4-2:~# gluster volume info
Volume Name: sicluster_glfs
Type: Distribute
Status: Started
Number of Bricks: 2
Transport-type: tcp,rdma
Brick1: dv4-2:/data/1/glusterfs/dht
Brick2: dv4-2:/data/2/glusterfs/dht
Options Reconfigured:
auth.allow: *

Now on the client side

[root@jr5-lab ~]# ifinfo --ifname=ib0
device:	address/netmask			MTU	   Tx (MB)	   Rx (MB)
ib0:      	65520	     2.887	     0.616
[root@jr5-lab ~]# ping -c 3
PING ( 56(84) bytes of data.
64 bytes from icmp_seq=1 ttl=64 time=0.174 ms
64 bytes from icmp_seq=2 ttl=64 time=0.144 ms
64 bytes from icmp_seq=3 ttl=64 time=0.145 ms
[root@jr5-lab ~]# modprobe -v fuse
insmod /lib/modules/
[root@jr5-lab ~]# mount -t glusterfs /mnt/t
[root@jr5-lab ~]# mount
/dev/md0 on / type ext3 (rw,noatime)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
none on /ipathfs type ipathfs (rw) on /home type nfs (rw,noatime,intr,bg,tcp,rsize=65536,wsize=65536,addr= on /mnt/t type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072)
[root@jr5-lab ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               24G   22G  861M  97% /
tmpfs                  34G     0   34G   0% /dev/shm     451G  341G  110G  76% /home
df: `/mnt/t': Transport endpoint is not connected

Yeah, its broken.
And it gets worse.

[root@jr5-lab ~]# mount -t glusterfs /mnt/t
[root@jr5-lab ~]# df -h /mnt/t
df: `/mnt/t': No such file or directory
[root@jr5-lab ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/md0               24G   22G  861M  97% /
tmpfs                  34G     0   34G   0% /dev/shm     451G  341G  110G  76% /home
df: `/mnt/t': Transport endpoint is not connected

Will do some more experimentation. But it does seem that something is broken. Working on figuring out what.
Needless to say, we are catching lots of heat from customers over this. Its kind of a shame to be left twisting in the wind. We’d never do this to our customers or partners.
[update 1]

-bash-4.1# mount -t glusterfs dv4-2:/sicluster_glfs.rdma /mnt/t
-bash-4.1# df -h
Filesystem            Size  Used Avail Use% Mounted on
tmpfs                 7.9G     0  7.9G   0% /dev/shm
none                  7.9G  4.0K  7.9G   1% /tmp
/dev/md1               15T   34M   15T   1% /data/1
/dev/md2               15T   34M   15T   1% /data/2
                       30T   67M   30T   1% /mnt/t
-bash-4.1# touch /mnt/t/x
-bash-4.1# ls -alF !$
ls -alF /mnt/t/x
-rw-r--r-- 1 root root 0 Dec  5 04:28 /mnt/t/x

While this works, it works on the local host, not on a remote host.
On the remote host (another dv4 in the same cluster on the same IB and 10GbE switch), the mount silently falls over to tcp (IPoIB). We can see the network traffic as tcp traffic over the ib0 interface.
Note: testing being done with RHEL 6.1 on storage node with Ubuntu client. Same issue on Ubuntu (client and server). Will try Centos 5.x later (not fully integrated into Tiburon yet … and that makes it real easy to select which OS to boot from … 🙂 )
Of course, RHEL 6.1 to 6.1 seems to work now with OFED 1.5.4-rc5. Ugh. Just what I want to tell our customer. They must upgrade to make this work.

7 thoughts on “[UPDATED with more info] regression in Gluster”

  1. Shame – this was the one you and I were chatting about on the show floor of SC11… Maybe the push to get gluster appliances out of the door made the priority of other functioning code fall off the road map. Hrrm – thanks for sharing the results Joe! In other news the Si coffee mug is going gangbusters :-))

  2. @James
    I don’t want to guess the reasons why right now. I do want to push on this to make sure others test it as well. We can reproduce it easily apart from one specific case this evening (using the latest model OFED stack).
    Will try on Centos 5.x with this.

  3. Unfortunately, “case closed without resolution” is my experience with redhat support and one of the reasons that I was not excited to see Gluster acquired.

  4. Ah, how unfortunate. A mix of 1/10GBE and IB is definitely a use case we are interested in. Really, it should be for any site that has multiple clusters and one-off systems on the floor.

  5. @John
    I had responded within a few hours of the email in question in October/November. I did even send a few more emails about this to the site. Even mentioned this issue to Jacob Shucart at SC11.
    I wasn’t implying or suggesting Red Hat didn’t want to look at it. I was expressing frustration that my ticket with a regression was closed without a response to my response.
    I’ve spent many hours on the phone with the customer in question, and they have not taken kindly to the responses we’ve been getting. I have taken a heat over this issue from them.
    Right now, I am attempting to discern if we have a gluster version collision of some sort (libraries), or something else. RHEL 6.1 with OFED 1.5.4-rc5 works client and server.
    As for the custom kernels, they have never been an issue for us.
    This case has worked in the past, and it is my expectation it will work again. I need to understand what specifically is failing. I am starting to suspect some sort of library collision between gluster versions (all machines had been upgraded, not clean installations).
    My hope has been that they wouldn’t be slow. Generally Gluster support has been good before the acquisition.
    There are too many variables, and whats starting to look suspect are library collisions from previous versions. Which one might expect would normally be handled via an RPM upgrade process.
    I can’t say that they have been slow. I’ve asked a question before in the support system, and got an answer relatively quickly. Wasn’t the answer I had hoped for.
    Red Hat has been helpful and supportive in the past and we expect that they will remain so. I’ve not found them to be slow by our measures. Nothing like some of the folks we’ve dealt with.
    My annoyance appears to be due to a lost email, that resulted in a ticket closing, and an unhappy customer wanting this solved. I didn’t want to circumvent the support system by going directly to the people I know. I wanted to make the system work.
    Again, Gluster is a good tool, it works well. Gluster and Red Hat have been supportive.

  6. Sadly my experience with Red Hat is not that great either, I’m still waiting for a kernel driver bug that had already been fixed upstream when reported against RHEL 5.5 to get fixed, 12 months after the initial report.
    It’s not like it’s not serious – packets on one interface of a dual port 10gigE card are occasionally delivered by the driver on the *other* interface. Strangely enough TCP doesn’t appreciate this. We’ve got a work around by bonding the two interfaces together and tagging the two networks down them so it no longer matters which they appear on, but that won’t help you should one of the two links fail as the packets on the good link can get dropped because the kernel thinks they’re arriving on the down link..
    The last set of correspondence on the bugzilla indicated that they wanted to delay it again to 5.9, plus they removed it as a blocker for 5.8.
    The guy on the other end of their customer support system (separate to BZ) has been very helpful, but I can sense that he’s been frustrated trying to get answers from engineering at times..

Comments are closed.