Its looking like updates with older versions lying around didn’t make the new versions very happy. Actually made them very unhappy. More technical than this, it appears that a library search path found the older libs first, and they didn’t mesh well with the newer libs. This was with an rpm -Uvh upgrade at at that …
On an absolutely clean install, I cannot reproduce this problem. With an upgrade I seem to be able to reproduce.
So its not solved, but I have a work-around. The solution is to scrub the old libraries off (remove them from the search paths, etc).
… and yes, I filed a bug, and yes, it was closed on me. Still can’t quite figure out why. Just asked for an escalation.
Gluster case 00002644 and Red Hat case 00568461
Short version. It appears (and we’ve been able to reproduce this in the lab), that glusterfs no longer uses RDMA transport even when volumes are built with the RDMA and tcp transport options.
Imagine you have one storage pool for a Franken-Cluster. This cluster has Infiniband, 1GbE and 10GbE connectivity. Not all nodes have all connectivity options, though all storage is reachable over the 1GbE on all nodes. For the IB connected nodes, we want to use rdma. For the 10GbE nodes we want to use those nics. In the event of no other options, we want to use the 1GbE nics.
First problem is in the way gluster specifies its volumes. I’d opened a ticket before on this … basically you have to play dns games to have the volume names mapped to a relevant IPoIB or IP address, in order to see the storage at all. That is, an IB-only connected cluster (and yes, there are these out there), may not see the same mapping of
brick_1 -> ip_address_1
as the 1GbE or 10GbE node, as the packets may take a different route to reach the storage unit. In the past, gluster appeared to do the right thing, and negotiate the fastest possible connection. As of 3.1 and higher, this no longer appears to be happening.
But the real issue is the rdma functionality. Which appears to be broken.
In 3.2.5 (and in earlier releases above 3.0.x), the mechanism to force rdma transport had changed from a mount option:
mount -o transport=rdma,... -t glusterfs server_1:/brick_1 /gluster/mount/point
mount -o ... -t glusterfs server_1:/brick_1.rdma /gluster/mount/point
Which is all well and good. Apart from the not working part.
Here’s a session on an IB and 10GbE connected server:
root@dv4-2:~# ifinfo --ifname=ib0 device: address/netmask MTU Tx (MB) Rx (MB) ib0: 172.16.64.102/255.255.0.0 2044 0.219 2.425 root@dv4-2:~# gluster -V glusterfs 3.2.5 built on Dec 4 2011 19:58:44 Repository revision: git://git.gluster.com/glusterfs.git Copyright (c) 2006-2011 Gluster Inc. <http: //www.gluster.com> GlusterFS comes with ABSOLUTELY NO WARRANTY. You may redistribute copies of GlusterFS under the terms of the GNU General Public License. root@dv4-2:~# mount | grep data 10.100.100.250:/data/tiburon/diskless/hn on / type nfs (rw,errors=remount-ro,vers=4,addr=10.100.100.250,clientaddr=10.100.0.33) /dev/md1 on /data/1 type xfs (rw) /dev/md2 on /data/2 type xfs (rw) root@dv4-2:~# gluster volume create sicluster_glfs transport tcp,rdma dv4-2:/data/1/glusterfs/dht/ dv4-2:/data/2/glusterfs/dht/ Creation of volume sicluster_glfs has been successful. Please start the volume to access data.\ root@dv4-2:~# gluster volume start sicluster_glfs Starting volume sicluster_glfs has been successful root@dv4-2:~# gluster volume set sicluster_glfs auth.allow "*" Set volume successful root@dv4-2:~# gluster volume info Volume Name: sicluster_glfs Type: Distribute Status: Started Number of Bricks: 2 Transport-type: tcp,rdma Bricks: Brick1: dv4-2:/data/1/glusterfs/dht Brick2: dv4-2:/data/2/glusterfs/dht Options Reconfigured: auth.allow: * </http:>
Now on the client side
[root@jr5-lab ~]# ifinfo --ifname=ib0 device: address/netmask MTU Tx (MB) Rx (MB) ib0: 172.16.64.241/255.255.0.0 65520 2.887 0.616 [root@jr5-lab ~]# ping -c 3 172.16.64.102 PING 172.16.64.102 (172.16.64.102) 56(84) bytes of data. 64 bytes from 172.16.64.102: icmp_seq=1 ttl=64 time=0.174 ms 64 bytes from 172.16.64.102: icmp_seq=2 ttl=64 time=0.144 ms 64 bytes from 172.16.64.102: icmp_seq=3 ttl=64 time=0.145 ms [root@jr5-lab ~]# modprobe -v fuse insmod /lib/modules/22.214.171.124.scalable/kernel/fs/fuse/fuse.ko [root@jr5-lab ~]# mount -t glusterfs 172.16.64.102:/sicluster_glfs.rdma /mnt/t [root@jr5-lab ~]# mount /dev/md0 on / type ext3 (rw,noatime) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) none on /ipathfs type ipathfs (rw) 192.168.1.2:/home on /home type nfs (rw,noatime,intr,bg,tcp,rsize=65536,wsize=65536,addr=192.168.1.2) 172.16.64.102:/sicluster_glfs.rdma on /mnt/t type fuse.glusterfs (rw,allow_other,default_permissions,max_read=131072) [root@jr5-lab ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/md0 24G 22G 861M 97% / tmpfs 34G 0 34G 0% /dev/shm 192.168.1.2:/home 451G 341G 110G 76% /home df: `/mnt/t': Transport endpoint is not connected
Yeah, its broken.
And it gets worse.
[root@jr5-lab ~]# mount -t glusterfs 172.16.64.102:/sicluster_glfs /mnt/t [root@jr5-lab ~]# df -h /mnt/t df: `/mnt/t': No such file or directory [root@jr5-lab ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/md0 24G 22G 861M 97% / tmpfs 34G 0 34G 0% /dev/shm 192.168.1.2:/home 451G 341G 110G 76% /home df: `/mnt/t': Transport endpoint is not connected
Will do some more experimentation. But it does seem that something is broken. Working on figuring out what.
Needless to say, we are catching lots of heat from customers over this. Its kind of a shame to be left twisting in the wind. We’d never do this to our customers or partners.
-bash-4.1# mount -t glusterfs dv4-2:/sicluster_glfs.rdma /mnt/t -bash-4.1# df -h Filesystem Size Used Avail Use% Mounted on tmpfs 7.9G 0 7.9G 0% /dev/shm none 7.9G 4.0K 7.9G 1% /tmp /dev/md1 15T 34M 15T 1% /data/1 /dev/md2 15T 34M 15T 1% /data/2 dv4-2:/sicluster_glfs.rdma 30T 67M 30T 1% /mnt/t -bash-4.1# touch /mnt/t/x -bash-4.1# ls -alF !$ ls -alF /mnt/t/x -rw-r--r-- 1 root root 0 Dec 5 04:28 /mnt/t/x
While this works, it works on the local host, not on a remote host.
On the remote host (another dv4 in the same cluster on the same IB and 10GbE switch), the mount silently falls over to tcp (IPoIB). We can see the network traffic as tcp traffic over the ib0 interface.
Note: testing being done with RHEL 6.1 on storage node with Ubuntu client. Same issue on Ubuntu (client and server). Will try Centos 5.x later (not fully integrated into Tiburon yet … and that makes it real easy to select which OS to boot from … 🙂 )
Of course, RHEL 6.1 to 6.1 seems to work now with OFED 1.5.4-rc5. Ugh. Just what I want to tell our customer. They must upgrade to make this work.
Viewed 38138 times by 5470 viewers