[Lustre-discuss] LustreError: 11-0: an error occurred while communicating with 192.168.16.24@o2ib. The ost

Dennis Nelson

2009-03-25 17:03:59 UTC

Post by Kevin Van Maren
Dennis,
You haven't provided enough context for people to help.
What have you done to determine if the IB fabric is working properly?

Basic functionality appears to be there. I can lctl ping between all
servers. I have run ibdiagnet and it appears to be clean. I have run
several instances of ib_rdma_bw between various lustre servers and it
completes with good performance.

Post by Kevin Van Maren
What are hostnames and NIDs for the 10 servers (lctl list_nids)?

Executing on mds2
192.168.17.11 at o2ib
Executing on mds1
192.168.16.11 at o2ib
Executing on oss1
192.168.16.21 at o2ib
Executing on oss2
192.168.16.22 at o2ib
Executing on oss3
192.168.16.23 at o2ib
Executing on oss4
192.168.16.24 at o2ib
Executing on oss5
192.168.17.21 at o2ib
Executing on oss6
192.168.17.22 at o2ib
Executing on oss7
192.168.17.23 at o2ib
Executing on oss8
192.168.17.24 at o2ib

Post by Kevin Van Maren
Which OSTs are on which servers?

Lustre filesystems on mds2
Lustre filesystems on mds1
/dev/mapper/mdt 2009362216 485528 2008876688 1% /mnt/mdt
Lustre filesystems on oss1
/dev/mapper/ost0000 1130279280 715816 1129563464 1% /mnt/ost0000
/dev/mapper/ost0001 1130279280 659436 1129619844 1% /mnt/ost0001
/dev/mapper/ost000f 1130279280 667208 1129612072 1% /mnt/ost000f
Lustre filesystems on oss2
/dev/mapper/ost0002 1130279280 697520 1129581760 1% /mnt/ost0002
/dev/mapper/ost0003 1130279280 585260 1129694020 1% /mnt/ost0003
/dev/mapper/ost0010 1130279280 600640 1129678640 1% /mnt/ost0010
Lustre filesystems on oss3
/dev/mapper/ost0004 1130279280 515628 1129763652 1% /mnt/ost0004
/dev/mapper/ost0005 1130279280 549292 1129729988 1% /mnt/ost0005
/dev/mapper/ost0011 1130279280 697956 1129581324 1% /mnt/ost0011
Lustre filesystems on oss4
/dev/mapper/ost0006 1130279280 565684 1129713596 1% /mnt/ost0006
/dev/mapper/ost0012 1130279280 482856 1129796424 1% /mnt/ost0012
/dev/mapper/ost0013 1130279280 482856 1129796424 1% /mnt/ost0013
Lustre filesystems on oss5
/dev/mapper/ost0007 1130279280 532844 1129746436 1% /mnt/ost0007
/dev/mapper/ost0008 1130279280 682308 1129596972 1% /mnt/ost0008
/dev/mapper/ost0014 1130279280 532016 1129747264 1% /mnt/ost0014
/dev/mapper/ost0015 1130279280 482856 1129796424 1% /mnt/ost0015
Lustre filesystems on oss6
/dev/mapper/ost0009 1130279280 482860 1129796420 1% /mnt/ost0009
/dev/mapper/ost000a 1130279280 585260 1129694020 1% /mnt/ost000a
/dev/mapper/ost0016 1130279280 499244 1129780036 1% /mnt/ost0016
/dev/mapper/ost0017 1130279280 482856 1129796424 1% /mnt/ost0017
Lustre filesystems on oss7
/dev/mapper/ost000b 1130279280 482852 1129796428 1% /mnt/ost000b
/dev/mapper/ost000c 1130279280 482872 1129796408 1% /mnt/ost000c
/dev/mapper/ost0018 1130279280 581172 1129698108 1% /mnt/ost0018
/dev/mapper/ost0019 1130279280 665556 1129613724 1% /mnt/ost0019
Lustre filesystems on oss8
/dev/mapper/ost000d 1130279280 687688 1129591592 1% /mnt/ost000d
/dev/mapper/ost000e 1130279280 606008 1129673272 1% /mnt/ost000e
/dev/mapper/ost001a 1130279280 511600 1129767680 1% /mnt/ost001a
/dev/mapper/ost001b 1130279280 482852 1129796428 1% /mnt/ost001b

Post by Kevin Van Maren
OST4 is on a machine at 192.168.16.23

Yes, oss3.

Post by Kevin Van Maren
What machine is 192.168.16.24? Is that the OST4 failover partner?

Yes, oss4 is the failover partner.

Post by Kevin Van Maren
You have a client at 192.168.16.1?

Yes, it is hanging each time I attempt IO.

oss3:~ # tunefs.lustre --dryrun /dev/mapper/ost0004
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

Read previous values:
Target: lustre-OST0004
Index: 4
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.16.11 at o2ib mgsnode=192.168.17.11 at o2ib
failover.node=192.168.16.24 at o2ib

Permanent disk data:
Target: lustre-OST0004
Index: 4
Lustre FS: lustre
Mount type: ldiskfs
Flags: 0x2
(OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=192.168.16.11 at o2ib mgsnode=192.168.17.11 at o2ib
failover.node=192.168.16.24 at o2ib

exiting before disk write.

Post by Kevin Van Maren
Kevin

Post by Dennis Nelson
Hi,
I have encountered an issue with Lustre that has happened a couple of
times
now. I am beginning to suspect an issue with the IB fabric but wanted to
reach out to the list to confirm my suspicions. The odd part is that even
when the MDS complains that it cannot connect to a given ost, lctl ping to
the OSS that owns the OST works without an issue. Also, the OSS in
question
has other OSTs which, in the latest case, have not reported any errors.
I have attached a file with the errors that I encountered from the MDS. I
am running Lustre 1.6.6 with a a pair of MDSs and 8 OSS and 28 OSTs spread
across the the 8 OSSs. I am using IB DDR interconnects between all
systems.
Thanks,
------------------------------------------------------------------------
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss