Discussion:
[Lustre-discuss] Lustre locking up on login/interactive nodes
Brock Palen
2008-07-21 15:43:39 UTC
Permalink
Every so often lustre locks up. It will recover eventually. The
process show this self's in 'D' Uninterruptible IO Wait. This case
it was 'ar' making an archive.

Dmesg then shows:

Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection to service
nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
LustreError: 167-0: This client was evicted by nobackup-MDT0000; in
progress operations using this service will fail.
LustreError: 17575:0:(client.c:519:ptlrpc_import_delay_req()) @@@
IMP_INVALID req at 0000010189e2f400 x912452/t0
o101->nobackup-MDT0000_UUID at ***@tcp:12 lens 488/768 ref 1
fl Rpc:P/0/0 rc 0/0
LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue())
ldlm_cli_enqueue: -108
LustreError: 27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@
IMP_INVALID req at 00000101ed528a00 x912464/t0
o101->nobackup-MDT0000_UUID at ***@tcp:12 lens 440/768 ref 1
fl Rpc:/0/0 rc 0/0
LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue())
ldlm_cli_enqueue: -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode
12653753 mdc close failed: rc = -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode
12195682 mdc close failed: rc = -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped
46 previous similar messages
Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection restored to
service nobackup-MDT0000 using nid 141.212.30.184 at tcp.
LustreError: 11-0: an error occurred while communicating with
141.212.30.184 at tcp. The mds_close operation failed with -116
LustreError: 11-0: an error occurred while communicating with
141.212.30.184 at tcp. The mds_close operation failed with -116
LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode
11441446 mdc close failed: rc = -116
LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped
113 previous similar messages


Is there special options that should be done on interactive/login
nodes? I remember something about how much memory should be available
on login vs batch nodes. But I don't know how to change that, I just
assumed lustre would use it. Login nodes have 8GB.
__________________________________________________
www.palen.serveftp.net
Center for Advanced Computing
http://cac.engin.umich.edu
brockp at umich.edu
Brian J. Murrell
2008-07-21 15:51:55 UTC
Permalink
Post by Brock Palen
Every so often lustre locks up. It will recover eventually. The
process show this self's in 'D' Uninterruptible IO Wait. This case
it was 'ar' making an archive.
Syslog is usually a better place to get messages from as it gives some
context as to the time of the messages.
Post by Brock Palen
Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection to service
nobackup-MDT0000 via nid 141.212.30.184 at tcp was lost; in progress
operations using this service will wait for recovery to complete.
LustreError: 167-0: This client was evicted by nobackup-MDT0000; in
progress operations using this service will fail.
IMP_INVALID req at 0000010189e2f400 x912452/t0
fl Rpc:P/0/0 rc 0/0
LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue())
ldlm_cli_enqueue: -108
IMP_INVALID req at 00000101ed528a00 x912464/t0
fl Rpc:/0/0 rc 0/0
LustreError: 27076:0:(mdc_locks.c:423:mdc_finish_enqueue())
ldlm_cli_enqueue: -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode
12653753 mdc close failed: rc = -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) inode
12195682 mdc close failed: rc = -108
LustreError: 27489:0:(file.c:97:ll_close_inode_openhandle()) Skipped
46 previous similar messages
Lustre: nobackup-MDT0000-mdc-00000101fc467800: Connection restored to
service nobackup-MDT0000 using nid 141.212.30.184 at tcp.
LustreError: 11-0: an error occurred while communicating with
141.212.30.184 at tcp. The mds_close operation failed with -116
LustreError: 11-0: an error occurred while communicating with
141.212.30.184 at tcp. The mds_close operation failed with -116
LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) inode
11441446 mdc close failed: rc = -116
LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped
113 previous similar messages
This looks like a pretty standard eviction. Probably the most
interesting information is on the node that did the evicting. If it
doesn't contain much other than a "have not heard from", then you have
node that is either disappearing from the network or getting wedged
enough to stop sending pings (or any other traffic in lieu of).

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080721/e4756dfb/attachment.bin
Brock Palen
2008-07-21 16:04:54 UTC
Permalink
Post by Brian J. Murrell
Post by Brock Palen
Every so often lustre locks up. It will recover eventually. The
process show this self's in 'D' Uninterruptible IO Wait. This case
it was 'ar' making an archive.
Syslog is usually a better place to get messages from as it gives some
context as to the time of the messages.
Ok will keep in mind. Looks the same though, Its odd though, if I
login to the same machine I can move to that directory list the files
etc. read files on those OST's and notice this was eviction by the
MDS,

I see no lost network connections or network errors. Strange not
good not good at all.
The syslog data is the same, its below:

Brock


Jul 21 11:38:39 nyx-login1 kernel: Lustre: nobackup-MDT0000-
mdc-00000101fc467800: Connection to service nobackup-MDT0000 via nid
141.212.30.184 at tcp was lost; in progress operations using this
service will wait for recovery to complete.Jul 21 11:38:39 nyx-login1
kernel: LustreError: 167-0: This client was evicted by nobackup-
MDT0000; in progress operations using this service will fail.Jul 21
11:38:39 nyx-login1 kernel: LustreError: 17575:0:(client.c:
519:ptlrpc_import_delay_req()) @@@ IMP_INVALID req at 0000010189e2f400
x912452/t0 o101->nobackup-MDT0000_UUID at ***@tcp:12 lens
488/768 ref 1 fl Rpc:P/0/0 rc 0/0Jul 21 11:38:39 nyx-login1 kernel:
LustreError: 17575:0:(mdc_locks.c:423:mdc_finish_enqueue())
ldlm_cli_enqueue: -108Jul 21 11:38:39 nyx-login1 kernel: LustreError:
27076:0:(client.c:519:ptlrpc_import_delay_req()) @@@ IMP_INVALID
req at 00000101ed528a00 x912464/t0 o101->nobackup-
MDT0000_UUID at ***@tcp:12 lens 440/768 ref 1 fl Rpc:/0/0 rc
0/0Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27076:0:
(mdc_locks.c:423:mdc_finish_enqueue()) ldlm_cli_enqueue: -108Jul 21
11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c:
97:ll_close_inode_openhandle()) inode 12653753 mdc close failed: rc =
- -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c:
97:ll_close_inode_openhandle()) inode 12195682 mdc close failed: rc =
- -108Jul 21 11:38:39 nyx-login1 kernel: LustreError: 27489:0:(file.c:
97:ll_close_inode_openhandle()) Skipped 46 previous similar
messagesJul 21 11:38:39 nyx-login1 kernel: Lustre: nobackup-MDT0000-
mdc-00000101fc467800: Connection restored to service nobackup-MDT0000
using nid 141.212.30.184 at tcp.Jul 21 11:38:39 nyx-login1 kernel:
LustreError: 11-0: an error occurred while communicating with
141.212.30.184 at tcp. The mds_close operation failed with -116Jul 21
11:38:39 nyx-login1 kernel: LustreError: 11-0: an error occurred
while communicating with 141.212.30.184 at tcp. The mds_close operation
failed with -116Jul 21 11:38:39 nyx-login1 kernel: LustreError:
26930:0:(file.c:97:ll_close_inode_openhandle()) inode 11441446 mdc
close failed: rc = -116Jul 21 11:38:39 nyx-login1 kernel:
LustreError: 26930:0:(file.c:97:ll_close_inode_openhandle()) Skipped
113 previous similar messages
Brian J. Murrell
2008-07-22 13:18:00 UTC
Permalink
Post by Brock Palen
Ok will keep in mind. Looks the same though,
Indeed, in most cases it is the same, with the one exception that from
syslog, we get time context.
Post by Brock Palen
Its odd though, if I
login to the same machine I can move to that directory list the files
etc. read files on those OST's and notice this was eviction by the
MDS,
I see no lost network connections or network errors. Strange not
good not good at all.
Right. There is nothing terribly useful in it though. It's just the
notification of an eviction. The real question is why did it get
evicted. The evictor will know more about that than the evictee.

If the evictor doesn't log anything more than a "didn't hear from
client ... in ... seconds, evicting" then something is preventing the
client from sending traffic to the evictor.

b.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20080722/77d388b9/attachment.bin
Loading...