Discussion:
[lustre-discuss] lustre 2.5.2 - unable to mount ost
Perez, Rafael
2015-11-22 23:12:28 UTC
Permalink
Hello.

I'm wondering if someone can help. We're running lustre 2.5.2 and our filesystem is unable to start due to one of the ost's which will not mount. There seems to be some corruption with the configuration logs (or the "lfs1-client" log which is specific to our setup I'm guessing). I am able to mount the underlying ldiskfs filesystem for this ost. Here is the mount command and syslog errors reported when attempting to mount:

***@oss2 ~ # mount -t lustre /dev/mapper/ost5 /mnt/ost5
mount.lustre: mount /dev/mapper/ost5 at /mnt/ost5 failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)


LustreError: 10476:0:(llog_osd.c:254:llog_osd_read_header()) lfs1-OST0006-osd: bad log lfs1-client [0xa:0xcc:0x0] header magic: 0x20d1 (expected 0x10645539)
LustreError: 10476:0:(llog_osd.c:254:llog_osd_read_header()) Skipped 1 previous similar message
LustreError: 10476:0:(mgc_request.c:1707:mgc_llog_local_copy()) ***@o2ib: failed to copy remote log lfs1-client: rc = -5
LustreError: 13a-8: Failed to get MGS log lfs1-client and no local copy.
LustreError: 15c-8: ***@o2ib: The configuration from log 'lfs1-client' failed (-2). This may be the result of communication errors between this node and the MGS, a bad configuration, or other errors. See the syslog for more information.
LustreError: 10476:0:(obd_mount_server.c:1285:server_start_targets()) lfs1-OST0006: failed to start LWP: -2
LustreError: 10476:0:(obd_mount_server.c:1739:server_fill_super()) Unable to start targets: -2
Lustre: Failing over lfs1-OST0006
Lustre: server umount lfs1-OST0006 complete
LustreError: 10476:0:(obd_mount.c:1323:lustre_fill_super()) Unable to mount (-2)

It looks like lustre has some built-in fencing mechanism and refuses to mount this volume. We have 6 osts in total and the others can mount without issue. Any suggestions on what to try? I've already run e2fsck and that comes back with a few minor issues. The other osts have the same thing but they are able to mount.

Thanks,
Rafael
Mohr Jr, Richard Frank (Rick Mohr)
2015-11-23 15:57:50 UTC
Permalink
Post by Perez, Rafael
LustreError: 13a-8: Failed to get MGS log lfs1-client and no local copy.
LustreError: 10476:0:(obd_mount_server.c:1285:server_start_targets()) lfs1-OST0006: failed to start LWP: -2
Does this server have other OSTs that mount? Or is this the only OST on this OSS server? You can use tune2fs to list the OST config parameters and verify that they are correct. I have also seen this kind of error when there are network problems. I would look for IB errors or other signs of problems. (Maybe even do a bandwidth test to see if it is performing as expected.) You can also run “lctl ping” to test LNet connectivity between the OSS server and the MGS server.

If the network checks out and it really is the llog that is the problem, you can try doing a writeconf to fix things up.

--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu
Rafael Perez
2015-11-23 19:13:02 UTC
Permalink
Hi Rick,

Thanks for your suggestions. Turns out I was able to get the filesystem
started this morning and restore access to the critical data. It was a
long journey of troubleshooting but here are the steps I ended up taking
to fix the issue.

- stop the lustre filesystem (umount the osts and mdt/mgt)
- mount the ldiskfs filesystem for the problematic ost (/dev/mapper/ost5
to /mnt/ost5 in this case)
- backup the CONFIGS/lfs1-client file
# cp -a /mnt/ost5/CONFIGS/lfs1-client
/mnt/ost5/CONFIGS/lfs1-client.ORIG
- copy a working non-corrupted 'lfs1-client' file from the MGS (from the
mounted ldiskfs filesystem on the MGS)
(there were signs of corruption in the file when I ran llog_reader
against the bad lfs1-client file and received unexpected output)
- umount all ldiskfs filesystems
- run a writeconf to the MDS and all OSTs
# tunefs.lustre --verbose --writeconf /dev/mapper/ostX
- restart the filesystem
(this is where lfs1-OST0006 finally mounted!)
- mount the filesystem on a client

Our setup has 2 oss servers (oss1 and oss2) which serve 3 OSTs on each:
oss1:
/mnt/ost0
/mnt/ost1
/mnt/ost2

oss2:
/mnt/oss3
/mnt/ost4
/mnt/ost5

I'm sending this out for reference.

Thanks again,
Rafael
Post by Mohr Jr, Richard Frank (Rick Mohr)
Post by Perez, Rafael
LustreError: 13a-8: Failed to get MGS log lfs1-client and no local copy.
LustreError: 10476:0:(obd_mount_server.c:1285:server_start_targets()) lfs1-OST0006: failed to start LWP: -2
Does this server have other OSTs that mount? Or is this the only OST on this OSS server? You can use tune2fs to list the OST config parameters and verify that they are correct. I have also seen this kind of error when there are network problems. I would look for IB errors or other signs of problems. (Maybe even do a bandwidth test to see if it is performing as expected.) You can also run “lctl ping” to test LNet connectivity between the OSS server and the MGS server.
If the network checks out and it really is the llog that is the problem, you can try doing a writeconf to fix things up.
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu
--
Rafael Perez
***@bnl.gov
ITD HPC Support, Sr Technology Engineer
(631) 344-4426
Loading...