Discussion:
[Lustre-discuss] problem reading HDF files on 1.8.5 filesystem
Christopher Walker
2011-05-04 20:47:26 UTC
Permalink
Hello,

We have a user who is trying to post-process HDF files in R. Her script
goes through a number (~2500) of files in a directory, opening and
reading the contents. This usually goes fine, but occasionally the
script dies with:


HDF5-DIAG: Error detected in HDF5 (1.9.4) thread 46944713368080:
#000: H5F.c line 1560 in H5Fopen(): unable to open file
major: File accessability
minor: Unable to open file
#001: H5F.c line 1337 in H5F_open(): unable to read superblock
major: File accessability
minor: Read failed
#002: H5Fsuper.c line 542 in H5F_super_read(): truncated file
major: File accessability
minor: File has been truncated
Error in hdf5load(file = myfile, load = FALSE, verbosity = 0, tidy =
TRUE) :
unable to open HDF file:
/n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5
HDF5-DIAG: Error detected in HDF5 (1.9.4) thread 46944713368080:
#000: H5F.c line 2012 in H5Fclose(): decrementing file ID failed
major: Object atom
minor: Unable to close file
#001: H5I.c line 1340 in H5I_dec_ref(): can't locate ID
major: Object atom
minor: Unable to find atom information (already closed?)
Error in hdf5cleanup(16778754L) : unable to close HDF file


But this file definitely does exist -- any stat or ls command shows it
without a problem. Further, once I 'ls' this file, if I rerun the same
script, it successfully reads this file, but then dies on the next one
with the same error. If I 'ls' the entire directory, the script runs to
completion without a problem. strace output shows:

open("/n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5",
O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lseek(3, 0, SEEK_SET) = 0
read(3, "\211HDF\r\n\32\n", 8) = 8
read(3, "\0", 1) = 1
read(3,
"\0\0\0\0\10\10\0\4\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377\377\377\377\377@"...,
87) = 87
close(3) = 0
write(2, "HDF5-DIAG: Error detected in HDF"..., 42) = 42
etc

which initially looks fine to me, followed by an abrupt close.

NFS filesystems and our 1.6.7.2 filesystem have no such problems -- any
suggestions?

Thanks very much,
Chris
David Dillow
2011-05-04 21:06:35 UTC
Permalink
Post by Christopher Walker
Hello,
We have a user who is trying to post-process HDF files in R. Her script
goes through a number (~2500) of files in a directory, opening and
reading the contents. This usually goes fine, but occasionally the
open("/n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5",
O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
This thinks it is a zero length file, but I'll bet you see the right
length later.

Perhaps you could test Johann's suggestion from
http://jira.whamcloud.com/browse/LU-274 to see it that helps, and report
the results in Jira?
Larry
2011-05-05 02:05:56 UTC
Permalink
try mounting the lustre filesystem with -o flock or -o localflock

On Thu, May 5, 2011 at 4:47 AM, Christopher Walker
Post by Christopher Walker
Hello,
We have a user who is trying to post-process HDF files in R. ?Her script
goes through a number (~2500) of files in a directory, opening and
reading the contents. ?This usually goes fine, but occasionally the
? #000: H5F.c line 1560 in H5Fopen(): unable to open file
? ? major: File accessability
? ? minor: Unable to open file
? #001: H5F.c line 1337 in H5F_open(): unable to read superblock
? ? major: File accessability
? ? minor: Read failed
? #002: H5Fsuper.c line 542 in H5F_super_read(): truncated file
? ? major: File accessability
? ? minor: File has been truncated
Error in hdf5load(file = myfile, load = FALSE, verbosity = 0, tidy =
/n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5
? #000: H5F.c line 2012 in H5Fclose(): decrementing file ID failed
? ? major: Object atom
? ? minor: Unable to close file
? #001: H5I.c line 1340 in H5I_dec_ref(): can't locate ID
? ? major: Object atom
? ? minor: Unable to find atom information (already closed?)
Error in hdf5cleanup(16778754L) : unable to close HDF file
But this file definitely does exist -- any stat or ls command shows it
without a problem. ?Further, once I 'ls' this file, if I rerun the same
script, it successfully reads this file, but then dies on the next one
with the same error. ?If I 'ls' the entire directory, the script runs to
open("/n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5",
O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lseek(3, 0, SEEK_SET) ? ? ? ? ? ? ? ? ? = 0
read(3, "\211HDF\r\n\32\n", 8) ? ? ? ? ?= 8
read(3, "\0", 1) ? ? ? ? ? ? ? ? ? ? ? ?= 1
read(3,
87) = 87
close(3) ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?= 0
write(2, "HDF5-DIAG: Error detected in HDF"..., 42) = 42
etc
which initially looks fine to me, followed by an abrupt close.
NFS filesystems and our 1.6.7.2 filesystem have no such problems -- any
suggestions?
Thanks very much,
Chris
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Christopher Walker
2011-05-05 03:57:45 UTC
Permalink
Hi Larry,

Everything below is with the filesystem mounted with localflock.

This does indeed look a lot like the bug referred to by David Dillow
(thanks!)

Chris
Post by Larry
try mounting the lustre filesystem with -o flock or -o localflock
On Thu, May 5, 2011 at 4:47 AM, Christopher Walker
Post by Christopher Walker
Hello,
We have a user who is trying to post-process HDF files in R. Her script
goes through a number (~2500) of files in a directory, opening and
reading the contents. This usually goes fine, but occasionally the
#000: H5F.c line 1560 in H5Fopen(): unable to open file
major: File accessability
minor: Unable to open file
#001: H5F.c line 1337 in H5F_open(): unable to read superblock
major: File accessability
minor: Read failed
#002: H5Fsuper.c line 542 in H5F_super_read(): truncated file
major: File accessability
minor: File has been truncated
Error in hdf5load(file = myfile, load = FALSE, verbosity = 0, tidy =
/n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5
#000: H5F.c line 2012 in H5Fclose(): decrementing file ID failed
major: Object atom
minor: Unable to close file
#001: H5I.c line 1340 in H5I_dec_ref(): can't locate ID
major: Object atom
minor: Unable to find atom information (already closed?)
Error in hdf5cleanup(16778754L) : unable to close HDF file
But this file definitely does exist -- any stat or ls command shows it
without a problem. Further, once I 'ls' this file, if I rerun the same
script, it successfully reads this file, but then dies on the next one
with the same error. If I 'ls' the entire directory, the script runs to
open("/n/scratch2/moorcroft_lab/nlevine/Moore_sites_final/met/LT_spinup/ms67/analy/s67-E-1628-04-00-000000-g01.h5",
O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lseek(3, 0, SEEK_SET) = 0
read(3, "\211HDF\r\n\32\n", 8) = 8
read(3, "\0", 1) = 1
read(3,
87) = 87
close(3) = 0
write(2, "HDF5-DIAG: Error detected in HDF"..., 42) = 42
etc
which initially looks fine to me, followed by an abrupt close.
NFS filesystems and our 1.6.7.2 filesystem have no such problems -- any
suggestions?
Thanks very much,
Chris
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Peter Kjellström
2011-05-05 11:17:18 UTC
Permalink
Post by Christopher Walker
Hello,
We have a user who is trying to post-process HDF files in R. Her script
goes through a number (~2500) of files in a directory, opening and
reading the contents. This usually goes fine, but occasionally the
#000: H5F.c line 1560 in H5Fopen(): unable to open file
major: File accessability
minor: Unable to open file
...
Post by Christopher Walker
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
Seems just like https://bugzilla.lustre.org/show_bug.cgi?id=24458 which is a
real pain for us atm. Lustre returns file size == 0 for a stat of a non-zero
file.
Post by Christopher Walker
NFS filesystems and our 1.6.7.2 filesystem have no such problems -- any
suggestions?
This is what we've done, downgrade all our clients from 1.8.5(patchless) to
1.6.7.1(patchless).

/Peter
Post by Christopher Walker
Thanks very much,
Chris
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: This is a digitally signed message part.
Url : http://lists.lustre.org/pipermail/lustre-discuss/attachments/20110505/77e8d122/attachment-0001.bin
Loading...