Discussion:
[lustre-discuss] Understanding MDT getxattr stats
Kirk, Benjamin (JSC-EG311)
2018-09-25 21:01:50 UTC
Permalink
Hi all,

We’re using jobstats under SLURM and have pulled together a tool to integrate SLURM job info and lustre OST/MDT jobstats. The idea is to correlate filesystem use cases with particular applications as targets for refactoring.

In doing so, I’m seeing some applications really trigger getxattr on the MDT, and others do not. A particular egregious example is below: 360 cores, ~10s of GB output, ~6500 files, but 16,608,476 calls to getxattr during a 4 hour runtime. And this is a nominally compute-bound problem, so that I/O pattern is likely compressed into small windows of time.

The system is CentOS 7.5 / lustre 2.10.5 / zfs-0.7.9, single mdt, 12 OSS, 2 OST each. Default stripe count of 4.

A couple questions:

1) should I care about this? We do see sporadic mdt slowness under zfs, but that doesn’t seem rare. I’m looking for a good way to trace that to jobs / use cases.
2) what types of operations might be triggering the getxattr usage on a moderate amount of files (e.g. what to watch for in the refactoring process
)

Thanks,

-Ben

--------------------------

.
TRES : cpu=360,node=30,billing=360
RunTime : 04:59:14
GroupId : eg3(3000)
ExitCode : 0:0
MDT:rename : 373
MDT:snapshot_time : 2018-09-21 08:36:29
MDT:setattr : 444
MDT:mkdir : 361
MDT:getattr : 1570
MDT:getxattr : 16608476
MDT:mknod : 265
MDT:rmdir : 1
MDT:samedir_rename : 373
MDT:close : 6331
MDT:unlink : 113
MDT:open : 6345
OST0009:write_bytes : 3.46 GB
OST0008:write_bytes : 3.11 GB
OST0001:write_bytes : 1.01 GB
OST0000:write_bytes : 396.19 MB
OST0005:read_bytes : 8.19 KB
OST0005:write_bytes : 2.38 GB
OST0005:setattr : 1
OST0004:write_bytes : 790.65 MB
OST0007:write_bytes : 3.02 GB
OST0006:write_bytes : 817.14 MB
OST0016:write_bytes : 4.57 GB
OST0017:write_bytes : 5.15 GB
OST0017:setattr : 1
OST0014:write_bytes : 8.8 GB
OST0015:write_bytes : 1.37 GB
OST0012:write_bytes : 7 GB
OST0012:setattr : 1
OST0013:read_bytes : 8.39 MB
OST0013:write_bytes : 8.4 GB
OST0013:setattr : 1
OST0010:write_bytes : 1.98 GB
OST0011:read_bytes : 27.28 MB
OST0011:write_bytes : 9.42 GB
OST000c:read_bytes : 131.07 KB
OST000c:write_bytes : 5.83 GB
OST000c:setattr : 2
OST000b:read_bytes : 28.12 MB
OST000b:write_bytes : 4.23 GB
OST000e:read_bytes : 8.02 MB
OST000e:write_bytes : 7.48 GB
OST000e:setattr : 1
OST000d:write_bytes : 1.21 GB
OST000f:write_bytes : 2.88 GB
Andreas Dilger
2018-09-25 22:58:56 UTC
Permalink
Post by Kirk, Benjamin (JSC-EG311)
Hi all,
We’re using jobstats under SLURM and have pulled together a tool to integrate SLURM job info and lustre OST/MDT jobstats. The idea is to correlate filesystem use cases with particular applications as targets for refactoring.
In doing so, I’m seeing some applications really trigger getxattr on the MDT, and others do not. A particular egregious example is below: 360 cores, ~10s of GB output, ~6500 files, but 16,608,476 calls to getxattr during a 4 hour runtime. And this is a nominally compute-bound problem, so that I/O pattern is likely compressed into small windows of time.
The system is CentOS 7.5 / lustre 2.10.5 / zfs-0.7.9, single mdt, 12 OSS, 2 OST each. Default stripe count of 4.
1) should I care about this? We do see sporadic mdt slowness under zfs, but that doesn’t seem rare. I’m looking for a good way to trace that to jobs / use cases.
Having SLURM report the JobID stats from the servers seems like a great idea to me. This makes IO much more visible to users/developers, an they can start to get a feeling about whether they are dong a lot or a little IO.

I pushed a patch recently that reports the start of the jobstats data in the output, so one can get a better idea about the IO rates involved.

I've also wondered whether we should keep an IO histogram for each JobID (like brw_stats), but I wonder if count and sum is enough to get the average IO size, and maybe sum_squared to calculate stddev?
Post by Kirk, Benjamin (JSC-EG311)
2) what types of operations might be triggering the getxattr usage on a moderate amount of files (e.g. what to watch for in the refactoring process
)
There are a number of different possibilities:
- spurious SELinux security checks
- ACLs (which are stored as xattrs on disk)
- user xattrs (if you have this enabled)
- xattrs are too large to fit into cache?

There is already an xattr cache on the client, but it doesn't cache very large xattrs. You could try running strace on the running program to see what it is doing. If you know the input/output files you could check with getfattr and getqfacl to see what xattrs are stored there. With 17M calls for the job in 5 minutes, that is over 50k/sec. While it is great the MDS can handle this load, it isn't great that it is dong that at all.
Post by Kirk, Benjamin (JSC-EG311)
Thanks,
-Ben
--------------------------

.
TRES : cpu=360,node=30,billing=360
RunTime : 04:59:14
GroupId : eg3(3000)
ExitCode : 0:0
MDT:rename : 373
MDT:snapshot_time : 2018-09-21 08:36:29
MDT:setattr : 444
MDT:mkdir : 361
MDT:getattr : 1570
MDT:getxattr : 16608476
MDT:mknod : 265
MDT:rmdir : 1
MDT:samedir_rename : 373
MDT:close : 6331
MDT:unlink : 113
MDT:open : 6345
OST0009:write_bytes : 3.46 GB
OST0008:write_bytes : 3.11 GB
OST0001:write_bytes : 1.01 GB
OST0000:write_bytes : 396.19 MB
OST0005:read_bytes : 8.19 KB
OST0005:write_bytes : 2.38 GB
OST0005:setattr : 1
OST0004:write_bytes : 790.65 MB
OST0007:write_bytes : 3.02 GB
OST0006:write_bytes : 817.14 MB
OST0016:write_bytes : 4.57 GB
OST0017:write_bytes : 5.15 GB
OST0017:setattr : 1
OST0014:write_bytes : 8.8 GB
OST0015:write_bytes : 1.37 GB
OST0012:write_bytes : 7 GB
OST0012:setattr : 1
OST0013:read_bytes : 8.39 MB
OST0013:write_bytes : 8.4 GB
OST0013:setattr : 1
OST0010:write_bytes : 1.98 GB
OST0011:read_bytes : 27.28 MB
OST0011:write_bytes : 9.42 GB
OST000c:read_bytes : 131.07 KB
OST000c:write_bytes : 5.83 GB
OST000c:setattr : 2
OST000b:read_bytes : 28.12 MB
OST000b:write_bytes : 4.23 GB
OST000e:read_bytes : 8.02 MB
OST000e:write_bytes : 7.48 GB
OST000e:setattr : 1
OST000d:write_bytes : 1.21 GB
OST000f:write_bytes : 2.88 GB
_______________________________________________
lustre-discuss mailing list
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Cheers, Andreas
---
Andreas Dilger
CTO Whamcloud

Loading...