[Lustre-discuss] Slow metadata, small file performance

Discussion:

Brent A Nelson

19 years ago

Well, thanks to several people on the list, I got Lustre 1.4.5 running on
my test setup (with Ubuntu Breezy, no less!), and it seems stable (no
problems so far that I didn't cause myself ;-)).

However, I've noticed that some things are performing rather subpar in
some limited testing with a single client. Large sequential reads and
writes seem quick (although perhaps not nearly as quick as would be
theoretically possible with this setup, it might make up for that when
multiple clients are running), but "ls -lR >/dev/null" (even with just a
single client and no other activity) and a "cp -a /usr /lustre1/test1"
(~3.5 minutes for a <350MB /usr) both perform more slowly than to an older
Linux box running NFS on fast ethernet (my Lustre servers have
channel-bonded gigabit, and dual PIII 1GHz processors rather than the NFS
server's 450MHz processors).

I tried increasing the lru_size on everything, but that didn't seem to
have any effect at all in this scenario (maybe it only matters when there
are many more clients). I also added mballoc and extents to the mount
options for the OSTs (small effect, if any). Setting the debug level to
zero helped significantly, but it's still much slower than NFS. The cp
takes maybe 50% longer than NFS and the ls takes about 300% longer. The
numbers are fairly similar whether I have 2 OSS servers serving 3 drbd
mirrors each, or the same servers just serving out a logical volume from
the system drive on each (although the more complex scenario is actually a
little faster for these tests, but still much slower than NFS). I
originally had the MDS on one of the OSS servers and tried moving it to a
third server, but the speed stayed the same.

Any ideas?

Many thanks! I know I've been rather scant on details; just let me know
and I'll provide whatever info you need.

Brent Nelson
Director of Computing
Dept. of Physics
University of Florida

Andreas Dilger

19 years ago

Permalink

"ls -lR >/dev/null" (even with just a single client and no other activity)
and a "cp -a /usr /lustre1/test1" (~3.5 minutes for a <350MB /usr) both
perform more slowly than to an older Linux box running NFS
I tried increasing the lru_size on everything, but that didn't seem to
have any effect at all in this scenario (maybe it only matters when there
are many more clients).

The single-client "ls -lR" case is one of the worst usage cases for Lustre.
We have plans to fix this, but haven't done so yet. If you are doing
repeat "ls -lR" on the same directory (and working set fits into LRU)
then the performance is greatly improved and you are still guaranteed
coherency, unlike NFS. That is why we recommend increased LRU sizes for
nodes that are being used interactively. HPC compute nodes rarely have a
large working set.

Similarly, if you have a large number of clients doing such operations
the aggregate performance will be higher than that of NFS. In some
usage scenarios adding more clients improves lustre performance instead
of hurting it.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Brent A Nelson

19 years ago

Permalink

Once again, thanks for the reply!

...

Does this apply for the single-client copy case, as well? I imagine for a
/usr directory, with lots of small files, there would be a lot of metadata
activity.

I haven't noticed any performance improvement at all with the lru_size
change. Perhaps the directory metadata is small enough to fit in the
default lru_size. All I have in my test case is 2 ~350GB copies of /usr
and a single 4GB file (this totals to about 37000 files and 3000
directories).

Also, I don't notice any speed change between successive, identical ls
timings on a client. If I start up a new client and do the ls test twice
in a row I get almost identical timings (even when I bump up the
lru_size).

Is the caching on the metadata server, on the client, or both? If on the
server, perhaps the metadata has stayed in cache since the initial copies
and I've always been experiencing the cached peformance. But I see no
evidence of a client-side caching effect on the ls tests.

Also, during the ls tests, I noticed there was some activity on the OSTs;
why is that? I would have thought this was purely a metadata operation.

Assuming nothing's wrong (and it may be a while before someone improves
the behavior of the code in this situation), what would I do to improve
performance in this case? What hardware improvement would be of most
benefit with the present code, faster processors, faster disks, lower
latency disks, lower latency connection between nodes, more OSS nodes,...?

Thanks,

Brent

Andreas Dilger

19 years ago

Permalink

Post by Brent A Nelson
I haven't noticed any performance improvement at all with the lru_size
change. Perhaps the directory metadata is small enough to fit in the
default lru_size. All I have in my test case is 2 ~350GB copies of /usr
and a single 4GB file (this totals to about 37000 files and 3000
directories).

If your 40k files is larger than the LRU size then it is expected that
the performance would not be improved. The default LRU size is 100,
and in newer versions 100 * num_cpus.

Also, if you are increasing the LRU size, both the MDC and OSCs need
to have larger LRUs. The OSC LRU size can be $defaultstripe_count/$num_osts
of the MDC LRU, since you will be able to cache more object locks for
the larger number of OSTs.

Post by Brent A Nelson
Is the caching on the metadata server, on the client, or both? If on the
server, perhaps the metadata has stayed in cache since the initial copies
and I've always been experiencing the cached peformance. But I see no
evidence of a client-side caching effect on the ls tests.

There is cache on both sides. The DLM locks (kept in aforementioned LRU)
are client-side locks. The server also has the Linux filesystem caches
(dcache, icache, pagecache) to cache data on the server side.

Post by Brent A Nelson
Also, during the ls tests, I noticed there was some activity on the OSTs;
why is that? I would have thought this was purely a metadata operation.

The file size is stored on the OST, so this is normal.

Post by Brent A Nelson
Assuming nothing's wrong (and it may be a while before someone improves
the behavior of the code in this situation), what would I do to improve
performance in this case? What hardware improvement would be of most
benefit with the present code, faster processors, faster disks, lower
latency disks, lower latency connection between nodes, more OSS nodes,...?

The biggest improvement of "ls" performance would likely come from
larger MDS RAM size (and 64-bit CPU to be able to use it effectively) so
that it can cache this information on disk. Recent tests showed a 10x
performance improvement when the inode information was in cache on the
MDS vs. reading it from the disk.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

Oleg Drokin

19 years ago

Permalink

Hello!

Post by Brent A Nelson

Post by Andreas Dilger
The single-client "ls -lR" case is one of the worst usage cases for Lustre.
We have plans to fix this, but haven't done so yet. If you are doing
repeat "ls -lR" on the same directory (and working set fits into LRU)
then the performance is greatly improved and you are still guaranteed
coherency, unlike NFS. That is why we recommend increased LRU sizes for
nodes that are being used interactively. HPC compute nodes rarely have a
large working set.

Does this apply for the single-client copy case, as well? I imagine for a

...

Yes.

Post by Brent A Nelson
/usr directory, with lots of small files, there would be a lot of metadata
activity.

Right.

Well might be so,
For metadata it is unimportant how much space the data takes.

Post by Brent A Nelson
and a single 4GB file (this totals to about 37000 files and 3000
directories).

So around 40000 entries.
This means you need to set your lock lru to at least 40000 so that entire
ls -lR fits into it.

Post by Brent A Nelson
Also, I don't notice any speed change between successive, identical ls
timings on a client. If I start up a new client and do the ls test twice
in a row I get almost identical timings (even when I bump up the
lru_size).

How big was your increased lru size?

Post by Brent A Nelson
Is the caching on the metadata server, on the client, or both? If on the

Caching is on client only.

Post by Brent A Nelson
Also, during the ls tests, I noticed there was some activity on the OSTs;

That's right.

Post by Brent A Nelson
why is that? I would have thought this was purely a metadata operation.

Part of the metadata is stored on OSTs. This is current file size and mtime
for now.
So you need to increase lru size for both MDS and all OSCs.
You can do this with this sequence of commands (on every client you want
this change to be present):
NEWLRUSIZE=41000 #(set lru size to 41k)
echo $NEWLRUSIZE >/proc/fs/lustre/ldlm/namespaces/MDC*/lru_size # set MDC LRU
for d in /proc/fs/lustre/ldlm/namespaces/OSC*/lru_size;
do echo $NEWLRUSIZE > $d ; done # Set all OSCs LRU

lru size where everything fits should help for "Second run ls".

Post by Brent A Nelson
benefit with the present code, faster processors, faster disks, lower
latency disks, lower latency connection between nodes, more OSS nodes,...?

Lower latency connection between nodes is what you want for fast metadata
operations, essentially.
The more OSS nodes your files are striped over - the slower stat(2) speed is,
because every such node should be queried for mtime and size of stripe it might
hold. (note that if you have many OSS nodes, but all files are striped over
only one OST, then stat(2) performance would be the same as if you have
single OST (with no other load, of course)).

Bye,
Oleg