[lustre-discuss] bad performance with Lustre/ZFS on NVMe SSD

Discussion:

Riccardo Veraldi

2018-04-07 05:04:53 UTC

So I'm struggling since months with these low performances on Lsutre/ZFS.

Looking for hints.

3 OSSes, RHEL 74 Lustre 2.10.3 and zfs 0.7.6

each OSS has one OST raidz

pool: drpffb-ost01
state: ONLINE
scan: none requested
trim: completed on Fri Apr 6 21:53:04 2018 (after 0h3m)
config:

    NAME          STATE     READ WRITE CKSUM
    drpffb-ost01 ONLINE       0     0     0
    raidz1-0    ONLINE       0     0     0
        nvme0n1   ONLINE       0     0     0
        nvme1n1   ONLINE       0     0     0
        nvme2n1   ONLINE       0     0     0
        nvme3n1   ONLINE       0     0     0
        nvme4n1   ONLINE       0     0     0
        nvme5n1   ONLINE       0     0     0

while the raidz without Lustre perform well at 6GB/s (1GB/s per disk),
with Lustre on top of it performances are really poor.
most of all they are not stable at all and go up and down between
1.5GB/s and 6GB/s. I Tested with obfilter-survey
LNET is ok and working at 6GB/s (using infiniband FDR)

What could be the cause of OST performance going up and down like a
roller coaster ?

for reference here are few considerations:

filesystem parameters:

zfs set mountpoint=none drpffb-ost01
zfs set sync=disabled drpffb-ost01
zfs set atime=off drpffb-ost01
zfs set redundant_metadata=most drpffb-ost01
zfs set xattr=sa drpffb-ost01
zfs set recordsize=1M drpffb-ost01

NVMe SSD are 4KB/sector

ashift=12

ZFS module parameters

options zfs zfs_prefetch_disable=1
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#options zfs zfs_vdev_sync_write_min_active=64
#options zfs zfs_vdev_sync_write_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

Dilger, Andreas

2018-04-09 23:15:11 UTC

Permalink

Post by Riccardo Veraldi
So I'm struggling since months with these low performances on Lsutre/ZFS.
Looking for hints.
3 OSSes, RHEL 74 Lustre 2.10.3 and zfs 0.7.6
each OSS has one OST raidz
pool: drpffb-ost01
state: ONLINE
scan: none requested
trim: completed on Fri Apr 6 21:53:04 2018 (after 0h3m)
NAME STATE READ WRITE CKSUM
drpffb-ost01 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
nvme4n1 ONLINE 0 0 0
nvme5n1 ONLINE 0 0 0
while the raidz without Lustre perform well at 6GB/s (1GB/s per disk),
with Lustre on top of it performances are really poor.
most of all they are not stable at all and go up and down between
1.5GB/s and 6GB/s. I Tested with obfilter-survey
LNET is ok and working at 6GB/s (using infiniband FDR)
What could be the cause of OST performance going up and down like a
roller coaster ?

Riccardo,
to take a step back for a minute, have you tested all of the devices
individually, and also concurrently with some low-level tool like
sgpdd or vdbench? After that is known to be working, have you tested
with obdfilter-survey locally on the OSS, then remotely on the client(s)
so that we can isolate where the bottleneck is being hit.

Cheers, Andreas

Post by Riccardo Veraldi
zfs set mountpoint=none drpffb-ost01
zfs set sync=disabled drpffb-ost01
zfs set atime=off drpffb-ost01
zfs set redundant_metadata=most drpffb-ost01
zfs set xattr=sa drpffb-ost01
zfs set recordsize=1M drpffb-ost01
NVMe SSD are 4KB/sector
ashift=12
ZFS module parameters
options zfs zfs_prefetch_disable=1
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#options zfs zfs_vdev_sync_write_min_active=64
#options zfs zfs_vdev_sync_write_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

Alexander I Kulyavtsev

2018-04-10 18:43:35 UTC

Permalink

Ricardo,
It can be helpful to see output of commands on zfs pool host when you read files through lustre client; and directly through zfs:

# zpool iostat -lq -y zpool_name 1
# zpool iostat -w -y zpool_name 5
# zpool iostat -r -y zpool_name 5

-q queue statistics
-l Latency statistics

-r Request size histogram:
-w (undocumented) latency statistics

I did see different behavior of zfs reads on zfs pool for the same dd/fio command reading file from lustre mount on different host; and directly from zfs on OSS. I created separate zfs dataset with similar zfs settings on lustre zpool.
lustre IO seen on zfs pool with 128KB requests while dd/fio on zfs has 1MB requests. dd/fio command used 1MB IO.

zptevlfs6 sync_read sync_write async_read async_write scrub
req_size ind agg ind agg ind agg ind agg ind agg
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
512 0 0 0 0 0 0 0 0 0 0
1K 0 0 0 0 0 0 0 0 0 0
2K 0 0 0 0 0 0 0 0 0 0
4K 0 0 0 0 0 0 0 0 0 0
8K 0 0 0 0 0 0 0 0 0 0
16K 0 0 0 0 0 0 0 0 0 0
32K 0 0 0 0 0 0 0 0 0 0
64K 0 0 0 0 0 0 0 0 0 0
128K 0 0 0 0 2.00K 0 0 0 0 0 <====
256K 0 0 0 0 0 0 0 0 0 0
512K 0 0 0 0 0 0 0 0 0 0
1M 0 0 0 0 125 0 0 0 0 0 <====
2M 0 0 0 0 0 0 0 0 0 0
4M 0 0 0 0 0 0 0 0 0 0
8M 0 0 0 0 0 0 0 0 0 0
16M 0 0 0 0 0 0 0 0 0 0
--------------------------------------------------------------------------------
^C

Alex.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation

_______________________________________________
lustre-discuss mailing list
lustre-***@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Riccardo Veraldi

2018-04-13 00:07:36 UTC

Permalink

Yes I tested every single disk and also with disks in a raidz pool
without Lustre.
disks perform to specs, 1.2TB each and up to 6GB/s in the zpool.
When using lustre the zpool performs really bad no more than 1.5GB/s.

I then configured one OST per disk without any raidz (6 OST total).
I can scale up with performance distributing processes across OSTs in
this way, but anyway if I use striping across all OSTs
instead of manually bounding proesses to a specific OST, the performance
decreases.
Also running a single process on a single OST I never can get more than
700MB/s while I can reach 1.2GB/s using at least 4 processes on the same
OST.

I did test using obdfilter-survey this is what I got:

ost 1 sz 524288000K rsz 1024K obj 4 thr 4 write 4872.92 [1525.83,
6120.75]

I did run Lnet selftest and I got 6GB/s using FDR.

But when I write form the client side the performances really drops
dramatically. Especially when using a Lustre on raidz.

so I Was wondering if there is any RPC parameter setting that I need to
set to get better performances out of Lustre ?

thank you

Post by Dilger, Andreas

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Intel Corporation