Discussion:
[lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation
Kevin Abbey
2015-05-04 18:18:35 UTC
Permalink
Hi,

For a single node OSS I'm planning to use a combined MGS/MDS. Can
anyone recommend an enterprise ssd designed for this workload? I'd like
to create a raid10 with 4x ssd using zfs as the backing fs.

Are there any published/documented systems using zfs in raid 10 using ssd?

Thanks,
Kevin


--
Kevin Abbey
Systems Administrator
Rutgers University
Patrick Farrell
2015-05-04 18:32:42 UTC
Permalink
Kevin,

I don¹t have an answer to your question, but I thought you should know:
ZFS MDTs have some fairly significant performance issues vs ldiskfs MDTs.
There are plans to resolve this, but as it is, it¹s much slower.

- Patrick

On 5/4/15, 1:18 PM, "Kevin Abbey" <***@rutgers.edu> wrote:

>Hi,
>
> For a single node OSS I'm planning to use a combined MGS/MDS. Can
>anyone recommend an enterprise ssd designed for this workload? I'd like
>to create a raid10 with 4x ssd using zfs as the backing fs.
>
>Are there any published/documented systems using zfs in raid 10 using ssd?
>
>Thanks,
>Kevin
>
>
>--
>Kevin Abbey
>Systems Administrator
>Rutgers University
>
>_______________________________________________
>lustre-discuss mailing list
>lustre-***@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Charlie D Whitehead III
2015-05-04 22:01:50 UTC
Permalink
Kevin,

I must agree with Patrick. I used ZFS on our metadata servers, and found an immediate reduction in performance over ldiskfs. Now, there are a lot of factors involved in performance, but in our case, ZFS for MDTs was not the best choice. That being said, I have not tried it on SSDs... yet.

Regards
--
Charlie D Whitehead III
Andrew Wagner
2015-05-05 02:20:41 UTC
Permalink
I can offer some guidance on our experiences with ZFS Lustre MDTs.
Patrick and Charlie are right - you will get less performance per $ out
of ZFS MDTs vs. LDISKFS MDTs. That said, our RAID10 with 4x Dell Mixed
Use Enterprise SSDs achieves similar performance to most of our LDISKFS
MDTs. Our MDS was a Dell server and we wanted complete support coverage.

One of the most important things for good performance with our ZFS MDS
was RAM. We doubled the amount of RAM in the system after experiencing
performance issues that were clearly memory pressure related. If you
expect to have tens of millions of files, I wouldn't run the MDS without
at least 128GB of RAM. I would be prepared to increase that number if
you run into RAM bottlenecks - we ended up going to 256GB in the end.

For a single OSS, you may not need 4x SSDs to deal with the load. We use
the 4 disk RAID10 setup with a 1PB filesystem and 1.8PB filesystem. Our
use case was more for archive purposes, so we wanted to go with a
complete ZFS solution.



On 5/4/2015 1:18 PM, Kevin Abbey wrote:
> Hi,
>
> For a single node OSS I'm planning to use a combined MGS/MDS. Can
> anyone recommend an enterprise ssd designed for this workload? I'd
> like to create a raid10 with 4x ssd using zfs as the backing fs.
>
> Are there any published/documented systems using zfs in raid 10 using
> ssd?
>
> Thanks,
> Kevin
>
>
Andrew Holway
2015-05-05 06:30:36 UTC
Permalink
ZFS should not be slower for very long. I understand that, now ZFS on Linux
is stable, many significant performance problems have been identified and
are being worked on.

On 5 May 2015 at 04:20, Andrew Wagner <***@ssec.wisc.edu> wrote:

> I can offer some guidance on our experiences with ZFS Lustre MDTs. Patrick
> and Charlie are right - you will get less performance per $ out of ZFS MDTs
> vs. LDISKFS MDTs. That said, our RAID10 with 4x Dell Mixed Use Enterprise
> SSDs achieves similar performance to most of our LDISKFS MDTs. Our MDS was
> a Dell server and we wanted complete support coverage.
>
> One of the most important things for good performance with our ZFS MDS was
> RAM. We doubled the amount of RAM in the system after experiencing
> performance issues that were clearly memory pressure related. If you expect
> to have tens of millions of files, I wouldn't run the MDS without at least
> 128GB of RAM. I would be prepared to increase that number if you run into
> RAM bottlenecks - we ended up going to 256GB in the end.
>
> For a single OSS, you may not need 4x SSDs to deal with the load. We use
> the 4 disk RAID10 setup with a 1PB filesystem and 1.8PB filesystem. Our use
> case was more for archive purposes, so we wanted to go with a complete ZFS
> solution.
>
>
>
> On 5/4/2015 1:18 PM, Kevin Abbey wrote:
>
>> Hi,
>>
>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone
>> recommend an enterprise ssd designed for this workload? I'd like to create
>> a raid10 with 4x ssd using zfs as the backing fs.
>>
>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>
>> Thanks,
>> Kevin
>>
>>
>>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
Patrick Farrell
2015-05-05 14:13:44 UTC
Permalink
The Livermore folks leading this effort can correct me if I misspeak, but they (Specifically Brian Behlendorf) presented on this topic at the Developers' Day at LUG 2015 (no video of the DD talks, sorry).

From his discussion, the issues have been identified, but the fixes are between six months and two years away, and may still not fully close the gap. It'll be a bit yet.

- Patrick
________________________________
From: lustre-discuss [lustre-discuss-***@lists.lustre.org] on behalf of Andrew Holway [***@gmail.com]
Sent: Tuesday, May 05, 2015 1:30 AM
To: Andrew Wagner
Cc: ***@rutgers.edu; lustre-***@lists.lustre.org
Subject: Re: [lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

ZFS should not be slower for very long. I understand that, now ZFS on Linux is stable, many significant performance problems have been identified and are being worked on.

On 5 May 2015 at 04:20, Andrew Wagner <***@ssec.wisc.edu<mailto:***@ssec.wisc.edu>> wrote:
I can offer some guidance on our experiences with ZFS Lustre MDTs. Patrick and Charlie are right - you will get less performance per $ out of ZFS MDTs vs. LDISKFS MDTs. That said, our RAID10 with 4x Dell Mixed Use Enterprise SSDs achieves similar performance to most of our LDISKFS MDTs. Our MDS was a Dell server and we wanted complete support coverage.

One of the most important things for good performance with our ZFS MDS was RAM. We doubled the amount of RAM in the system after experiencing performance issues that were clearly memory pressure related. If you expect to have tens of millions of files, I wouldn't run the MDS without at least 128GB of RAM. I would be prepared to increase that number if you run into RAM bottlenecks - we ended up going to 256GB in the end.

For a single OSS, you may not need 4x SSDs to deal with the load. We use the 4 disk RAID10 setup with a 1PB filesystem and 1.8PB filesystem. Our use case was more for archive purposes, so we wanted to go with a complete ZFS solution.



On 5/4/2015 1:18 PM, Kevin Abbey wrote:
Hi,

For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.

Are there any published/documented systems using zfs in raid 10 using ssd?

Thanks,
Kevin
Wolfgang Baudler
2015-05-05 14:16:38 UTC
Permalink
> The Livermore folks leading this effort can correct me if I misspeak, but
> they (Specifically Brian Behlendorf) presented on this topic at the
> Developers' Day at LUG 2015 (no video of the DD talks, sorry).
>
>>From his discussion, the issues have been identified, but the fixes are
>> between six months and two years away, and may still not fully close the
>> gap. It'll be a bit yet.
>
> - Patrick

So, these performance issues are specific to Lustre using ZFS or is it
problems with ZFS on Linux in general?

Wolfgang
Rick Wagner
2015-05-05 15:07:10 UTC
Permalink
On May 5, 2015, at 7:16 AM, Wolfgang Baudler <***@gb.nrao.edu> wrote:

>> The Livermore folks leading this effort can correct me if I misspeak, but
>> they (Specifically Brian Behlendorf) presented on this topic at the
>> Developers' Day at LUG 2015 (no video of the DD talks, sorry).
>>
>>> From his discussion, the issues have been identified, but the fixes are
>>> between six months and two years away, and may still not fully close the
>>> gap. It'll be a bit yet.
>>
>> - Patrick
>
> So, these performance issues are specific to Lustre using ZFS or is it
> problems with ZFS on Linux in general?

It's Lustre on ZFS, especially for metadata operations that create, modify, or remove inodes. Native ZFS metadata operations are much faster than what Lustre on ZFS is currently providing. That said, we've gone with a ZFS-based patchless MDS, since read operations have always been more critical for us, and our performance is more than adequate.

--Rick

>
> Wolfgang
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Patrick Farrell
2015-05-05 15:11:18 UTC
Permalink
My understanding is it's primarily ZFS rather than ZFS+Lustre, but the issue is that the Lustre MDT does a lot of IO in a manner that ZFS is not (currently) great at handling. As I understood, the fixes will be in ZFS, not the Lustre layers above it.

Bear in mind that Lustre metdata operations are notably different from regular ZFS metadata operations.

I've copied Brian in hopes he's able to chime in - This is all second hand.

Brian - Would you mind offering a few words on this topic?
________________________________________
From: Rick Wagner [***@sdsc.edu]
Sent: Tuesday, May 05, 2015 10:07 AM
To: Wolfgang Baudler
Cc: Patrick Farrell; ***@rutgers.edu; lustre-***@lists.lustre.org
Subject: Re: [lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

On May 5, 2015, at 7:16 AM, Wolfgang Baudler <***@gb.nrao.edu> wrote:

>> The Livermore folks leading this effort can correct me if I misspeak, but
>> they (Specifically Brian Behlendorf) presented on this topic at the
>> Developers' Day at LUG 2015 (no video of the DD talks, sorry).
>>
>>> From his discussion, the issues have been identified, but the fixes are
>>> between six months and two years away, and may still not fully close the
>>> gap. It'll be a bit yet.
>>
>> - Patrick
>
> So, these performance issues are specific to Lustre using ZFS or is it
> problems with ZFS on Linux in general?

It's Lustre on ZFS, especially for metadata operations that create, modify, or remove inodes. Native ZFS metadata operations are much faster than what Lustre on ZFS is currently providing. That said, we've gone with a ZFS-based patchless MDS, since read operations have always been more critical for us, and our performance is more than adequate.

--Rick

>
> Wolfgang
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Peter Kjellström
2015-05-05 15:22:05 UTC
Permalink
On Tue, 5 May 2015 15:11:18 +0000
Patrick Farrell <***@cray.com> wrote:

> My understanding is it's primarily ZFS rather than ZFS+Lustre, but
> the issue is that the Lustre MDT does a lot of IO in a manner that
> ZFS is not (currently) great at handling. As I understood, the
> fixes will be in ZFS, not the Lustre layers above it.

One thing that will be in the lustre layer is lustre supporting ZIL. As
it stands today you can't move the intentlog to a fast NVRAM simply
because lustre (unlike zfs posix layer) doesn't use intent log.

As I understood it people originally hoped that this would make it into
2.7.0 but that was waay optimistic. Also many systems in production use
osd_*_sync_delay_us to skip the very expensive pool sync.

/Peter K
Scott Nolin
2015-05-05 17:06:26 UTC
Permalink
I just want to second what Rick said - It's create/remove not stat of
files where there are performance penalties. We covered this issue for
our workload just by using SSD's for our mdt, when normally we'd just
use fast SAS drives.

A bigger deal for us was RAM on the server, and improvements with SPL 0.6.3+

Scott

>
> It's Lustre on ZFS, especially for metadata operations that create,
> modify, or remove inodes. Native ZFS metadata operations are much
> faster than what Lustre on ZFS is currently providing. That said,
> we've gone with a ZFS-based patchless MDS, since read operations have
> always been more critical for us, and our performance is more than
> adequate.
>
> --Rick
>
Stearman, Marc
2015-05-05 15:43:20 UTC
Permalink
We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.

We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.

We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.

-Marc

----
D. Marc Stearman
Lustre Operations Lead
***@llnl.gov
925.423.9670




On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:

> Hi,
>
> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>
> Are there any published/documented systems using zfs in raid 10 using ssd?
>
> Thanks,
> Kevin
>
>
> --
> Kevin Abbey
> Systems Administrator
> Rutgers University
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Alexander I Kulyavtsev
2015-05-05 17:16:14 UTC
Permalink
How much space is used per i-node on MDT in production installation.
What is recommended size of MDT?

I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.

I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.

We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)

Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.

Alex.

On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:

> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>
> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>
> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> ***@llnl.gov
> 925.423.9670
>
>
>
>
> On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:
>
>> Hi,
>>
>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>
>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>
>> Thanks,
>> Kevin
>>
>>
>> --
>> Kevin Abbey
>> Systems Administrator
>> Rutgers University
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-***@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Alexander I Kulyavtsev
2015-05-05 19:52:16 UTC
Permalink
I checked lustre 1.8.8 ldiskfs MDT: 106*10^6 inodes take 610GB on MDT, or 3.5 KB/inode. I've thought it is less.
So MDT size just 'factor three' more compared to old ldiskfs.
How many files do you plan to have?
Alex.

On May 5, 2015, at 12:16 PM, Alexander I Kulyavtsev <***@fnal.gov<mailto:***@fnal.gov>> wrote:

I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
Isaac Huang
2015-05-07 01:29:39 UTC