Discussion:
zfs -- mds/mdt -- ssd model / type recommendation
(too old to reply)
Kevin Abbey
2015-05-04 18:18:35 UTC
Permalink
Hi,

For a single node OSS I'm planning to use a combined MGS/MDS. Can
anyone recommend an enterprise ssd designed for this workload? I'd like
to create a raid10 with 4x ssd using zfs as the backing fs.

Are there any published/documented systems using zfs in raid 10 using ssd?

Thanks,
Kevin


--
Kevin Abbey
Systems Administrator
Rutgers University
Patrick Farrell
2015-05-04 18:32:42 UTC
Permalink
Kevin,

I don¹t have an answer to your question, but I thought you should know:
ZFS MDTs have some fairly significant performance issues vs ldiskfs MDTs.
There are plans to resolve this, but as it is, it¹s much slower.

- Patrick

On 5/4/15, 1:18 PM, "Kevin Abbey" <***@rutgers.edu> wrote:

>Hi,
>
> For a single node OSS I'm planning to use a combined MGS/MDS. Can
>anyone recommend an enterprise ssd designed for this workload? I'd like
>to create a raid10 with 4x ssd using zfs as the backing fs.
>
>Are there any published/documented systems using zfs in raid 10 using ssd?
>
>Thanks,
>Kevin
>
>
>--
>Kevin Abbey
>Systems Administrator
>Rutgers University
>
>_______________________________________________
>lustre-discuss mailing list
>lustre-***@lists.lustre.org
>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Charlie D Whitehead III
2015-05-04 22:01:50 UTC
Permalink
Kevin,

I must agree with Patrick. I used ZFS on our metadata servers, and found an immediate reduction in performance over ldiskfs. Now, there are a lot of factors involved in performance, but in our case, ZFS for MDTs was not the best choice. That being said, I have not tried it on SSDs... yet.

Regards
--
Charlie D Whitehead III
Andrew Wagner
2015-05-05 02:20:41 UTC
Permalink
I can offer some guidance on our experiences with ZFS Lustre MDTs.
Patrick and Charlie are right - you will get less performance per $ out
of ZFS MDTs vs. LDISKFS MDTs. That said, our RAID10 with 4x Dell Mixed
Use Enterprise SSDs achieves similar performance to most of our LDISKFS
MDTs. Our MDS was a Dell server and we wanted complete support coverage.

One of the most important things for good performance with our ZFS MDS
was RAM. We doubled the amount of RAM in the system after experiencing
performance issues that were clearly memory pressure related. If you
expect to have tens of millions of files, I wouldn't run the MDS without
at least 128GB of RAM. I would be prepared to increase that number if
you run into RAM bottlenecks - we ended up going to 256GB in the end.

For a single OSS, you may not need 4x SSDs to deal with the load. We use
the 4 disk RAID10 setup with a 1PB filesystem and 1.8PB filesystem. Our
use case was more for archive purposes, so we wanted to go with a
complete ZFS solution.



On 5/4/2015 1:18 PM, Kevin Abbey wrote:
> Hi,
>
> For a single node OSS I'm planning to use a combined MGS/MDS. Can
> anyone recommend an enterprise ssd designed for this workload? I'd
> like to create a raid10 with 4x ssd using zfs as the backing fs.
>
> Are there any published/documented systems using zfs in raid 10 using
> ssd?
>
> Thanks,
> Kevin
>
>
Andrew Holway
2015-05-05 06:30:36 UTC
Permalink
ZFS should not be slower for very long. I understand that, now ZFS on Linux
is stable, many significant performance problems have been identified and
are being worked on.

On 5 May 2015 at 04:20, Andrew Wagner <***@ssec.wisc.edu> wrote:

> I can offer some guidance on our experiences with ZFS Lustre MDTs. Patrick
> and Charlie are right - you will get less performance per $ out of ZFS MDTs
> vs. LDISKFS MDTs. That said, our RAID10 with 4x Dell Mixed Use Enterprise
> SSDs achieves similar performance to most of our LDISKFS MDTs. Our MDS was
> a Dell server and we wanted complete support coverage.
>
> One of the most important things for good performance with our ZFS MDS was
> RAM. We doubled the amount of RAM in the system after experiencing
> performance issues that were clearly memory pressure related. If you expect
> to have tens of millions of files, I wouldn't run the MDS without at least
> 128GB of RAM. I would be prepared to increase that number if you run into
> RAM bottlenecks - we ended up going to 256GB in the end.
>
> For a single OSS, you may not need 4x SSDs to deal with the load. We use
> the 4 disk RAID10 setup with a 1PB filesystem and 1.8PB filesystem. Our use
> case was more for archive purposes, so we wanted to go with a complete ZFS
> solution.
>
>
>
> On 5/4/2015 1:18 PM, Kevin Abbey wrote:
>
>> Hi,
>>
>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone
>> recommend an enterprise ssd designed for this workload? I'd like to create
>> a raid10 with 4x ssd using zfs as the backing fs.
>>
>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>
>> Thanks,
>> Kevin
>>
>>
>>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
Patrick Farrell
2015-05-05 14:13:44 UTC
Permalink
The Livermore folks leading this effort can correct me if I misspeak, but they (Specifically Brian Behlendorf) presented on this topic at the Developers' Day at LUG 2015 (no video of the DD talks, sorry).

From his discussion, the issues have been identified, but the fixes are between six months and two years away, and may still not fully close the gap. It'll be a bit yet.

- Patrick
________________________________
From: lustre-discuss [lustre-discuss-***@lists.lustre.org] on behalf of Andrew Holway [***@gmail.com]
Sent: Tuesday, May 05, 2015 1:30 AM
To: Andrew Wagner
Cc: ***@rutgers.edu; lustre-***@lists.lustre.org
Subject: Re: [lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

ZFS should not be slower for very long. I understand that, now ZFS on Linux is stable, many significant performance problems have been identified and are being worked on.

On 5 May 2015 at 04:20, Andrew Wagner <***@ssec.wisc.edu<mailto:***@ssec.wisc.edu>> wrote:
I can offer some guidance on our experiences with ZFS Lustre MDTs. Patrick and Charlie are right - you will get less performance per $ out of ZFS MDTs vs. LDISKFS MDTs. That said, our RAID10 with 4x Dell Mixed Use Enterprise SSDs achieves similar performance to most of our LDISKFS MDTs. Our MDS was a Dell server and we wanted complete support coverage.

One of the most important things for good performance with our ZFS MDS was RAM. We doubled the amount of RAM in the system after experiencing performance issues that were clearly memory pressure related. If you expect to have tens of millions of files, I wouldn't run the MDS without at least 128GB of RAM. I would be prepared to increase that number if you run into RAM bottlenecks - we ended up going to 256GB in the end.

For a single OSS, you may not need 4x SSDs to deal with the load. We use the 4 disk RAID10 setup with a 1PB filesystem and 1.8PB filesystem. Our use case was more for archive purposes, so we wanted to go with a complete ZFS solution.



On 5/4/2015 1:18 PM, Kevin Abbey wrote:
Hi,

For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.

Are there any published/documented systems using zfs in raid 10 using ssd?

Thanks,
Kevin
Wolfgang Baudler
2015-05-05 14:16:38 UTC
Permalink
> The Livermore folks leading this effort can correct me if I misspeak, but
> they (Specifically Brian Behlendorf) presented on this topic at the
> Developers' Day at LUG 2015 (no video of the DD talks, sorry).
>
>>From his discussion, the issues have been identified, but the fixes are
>> between six months and two years away, and may still not fully close the
>> gap. It'll be a bit yet.
>
> - Patrick

So, these performance issues are specific to Lustre using ZFS or is it
problems with ZFS on Linux in general?

Wolfgang
Rick Wagner
2015-05-05 15:07:10 UTC
Permalink
On May 5, 2015, at 7:16 AM, Wolfgang Baudler <***@gb.nrao.edu> wrote:

>> The Livermore folks leading this effort can correct me if I misspeak, but
>> they (Specifically Brian Behlendorf) presented on this topic at the
>> Developers' Day at LUG 2015 (no video of the DD talks, sorry).
>>
>>> From his discussion, the issues have been identified, but the fixes are
>>> between six months and two years away, and may still not fully close the
>>> gap. It'll be a bit yet.
>>
>> - Patrick
>
> So, these performance issues are specific to Lustre using ZFS or is it
> problems with ZFS on Linux in general?

It's Lustre on ZFS, especially for metadata operations that create, modify, or remove inodes. Native ZFS metadata operations are much faster than what Lustre on ZFS is currently providing. That said, we've gone with a ZFS-based patchless MDS, since read operations have always been more critical for us, and our performance is more than adequate.

--Rick

>
> Wolfgang
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Patrick Farrell
2015-05-05 15:11:18 UTC
Permalink
My understanding is it's primarily ZFS rather than ZFS+Lustre, but the issue is that the Lustre MDT does a lot of IO in a manner that ZFS is not (currently) great at handling. As I understood, the fixes will be in ZFS, not the Lustre layers above it.

Bear in mind that Lustre metdata operations are notably different from regular ZFS metadata operations.

I've copied Brian in hopes he's able to chime in - This is all second hand.

Brian - Would you mind offering a few words on this topic?
________________________________________
From: Rick Wagner [***@sdsc.edu]
Sent: Tuesday, May 05, 2015 10:07 AM
To: Wolfgang Baudler
Cc: Patrick Farrell; ***@rutgers.edu; lustre-***@lists.lustre.org
Subject: Re: [lustre-discuss] zfs -- mds/mdt -- ssd model / type recommendation

On May 5, 2015, at 7:16 AM, Wolfgang Baudler <***@gb.nrao.edu> wrote:

>> The Livermore folks leading this effort can correct me if I misspeak, but
>> they (Specifically Brian Behlendorf) presented on this topic at the
>> Developers' Day at LUG 2015 (no video of the DD talks, sorry).
>>
>>> From his discussion, the issues have been identified, but the fixes are
>>> between six months and two years away, and may still not fully close the
>>> gap. It'll be a bit yet.
>>
>> - Patrick
>
> So, these performance issues are specific to Lustre using ZFS or is it
> problems with ZFS on Linux in general?

It's Lustre on ZFS, especially for metadata operations that create, modify, or remove inodes. Native ZFS metadata operations are much faster than what Lustre on ZFS is currently providing. That said, we've gone with a ZFS-based patchless MDS, since read operations have always been more critical for us, and our performance is more than adequate.

--Rick

>
> Wolfgang
>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Peter Kjellström
2015-05-05 15:22:05 UTC
Permalink
On Tue, 5 May 2015 15:11:18 +0000
Patrick Farrell <***@cray.com> wrote:

> My understanding is it's primarily ZFS rather than ZFS+Lustre, but
> the issue is that the Lustre MDT does a lot of IO in a manner that
> ZFS is not (currently) great at handling. As I understood, the
> fixes will be in ZFS, not the Lustre layers above it.

One thing that will be in the lustre layer is lustre supporting ZIL. As
it stands today you can't move the intentlog to a fast NVRAM simply
because lustre (unlike zfs posix layer) doesn't use intent log.

As I understood it people originally hoped that this would make it into
2.7.0 but that was waay optimistic. Also many systems in production use
osd_*_sync_delay_us to skip the very expensive pool sync.

/Peter K
Scott Nolin
2015-05-05 17:06:26 UTC
Permalink
I just want to second what Rick said - It's create/remove not stat of
files where there are performance penalties. We covered this issue for
our workload just by using SSD's for our mdt, when normally we'd just
use fast SAS drives.

A bigger deal for us was RAM on the server, and improvements with SPL 0.6.3+

Scott

>
> It's Lustre on ZFS, especially for metadata operations that create,
> modify, or remove inodes. Native ZFS metadata operations are much
> faster than what Lustre on ZFS is currently providing. That said,
> we've gone with a ZFS-based patchless MDS, since read operations have
> always been more critical for us, and our performance is more than
> adequate.
>
> --Rick
>
Stearman, Marc
2015-05-05 15:43:20 UTC
Permalink
We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.

We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.

We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.

-Marc

----
D. Marc Stearman
Lustre Operations Lead
***@llnl.gov
925.423.9670




On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:

> Hi,
>
> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>
> Are there any published/documented systems using zfs in raid 10 using ssd?
>
> Thanks,
> Kevin
>
>
> --
> Kevin Abbey
> Systems Administrator
> Rutgers University
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Alexander I Kulyavtsev
2015-05-05 17:16:14 UTC
Permalink
How much space is used per i-node on MDT in production installation.
What is recommended size of MDT?

I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.

I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.

We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)

Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.

Alex.

On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:

> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>
> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>
> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> ***@llnl.gov
> 925.423.9670
>
>
>
>
> On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:
>
>> Hi,
>>
>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>
>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>
>> Thanks,
>> Kevin
>>
>>
>> --
>> Kevin Abbey
>> Systems Administrator
>> Rutgers University
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-***@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Alexander I Kulyavtsev
2015-05-05 19:52:16 UTC
Permalink
I checked lustre 1.8.8 ldiskfs MDT: 106*10^6 inodes take 610GB on MDT, or 3.5 KB/inode. I've thought it is less.
So MDT size just 'factor three' more compared to old ldiskfs.
How many files do you plan to have?
Alex.

On May 5, 2015, at 12:16 PM, Alexander I Kulyavtsev <***@fnal.gov<mailto:***@fnal.gov>> wrote:

I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
Isaac Huang
2015-05-07 01:29:39 UTC
Permalink
The dnodes are ditto'ed over whatever redundancy the raidz/mirror
already provides, so for 2-way mirrors that'd be multiplied by 4 from
the compressed dnode size. BTW, all ZFS meta-data are compressed by
default. The recent 0.6.4 release supports LZ4 compression of meta
data, which I found in some benchmarks to increase object
creation/removal rates by roughly 9%. YMMV though.

-Isaac

On Tue, May 05, 2015 at 07:52:16PM +0000, Alexander I Kulyavtsev wrote:
> I checked lustre 1.8.8 ldiskfs MDT: 106*10^6 inodes take 610GB on MDT, or 3.5 KB/inode. I've thought it is less.
> So MDT size just 'factor three' more compared to old ldiskfs.
> How many files do you plan to have?
> Alex.
>
> On May 5, 2015, at 12:16 PM, Alexander I Kulyavtsev <***@fnal.gov<mailto:***@fnal.gov>> wrote:
>
> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>

> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


--
-Isaac
Stearman, Marc
2015-05-05 20:07:35 UTC
Permalink
Most of our production MDS nodes have a 2.7TB zpool. They vary in amount full, but one file system has 1 billion files and is 63% full. I plan on adding a bit more storage to try and get % full down to about 50%. This is another nice attribute of ZFS. I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.

Also, remember that by default, ZFS has redundant_metadata=all defined. ZFS is storing an extra copy of all the metadata for the pool. And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.

Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)

-Marc

----
D. Marc Stearman
Lustre Operations Lead
***@llnl.gov
925.423.9670




On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <***@fnal.gov>
wrote:

> How much space is used per i-node on MDT in production installation.
> What is recommended size of MDT?
>
> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>
> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>
> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>
> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>
> Alex.
>
> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>
>> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>>
>> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>
>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>
>> -Marc
>>
>> ----
>> D. Marc Stearman
>> Lustre Operations Lead
>> ***@llnl.gov
>> 925.423.9670
>>
>>
>>
>>
>> On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:
>>
>>> Hi,
>>>
>>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>>
>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>
>>> Thanks,
>>> Kevin
>>>
>>>
>>> --
>>> Kevin Abbey
>>> Systems Administrator
>>> Rutgers University
>>>
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-***@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-***@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
Andrew Wagner
2015-05-06 13:50:43 UTC
Permalink
Marc,

Have you backed up your ZFS MDT? Our SSD RAID10 with 4 disks and ~200GB
of metadata can take days to backup a snapshot.

Andrew Wagner
Research Systems Administrator
Space Science and Engineering
University of Wisconsin
***@ssec.wisc.edu | 608-261-1360

On 05/05/2015 03:07 PM, Stearman, Marc wrote:
> Most of our production MDS nodes have a 2.7TB zpool. They vary in amount full, but one file system has 1 billion files and is 63% full. I plan on adding a bit more storage to try and get % full down to about 50%. This is another nice attribute of ZFS. I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>
> Also, remember that by default, ZFS has redundant_metadata=all defined. ZFS is storing an extra copy of all the metadata for the pool. And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>
> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> ***@llnl.gov
> 925.423.9670
>
>
>
>
> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <***@fnal.gov>
> wrote:
>
>> How much space is used per i-node on MDT in production installation.
>> What is recommended size of MDT?
>>
>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>
>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>
>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>
>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>
>> Alex.
>>
>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>
>>> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>>>
>>> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>
>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> Lustre Operations Lead
>>> ***@llnl.gov
>>> 925.423.9670
>>>
>>>
>>>
>>>
>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>>>
>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>>
>>>> --
>>>> Kevin Abbey
>>>> Systems Administrator
>>>> Rutgers University
>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-***@lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-***@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Stearman, Marc
2015-05-06 15:40:49 UTC
Permalink
No, all of our Lustre file systems are scratch space. They are not backed up. We have an HPSS archive to store files forever, and we use NetApp filers for NFS home space, which is backed up.

I do recall we tried to do a migration years ago under ldiskfs to reformat with more inodes and the backup for the MDT was taking forever (more than a week), so we decided for future migrations to just build a new file system along side and ask the users to move the most important files that they needed.

-Marc

----
D. Marc Stearman
Lustre Operations Lead
***@llnl.gov
925.423.9670




On May 6, 2015, at 6:50 AM, Andrew Wagner <***@ssec.wisc.edu>
wrote:

> Marc,
>
> Have you backed up your ZFS MDT? Our SSD RAID10 with 4 disks and ~200GB of metadata can take days to backup a snapshot.
>
> Andrew Wagner
> Research Systems Administrator
> Space Science and Engineering
> University of Wisconsin
> ***@ssec.wisc.edu | 608-261-1360
>
> On 05/05/2015 03:07 PM, Stearman, Marc wrote:
>> Most of our production MDS nodes have a 2.7TB zpool. They vary in amount full, but one file system has 1 billion files and is 63% full. I plan on adding a bit more storage to try and get % full down to about 50%. This is another nice attribute of ZFS. I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>>
>> Also, remember that by default, ZFS has redundant_metadata=all defined. ZFS is storing an extra copy of all the metadata for the pool. And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>>
>> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>>
>> -Marc
>>
>> ----
>> D. Marc Stearman
>> Lustre Operations Lead
>> ***@llnl.gov
>> 925.423.9670
>>
>>
>>
>>
>> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <***@fnal.gov>
>> wrote:
>>
>>> How much space is used per i-node on MDT in production installation.
>>> What is recommended size of MDT?
>>>
>>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>>
>>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>>
>>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>>
>>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>>
>>> Alex.
>>>
>>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>>
>>>> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>>>>
>>>> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>>
>>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>>
>>>> -Marc
>>>>
>>>> ----
>>>> D. Marc Stearman
>>>> Lustre Operations Lead
>>>> ***@llnl.gov
>>>> 925.423.9670
>>>>
>>>>
>>>>
>>>>
>>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>>>>
>>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>>
>>>>> Thanks,
>>>>> Kevin
>>>>>
>>>>>
>>>>> --
>>>>> Kevin Abbey
>>>>> Systems Administrator
>>>>> Rutgers University
>>>>>
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-***@lists.lustre.org
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-***@lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>> _______________________________________________
>> lustre-discuss mailing list
>> lustre-***@lists.lustre.org
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
Scott Nolin
2015-05-06 19:13:09 UTC
Permalink
Regarding ZFS MDT backups -

Dealing with the metadata performance issues we performed many zfs mdt
backup/recoveries and used snapshots to test things. It did work, but
slow. It was *agonizing* to wait a day for each testing iteration.

So now in production we've just got a full and incrementals on top,
which helps with total time to snapshot. But I'd still rather run
frequent full snapshots too.

We use lustre for not just scratch, so backup of the mdt matters for us.
And really, if it wasn't hard to do, wouldn't you do it for scratch
filesystems too?

Of course critical stuff gets backed up in home or whatever is really
unique, but there's still plenty of data that will be painful if the
filesystem dies. This has been a bigger decision point for us than
performance for MDT backing filesystem choices. For our next 2 lustre
file systems we are leaning more towards ldiskfs on the MDT because of
it. It's going to be tough if we give up the other features because of
this, but it's important.

One interesting thing we tested with ZFS is using it to mirror MDT's
between 2 servers via infiniband SRP. Notes are here:
http://wiki.lustre.org/MDT_Mirroring_with_ZFS_and_SRP - This would give
you a true mirror of your data (separate servers, separate disks) at
what looked like little or no performance penalty from my testing. Not a
backup, but nice. We only were able to test it for 1 week, so could not
put it into production.

Scott



On 5/6/2015 10:40 AM, Stearman, Marc wrote:
> No, all of our Lustre file systems are scratch space. They are not backed up. We have an HPSS archive to store files forever, and we use NetApp filers for NFS home space, which is backed up.
>
> I do recall we tried to do a migration years ago under ldiskfs to reformat with more inodes and the backup for the MDT was taking forever (more than a week), so we decided for future migrations to just build a new file system along side and ask the users to move the most important files that they needed.
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> ***@llnl.gov
> 925.423.9670
>
>
>
>
> On May 6, 2015, at 6:50 AM, Andrew Wagner <***@ssec.wisc.edu>
> wrote:
>
>> Marc,
>>
>> Have you backed up your ZFS MDT? Our SSD RAID10 with 4 disks and ~200GB of metadata can take days to backup a snapshot.
>>
>> Andrew Wagner
>> Research Systems Administrator
>> Space Science and Engineering
>> University of Wisconsin
>> ***@ssec.wisc.edu | 608-261-1360
>>
>> On 05/05/2015 03:07 PM, Stearman, Marc wrote:
>>> Most of our production MDS nodes have a 2.7TB zpool. They vary in amount full, but one file system has 1 billion files and is 63% full. I plan on adding a bit more storage to try and get % full down to about 50%. This is another nice attribute of ZFS. I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>>>
>>> Also, remember that by default, ZFS has redundant_metadata=all defined. ZFS is storing an extra copy of all the metadata for the pool. And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>>>
>>> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> Lustre Operations Lead
>>> ***@llnl.gov
>>> 925.423.9670
>>>
>>>
>>>
>>>
>>> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <***@fnal.gov>
>>> wrote:
>>>
>>>> How much space is used per i-node on MDT in production installation.
>>>> What is recommended size of MDT?
>>>>
>>>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>>>
>>>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>>>
>>>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>>>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>>>
>>>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>>>
>>>> Alex.
>>>>
>>>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>>>
>>>>> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>>>>>
>>>>> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>>>
>>>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>>>
>>>>> -Marc
>>>>>
>>>>> ----
>>>>> D. Marc Stearman
>>>>> Lustre Operations Lead
>>>>> ***@llnl.gov
>>>>> 925.423.9670
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>>>>>
>>>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>>>
>>>>>> Thanks,
>>>>>> Kevin
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Kevin Abbey
>>>>>> Systems Administrator
>>>>>> Rutgers University
>>>>>>
>>>>>> _______________________________________________
>>>>>> lustre-discuss mailing list
>>>>>> lustre-***@lists.lustre.org
>>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-***@lists.lustre.org
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-***@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
Kevin Abbey
2015-05-06 16:44:16 UTC
Permalink
Dear Marc and list,

All of the replies are very helpful.

Could you share your method (command lines) to expand the zfs pool? I
created my initial setup last summer and used the mkfs.lustre with zfs
backing. Are you expanding the zpool using the zfs commands directly on
the zpool ? I have not done editing of an existing pool with live data
and just want to be sure I understand the methods correctly.


Regarding the statistics of the file system, can you or others share the
cli methods to obtain:

- number of files on the file system
- number of inodes
- total space used by inodes
- size of inodes (? inodes total used space / inodes )
- fragmentation %
- distribution of file sizes on the system?
- frequency of file access?
- random vs streaming IO?


Perhaps a link to a reference is sufficient.


This is the layout of the system:
------------------------------------------------------------------------------------------------
zpool list
------------------------------------------------------------------------------------------------
NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
lustre-mdt0 7.25T 1.01G 7.25T 0% 1.00x ONLINE -
lustre-ost0 72.5T 28.1T 44.4T 38% 1.00x ONLINE -
lustre-ost1 72.5T 33.1T 39.4T 45% 1.00x ONLINE -
lustre-ost2 72.5T 27.2T 45.3T 37% 1.00x ONLINE -
lustre-ost3 72.5T 31.2T 41.3T 43% 1.00x ONLINE -
------------------------------------------------------------------------------------------------
(using) df -h

lustre-mdt0/mdt0 7.1T 1.1G 7.1T 1% /mnt/lustre/local/lustre-MDT0000
lustre-ost0/ost0 57T 23T 35T 40% /mnt/lustre/local/lustre-OST0000
lustre-ost1/ost1 57T 27T 31T 47% /mnt/lustre/local/lustre-OST0001
lustre-ost2/ost2 57T 22T 35T 39% /mnt/lustre/local/lustre-OST0002
lustre-ost3/ost3 57T 25T 32T 45% /mnt/lustre/local/lustre-OST0003
------------------------------------------------------------------------------------------------
***@tcp1:/lustre
226T 96T 131T 43% /lustre
------------------------------------------------------------------------------------------------
Disks in use.

lustre-mdt0/mdt0 raid10, each disk is a 4TB Enterprise SATA
-- ST4000NM0033-9Z
lustre-ost stripe across 2xraidz2, each raidz is 10x 4TB Enterprise
SATA-- ST4000NM0033-9Z
------------------------------------------------------------------------------------------------




The zfs benefits you described are why I am using it.


The current mds/mdt I have consists of a zfs raid 10 using 4TB
enterprise sata drives. I haven't done a performance measure
specifically but I have the assumption that this is a good place to make
a performance improvement by using the proper type of ssd drive. I'll
be doubling the number of OSTs within a ~60days. I may implement a new
lustre with 2.7, then migrate data, then incorporate the existing jbods
in the new lustre. One issue to resolve is that the existing setup did
not have the o2ib added as an option. I read that adding this after the
creation is not guaranteed to proceed without failure. Thus, the reason
for starting with a new mds/mdt. It is currently using tcp and IPoIB.
We only have 16 ib clients and 26 tcp clients. Most of the files access
are large files for genomic/computational biology or md simulations,
files sizes ranging from a few GB to 100-500GB.


The zil is another place for performance improvement. I've read that
since the zil is small, the zil from multiple pools could be located on
partitions of mirrored disks, thus sharing mirrored ssds. Is this
incompatible with lustre? It has been a while since I read about this
and did not find any example usage with a lustre setup, only zfs alone
setup. I also read that there is a zil support plan for lustre. Is
there a link to where I can read more about this and the schedule for
implementation. It will be interesting to learn if I can deploy a
system now and turn on the zil support when it becomes available.


Thank you for any comments/assistance possible,
Kevin



On 05/05/2015 04:07 PM, Stearman, Marc wrote:
> Most of our production MDS nodes have a 2.7TB zpool. They vary in amount full, but one file system has 1 billion files and is 63% full. I plan on adding a bit more storage to try and get % full down to about 50%. This is another nice attribute of ZFS. I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>
> Also, remember that by default, ZFS has redundant_metadata=all defined. ZFS is storing an extra copy of all the metadata for the pool. And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>
> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>
> -Marc
>
> ----
> D. Marc Stearman
> Lustre Operations Lead
> ***@llnl.gov
> 925.423.9670
>
>
>
>
> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <***@fnal.gov>
> wrote:
>
>> How much space is used per i-node on MDT in production installation.
>> What is recommended size of MDT?
>>
>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>
>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>
>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>
>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>
>> Alex.
>>
>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>
>>> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>>>
>>> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>
>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>
>>> -Marc
>>>
>>> ----
>>> D. Marc Stearman
>>> Lustre Operations Lead
>>> ***@llnl.gov
>>> 925.423.9670
>>>
>>>
>>>
>>>
>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:
>>>
>>>> Hi,
>>>>
>>>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>>>
>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>>
>>>> --
>>>> Kevin Abbey
>>>> Systems Administrator
>>>> Rutgers University
>>>>
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-***@lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>> _______________________________________________
>>> lustre-discuss mailing list
>>> lustre-***@lists.lustre.org
>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
Kevin Abbey
Systems Administrator
Center for Computational and Integrative Biology (CCIB)
http://ccib.camden.rutgers.edu/

Rutgers University - Science Building
315 Penn St.
Camden, NJ 08102
Telephone: (856) 225-6770
Fax:(856) 225-6312
Email: ***@rutgers.edu
Stearman, Marc
2015-05-06 17:19:11 UTC
Permalink
On May 6, 2015, at 9:44 AM, Kevin Abbey <***@rutgers.edu>
wrote:

> Dear Marc and list,
>
> All of the replies are very helpful.
>
> Could you share your method (command lines) to expand the zfs pool? I created my initial setup last summer and used the mkfs.lustre with zfs backing. Are you expanding the zpool using the zfs commands directly on the zpool ? I have not done editing of an existing pool with live data and just want to be sure I understand the methods correctly.

Sure. Given a pool like this:

# stout-mds1 /dev/disk/by-vdev > zpool status
pool: stout-mds1
state: ONLINE
scan:
config:
NAME STATE READ WRITE CKSUM
stout-mds1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
A0 ONLINE 0 0 0
B0 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
A1 ONLINE 0 0 0
B1 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
A2 ONLINE 0 0 0
B2 ONLINE 0 0 0


We have 10 mirror pairs defined, but the commands are all the same.

To add a device (this is all in the zpool manpage btw) you could run: "zpool add <pool-name> mirror <dev1> <dev2>" This would add another mirror pair as a vdev to the pool. If you want to do what we did, and replace SAS drives with SSDs, the procedure is a bit different.

• Run zpool attach -o ashift=9 <pool> <dev1> <dev2> for each SDD being added.
• ashift=9 is necessary to align the SSDs to the same sector size boundaries, in this case spinning disk is 512B (or 29)
• pool is the zpool. In lc, this is the hostname of the mds (i.e. stout-mds1)
• dev 1 is the first device in the existing mirror set (i.e. A0, A1, A2, etc.)
• dev 2 is the device name of the SSD you are adding (i.e. A10, B10, A11, etc.)
• man zpool for more detailed information

So to add an SSD to the above pool, the command would be: "zpool attach -o ashift=9 stout-mds1 A0 A10". This would add a new device as a third mirror to the vdev.

> Regarding the statistics of the file system, can you or others share the cli methods to obtain:
>
> - number of files on the file system
> - number of inodes
> - total space used by inodes
> - size of inodes (? inodes total used space / inodes )
> - fragmentation %
> - distribution of file sizes on the system?
> - frequency of file access?
> - random vs streaming IO?

You can use the " -i " flag to df to show inodes. Run that on the MDS and you can see how many total files you can support. With a 7TB MDS, I suspect you can support roughly 4 billion files, but realistically you want to keep that volume around 50% so maybe 2 billion is more realistic.

You can run "lfs df" and "lfs df -i" to see how your OSTs are balanced for the distribution of objects within the file system. If you add an OST later, lustre will give preference to the new OSTs to get them in balance. This may impact performance a bit.

The newest version of ZFS (0.6.3) has stats you can use to look at fragmentation of the pool via files in /proc (we haven't done the pool upgrade yet, so I don't recall the path).

Typically, we run IOR and mdtest to benchmark the file system before we give it to users. I will often run small IORs and log the data to splunk so that I can trend over time to see if changes impacted performance. FIO is another good benchmarking tool to test your I/O.

>
> Perhaps a link to a reference is sufficient.
>
>
> This is the layout of the system:
> ------------------------------------------------------------------------------------------------
> zpool list
> ------------------------------------------------------------------------------------------------
> NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT
> lustre-mdt0 7.25T 1.01G 7.25T 0% 1.00x ONLINE -
> lustre-ost0 72.5T 28.1T 44.4T 38% 1.00x ONLINE -
> lustre-ost1 72.5T 33.1T 39.4T 45% 1.00x ONLINE -
> lustre-ost2 72.5T 27.2T 45.3T 37% 1.00x ONLINE -
> lustre-ost3 72.5T 31.2T 41.3T 43% 1.00x ONLINE -
> ------------------------------------------------------------------------------------------------
> (using) df -h
>
> lustre-mdt0/mdt0 7.1T 1.1G 7.1T 1% /mnt/lustre/local/lustre-MDT0000
> lustre-ost0/ost0 57T 23T 35T 40% /mnt/lustre/local/lustre-OST0000
> lustre-ost1/ost1 57T 27T 31T 47% /mnt/lustre/local/lustre-OST0001
> lustre-ost2/ost2 57T 22T 35T 39% /mnt/lustre/local/lustre-OST0002
> lustre-ost3/ost3 57T 25T 32T 45% /mnt/lustre/local/lustre-OST0003
> ------------------------------------------------------------------------------------------------
> ***@tcp1:/lustre
> 226T 96T 131T 43% /lustre
> ------------------------------------------------------------------------------------------------
> Disks in use.
>
> lustre-mdt0/mdt0 raid10, each disk is a 4TB Enterprise SATA -- ST4000NM0033-9Z
> lustre-ost stripe across 2xraidz2, each raidz is 10x 4TB Enterprise SATA-- ST4000NM0033-9Z
> ------------------------------------------------------------------------------------------------
>
>
>
>
> The zfs benefits you described are why I am using it.
>
>
> The current mds/mdt I have consists of a zfs raid 10 using 4TB enterprise sata drives. I haven't done a performance measure specifically but I have the assumption that this is a good place to make a performance improvement by using the proper type of ssd drive. I'll be doubling the number of OSTs within a ~60days. I may implement a new lustre with 2.7, then migrate data, then incorporate the existing jbods in the new lustre. One issue to resolve is that the existing setup did not have the o2ib added as an option. I read that adding this after the creation is not guaranteed to proceed without failure. Thus, the reason for starting with a new mds/mdt. It is currently using tcp and IPoIB. We only have 16 ib clients and 26 tcp clients. Most of the files access are large files for genomic/computational biology or md simulations, files sizes ranging from a few GB to 100-500GB.

You should be able to change NIDs without reformatting or migrating. You just need to do a write_conf on all the servers and restart them (clients unmounted of course). This is all described in the Lustre Manual. We've done it a few times here and there and it works.

>
> The zil is another place for performance improvement. I've read that since the zil is small, the zil from multiple pools could be located on partitions of mirrored disks, thus sharing mirrored ssds. Is this incompatible with lustre? It has been a while since I read about this and did not find any example usage with a lustre setup, only zfs alone setup. I also read that there is a zil support plan for lustre. Is there a link to where I can read more about this and the schedule for implementation. It will be interesting to learn if I can deploy a system now and turn on the zil support when it becomes available.

The ZIL is not supported by Lustre. There are plans to add that to Lustre, but I don't know all the details. Adding a cache device does work with the pool and can help if you are running on spinning disks.

I will second what everyone is saying about maxing out your memory in the MDS. We are at 128GB today. I would prefer 256GB. Also, if you have a large number of clients, you should check /proc/slabinfo (or run slabtop) and see what is using the memory on your MDS. We found that ldlm_locks and ldlm_resources was consuming a great deal of memory on our MDS nodes, and have taken steps to limit the clients to avoid OOM situations. Also the more memory you have for the ZFS ARC, the better. I think the memory is better used in the ARC, than in the ldlm_locks. Obviously you want to be reasonable with your limits, but I doubt the MDS needs to hold onto 50+GB of RAM for locks.

-Marc

----
D. Marc Stearman
Lustre Operations Lead
***@llnl.gov
925.423.9670


>
> On 05/05/2015 04:07 PM, Stearman, Marc wrote:
>> Most of our production MDS nodes have a 2.7TB zpool. They vary in amount full, but one file system has 1 billion files and is 63% full. I plan on adding a bit more storage to try and get % full down to about 50%. This is another nice attribute of ZFS. I can increase the size of the pool online without having to reformat the file system, thereby adding more inodes.
>>
>> Also, remember that by default, ZFS has redundant_metadata=all defined. ZFS is storing an extra copy of all the metadata for the pool. And, ZFS stores a checksum of all blocks on the file system, so yes there is more overhead, but you do not need to do offline fsck, and it's checking not just metadata, but actual data as well.
>>
>> Disks today are relatively inexpensive, and I feel the benefits of ZFS (online fsck, data integrity, etc) are well worth the cost (slightly slower perf., need a few extra drives)
>>
>> -Marc
>>
>> ----
>> D. Marc Stearman
>> Lustre Operations Lead
>> ***@llnl.gov
>> 925.423.9670
>>
>>
>>
>>
>> On May 5, 2015, at 10:16 AM, Alexander I Kulyavtsev <***@fnal.gov>
>> wrote:
>>
>>> How much space is used per i-node on MDT in production installation.
>>> What is recommended size of MDT?
>>>
>>> I'm presently at about 10 KB/inode which seems too high compared with ldiskfs.
>>>
>>> I ran out of inodes on zfs mdt in my tests and zfs got "locked". MDT zpool got all space used.
>>>
>>> We have zpool created as stripe of mirrors ( mirror s0 s1 mirror s3 s3). Total size ~940 GB, get stuck at about 97 mil files.
>>> zfs v 0.6.4.1 . default 128 KB record. Fragmentation went to 83% when things get locked at 98 % capacity; now I'm at 62% fragmentation after I removed some files (down to 97% space capacity.)
>>>
>>> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>>>
>>> Alex.
>>>
>>> On May 5, 2015, at 10:43 AM, Stearman, Marc <stearman2@†llnl.gov> wrote:
>>>
>>>> We are using the HGST S842 line of 2.5" SSDs. We have them configures as a raid10 setup in ZFS. We started with SAS drives and found them to be too slow, and were bottlenecked on the drives, so we upgraded to SSDs. The nice thing with ZFS is that it's not just a two device mirror. You can do an n-way mirror, so we added the SSDs to each of the vdevs with the SAS drives, let them resilver online, and then removed the SAS drives. Users did not have to experience any downtime.
>>>>
>>>> We have about 100PB of Lustre spread over 10 file systems. All of them are using SSDs. We have a couple using OCZ SSDs, but I'm not a fan of their RMA policies. That has changed since they were bought by Toshiba, but I still prefer the HGST drives.
>>>>
>>>> We configure them as 10 mirror pairs (20 drives total), spread across two JBODs so we can lose an entire JBOD and still have the pool up.
>>>>
>>>> -Marc
>>>>
>>>> ----
>>>> D. Marc Stearman
>>>> Lustre Operations Lead
>>>> ***@llnl.gov
>>>> 925.423.9670
>>>>
>>>>
>>>>
>>>>
>>>> On May 4, 2015, at 11:18 AM, Kevin Abbey <***@rutgers.edu> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone recommend an enterprise ssd designed for this workload? I'd like to create a raid10 with 4x ssd using zfs as the backing fs.
>>>>>
>>>>> Are there any published/documented systems using zfs in raid 10 using ssd?
>>>>>
>>>>> Thanks,
>>>>> Kevin
>>>>>
>>>>>
>>>>> --
>>>>> Kevin Abbey
>>>>> Systems Administrator
>>>>> Rutgers University
>>>>>
>>>>> _______________________________________________
>>>>> lustre-discuss mailing list
>>>>> lustre-***@lists.lustre.org
>>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>>> _______________________________________________
>>>> lustre-discuss mailing list
>>>> lustre-***@lists.lustre.org
>>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>
> --
> Kevin Abbey
> Systems Administrator
> Center for Computational and Integrative Biology (CCIB)
> http://ccib.camden.rutgers.edu/
> Rutgers University - Science Building
> 315 Penn St.
> Camden, NJ 08102
> Telephone: (856) 225-6770
> Fax:(856) 225-6312
> Email: ***@rutgers.edu
>
Kevin Abbey
2015-05-07 19:10:28 UTC
Permalink
Dear list,

All of the replies to my last email are very helpful. This made me
wonder if it would be of interest to create a new method to collect
implementation details and compare usage of the lustre
software/hardware. This could be an automated script or a voluntary
form lustre /admin/users can complete when creating a new
implementation. These real examples and user experiences may be
interesting to gather and learn from.

Clearly many on this list are already willing to share notes. A
form/method of some sort may encourage more involvement and create
opportunity for discussion/learning from each other. The developers
already do this for the software usage via git and bug
reporting/tracking. The hardware and deployment information is more
obscure, diverse, at least to me as a new admin user.


Thanks again for all of the useful suggestions,
Kevin

--
Kevin Abbey
Systems Administrator
Rutgers University
Isaac Huang
2015-05-07 03:02:11 UTC
Permalink
On Tue, May 05, 2015 at 05:16:14PM +0000, Alexander I Kulyavtsev wrote:
> ......
> Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.

The ZFS recordsize does not affect Lustre OST/MDT. The Lustre osd-zfs
driver sets block sizes for its objects regardless of the ZFS
recordsize. See:
https://jira.hpdd.intel.com/browse/LU-4865
https://jira.hpdd.intel.com/browse/LU-5391

-Isaac
Isaac Huang
2015-05-07 03:13:26 UTC
Permalink
The dnodes are stored in data blocks of the meta_dnode, whose block
size is a fixed constant:
#define DNODE_BLOCK_SHIFT 14 /* 16k */

Again, this is not affected by ZFS recordsize.

-Isaac

On Wed, May 06, 2015 at 09:02:11PM -0600, Isaac Huang wrote:
> On Tue, May 05, 2015 at 05:16:14PM +0000, Alexander I Kulyavtsev wrote:
> > ......
> > Shall we use smaller ZFS record size on MDT, say 8KB or 16KB? If inode is ~10KB and zfs record 128KB, we are dropping caches and read data we do not need.
>
> The ZFS recordsize does not affect Lustre OST/MDT. The Lustre osd-zfs
> driver sets block sizes for its objects regardless of the ZFS
> recordsize. See:
> https://jira.hpdd.intel.com/browse/LU-4865
> https://jira.hpdd.intel.com/browse/LU-5391
>
> -Isaac
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
-Isaac
Charlie D Whitehead III
2015-05-06 17:15:11 UTC
Permalink
We did something similar to what Marc described. All of our data is ingested via tape, so it already exists in backup and archive media/facilies. Only data that can be quickly restored is ever copied onto the Lustre file system. We have home directories served on a separate file system, which is backed up, and users know to save there for long-term retention. Our MDTs total at ~100TB, so we try to avoid any little thing that may cause us un-necessary grief. Luckily, we used advice from others in the Lustre community, and did not have to learn some of these lessons on our own.

Regards
--
Charlie D Whitehead III
508.596.0710
Isaac Huang
2015-05-07 00:48:54 UTC
Permalink
Since there's no TRIM support for ZFS on Linux yet, I wonder if
someone has data/experience to share about ZFS on SSD performance as
the SSDs age. Some believe for modern over-provisioned SSDs, lack of
TRIM isn't any big deal but I talked with some SSD developers here
and they all disagreed.

-Isaac

On Mon, May 04, 2015 at 02:18:35PM -0400, Kevin Abbey wrote:
> Hi,
>
> For a single node OSS I'm planning to use a combined MGS/MDS. Can anyone
> recommend an enterprise ssd designed for this workload? I'd like to create
> a raid10 with 4x ssd using zfs as the backing fs.
>
> Are there any published/documented systems using zfs in raid 10 using ssd?
>
> Thanks,
> Kevin
>
>
> --
> Kevin Abbey
> Systems Administrator
> Rutgers University
>
> _______________________________________________
> lustre-discuss mailing list
> lustre-***@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

--
-Isaac
Loading...