Discussion:
[Lustre-discuss] InfiniBand QoS with Lustre ko2iblnd.
Daniel Kobras
2009-05-18 10:04:37 UTC
Permalink
Hi!

Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
LND allowed to #define a service level, but I couldn't find a similar
facility in o2ib. Is there a different way to apply QoS rules?

Thanks,

Daniel.
Jim Garlick
2009-05-18 20:34:03 UTC
Permalink
Post by Daniel Kobras
Hi!
Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
LND allowed to #define a service level, but I couldn't find a similar
facility in o2ib. Is there a different way to apply QoS rules?
Thanks,
Daniel.
Hi, I don't know much about this stuff, but our IB guys did use QoS
to help us when we found LNET was falling apart when we brought up
our first 1K node cluster based on quad socket, quad core opterons,
and ran MPI collective stress tests on all cores.

Here are some notes they put together - see the "QoS Policy file" section.

Jim
____________________________________
QoS configuration on Infiniband

May 18, 2009

Albert Chu
chu11 at llnl.gov

Overview
--------
Quality of Service (QoS) is offered in Infiniband as a means to offer some
guarantees/minimum requirements for certain applications on the fabric.

Definitions
-----------

Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14)
Virtual Lanes (VLs) for traffic. The virtual lanes support
independent virtual transmit/receive buffers for each port on the
fabric.

Service Level (SL): A number (0-15) that can be assigned to any
Infiniband packet. The definition/purpose of a SL is not defined.
It's up to the user to determine.

Basic QoS Implementation in Infiniband
--------------------------------------

There are three basic parts to QoS in Infiniband.

1) Assign/configure protocols/tool/applications to use appropriate
SLs.

Normally, you assign different SLs to different protocols,
applications, etc. (i.e. MPI, Lustre). This allows each
protocol/application to be given unique QoS requirements.

2) Configure SL2VL mapping

Map SLs to VLs. For example, SL0->VL0, SL1->VL1, etc.

3) Configure VL Arbitration

Determines VL transmission rules based on a set of prioritization
rules.

It is the responsibility of administrators/users to use and configure
the SLs/VLs properly. VLs and SLs do nothing/mean nothing in the
Infiniband card.

SL2VL Mapping Configuration
---------------------------

This is pretty basic. You assign a SL to a VL. It's a direct one to
one mapping. i.e. SL1->VL1, SL2->VL2

Normally, you map SLX -> VLX. If you do otherwise, you're starting to
do something pretty crazy.

VL Arbitration Configuration
----------------------------

This is not so basic. There are three components to VL Arbitration
configuration, the High-Priority Table, the Low-Priority Table, and
the Limit of High Priority.

High/Low VL Arbitration Tables
------------------------------

High & Low Priority VL Arbitration Tables are a list of VL numbers
(0-14) and a weighting value (0-255) pairs. The weighting value
indicates the number of 64 byte units that can be transmitted from
that VL when it is that VL's turn to transmit. A weight of 0 means no
data can be transferred. Counters are rounded up as needed for
packets (i.e. a weight of 1 means a packet > 64 bytes can still be
sent). The High Priority VL Arbitration Table is weights for "high
priority" data while the Low Priority VL Arbitration Table is weights
for "low priority" data (the usefulness will make more sense after you
read "Limit of High Priority" below).

Note that 64*255 =~ 16K, which is small number for many institutions.
I think it is easiest to think of the weights as ratios for percentage
bandwidth if the network is completely flooded with data from all
protocols/applications.

For example:

A) VL0 Weight = 255, VL1 Weight = 255

50% bandwidth for VL0 and VL1 each.

B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255

33% bandwidth for VL0, VL1, and VL2 each.

C) VL0 Weight = 200, VL1 Weight = 100

66% bandwidth for VL0, 33% bandwidth for VL1.

D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100

50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each.

Limit of High Priority
----------------------

Indicates the number of high-priority packets (from the High VL
Arbitration Table) that can be sent without an opportunity to send a
low priority packet (from the Low VL Arbitration Table). Increments
are in 4K bytes (special numbers, 0 = one packet. 255 = unlimited
data).

4K*254 =~ 1M, which again is small number for many institutions. The
most likely numbers to consider using are:

0 - one packet
254 - max high limit data w/o being unlimited
255 - unlimited data

VL Arbitration Examples
-----------------------

When you combine the High/Low VL Arbitration tables with the Limit of
High Priority, you can create some interesting QoS behavior.

Example 1:

(Following example is borrowed from the "Quality and Service in OFED
3.1" presentation listed below.)

High-Limit: 0
VL-Arb-High: VL2 Weight = 1
VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50

Effectively, anytime any data on VL2 is available, send at most one
packet from VL2 before sending data from VL0 or VL1. If no VL2 data
is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth.

Idea:

(Assume Lustre Meta Data Servers and Lustre OSTs are on the same
fabric)

MPI -> SL0 -> VL0
Lustre OST Data -> SL1 -> VL1
Lustre Meta Data -> SL2 -> VL2

In this example, Lustre meta data traffic is assumed to be low, but
with the high priority, is accessed faster and theoretically allow for
better Lustre interaction. When there is no Lustre meta data traffic
on the fabric, MPI is given the majority share of bandwidth b/c it is
more timing sensitive.

Example 2:

High-Limit: 254
Vl-Arb-High: VL0 Weight = 255
Vl-Arb-Low: VL1 Weight = 1

Effectively, whenever there is data on VL0, always send it before VL1.
But do not allow VL0 to starve VL1. Let VL1 send *something* once in
awhile.

Idea:

MPI -> SL1 -> VL0
Lustre -> Sl1 -> VL1

So MPI always gets priority over Lustre, but cannot starve it out.
The High-Limit of 254 means a low priority packet must be sent once in
awhile. This could be important if Lustre "pings" are done to keep
some services alive.

Configuring for OpenSM
----------------------

Currently configure in /var/cache/opensm/opensm.opts (later to be in
/etc/opensm/opensm.conf).

#
# QoS OPTIONS
#
qos TRUE

qos_policy_file /var/cache/opensm/qos-policy.conf

# QoS default options
qos_max_vls 2
qos_high_limit 254
qos_vlarb_high 0:255
qos_vlarb_low 1:1
qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

qos_ca_max_vls 2
qos_ca_high_limit 254
qos_ca_vlarb_high 0:255
qos_ca_vlarb_low 1:1
qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

# achu: VL2 not used, need to give non-null input to buggy opensm
qos_swe_max_vls 2
qos_swe_high_limit 255
qos_swe_vlarb_high 0:225,1:25
qos_swe_vlarb_low 2:1
qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15

Notes/Comments:

There are default QoS options, and specific QoS options
for channel adapters, switches, etc. They allow you to configure
for different port-types across the fabric.

The "max_vls" entries can be ignored.

The "high_limit", "vlarb_high", and "vlarb_low" fields are hopefully
self exaplanatory. The "vlarb_high"/"vlarb_low" entries take inputs
as <VL>:<Weight> as input.

In the above example, channel Adapters have:

VL0 Weight = 255 -> For MPI

VL1 Weight = 1 -> For Lustre

Idea: With the High Limit of 254, MPI always gets priority, but cannot
starve Lustre.

In the above example, Switches have:

VL0 Weight = 225 -> For MPI
VL1 Weight = 25 -> For Lustre

Idea: Across the entire cluster, MPI, Lustre, etc. are going on from
different jobs/tasks. We don't want MPI to starve out other traffic
so we give it a nice chunk of bandwidth but not all bandwidth (in this
example 90% for MPI, 10% for Lustre).

SLs to VLs are mapped by listing the VLs for each SL in increasing
order. In the above example, SL0 -> VL0 and SL1 -> VL1. The input of
15 is if the SL is one you don't care about.

Assigning SLs
-------------

The configuration of QoS is now over, but we still need to make
protocols/applications use the appropriate SL.

Some tools allow you to pick an SL when you run.

i.e.
Post by Daniel Kobras
mpirun -sl 0
However, it may not be easy to force/change users/applications to use
different SLs. The easiest way to configure SLs is through the OpenSM
QoS policy file.

QoS Policy File
---------------

Depending on OpenSM version, this file is in
/var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf.

The following is the short summary of options I think are needed for
our environment. See "QoS Management in OpenSM" for full set of
options.

Format:

qos-ulps
<user level protocol>, <options> : <SL level>
end-qos-ulps

<user level protocol> = IPoIB, SDP, SRP, iSER

<options> = port-num, pkey, service-id, target-port-guid
(Note: options depends on which user level protocol is selected)

<SL level> = SL level 0-15.

Example:

qos-ulps
default : 0
any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1
end-qos-ulps

Idea:

Everything (most notably MPI) defaults to SL0. Any of the above
locations with the listed destination GUID gets SL1.

If the target-port-guid's list of GUIDs are Lustre Routers, that would
indicate Lustre data gets SL=1. In combination with the VL
Arbitration and SL2VL Mapping configuration listed above, hopefully it
can be seen how MPI gets priority over Lustre, but does not starve it
out.

Note that files with target-port-guids must be kept up to date if
GUIDs change. You can determine GUIDs via /usr/sbin/ibstat.

Verifying Configuration
-----------------------

The tool smpquery can be used to verify that VL Arbitration tables and
SL2VL tables have been configured in cards/switches properly.

# > /usr/sbin/smpquery sl2vl 346
# SL2VL table: Lid 346
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15|

# > /usr/sbin/smpquery vlarb 346
# VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8
# Low priority VL Arbitration Table:
VL : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
# High priority VL Arbitration Table:
VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |

The high limit can be determined by issuing portinfo queries via
/usr/sbin/smpquery.

# > /usr/sbin/smpquery portinfo 346 | grep Limit
VLHighLimit:.....................0

Random Configuration Notes
--------------------------

SLs are most often assigned during Infiniband Queue Pair (QP) creation
time. So, if you change your QoS settings, any tools/applications
(including Lustre) that are currently running and have already created
QPs may not have absorbed the newest QoS policy. The appropriate
tools/applications should be restarted.

Not all Infiniband adapters support VLs. Those that do many not
support all 15 VLs. You can determine what your system supports by
issuing portinfo queries via /usr/sbin/smpquery.

References
----------

Qos Management in OpenSM

(this is a link to the Git Tree - hopefully the URL is always legit)

http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD

Quality and Service in OFED 3.1 - Liran Liss

http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt

QoS support in OFED

(this is a link to the Git Tree - the URL is on the ofed_1_4 branch,
so it probably will change at some point)

http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4
Sébastien Buisson
2009-05-19 15:55:21 UTC
Permalink
Hi,

We took a slightly different approach to deal with IB QoS in Lustre.

We decided to assign a specific service-id to Lustre: in ofa-kernel we
added a new value in the rdma_port_space enum, that we called
RDMA_PS_LUSTRE. Then we modified the calls to rdma_create_id in
o2iblnd.c and o2iblnd_cb.c to use this new port space value instead of
RDMA_PS_TCP (well, we did a little more than that in the Lustre code,
because we wanted the service-id to be a ko2iblnd module parameter, so
we added some stuff in o2iblnd_modparams.c for instance).

The next step is to tell OpenSM to assign an SL to this service-id.
Here is an extract of our "QoS policy file":
qos-ulps
default : 0
any, service-id=0x.....: 3
end-qos-ulps

The major drawback of this solution is that the modification we made in
the ofa-kernel is not OpenFabrics Alliance compliant, because the
portspace list is defined in the IB standard.

Cheers,
Sebastien.
Post by Jim Garlick
Post by Daniel Kobras
Hi!
Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
LND allowed to #define a service level, but I couldn't find a similar
facility in o2ib. Is there a different way to apply QoS rules?
Thanks,
Daniel.
Hi, I don't know much about this stuff, but our IB guys did use QoS
to help us when we found LNET was falling apart when we brought up
our first 1K node cluster based on quad socket, quad core opterons,
and ran MPI collective stress tests on all cores.
Here are some notes they put together - see the "QoS Policy file" section.
Jim
____________________________________
QoS configuration on Infiniband
May 18, 2009
Albert Chu
chu11 at llnl.gov
Overview
--------
Quality of Service (QoS) is offered in Infiniband as a means to offer some
guarantees/minimum requirements for certain applications on the fabric.
Definitions
-----------
Virtual Lanes (VLs): Infiniband supports up to 15 (numbered 0-14)
Virtual Lanes (VLs) for traffic. The virtual lanes support
independent virtual transmit/receive buffers for each port on the
fabric.
Service Level (SL): A number (0-15) that can be assigned to any
Infiniband packet. The definition/purpose of a SL is not defined.
It's up to the user to determine.
Basic QoS Implementation in Infiniband
--------------------------------------
There are three basic parts to QoS in Infiniband.
1) Assign/configure protocols/tool/applications to use appropriate
SLs.
Normally, you assign different SLs to different protocols,
applications, etc. (i.e. MPI, Lustre). This allows each
protocol/application to be given unique QoS requirements.
2) Configure SL2VL mapping
Map SLs to VLs. For example, SL0->VL0, SL1->VL1, etc.
3) Configure VL Arbitration
Determines VL transmission rules based on a set of prioritization
rules.
It is the responsibility of administrators/users to use and configure
the SLs/VLs properly. VLs and SLs do nothing/mean nothing in the
Infiniband card.
SL2VL Mapping Configuration
---------------------------
This is pretty basic. You assign a SL to a VL. It's a direct one to
one mapping. i.e. SL1->VL1, SL2->VL2
Normally, you map SLX -> VLX. If you do otherwise, you're starting to
do something pretty crazy.
VL Arbitration Configuration
----------------------------
This is not so basic. There are three components to VL Arbitration
configuration, the High-Priority Table, the Low-Priority Table, and
the Limit of High Priority.
High/Low VL Arbitration Tables
------------------------------
High & Low Priority VL Arbitration Tables are a list of VL numbers
(0-14) and a weighting value (0-255) pairs. The weighting value
indicates the number of 64 byte units that can be transmitted from
that VL when it is that VL's turn to transmit. A weight of 0 means no
data can be transferred. Counters are rounded up as needed for
packets (i.e. a weight of 1 means a packet > 64 bytes can still be
sent). The High Priority VL Arbitration Table is weights for "high
priority" data while the Low Priority VL Arbitration Table is weights
for "low priority" data (the usefulness will make more sense after you
read "Limit of High Priority" below).
Note that 64*255 =~ 16K, which is small number for many institutions.
I think it is easiest to think of the weights as ratios for percentage
bandwidth if the network is completely flooded with data from all
protocols/applications.
A) VL0 Weight = 255, VL1 Weight = 255
50% bandwidth for VL0 and VL1 each.
B) VL0 Weight = 255, VL1 Weight = 255, VL2 Weight = 255
33% bandwidth for VL0, VL1, and VL2 each.
C) VL0 Weight = 200, VL1 Weight = 100
66% bandwidth for VL0, 33% bandwidth for VL1.
D) VL0 Weight = 200, VL1 Weight = 100, VL2 Weight = 100
50% bandwidth for VL0, 25% bandwidth for VL1 and VL2 each.
Limit of High Priority
----------------------
Indicates the number of high-priority packets (from the High VL
Arbitration Table) that can be sent without an opportunity to send a
low priority packet (from the Low VL Arbitration Table). Increments
are in 4K bytes (special numbers, 0 = one packet. 255 = unlimited
data).
4K*254 =~ 1M, which again is small number for many institutions. The
0 - one packet
254 - max high limit data w/o being unlimited
255 - unlimited data
VL Arbitration Examples
-----------------------
When you combine the High/Low VL Arbitration tables with the Limit of
High Priority, you can create some interesting QoS behavior.
(Following example is borrowed from the "Quality and Service in OFED
3.1" presentation listed below.)
High-Limit: 0
VL-Arb-High: VL2 Weight = 1
VL-Arb-Low: VL0 Weight = 200, VL1 Weight = 50
Effectively, anytime any data on VL2 is available, send at most one
packet from VL2 before sending data from VL0 or VL1. If no VL2 data
is available, VL0 gets 80% bandwidth, VL1 gets 20% of bandwidth.
(Assume Lustre Meta Data Servers and Lustre OSTs are on the same
fabric)
MPI -> SL0 -> VL0
Lustre OST Data -> SL1 -> VL1
Lustre Meta Data -> SL2 -> VL2
In this example, Lustre meta data traffic is assumed to be low, but
with the high priority, is accessed faster and theoretically allow for
better Lustre interaction. When there is no Lustre meta data traffic
on the fabric, MPI is given the majority share of bandwidth b/c it is
more timing sensitive.
High-Limit: 254
Vl-Arb-High: VL0 Weight = 255
Vl-Arb-Low: VL1 Weight = 1
Effectively, whenever there is data on VL0, always send it before VL1.
But do not allow VL0 to starve VL1. Let VL1 send *something* once in
awhile.
MPI -> SL1 -> VL0
Lustre -> Sl1 -> VL1
So MPI always gets priority over Lustre, but cannot starve it out.
The High-Limit of 254 means a low priority packet must be sent once in
awhile. This could be important if Lustre "pings" are done to keep
some services alive.
Configuring for OpenSM
----------------------
Currently configure in /var/cache/opensm/opensm.opts (later to be in
/etc/opensm/opensm.conf).
#
# QoS OPTIONS
#
qos TRUE
qos_policy_file /var/cache/opensm/qos-policy.conf
# QoS default options
qos_max_vls 2
qos_high_limit 254
qos_vlarb_high 0:255
qos_vlarb_low 1:1
qos_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
qos_ca_max_vls 2
qos_ca_high_limit 254
qos_ca_vlarb_high 0:255
qos_ca_vlarb_low 1:1
qos_ca_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
# achu: VL2 not used, need to give non-null input to buggy opensm
qos_swe_max_vls 2
qos_swe_high_limit 255
qos_swe_vlarb_high 0:225,1:25
qos_swe_vlarb_low 2:1
qos_swe_sl2vl 0,1,15,15,15,15,15,15,15,15,15,15,15,15,15,15
There are default QoS options, and specific QoS options
for channel adapters, switches, etc. They allow you to configure
for different port-types across the fabric.
The "max_vls" entries can be ignored.
The "high_limit", "vlarb_high", and "vlarb_low" fields are hopefully
self exaplanatory. The "vlarb_high"/"vlarb_low" entries take inputs
as <VL>:<Weight> as input.
VL0 Weight = 255 -> For MPI
VL1 Weight = 1 -> For Lustre
Idea: With the High Limit of 254, MPI always gets priority, but cannot
starve Lustre.
VL0 Weight = 225 -> For MPI
VL1 Weight = 25 -> For Lustre
Idea: Across the entire cluster, MPI, Lustre, etc. are going on from
different jobs/tasks. We don't want MPI to starve out other traffic
so we give it a nice chunk of bandwidth but not all bandwidth (in this
example 90% for MPI, 10% for Lustre).
SLs to VLs are mapped by listing the VLs for each SL in increasing
order. In the above example, SL0 -> VL0 and SL1 -> VL1. The input of
15 is if the SL is one you don't care about.
Assigning SLs
-------------
The configuration of QoS is now over, but we still need to make
protocols/applications use the appropriate SL.
Some tools allow you to pick an SL when you run.
i.e.
Post by Daniel Kobras
mpirun -sl 0
However, it may not be easy to force/change users/applications to use
different SLs. The easiest way to configure SLs is through the OpenSM
QoS policy file.
QoS Policy File
---------------
Depending on OpenSM version, this file is in
/var/cache/opensm/qos-policy.conf or /etc/opensm/qos-policy.conf.
The following is the short summary of options I think are needed for
our environment. See "QoS Management in OpenSM" for full set of
options.
qos-ulps
<user level protocol>, <options> : <SL level>
end-qos-ulps
<user level protocol> = IPoIB, SDP, SRP, iSER
<options> = port-num, pkey, service-id, target-port-guid
(Note: options depends on which user level protocol is selected)
<SL level> = SL level 0-15.
qos-ulps
default : 0
any, target-port-guid 0x0002c9030002879d,0x0002c90300028765 : 1
end-qos-ulps
Everything (most notably MPI) defaults to SL0. Any of the above
locations with the listed destination GUID gets SL1.
If the target-port-guid's list of GUIDs are Lustre Routers, that would
indicate Lustre data gets SL=1. In combination with the VL
Arbitration and SL2VL Mapping configuration listed above, hopefully it
can be seen how MPI gets priority over Lustre, but does not starve it
out.
Note that files with target-port-guids must be kept up to date if
GUIDs change. You can determine GUIDs via /usr/sbin/ibstat.
Verifying Configuration
-----------------------
The tool smpquery can be used to verify that VL Arbitration tables and
SL2VL tables have been configured in cards/switches properly.
# > /usr/sbin/smpquery sl2vl 346
# SL2VL table: Lid 346
# SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in 0, out 0: | 0| 1|15|15|15|15|15|15|15|15|15|15|15|15|15|15|
# > /usr/sbin/smpquery vlarb 346
# VLArbitration tables: Lid 346 port 0 LowCap 8 HighCap 8
VL : |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0x1 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
VL : |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
WEIGHT: |0xFF|0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |0x0 |
The high limit can be determined by issuing portinfo queries via
/usr/sbin/smpquery.
# > /usr/sbin/smpquery portinfo 346 | grep Limit
VLHighLimit:.....................0
Random Configuration Notes
--------------------------
SLs are most often assigned during Infiniband Queue Pair (QP) creation
time. So, if you change your QoS settings, any tools/applications
(including Lustre) that are currently running and have already created
QPs may not have absorbed the newest QoS policy. The appropriate
tools/applications should be restarted.
Not all Infiniband adapters support VLs. Those that do many not
support all 15 VLs. You can determine what your system supports by
issuing portinfo queries via /usr/sbin/smpquery.
References
----------
Qos Management in OpenSM
(this is a link to the Git Tree - hopefully the URL is always legit)
http://www.openfabrics.org/git/?p=~sashak/management.git;a=blob_plain;f=opensm/doc/QoS_management_in_OpenSM.txt;hb=HEAD
Quality and Service in OFED 3.1 - Liran Liss
http://www.openfabrics.org/archives/spring2008sonoma/Tuesday/qos_sonoma08_ofa_v1.ppt
QoS support in OFED
(this is a link to the Git Tree - the URL is on the ofed_1_4 branch,
so it probably will change at some point)
http://www.openfabrics.org/git/?p=~tziporet/docs.git;a=blob_plain;f=QoS_architecture.txt;hb=ofed_1_4
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Isaac Huang
2009-05-19 19:48:10 UTC
Permalink
Post by Sébastien Buisson
Hi,
We took a slightly different approach to deal with IB QoS in Lustre.
We decided to assign a specific service-id to Lustre: in ofa-kernel we
added a new value in the rdma_port_space enum, that we called
RDMA_PS_LUSTRE. Then we modified the calls to rdma_create_id in
o2iblnd.c and o2iblnd_cb.c to use this new port space value instead of
RDMA_PS_TCP (well, we did a little more than that in the Lustre code,
because we wanted the service-id to be a ko2iblnd module parameter, so
we added some stuff in o2iblnd_modparams.c for instance).
Maybe I missed something, but it seemed to me an overkill to specify
service-id this way. Without any code changes, you might figure out
the service-id by the ko2iblnd 'service' option:
rdma_resolve_route->cma_resolve_ib_route->cma_query_ib_route->cma_get_service_id

Isaac
Daniel Kobras
2009-05-19 17:05:03 UTC
Permalink
Hi!
Post by Jim Garlick
Hi, I don't know much about this stuff, but our IB guys did use QoS
to help us when we found LNET was falling apart when we brought up
our first 1K node cluster based on quad socket, quad core opterons,
and ran MPI collective stress tests on all cores.
Here are some notes they put together - see the "QoS Policy file" section.
Great summary, thanks for sharing! Seems like qos-ulp is a rather recent
OpenSM-specific feature, and the SMs in our switches apparently don't
offer a similar SID-to-SL mapping, either. , but it certainly got me a
leap further.

Thanks,

Daniel.
Isaac Huang
2009-05-19 19:25:55 UTC
Permalink
Post by Daniel Kobras
Hi!
Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
LND allowed to #define a service level, but I couldn't find a similar
facility in o2ib. Is there a different way to apply QoS rules?
The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the
o2iblnd service port (set via ko2iblnd option 'service', 987 by
default) and its port space (RDMA_PS_TCP). For a complete, and more
complicated story, please see:
https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2

Isaac
Sébastien Buisson
2009-06-22 14:49:03 UTC
Permalink
Hi all,

We have been thinking about this IB QoS thing in Lustre for a while, and
we would like to express a need that may not be satisfied by the current
solution exposed by Isaac (which consists in using the ko2iblnd
'service' option).

Let's consider we have two sets of OSSes, each set serving a different
Lustre file system (i.e. all the OSTs of an OSS are part of the same
Lustre file system). The same Lustre clients have access to both
filesystems.
In these conditions, how can we enforce different IB QoS in Lustre for
the 2 file systems?
- by using the ko2iblnd 'service' option, the o2iblnd SL would be the
same for all connections initiated by a given Lustre client, regardless
the destination file system. So we would not achieve our goal.
Unless what really matters is the SL of the connections created by the
servers (I think I have seen in the Lustre debug logs that the 'real'
data transfers are always done via the servers connections).
What do you think?
- if the 'service id' information was stored on the MGS on a file system
basis, one could imagine to retrieve it at mount time on the clients.
The 'service id' information stored on the MGS could consist in a port
space and a port id. Thus it would be possible to affect different
service ports to the various connections initiated by the client,
depending on the target file system.
What do you think? Would you say this is feasible, or can you see major
issues with this proposal?


Thanks in advance.
Sebastien.
Post by Isaac Huang
Post by Daniel Kobras
Hi!
Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
LND allowed to #define a service level, but I couldn't find a similar
facility in o2ib. Is there a different way to apply QoS rules?
The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the
o2iblnd service port (set via ko2iblnd option 'service', 987 by
default) and its port space (RDMA_PS_TCP). For a complete, and more
https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2
Isaac
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Sébastien Buisson
2009-06-24 07:46:19 UTC
Permalink
Post by Sébastien Buisson
Hi all,
We have been thinking about this IB QoS thing in Lustre for a while, and
we would like to express a need that may not be satisfied by the current
solution exposed by Isaac (which consists in using the ko2iblnd
'service' option).
Let's consider we have two sets of OSSes, each set serving a different
Lustre file system (i.e. all the OSTs of an OSS are part of the same
Lustre file system). The same Lustre clients have access to both
filesystems.
In these conditions, how can we enforce different IB QoS in Lustre for
the 2 file systems?
- by using the ko2iblnd 'service' option, the o2iblnd SL would be the
same for all connections initiated by a given Lustre client, regardless
the destination file system. So we would not achieve our goal.
Unless what really matters is the SL of the connections created by the
servers (I think I have seen in the Lustre debug logs that the 'real'
data transfers are always done via the servers connections).
What do you think?
I have tried to make communicate a client for which I set the ko2iblnd
'service' option to 986, with a server for which I set the ko2iblnd
'service' option to 987: it does not work.
This is not surprising because the ko2iblnd 'service' parameter is used
on the client side in the kiblnd_connect_peer function to designate to
port of the remote peer (the server in this case).
So, the ko2iblnd 'service' option must be the same for all the nodes
participating in the same file system.

In our case where the same clients access both file systems, it means
that we will not be able to set different o2iblnd SLs for the two file
systems.
Post by Sébastien Buisson
- if the 'service id' information was stored on the MGS on a file system
basis, one could imagine to retrieve it at mount time on the clients.
The 'service id' information stored on the MGS could consist in a port
space and a port id. Thus it would be possible to affect different
service ports to the various connections initiated by the client,
depending on the target file system.
What do you think? Would you say this is feasible, or can you see major
issues with this proposal?
The peer's port information could be stored in the kib_peer_t structure.
That way, it would be possible to make clients connect to servers which
listen on different ports.
What do you think?
Post by Sébastien Buisson
Thanks in advance.
Sebastien.
Post by Isaac Huang
Post by Daniel Kobras
Hi!
Does anyone know how to use QoS with Lustre's o2ib LND? The Voltaire IB
LND allowed to #define a service level, but I couldn't find a similar
facility in o2ib. Is there a different way to apply QoS rules?
The o2iblnd SL is set by the OFED RDMA CM, indirectly based on the
o2iblnd service port (set via ko2iblnd option 'service', 987 by
default) and its port space (RDMA_PS_TCP). For a complete, and more
https://bugzilla.lustre.org/show_bug.cgi?id=18360#c2
Isaac
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss at lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss
Daniel Kobras
2009-06-24 08:05:36 UTC
Permalink
Hi S?bastien!
Post by Sébastien Buisson
Post by Sébastien Buisson
- if the 'service id' information was stored on the MGS on a file system
basis, one could imagine to retrieve it at mount time on the clients.
The 'service id' information stored on the MGS could consist in a port
space and a port id. Thus it would be possible to affect different
service ports to the various connections initiated by the client,
depending on the target file system.
What do you think? Would you say this is feasible, or can you see major
issues with this proposal?
The peer's port information could be stored in the kib_peer_t structure.
That way, it would be possible to make clients connect to servers which
listen on different ports.
What do you think?
Why do you want to distinguish the two filesystems solely by service id
rather than, say, service id + port guids of the respective Lustre
servers? You'll need a full QoS policy file instead of the simplified
syntax, and configuration needs to be adapted on hardware changes, but
this still looks simpler to me than modifying the wire protocol.

Regards,

Daniel.
Isaac Huang
2009-06-25 18:49:06 UTC
Permalink
Post by Jim Garlick
......
The peer's port information could be stored in the kib_peer_t structure.
That way, it would be possible to make clients connect to servers which
listen on different ports.
What do you think?
At this point it can't be done. But we have in our development plans
to implement dynamic LNet configuration which includes per-NI options
(i.e. it'd be possible to specify the 'service' option on a per-NI
basis instead of being just LND global), and once it's implemented
you'd be able to specify different 'service' option if you'd create
two server networks for the two FS.

For your current concern of setting up different SLs, I'd believe that
it could be achieved via target GUIDs as mentioned in my previous reply.

Hope this helps,
Isaac
Sébastien Buisson
2009-06-26 11:42:53 UTC
Permalink
Post by Isaac Huang
Post by Jim Garlick
......
The peer's port information could be stored in the kib_peer_t structure.
That way, it would be possible to make clients connect to servers which
listen on different ports.
What do you think?
At this point it can't be done. But we have in our development plans
to implement dynamic LNet configuration which includes per-NI options
(i.e. it'd be possible to specify the 'service' option on a per-NI
basis instead of being just LND global), and once it's implemented
you'd be able to specify different 'service' option if you'd create
two server networks for the two FS.
OK, if I understand correctly, the major hurdle with what I proposed is
that LNET is not able to get configuration information dynamically at
the moment, right?

I agree with you, I think the per-NI options in LNET would do the trick.
Do you have plans about when this feature would be available? Have you
already begun to work on it?
If you have some pre-alpha work, we would be glad to evaluate it.
Post by Isaac Huang
For your current concern of setting up different SLs, I'd believe that
it could be achieved via target GUIDs as mentioned in my previous reply.
Unfortunately, configuring IB QoS via target GUIDs quickly becomes too
complicated. As the size of clusters grow, it would require to list
hundreds of GUIDs in the QoS policy rules.

Sebastien.
Isaac Huang
2009-07-01 06:07:33 UTC
Permalink
Post by Sébastien Buisson
Post by Isaac Huang
Post by Jim Garlick
......
The peer's port information could be stored in the kib_peer_t
structure. That way, it would be possible to make clients connect to
servers which listen on different ports.
What do you think?
At this point it can't be done. But we have in our development plans
to implement dynamic LNet configuration which includes per-NI options
(i.e. it'd be possible to specify the 'service' option on a per-NI
basis instead of being just LND global), and once it's implemented
you'd be able to specify different 'service' option if you'd create
two server networks for the two FS.
OK, if I understand correctly, the major hurdle with what I proposed is
that LNET is not able to get configuration information dynamically at
the moment, right?
Yes.
Post by Sébastien Buisson
I agree with you, I think the per-NI options in LNET would do the trick.
Do you have plans about when this feature would be available? Have you
already begun to work on it?
It's too early to make any realistic estimate at the moment. Though It's
already on the lnet roadmap, I'm not sure when we're going to start
working on it.
Post by Sébastien Buisson
If you have some pre-alpha work, we would be glad to evaluate it.
Thanks, I'll remember to ping you when it's available.
Post by Sébastien Buisson
Post by Isaac Huang
For your current concern of setting up different SLs, I'd believe that
it could be achieved via target GUIDs as mentioned in my previous reply.
Unfortunately, configuring IB QoS via target GUIDs quickly becomes too
complicated. As the size of clusters grow, it would require to list
hundreds of GUIDs in the QoS policy rules.
Yes, it's rather cumbersome at bigger scales.

Thanks,
Isaac
Isaac Huang
2009-07-01 06:31:36 UTC
Permalink
Post by Jim Garlick
......
Post by Sébastien Buisson
Post by Isaac Huang
For your current concern of setting up different SLs, I'd believe that
it could be achieved via target GUIDs as mentioned in my previous reply.
Unfortunately, configuring IB QoS via target GUIDs quickly becomes too
complicated. As the size of clusters grow, it would require to list
hundreds of GUIDs in the QoS policy rules.
Yes, it's rather cumbersome at bigger scales.
It just occurred to me that it might work by configuring QoS policy
based on IB partition keys. It's just an initial thought - if you'd
configure two @o2ib networks over two IB partitions over the same
fabric, one for each filesystem, then you might differentiate traffic
of the two FS based on their partition keys. I think it'd be much
easier to configure an additional @o2ib network than to maintain
hundreds of GUIDs that could change in the policy file.

By default the o2iblnd runs over the default IB partition. Please see
bug 18602 for how to configure the o2iblnd over a non-default partition.

Thanks,
Isaac

Isaac Huang
2009-06-25 18:34:47 UTC
Permalink
Post by Jim Garlick
......
Let's consider we have two sets of OSSes, each set serving a different
Lustre file system (i.e. all the OSTs of an OSS are part of the same
Lustre file system). The same Lustre clients have access to both
filesystems.
In these conditions, how can we enforce different IB QoS in Lustre for
the 2 file systems?
By assigning different SLs to the two sets of servers based on server
GUIDs, i.e. target-port-guid in QoS policy file.
Post by Jim Garlick
- by using the ko2iblnd 'service' option, the o2iblnd SL would be the
same for all connections initiated by a given Lustre client, regardless
the destination file system. So we would not achieve our goal.
Not necessarily. The service-id would be the same, but SLs could be
different if the SM has been configured in a way that doesn't
determine SLs solely based on service-id (e.g. also based on target
GUIDs).
Post by Jim Garlick
......
- if the 'service id' information was stored on the MGS on a file system
basis, one could imagine to retrieve it at mount time on the clients.
The 'service id' information stored on the MGS could consist in a port
space and a port id. Thus it would be possible to affect different
service ports to the various connections initiated by the client,
depending on the target file system.
What do you think? Would you say this is feasible, or can you see major
issues with this proposal?
The LNet configurations could not reside on the MGS because LNet must
have been properly configured so that configurations on MGS could be
fetched over the network.

Isaac
Loading...