Mark Butler | 1 Dec 2005 01:51

Re: SCTP Partial delivery / SOCK_SEQPACKET ambiguity

Brian F. G. Bidulock wrote:
Yes, standard Posix SOCK_SEQPACKET semantics. The serious problem, is with partial deliveries on a one-to-many socket, because they serialize communication from all other associations, potentially hanging the socket for a very long time. Lksctp is currently vulnerable to this problem.
Well, no. POSIX SOCK_SEQPACKET sets MSG_TRUNC if the message (not the record) is larger that the supplied buffer, just as does SOCK_DGRAM.
LKSCTP is vulnerable as  implemented.  Partial delivery locks one-to-many sockets by association.

Regrettably, the Posix documentation for SOCK_SEQPACKET is inconsistent. This is the description from section 2.10.6:

The SOCK_SEQPACKET socket type is similar to the SOCK_STREAM type, and is also connection-oriented. The only difference between these types is that record boundaries are maintained using the SOCK_SEQPACKET type. A record can be sent using one or more output operations and received using one or more input operations, but a single operation never transfers parts of more than one record. Record boundaries are visible to the receiver via the MSG_EOR flag in the received message flags returned by the recvmsg() function. It is protocol-specific whether a maximum record size is imposed.


No difference from SOCK_STREAM except record boundaries, means that message boundaries are either invisible or identical with record boundaries.  Not two independent sets of boundaries.

However the documentation for recvmsg() implies that records and messages are different:

The recvmsg() function shall return the total length of the message. For message-based sockets, such as SOCK_DGRAM and SOCK_SEQPACKET, the entire message shall be read in a single operation. If a message is too long to fit in the supplied buffers, and MSG_PEEK is not set in the flags argument, the excess bytes shall be discarded, and MSG_TRUNC shall be set in the msg_flags member of the msghdr structure. For stream-based sockets, such as SOCK_STREAM, message boundaries shall be ignored. In this case, data shall be returned to the user as soon as it becomes available, and no data shall be discarded.
...MSG_EOR


































































































End-of-record was received (if supported by the protocol).


































































































MSG_EOR

































































































End-of-record was received (if supported by the protocol).



































































































MSG_EOR





End-of-record was received (if supported by the protocol).





MSG_EOR




End-of-record was received (if supported by the protocol).




MSG_EOR



End-of-record was received (if supported by the protocol).



This is a complete contradiction. It only makes sense if there are protocols that support two sets of boundaries, lower level message boundaries embedded in higher level record boundaries.  Are there any?

More likely there are two conceptions of SOCK_SEQPACKET that made their way into the standard - One is SOCK_DGRAM with ordering and reliability added.  The other is a byte stream with occasional boundary markers.  The draft api uses the latter conception, consistent with Posix XSH 2.10.6.  Which conception do existing SOCK_SEQPACKET protocols use?

I guess I can't say standard SOCK_SEQPACKET semantics anymore. Sigh.

- Mark


POSIX references:

 http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_10.html
 http://www.opengroup.org/onlinepubs/009695399/functions/recvmsg.html
_______________________________________________
tsvwg mailing list
tsvwg <at> ietf.org
https://www1.ietf.org/mailman/listinfo/tsvwg
Mark Butler | 1 Dec 2005 02:19

Re: SCTP partial delivery

Brian,
SCTP_FRAGMENT_INTERLEAVE would enable interleaving fragments from different streams. It is an unexpected, non-Posix compliant behavior, and should be disabled by default. POSIX does not say that fragments from different streams can't be interleaved. But it does specify that records may not be interleaved, which is the same thing. Adding streams shouldn't be an excuse to break existing semantics - principle of least surprise, at a minimum. Streams should be transparent to readers. - Mark
Not the same thing. That is interleaving records from the SAME stream.
We will have to disagree here. 

- Mark
_______________________________________________
tsvwg mailing list
tsvwg <at> ietf.org
https://www1.ietf.org/mailman/listinfo/tsvwg
Mark Butler | 1 Dec 2005 02:25

Re: SCTP Partial delivery / SOCK_SEQPACKET ambiguity (corrected)

(This is a repeat with extraneous whitespace from a defective mail client removed)

Brian F. G. Bidulock wrote:
Yes, standard Posix SOCK_SEQPACKET semantics. The serious problem, is with partial deliveries on a one-to-many socket, because they serialize communication from all other associations, potentially hanging the socket for a very long time. Lksctp is currently vulnerable to this problem.
Well, no. POSIX SOCK_SEQPACKET sets MSG_TRUNC if the message (not the record) is larger that the supplied buffer, just as does SOCK_DGRAM.
LKSCTP is vulnerable as  implemented.  Partial delivery locks one-to-many sockets by association.

Regrettably, the Posix documentation for SOCK_SEQPACKET is inconsistent. This is the description from section 2.10.6:

The SOCK_SEQPACKET socket type is similar to the SOCK_STREAM type, and is also connection-oriented. The only difference between these types is that record boundaries are maintained using the SOCK_SEQPACKET type. A record can be sent using one or more output operations and received using one or more input operations, but a single operation never transfers parts of more than one record. Record boundaries are visible to the receiver via the MSG_EOR flag in the received message flags returned by the recvmsg() function. It is protocol-specific whether a maximum record size is imposed.


No difference from SOCK_STREAM except record boundaries, means that message boundaries are either invisible or identical with record boundaries.  Not two independent sets of boundaries.

However the documentation for recvmsg() implies that records and messages are different:

The recvmsg() function shall return the total length of the message. For message-based sockets, such as SOCK_DGRAM and SOCK_SEQPACKET, the entire message shall be read in a single operation. If a message is too long to fit in the supplied buffers, and MSG_PEEK is not set in the flags argument, the excess bytes shall be discarded, and MSG_TRUNC shall be set in the msg_flags member of the msghdr structure. For stream-based sockets, such as SOCK_STREAM, message boundaries shall be ignored. In this case, data shall be returned to the user as soon as it becomes available, and no data shall be discarded.
...MSG_EOR




This is a complete contradiction. It only makes sense if there are protocols that support two sets of boundaries, lower level message boundaries embedded in higher level record boundaries.  Are there any?

More likely there are two conceptions of SOCK_SEQPACKET that made their way into the standard - One is SOCK_DGRAM with ordering and reliability added.  The other is a byte stream with occasional boundary markers.  The draft api uses the latter conception, consistent with Posix XSH 2.10.6.  Which conception do existing SOCK_SEQPACKET protocols use?

I guess I can't say standard SOCK_SEQPACKET semantics anymore. Sigh.

- Mark


POSIX references:

 http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_10.html
 http://www.opengroup.org/onlinepubs/009695399/functions/recvmsg.html
_______________________________________________
tsvwg mailing list
tsvwg <at> ietf.org
https://www1.ietf.org/mailman/listinfo/tsvwg

_______________________________________________
tsvwg mailing list
tsvwg <at> ietf.org
https://www1.ietf.org/mailman/listinfo/tsvwg
Brian F. G. Bidulock | 1 Dec 2005 07:31
Favicon

Re: SCTP partial delivery

Mark,

On Wed, 30 Nov 2005, Mark Butler wrote:

> 
>    Brian,
> 
> SCTP_FRAGMENT_INTERLEAVE would enable interleaving fragments from 
> different streams.
> It is an unexpected, non-Posix compliant behavior, and should be 
> disabled by default.
>     
> 
> POSIX does not say that fragments from different streams can't be
> interleaved.
>   
> 
>    But  it does specify that records may not be interleaved, which is the
>    same  thing.   Adding streams shouldn't be an excuse to break existing
>    semantics  - principle of least surprise, at a minimum. Streams should
>    be transparent to readers.
>     - Mark
>     
> 
> Not the same thing.  That is interleaving records from the SAME stream.
>   
> 
>    We will have to disagree here.
>    - Mark

Apparently you disagree with POSIX here too.  While you seem to be
reading out loud, considere these two passages:

   The SOCK_SEQPACKET socket type is similar to the SOCK_STREAM type, and
   is  also  connection-oriented. The only difference between these types
   is  that  record  boundaries  are  maintained using the SOCK_SEQPACKET
   type.  A  record  can  be sent using one or more output operations and
   received  using  one  or more input operations, but a single operation
   never  transfers  parts of more than one record. Record boundaries are
   visible  to  the receiver via the MSG_EOR flag in the received message
   flags  returned by the [10]recvmsg() function. It is protocol-specific
   whether a maximum record size is imposed.

"a single operation never transfers parts of more than one record."

And here:

   The  contents of a receive buffer are logically structured as a series
   of data segments with associated ancillary data and other information.
   A  data segment may contain normal data or out-of-band data, but never
   both.  A  data  segment may complete a record if the protocol supports
   records  (always  true  for  types  SOCK_SEQPACKET  and SOCK_DGRAM). A
   record  may  be  stored  as more than one segment; the complete record
   might never be present in the receive buffer at one time, as a portion
   might  already  have  been  returned  to  the application, and another
   portion  might  not  yet  have  been  received from the communications
   provider. A data segment may contain ancillary protocol data, which is
   logically  associated  with the segment. Ancillary data is received as
   if  it  were  queued  along  with  the  first normal data octet in the
   segment  (if  any). A segment may contain ancillary data only, with no
   normal  or  out-of-band  data.  For  the  purposes  of this section, a
   datagram  is considered to be a data segment that terminates a record,
   and  that  includes  a  source  address as a special type of ancillary
   data.  Data segments are placed into the queue as data is delivered to
   the socket by the protocol. Normal data segments are placed at the end
   of the queue as they are delivered. If a new segment contains the same
   type  of data as the preceding segment and includes no ancillary data,
   and if the preceding segment does not terminate a record, the segments
   are logically merged into a single segment.

   The receive queue is logically terminated if an end-of-file indication
   has been received or a connection has been terminated. A segment shall
   be  considered  to  be terminated if another segment follows it in the
   queue,  if  the  segment  completes  a record, or if an end-of-file or
   other  connection  termination  has been reported. The last segment in
   the  receive queue shall also be considered to be terminated while the
   socket has a pending error to be reported.

   A  receive  operation  shall  never return data or ancillary data from
   more than one segment.

Now, I believe that I am doing this exactly correct, particularly
considering stream id is "ancillary data", and a fragment is a "segment".

What you propose deviates quite widely from these passages.

--brian

--

-- 
Brian F. G. Bidulock    ¦ The reasonable man adapts himself to the ¦
bidulock <at> openss7.org    ¦ world; the unreasonable one persists in  ¦
http://www.openss7.org/ ¦ trying  to adapt the  world  to himself. ¦
                        ¦ Therefore  all  progress  depends on the ¦
                        ¦ unreasonable man. -- George Bernard Shaw ¦
Brian F. G. Bidulock | 1 Dec 2005 07:41
Favicon

Re: SCTP Partial delivery / SOCK_SEQPACKET ambiguity (corrected)

Mark,

While your reading, read down in chapter 2 under Socket Receive Queue:

    Socket Receive Queue

   A  socket has a receive queue that buffers data when it is received by
   the  system  until  it  is removed by a receive call. Depending on the
   type  of  the socket and the communication provider, the receive queue
   may  also  contain  ancillary  data  such  as the addressing and other
   protocol  data  associated  with the normal data in the queue, and may
   contain  out-of-band  or  expedited  data. The limit on the queue size
   includes  any normal, out-of-band data, datagram source addresses, and
   ancillary  data  in the queue. The description in this section applies
   to  all  sockets,  even though some elements cannot be present in some
   instances.

   The  contents of a receive buffer are logically structured as a series
   of data segments with associated ancillary data and other information.
   A  data segment may contain normal data or out-of-band data, but never
   both.  A  data  segment may complete a record if the protocol supports
   records  (always  true  for  types  SOCK_SEQPACKET  and SOCK_DGRAM). A
   record  may  be  stored  as more than one segment; the complete record
   might never be present in the receive buffer at one time, as a portion
   might  already  have  been  returned  to  the application, and another
   portion  might  not  yet  have  been  received from the communications
   provider. A data segment may contain ancillary protocol data, which is
   logically  associated  with the segment. Ancillary data is received as
   if  it  were  queued  along  with  the  first normal data octet in the
   segment  (if  any). A segment may contain ancillary data only, with no
   normal  or  out-of-band  data.  For  the  purposes  of this section, a
   datagram  is considered to be a data segment that terminates a record,
   and  that  includes  a  source  address as a special type of ancillary
   data.  Data segments are placed into the queue as data is delivered to
   the socket by the protocol. Normal data segments are placed at the end
   of the queue as they are delivered. If a new segment contains the same
   type  of data as the preceding segment and includes no ancillary data,
   and if the preceding segment does not terminate a record, the segments
   are logically merged into a single segment.

   The receive queue is logically terminated if an end-of-file indication
   has been received or a connection has been terminated. A segment shall
   be  considered  to  be terminated if another segment follows it in the
   queue,  if  the  segment  completes  a record, or if an end-of-file or
   other  connection  termination  has been reported. The last segment in
   the  receive queue shall also be considered to be terminated while the
   socket has a pending error to be reported.

   A  receive  operation  shall  never return data or ancillary data from
   more than one segment.

If you realize that a message fragment is a "segment" and stream id is
"ancillary data", then this is precisely how I have implemented the
sockets interface: adjacent fragments are merged and delivered together,
fragments with different stream id are delivered separate, fragments
are delivered in order of arrival (at the buffer, after gap logic), all
fragments do not have to be received to deliver the first ones, MSG_EOR
is set on the last segment of a record.

I don't see why this could not apply to 1:many also, particularly if
you consider association id as "ancillary data".

--brian

On Wed, 30 Nov 2005, Mark Butler wrote:

> 
>    (This  is  a  repeat  with extraneous whitespace from a defective mail
>    client removed)
>    Brian F. G. Bidulock wrote:
> 
> Yes, standard Posix SOCK_SEQPACKET semantics.  The serious problem, is 
> with partial deliveries on a one-to-many socket, because they serialize 
> communication from all other associations, potentially hanging the 
> socket for a very long time.  Lksctp is currently vulnerable to this 
> problem. 
>     
> 
> Well, no.  POSIX SOCK_SEQPACKET sets MSG_TRUNC if the message (not the
> record) is larger that the supplied buffer, just as does SOCK_DGRAM.
>   
> 
>    LKSCTP   is   vulnerable  as   implemented.   Partial  delivery  locks
>    one-to-many sockets by association.
>    Regrettably,   the   Posix   documentation   for   SOCK_SEQPACKET   is
>    inconsistent. This is the description from section 2.10.6:
> 
>      The  SOCK_SEQPACKET socket type is similar to the SOCK_STREAM type,
>      and  is also connection-oriented. The only difference between these
>      types   is   that   record  boundaries  are  maintained  using  the
>      SOCK_SEQPACKET  type. A record can be sent using one or more output
>      operations  and  received using one or more input operations, but a
>      single  operation  never  transfers  parts of more than one record.
>      Record  boundaries are visible to the receiver via the MSG_EOR flag
>      in   the  received  message  flags  returned  by  the  [1]recvmsg()
>      function.  It is protocol-specific whether a maximum record size is
>      imposed.
> 
>    No  difference  from  SOCK_STREAM except record boundaries, means that
>    message  boundaries  are  either  invisible  or  identical with record
>    boundaries.  Not two independent sets of boundaries.
>    However  the  documentation  for  recvmsg()  implies  that records and
>    messages are different:
> 
>      The  recvmsg()  function  shall  return  the  total  length  of the
>      message.   For   message-based  sockets,  such  as  SOCK_DGRAM  and
>      SOCK_SEQPACKET,  the  entire  message  shall  be  read  in a single
>      operation. If a message is too long to fit in the supplied buffers,
>      and  MSG_PEEK  is  not  set in the flags argument, the excess bytes
>      shall  be  discarded,  and  MSG_TRUNC shall be set in the msg_flags
>      member  of  the msghdr structure. For stream-based sockets, such as
>      SOCK_STREAM,  message  boundaries  shall  be ignored. In this case,
>      data shall be returned to the user as soon as it becomes available,
>      and no data shall be discarded.
>      ...
> 
>    MSG_EOR
>    This  is  a  complete  contradiction. It only makes sense if there are
>    protocols  that  support  two  sets of boundaries, lower level message
>    boundaries embedded in higher level record boundaries.  Are there any?
>    More  likely  there  are  two  conceptions of SOCK_SEQPACKET that made
>    their  way  into  the  standard  - One is SOCK_DGRAM with ordering and
>    reliability  added.   The  other  is  a  byte  stream  with occasional
>    boundary   markers.    The  draft  api  uses  the  latter  conception,
>    consistent  with  Posix  XSH  2.10.6.   Which  conception  do existing
>    SOCK_SEQPACKET protocols use?
>    I guess I can't say standard SOCK_SEQPACKET semantics anymore. Sigh.
>    - Mark
>    POSIX references:
>     [2]http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02
>    _10.html
>     [3]http://www.opengroup.org/onlinepubs/009695399/functions/recvmsg.ht
>    ml
> 
> References
> 
>    1. file://localhost/home/brian/functions/recvmsg.html
>    2. http://www.opengroup.org/onlinepubs/009695399/functions/xsh_chap02_10.html
>    3. http://www.opengroup.org/onlinepubs/009695399/functions/recvmsg.html

> _______________________________________________
> tsvwg mailing list
> tsvwg <at> ietf.org
> https://www1.ietf.org/mailman/listinfo/tsvwg
> 

--

-- 
Brian F. G. Bidulock    ¦ The reasonable man adapts himself to the ¦
bidulock <at> openss7.org    ¦ world; the unreasonable one persists in  ¦
http://www.openss7.org/ ¦ trying  to adapt the  world  to himself. ¦
                        ¦ Therefore  all  progress  depends on the ¦
                        ¦ unreasonable man. -- George Bernard Shaw ¦
Mark Butler | 1 Dec 2005 13:04

Re: SOCK_SEQPACKET semantics

Brian,

>Well, no.  POSIX SOCK_SEQPACKET sets MSG_TRUNC if the message (not the
>record) is larger that the supplied buffer, just as does SOCK_DGRAM.
>
>
>  
>
After further checking, it seems the SOCK_SEQPACKET world is split three 
ways:

Packet oriented protocols generally use packet level reads with 
truncation / discard semantics and no MSG_EOR.  X.25, Bluetooth, IRDA, 
and Unix domain sockets use SOCK_SEQPACKET this way.

Record oriented protocols generally use byte stream reads and MSG_EOR - 
no packet level visibility, no truncation / discard.  DECNet and ISO TP 
use SOCK_SEQPACKET that way.

Packet / record hybrids generally use SOCK_SEQPACKET with truncation / 
discard semantics on the packet level, and record terminating packets 
marked with MSG_EOR.  SPX and XNS SPP use SOCK_SEQPACKET this way.

- Mark
Mark Butler | 1 Dec 2005 07:21

Re: SCTP and Multicast

(Replying to a previous discussion)

Liangping,

SCTP could support reliable, sequenced multicast in one of two ways.  The easier way is to leave the protocol alone, and add an interface that retransmits messages to a set of connected endpoints.  The draft SCTP socket api, as it stands, allows an application to keep track of who is currently attached to a one-to-many socket.  Then whenever you want to multicast a message, just scan the list and use sctp_sendmsg() or sendto() to send a message to each  connected endpoint. 

If bandwidth efficiency is the primary concern, then the SCTP protocol could be extended to support IP multicast  as follows:
Clients establish associations with multicast server endpoints, with an INIT parameter indicating multicast capability. The server endpoint confirms multicast capability with an INIT ACK that contains a parameter indicating the multicast address it uses, and an initial multicast TSN. 

The client would then attempt a layer 3 multicast join, using IGMP as necessary, and report presumptive reachability in the COOKIE_ACK response.>
>
The server would use different stream numbers for unicast and multicast streams. A message on a multicast stream would be sent over IP multicast to all IP multicast reachable clients, and over IP unicast to all the others.  Multicast data would be delivered in MDATA chunks that contain the sender's current multicast TSN.  MDATA chunks would otherwise behave like regular DATA chunks.
>>
If multicast data chunks are missed, each receiver would individually request retransmission.  From time to time, each receiver would MSACK what it has received, and the sender would update the multicast cumulative TSN.    If a presumptively multicast capable client MSACKs a considerable series of multicast messages, the server would change its layer 3 multicast status to disabled, and transmit multicast messages to that client using unicast instead.

Interoperability with clients with existing SCTP stacks could be achieved using by using regular DATA chunks on those clients.
Ordering between multicast streams and unicast streams would not be guaranteed. This could be fixed by adding a TSN and stream sequence number bearing BARRIER chunk that would specify an inter-stream ordering dependency.  This would be necessary for use of multicast in some applications, notably database clusters, filesystems, and cache coherency protocols. 

Peer-to-peer multicast is a straightforward refinement.

Multicast SCTP would be an interesting project. I imagine that a reliable multicast layer on top of hardware RDMA would be more efficient for HPC applications, however, given adequate bandwidth.  Out of curiosity, what did you have in mind?

 - Mark
_______________________________________________
tsvwg mailing list
tsvwg <at> ietf.org
https://www1.ietf.org/mailman/listinfo/tsvwg
Mark Butler | 2 Dec 2005 03:26

Re: POSIX "segments" and "records"

Brian,

><POSIX section 2.10.11 omitted>
>  
>
 >  Now, I believe that I am doing this exactly correct, particularly

>considering stream id is "ancillary data", and a fragment is a "segment".
>  
>

I agree that what you are doing is in full compliance with the letter of 
POSIX  XSH sections 2.10.6 and 2.10.11 as currently specified.  However, 
whether some fragment is a "segment", a portion of a "segment", or a 
even a series of "segments" is arbitrary as far as POSIX is concerned.   
The mapping of lower level entities to POSIX "segments" varies by both 
protocol and socket type.  Anything can be presented to the socket layer 
as a "segment" as long as the relevant rules are followed. Whether such 
a presentation is useful is a different issue.

A POSIX "segment" on a SOCK_STREAM socket normally extends the length of 
the whole connection.  A SOCK_SEQPACKET "segment" typically extends for 
either one packet (SPX, Unix domain, X.25, etc.) or one arbitrary length 
"record" (DECnet, ISO TP).  Many of these protocols support both 
SOCK_STREAM and SOCK_SEQPACKET semantics over identical wire protocols.

> What you propose deviates quite widely from these passages.

All I am proposing is a SOCK_DGRAM socket type (or equivalent mode) for 
SCTP.  There is absolutely nothing  in POSIX that prevents an SCTP 
message from being mapped to a SOCK_DGRAM "segment".  Nothing in POSIX 
prohibits interleaved SAR below the socket layer.  Otherwise the socket 
interfaces for IP, ATM AAL5, and TCP would all be in violation.

- Mark
Brian F. G. Bidulock | 2 Dec 2005 04:50
Favicon

Re: POSIX "segments" and "records"

Mark,

On Thu, 01 Dec 2005, Mark Butler wrote:

> 
> All I am proposing is a SOCK_DGRAM socket type (or equivalent mode) for 
> SCTP.  There is absolutely nothing  in POSIX that prevents an SCTP 
> message from being mapped to a SOCK_DGRAM "segment".  Nothing in POSIX 
> prohibits interleaved SAR below the socket layer.  Otherwise the socket 
> interfaces for IP, ATM AAL5, and TCP would all be in violation.
> 
> - Mark

Yet, it is that very approach that causes the blocking problems that
you wish to avoid.

--brian

--

-- 
Brian F. G. Bidulock    ¦ The reasonable man adapts himself to the ¦
bidulock <at> openss7.org    ¦ world; the unreasonable one persists in  ¦
http://www.openss7.org/ ¦ trying  to adapt the  world  to himself. ¦
                        ¦ Therefore  all  progress  depends on the ¦
                        ¦ unreasonable man. -- George Bernard Shaw ¦
Mark Butler | 2 Dec 2005 06:25

SCTP Partial delivery vulnerability example / solutions


  Here is an example that demonstrates the vulnerability of SCTP partial 
delivery on a one-to-many socket (as in LKSCTP, no stream 
interleaving).  Suppose you decided to adapt SMTP to SCTP.  SMTP is a 
message oriented protocol where reasonably sized messages predominate, 
so you decide to do a one-to-one SMTP messages to SCTP message mapping 
with a few other control messages.

Then you have an implementation choice - you can either do  a 
connect/accept/fork style server using one-to-one sockets or a simpler 
design that uses one or more worker threads to read messages from a 
shared one-to-many socket.  You choose the latter, set a large socket 
buffer size, and start processing messages.

Occasionally, a much longer than usual message overflows the buffer, 
causing a partial delivery of the first portion of the message. This 
causes the stack to lock the socket against further reads from any other 
association (pd_mode = 1 in lksctp).  Normally the rest of the message 
arrives in a fraction of a second, and no one notices.  However, on 
occasion the SMTP sender crashes or has its network connection fail 
while a partial delivery is pending.  In that situation, other clients 
can establish new associations with the server, but no messages can be 
delivered until the stack times out the failed association, and aborts 
the partial delivery, a process that probably takes about thirty 
seconds.   Thirty second delays may be fine for MTA - MTA transfers, but 
are not acceptable for MUA - MTA transfers.  Most mail user agents time 
out after about five seconds.

One solution might be to use smaller abort timeouts.  However, anything 
reasonable will either abort slow clients unnecessarily (wasting 
resources and multiplying the number of in-doubt message transactions) 
or greatly increase delivery latency and multiply buffer space 
requirements on a busy MTA.    And creating a number of associations 
that stall mid message on purpose would make for a very effective denial 
of service attack on a public MTA with a shared socket.

Of course SMTP is an extraordinarily relaxed environment compared to 
something like an LDAP server.  A typical LDAP server might have 
thousands of clients and low millisecond reponse time requirements.  
With a single one-to-many server socket the stalling problem becomes 
much more serious.

LDAP has relatively low request lengths, and so SOCK_DGRAM code over a 
reliable transport like SCTP would be trivial.  Even UDP with client 
retransmission timers (e.g. DNS) is easier to support than the 
extraneous code required to operate a one-to-many socket server reliably 
with the api as it stands.

Granted, the SOCK_DGRAM style only works with ULPs that allow restricted 
message lengths, an in addition the stack has to truncate or discard 
messages beyond some specified threshold, but the atomic message 
operations make things much easier where those requirements are 
acceptable, a domain that covers a very broad span of upper layer protocols.

Latency sensitive applications that need to support very large messages 
from multiple clients must use listen()/accept() and some form of thread 
dispatch.  sctp_peeloff() almost works, but would do nothing to prevent 
DoS attacks without message length limitations on the primary socket. 

Here are four ways to avoid the problem with one-to-many sockets:

 1) An option (or socket type) to allow applications to get SOCK_DGRAM 
semantics where restricted message lengths are acceptable. (A new socket 
type might be required to get different semantics on BSD kernels.)

 2) An option to allow applications to reassemble stream fragments 
(where necessary) at the application layer, instead of blocking across 
associations. (SCTP_STREAM_FRAGMENT).

 3) Write server code to use one-to-one sockets instead.

 4)  Set the association abort timeouts low enough to mitigate the 
stalling problem, and doing the necessary event handling in the 
application to avoid silently gluing unrelated messages together. 
(Perhaps a final zero length fragment tagged with MSG_TRUNC and MSG_EOR 
would solve the latter problem).

I believe SCTP has greater potential as a next generation UDP than a 
next generation TCP, and supporting the SOCK_DGRAM programming model 
would go a great distance toward achieving that goal.

 - Mark

Gmane