Your analysis is very cool, and I appreciate the insight,
especially as to the effect of increased latencies depending on the
application. I'm not discouraged yet. There are pretty much two
simple goals in this application. One is to save at as
many host processor cycles as possible, and the other is to
achieve line rate as measured by a ~1500-byte streaming
uC/OS-II looks nice, and it's priced right.
Unfortunately, in this case our coprocessor has only 384kB of RAM, most of
which is already allocated for large drivers and network buffers, so
I've basically been prohibited from incorporating threads. I'm going to
have to implement the coprocessor side as an interrupt-driven state
The latest plan is to run just the raw API on the
coprocessor, and write new sockets interface client/server layers that can
operate over a true message passing boundary. As you say, hopefully I can
model critical portions of this before committing to implement the whole
With the coprocessor being so much slower than the host, I'm really
concerned about the overall effect upon latencies, and perhaps even
bandwidth. You could end up reducing TCP/IP performance by adding
coprocessor functionality. I would again urge you to look at the fraction
of time your host is spending in the TCP/IP stack, if at all possible. If
you are bound by stack performance, that may devolve to determining the amount
of time you are spending in the kernel as opposed to your app(s). If that
fraction is small, then it may not be worth your while to try to reduce
it. For example, if you are spending 90% of your time in your app and 10%
of your time in TCP/IP, then cutting the TCP/IP time in half would only net you
a small change in your performance.
If your protocol is heavily
acknowledged and you find yourself being performance bound by the performance of
the protocol, any additions to latencies will end up making you slower, not
faster. All that is speculation on my part, of course. You could be
compute bound with a streaming TCP/IP output, in which case additions to
latencies wouldn't have any effect at all.
As for the RTOS question, you
can find some surprisingly small ones. We have used uC/OS-II without being
horrified by its size. Depending on the CPU you are using in the
coprocessor, you may find that you have some pretty good options.
believe that it would probably be easiest to put on a top layer as you describe,
but I also think that it would be feasible to transport the messages to the tcp
thread as you originally described. As you note, there are some
difficulties, and it is possible that the message contents will have to be
augmented to deal with some of the existing data references. In either
event, you will almost certainly find yourself tinkering with the stack in one
way or another. The good news is that with a small open source project
like this, it is definitely feasible to do this. The bad news is that it
can still be a fair amount of work.
I'm really a bit conflicted about
this. On the one hand, it does sound like a really interesting thing to do
technically. On the other, it may actually end up costing you in system
performance. I hope you'll be able to make a good analysis of the likely
outcome before you commit yourself.
Curt McDowell wrote:
Thanks for the input, Jim.
for the performance improvement, that's a very significant question.
First, I think that it is important to ask what kind of performance
improvement you seek. If you are just seeking to offload the host, so
that it can go on to do some other task faster, then you stand a reasonable
chance of seeing that happen. If you are ultimately seeking to increase
TCP/IP throughput, that will be a more difficult
In our case, the host processor would be about 4
times as powerful as the coprocessor. The coprocessor has
some spare cycles, and it'll be there regardless of whether it
ends up doing TOE. The goal is simply to reduce CPU
consumption on the host processor with no reduction in throughput.
The MAC has no checksum acceleration, so that's actually one of the
most important things to off-load.
> I feel that your assessment of feasibility is
sound and that your list of problems and their resolution is reasonably
complete. Something always shows up in implementation, and I'm sure that
your project will be no exception, but I do think that your design is
finding that splitting the modules in the manner depicted is not so easy
after all. E.g., for efficiency reasons the top layer routine
netconn_write() calls tcp_sndbuf(), which peeks in the
bottom layer data structure. It's tempting to just add a
top layer to RPC the whole sockets API (but unfortunately,
the tiny RTOS on the TOE processor would then need to support
Curt McDowell wrote:
I'm looking into using lwIP as the basis for
a TOE (TCP/IP offload engine). If I understand correctly, the lwIP
environment is implemented as one thread for the IP stack, and one thread
for each application:
<-> Sockets <-> API-mux <------------> API-demux <->
Stack <-> netif
This architecture appears to
lend itself fairly well to the following TOE implementation (actually, SOE,
as it would be a full sockets offload):
ADAPTER W/ EMBEDDED CPU
| App using |---| lwIP
library |------------| lwIP |---| Network |--->
sockets API | | Sockets API | Hardware | stack
| | hardware |
this assessment sound correct?
- Could a significant
performance improvement be realized, compared to using a
host-native IP stack?
- Is anyone else interested in this type of
The only problems that I see are with the mbox transport
mechanism, in that it assumes a shared address space.
- It would need to send the data, instead of pointers
to the data.
- It would need to send messages for event
notifications instead of using callbacks.
- Message reception on either
side of the hardware bus would be signaled through