SLD: S-Link Driver for Linux.

Performance

Test setup

All tests have been executed on the same target system: a Pentium III at 700Mhz with 128MB of RAM running an almost standard Redhat 6.2 distribution. The non-standard here is that the kernel has been recompiled (which is always smart), but also the kgdb patch from SGI has been added to be able to use remote target debugging when necessary. A serial console was also used to be able to capture output and be able to save it in case of a system crash.

Initial test were executed using a Slidas LDC which allowed quite a bit of testing. However, for the benchmarks below a SharcLink was used to generate data as it allowed for a configurable event size. The S-Link used is a simple parallel connection which runs at 40Mhz. At this speed it is capable of delivering more data than the PCI bus can handle, so the S-Link will not be the limiting factor in the tests.

The normal configuration of the driver is to use 100 buffers of 4KB (or PAGE_SIZE on this system). And testing started with version 1.0.2 for the benchmarks.

Benchmarks

As this project only is about writing the driver, there is not much sense in doing a lot of benchmarks, since these must be done in a real-world environment where the driver is intended to be used. Without the load of the rest of the project, it is difficult say something useful about the performance of the driver. So there is only one real measurement to be done: what's the maximum transfer speed while throwing the data away as fast as possible. This actually ended up being three measurements: 1. the absolute maximum speed (raw speed), 2. the raw speed with `vmstat 5' running at the same time to monitor the CPU load and 3. the raw speed with `vmstat 5' running and also checking the data that came in on the interface. The last test should give an indication of the load this gives on the CPU, or how much CPU cylces are available to process the data from the S-Link.

One other interesting item to investigate is the behaviour of the driver around the point where it needs to do multi-dma. Obviously at first performance will decrease, since another interrupt must be handled, while it brings (almost) no extra data.

Throughput


The first graph shows the throughput in the three described situations. It is so close together that it is hard (at the used resolution) to see the differences. Don't worry about the blue line being straight from event size 2000 to 6000; I just skipped those measurements.

How close the graphs actually are is seen easily when the values at event size 10000 are evaluated: 82490 raw speed, 82227 with vmstat added and 81898 when also checking the errors. So that's just 592KB/s difference, or 0.72%.

CPU load

Getting useful values using vmstat turned out to be difficult. It doesn't have a cumulative option and the numbers jump up and down quite a bit. The second figure shows how the CPU load (system, idle, user) interacts with the throughput and number of interrupts per second for the third set of measurements (so that's with error checking). For the event sizes until 500. Unfortunately, vmstat could not produce numbers for the smallest event size, so those lines only start at size 50.


The interrupt line is interesting in the first few sizes: apparently the interrupts come in at such a speed that Linux doesn't get the opportunity to schedule any processes at all. The CPU load lines also point at this, with the system % starting high in the 70's.

Around the event size 100 till 200 also something interesting happened that cannot be seen in the graph, but I noticed it during measurements: the idle % is zero and the number of context switches is really low at size 100 and goes up a bit when events get bigger. Obviously the application has problems processing all the data from the S-Link. For larger event sizes (although not shown) the idle % keeps hovering between 20 and 30%.

One experiment was to increase the number of buffers from 100 to 1000, but the effect was limited to the very smallest events. For event size 5, throughput increased from 418 to 1070KB/s, but event size 50 and larger would only should a marginal increase in throughput that might just as well have been a measurement error. However, if one expects a lot of small events, having more buffers will help to keep things flowing.

Multi-DMA

The multi DMA tests were done using the same hardware, but a newer software version. I realized that there was one tiny bug left in the multi DMA handling when there would be just 1, 2 or 3 words more than a buffer full of data. So that became version 1.0.3.

The third and last graph shows how the interrupts and throughput behave around the first multi DMA point, or around 1024 words of data.


The beginning and end of the two graphs are understandable: performance increases slowly as more data is coming in because the overhead of the driver is spread over more words of data. And the rise in the number of interrupts was also expected: the first DMA is finished because the buffer is full and a new one must be started to get the rest. This is the cause of an extra interrupt per extra buffer needed.

However, one would expect the jump in number of interrupts and throughput to happen at 1024 words, when the first buffer is full. Well, that doesn't happen because the AMCC chip has an 8 word fifo. So The S-Link add-on hardware can put 8 more words into the AMCC after the buffer is full (and will do that before we get to handle the interrupt). So actually the jump in the graph should be at 1024 + 8 = 1032 words. And so it is. The largest number of interrupts is at 1033 words event size.

There is also quite a drop in performance and interrupts right before the jump at 1033. What happens is that an interrupt comes in, but there is still data in the AMCC fifo. So the interrupt routine has to take this data from the fifo before it can act on the interrupt. And because reacting to the interrupt takes more time than for the hardware to fill the AMCC fifo, there is already the S-Link change in data type present. This causes the interrupt routine not to start a DMA transfer for this last data, but rather read it out word-for-word. This is a lot slower than a running DMA, but (probably) faster than starting a DMA for these few words.


© 2000 by Jan Evert van Grootheest for Nikhef. Information : Ruud van Wijk