S-LINK performance measurements in the PC environment

 
Nata Kruszynska,  Jos Vermeulen

NIKHEF, September 1998


Introduction:

At NIKHEF Windows NT drivers for AMCC based PCI S-link interfaces have been written. Using these drivers the performance of an S-LINK connection using a ~2 m long SCSI cable was determined. We have also obtained from S. Luitz a memory-mapping driver for Linux OS  for the same hardware, together with examples of the driving applications [1]. We have used the software to measure the performance of PC's under Linux running the source application. On this page the results of measurements with two types of PC's with PentiumPro ( 200 MHz) and Pentium II ( 300 MHz) processors are presented.
 

Timing of the S-link drivers:

Fig.1 Source driver for Windows NT

In Fig.1 the results for the directly mapped DMA based and interrupt driven source driver on PentiumPro (200 MHz) PC are shown. The total elapsed time is given in sec, the rate in 1000 messages/sec and the bandwidth in MBytes/sec, as a function of the message size (in Bytes).
 

Fig.2 Differences in the modelling of the NT driver

In Fig.2 the influence of different design strategies on the performance of source driver are shown. The line marked with DMA represents the standard model of a directly mapped interrupt driven device driver. The poll driver does not employ interrupts, but actively polls on the DMA done. In the poll-fixed model the poll driver was modified to map the user's buffer when first used after which the buffer is reused. Results for the DMA based interrupt driven destination driver are also shown. The measurements were done with a source sending messages with the same length as received. The througput was limited by the source. We will see later that with the destination driver a larger throughput is obtained than with the source driver for the same message length.
 

Fig.3 Bandwidth comparisons

In Fig.3 the bandwidth as a function of the message length for the DELL PentiumPro (200 MHZ) and PentiumII (300 MHz) machines is shown. The line labelled with "Linux" represents results for the source application running under Linux on the PentiumPro machine. Under Linux the application communicates directly with the S-link interface, which is memory mapped by the device driver. The line labelled with "Lin-300" represents the same for the PentiumII machine. NT results for the source driver for the Pentium II machine are labelled with "NT-src300", those for the Pentium Pro machine with "NT-src", and results for the DMA based interrupt driven destination driver for NT on Pentium Pro with "NT-dest". For the measurements with the NT destination driver the Linux source application on another Pentium Pro PC was used. The results clearly show that with the NT destination driver a higher throughput can be obtained than with the NT source driver for the range of message sizes studied (up to a 20 kBytes). We can also see that the asymptotic values for long messages are more dependent on the machine used than on the driving method.

 

Analysing bus actions:

We have connected our PCI bus analyser of VMETRO to look for a clue explaining the results. Typical waveform results are shown in Fig. 4. The bus analyser was sampling every 30 ns, one dot of the upper trace represents one sampling point .
 

Fig.4 Waveform results obtained with the PCI bus analyser

Roughly speaking the PCI bus is busy when both signals IRDY# and TRDY# are low. As can be seen from fig. 4 this is the case during about a third of the time during transfer of a block of data, which accounts for the poor performance of about 1/3 of the maximum 133 MByte/s observed. We have had the suspicion that the host bridge of the PC can be partially responsible for the results obtained, as the PC is optimized for a short burst devices. However, changing the latency of the host bridge did not improve the performance appreciably.

The following trace shows how during short bursts every 30 ns a word is read from the PCI bus (indicated in blue), while in between bursts there are relatively long periods of inactivity (red lines). In this trace polling on the status of the data transfer by the CPU is occurring quite frequently, for the measurement results presented above the polling was considerably less frequent. The second column in the trace contains the time intervals between successive PCI cycles.

TRIG       0ns   ...... AD32 MemWri  FFBDFC2C 002D3000 Ok     --
   1      30ns   ...... AD32 MemWri  FFBDFC2C 00002710 TRetry --
   2     120ns     30ns AD32 MemWri  FFBDFC30 00002710 Ok     --
   3      90ns     30ns AD32 MemWri  FFBDFC3C 0000D000 Ok     --
   4     120ns     60ns AD32 MemRd   FFBDFC3C 0000D0A6 Ok     --
   5     390ns    330ns AD32 MRdMul  002D3000 00000000 Ok     -- 1st word transferred  
   This was starting procedure of DMA.....
   6      30ns   ...... AD32 MRdMul  002D3000 00000001 Ok     --
   7      30ns   ...... AD32 MRdMul  002D3000 00000002 Ok     --
   8      30ns   ...... AD32 MRdMul  002D3000 00000003 Ok     --
   9      30ns   ...... AD32 MRdMul  002D3000 00000004 Ok     --
  10      30ns   ...... AD32 MRdMul  002D3000 00000005 Ok     --
  11      30ns   ...... AD32 MRdMul  002D3000 00000006 Ok     --
  12     120ns   ...... AD32 MRdMul  002D3000 00000007 Ok     --
  13      30ns   ...... AD32 MRdMul  002D3000 00000008 Ok     --
  14      30ns   ...... AD32 MRdMul  002D3000 00000009 Ok     --
  15      30ns   ...... AD32 MRdMul  002D3000 0000000A Ok     --
  16      30ns   ...... AD32 MRdMul  002D3000 0000000B Ok     --
  17     210ns     60ns AD32 MemRd   FFBDFC3C 0000D0A0 Ok     -- polling
  18     510ns    450ns AD32 MRdMul  002D3030 0000000C Ok     --  
  19      30ns   ...... AD32 MRdMul  002D3030 0000000D Ok     --
  20      30ns   ...... AD32 MRdMul  002D3030 0000000E Ok     --
  21      60ns   ...... AD32 MRdMul  002D3030 0000000F Ok     --
  22      30ns   ...... AD32 MRdMul  002D3030 00000010 Ok     --
  23      30ns   ...... AD32 MRdMul  002D3030 00000011 Ok     --
  24      30ns   ...... AD32 MRdMul  002D3030 00000012 Ok     --
  25      30ns   ...... AD32 MRdMul  002D3030 00000013 Ok     --
  26      30ns   ...... AD32 MRdMul  002D3030 00000014 Ok     --
  27      30ns   ...... AD32 MRdMul  002D3030 00000015 Ok     --
  28     150ns     60ns AD32 MemRd   FFBDFC3C 0000D0A0 Ok     -- polling
  29     480ns    420ns AD32 MRdMul  002D3058 00000016 Ok     --
  30      90ns   ...... AD32 MRdMul  002D3058 00000017 Ok     --
  31      30ns   ...... AD32 MRdMul  002D3058 00000018 Ok     --
  32      30ns   ...... AD32 MRdMul  002D3058 00000019 Ok     --
  33      30ns   ...... AD32 MRdMul  002D3058 0000001A Ok     --
  34      30ns   ...... AD32 MRdMul  002D3058 0000001B Ok     --
  35      30ns   ...... AD32 MRdMul  002D3058 0000001C Ok     --
  36      30ns   ...... AD32 MRdMul  002D3058 0000001D Ok     --
  37      30ns   ...... AD32 MRdMul  002D3058 0000001E Ok     --
  38      30ns   ...... AD32 MRdMul  002D3058 0000001F Ok     --
  39     540ns    450ns AD32 MRdMul  002D3080 00000020 Ok     -- 
  40      30ns   ...... AD32 MRdMul  002D3080 00000021 Ok     --
  41      30ns   ...... AD32 MRdMul  002D3080 00000022 Ok     --
  42      30ns   ...... AD32 MRdMul  002D3080 00000023 Ok     --
 

Conclusions:

The preliminary results are leading to the following conclusions: The performance is lower than that reported by the LymxOS group. At a length of 1 kByte we observe about 30% of the bandwith reported( see [2] ). We are trying to develop a better understanding of the results, also with the help of the MicroEnable board of Silicon Software ([3])

References:


[1] S.Luitz - Linux approach towards PC solutions of destination S-link based on the PCI controller S5933 of AMCC - software handed by e-mail.
[2] M.Niculescu S-LINK Performance measurements in the environment of ATLAS DAQ/EF prototye -1
http://atddoc.cer.ch/Atlas/Notes/070/Note070-1.html
[3] MicroEnble - user manual & hardware applet microSlate, SiliconSoftware,Jan 98.
http://www.silicon-software.com