Performance and scalability analysis of TCP/IP protocol for ATLAS HLT model.
(preliminary)
Piotr Golonka, 10 Sept 2001.

o TCP/IP tests on LINUX were needed to improve and calibrate the modell of TCPBroker.
o Scalability issues were of main concern.
o
o All standarized ways of programming multi-connection applications succesfully set up and tested.
o POSIX standard describes 3 methods: select()/poll() system call, real-time signal-driven communication (SIGIO) and asynchronous comunication (AIO)
o select()/poll() - most popular among programmers, straightforward in implementation,
o SIGIO - not as easy, applied in some high-performance servers
o Asynchronous I/O: not widely known(!), though standarized(!), few vendors implement it in operating system.
o LINUX has currently 2 implementations: KAIO(by SGI) and "native" (as a real-time extension to standard "glibc" library).
o KAIO (=Kernel Asynchronous I/O) - patch to the kernels 2.4.* , needed some improvements in the code (posted to SGI), current status: experimental. Many "tweaking points" possible.
o
o TESTS performed:
o Setup: a few Pcs (400MHz and 2*800MHz), FastEthernet
o one or a few clients sending stream of data to one server with controlled rate
o various "message sizes" (chosen 1460B=1 packet for final tests),
o 1-1000 connections open (and used) simultanously.
o External application for measuring CPU load on the server.
o
o RESULTS:
o We have shown that reliable communication using TCP may be realised for up to a 1000 connections, with total rate of up to 10MBytes/s without saturation of 400MHz machine. It is possible to use socket API for network communication (keep standards!), TCP overhead should not be a problem
o CPU load vs Throughput for READING operation (Ping-Pong test gave the results for mixed read+write operations, we were unable to untie this!)
o Limits for total data rate which may be accepted by server (approx. 9MBytes/s, instabilities, signal queue overflows, etc)
o
o LIMITATIONS:
o select() - natural limit of 1023 connections that can be described by a bitmap data structure which is passed to this syscall. Poll() - solves the problem: bitmap replaced with dedicated data structure, (performance penalty!)
o Signal-driven methods: big message rates may overflow internal signal queues (1024 signals may wait in a signal queue in kernel 2.4.*). Method fails when overflow happens.
o
o RESULT ANALYSIS:
o select(): scales very poor! Applicable for applications that have less then a few tens of connections open for data transfer. We EXCLUDE this method.
o SIGIO: scales very well with increasing number of connections. Very good performance!
o (K)AIO: some unexplained behaviour. Scales well. Slightly poorer performance comparing to SIGIO.
o
o OUR CONCLUSIONS FOR MODELING:
o One scalability parameter (!) needed to be added to the model, if we decide to model SIGIO or AIO behaviour: impact of the number of connections for load.
o The value of the parameter may be found directly from the plots.
o The model will be applicable up to a "maximal rate" limit, i.e. 9-10 MBytes/s for FastEthernet (additional parameter?)
o All conclusions concerning these methods should be applicable to ALL communications protocols which use socket API (i.e. UDP, raw packet, etc.,) not applicable for MESH driver though
o Now: put it into at2sim and try out!
o
o PROBLEMS, DISCUSSION:
o signal queue overflow protection for KAIO may be implemented.
o The same for SIGIO - need some hints from LINUX Kernel hackers to find out how SIGIO works.
o Further experiments with AIO needed, especially to find out the optimal number of "Slave threads"
o Experiments on Gigabit Ethernet and much faster machines needed (To be done in Krakow, Oct/Nov 2001)
o Lots of things to be tested, no manpower (i.e. Same tests for UDP, zero-copy patches, etc) :-(
o HOPEFULLY WE CAN PROVE THAT TCP MAY BE USED IN ATLAS HLT.