Performance and scalability analysis of TCP/IP protocol for ATLAS HLT
model.
(preliminary)
Piotr Golonka, 10 Sept 2001.
o TCP/IP tests on LINUX were needed to improve and calibrate the modell
of TCPBroker.
o Scalability issues were of main concern.
o
o All standarized ways of programming multi-connection applications
succesfully set up and tested.
o POSIX standard describes 3 methods: select()/poll() system call,
real-time signal-driven communication (SIGIO) and asynchronous comunication
(AIO)
o select()/poll() - most popular among programmers, straightforward
in implementation,
o SIGIO - not as easy, applied in some high-performance servers
o Asynchronous I/O: not widely known(!), though standarized(!), few
vendors implement it in operating system.
o LINUX has currently 2 implementations: KAIO(by SGI) and "native"
(as a real-time extension to standard "glibc" library).
o KAIO (=Kernel Asynchronous I/O) - patch to the kernels 2.4.* , needed
some improvements in the code (posted to SGI), current status: experimental.
Many "tweaking points" possible.
o
o TESTS performed:
o Setup: a few Pcs (400MHz and 2*800MHz), FastEthernet
o one or a few clients sending stream of data to one server with controlled
rate
o various "message sizes" (chosen 1460B=1 packet for final tests),
o 1-1000 connections open (and used) simultanously.
o External application for measuring CPU load on the server.
o
o RESULTS:
o We have shown that reliable communication using TCP may be realised
for up to a 1000 connections, with total rate of up to 10MBytes/s without
saturation of 400MHz machine. It is possible to use socket API for network
communication (keep standards!), TCP overhead should not be a problem
o CPU load vs Throughput for READING operation (Ping-Pong test gave
the results for mixed read+write operations, we were unable to untie this!)
o Limits for total data rate which may be accepted by server (approx.
9MBytes/s, instabilities, signal queue overflows, etc)
o
o LIMITATIONS:
o select() - natural limit of 1023 connections that can be described
by a bitmap data structure which is passed to this syscall. Poll() - solves
the problem: bitmap replaced with dedicated data structure, (performance
penalty!)
o Signal-driven methods: big message rates may overflow internal signal
queues (1024 signals may wait in a signal queue in kernel 2.4.*). Method
fails when overflow happens.
o
o RESULT ANALYSIS:
o select(): scales very poor! Applicable for applications that have
less then a few tens of connections open for data transfer. We EXCLUDE
this method.
o SIGIO: scales very well with increasing number of connections. Very
good performance!
o (K)AIO: some unexplained behaviour. Scales well. Slightly poorer
performance comparing to SIGIO.
o
o OUR CONCLUSIONS FOR MODELING:
o One scalability parameter (!) needed to be added to the model, if
we decide to model SIGIO or AIO behaviour: impact of the number of connections
for load.
o The value of the parameter may be found directly from the plots.
o The model will be applicable up to a "maximal rate" limit, i.e. 9-10
MBytes/s for FastEthernet (additional parameter?)
o All conclusions concerning these methods should be applicable to
ALL communications protocols which use socket API (i.e. UDP, raw packet,
etc.,) not applicable for MESH driver though
o Now: put it into at2sim and try out!
o
o PROBLEMS, DISCUSSION:
o signal queue overflow protection for KAIO may be implemented.
o The same for SIGIO - need some hints from LINUX Kernel hackers to
find out how SIGIO works.
o Further experiments with AIO needed, especially to find out the optimal
number of "Slave threads"
o Experiments on Gigabit Ethernet and much faster machines needed (To
be done in Krakow, Oct/Nov 2001)
o Lots of things to be tested, no manpower (i.e. Same tests for UDP,
zero-copy patches, etc) :-(
o HOPEFULLY WE CAN PROVE THAT TCP MAY BE USED IN ATLAS HLT.