February 2000
(Kors Bos, NIKHEF)
On behalf of the D0 group at NIKHEF and the D0 group of the University of Nijmegen we propose a computer facility to allow data analysis and Monte Carlo event production for the D0 experiment. D0 is an international collaboration of about 400 physicists based at Fermilab near to Chicago, USA. The D0 experiment will start taking data at the Tevatron proton-antiproton accelerator on March 1 2001. Until the Large Hadron Collider LHC at CERN in Geneva, Switzerland starts running in the second half of this decade, the Tevatron will represent the “high energy frontier” in experimental particle physics. The Amsterdam and Nijmegen NIKHEF groups have joined D0 in the summer of 1998 because we think that for the next, at least, five years the Tevatron is by far the most promising place to pursue precision electro-weak measurements and direct searches for new physics, and will provide measurements in B physics that will be competitive with the dedicated electron-positron “B-factories”. Participating in the D0 experiment now is an excellent preparation for working on the LHC and moreover PhD students and Research Associates will have a unique chance to do top level physics research for their degrees.
Huge computing resources are needed to extract the physics from the D0 data. The experiment will record around 6 x 108 proton-antiproton collisions per year and will produce a dataset of 300 TB per year. The computing power available to the collaboration at Fermilab alone will amount to many hundreds of GIPs. Whilst modest on these scales, the proposed D0NIK computer facility, with a data store of 1 TB and CPU power of 35 GIPs, will be sufficient to allow us to participate effectively in physics analysis in D0. To succeed in this aim we need to be able to:
The overall D0 computing strategy is based on the assumption that Monte Carlo data will be generated outside Fermilab by the collaborating institutes and universities. By making the NIKHEF generated Monte Carlo available to the rest of the collaboration (corresponding to about one quarter of the total needs of D0) we shall be making a very significant contribution to the overall success of the experiment. Moreover the proposed facility will significantly enhance the computer infrastructure available to staff and students at NIKHEF in Amsterdam and Nijmegen to participate in D0 data analysis.
The proposed system is described in detail in the next section, but in overview it will consist of:
Considerable expertise has been developed within the particle physics community at Fermilab and elsewhere in the operation of PC-based computing farms for large experiments. Within the D0 collaboration similar facilities will be operational in Prague, Lyon, Lancaster, Texas and we can profit greatly of this experience and we shall be able to take a significant fraction of the required system software from Fermilab.
In the case of a Monte Carlo event simulation and reconstruction task, the following steps will be carried out:
The hardware requirements are determined by the need to have sufficient processing power for the Monte Carlo production and reconstruction and to have sufficient data storage and throughput. The system is required to provide around 35 GIPs. Measurements at Fermilab using realistic particle physics applications indicate that a 450 MHz Intel Pentium processor can deliver 260 MIPs [3]. Based on this, any equivalent of 50 dual 600 MHz processor PC’s will deliver this.
At this stage of the proposal we assume the usage of dual processor
PC's. However, more detailed analysis may show that it is better to use PC
boards with 4 or even 8 CPU's. In that case the numbers quoted below have to be
adjusted accordingly.
To minimize network traffic each of the worker nodes will require enough disk space so that it can hold in- and output files being processed to each processor plus additional space for data files that are in the process of being staged in/out the tape store. Output files of the simulation are of order 1 GB, so 2 GB are required for dual processors. Output files from the reconstruction are 50% bigger, because the simulated data is stored again together with the reconstructed data chunks. This total of 3 Gbyte must be multiplied by 2 for the files of the past run which are transferred to the data store. This makes a minimum requirement for 6 GB of disk space for the simulated and reconstructed data. Moreover some space is needed for the input and calibration files for the simulation and reconstruction programs and last but not least the disk must hold the operating system and necessary tools. So, a “standard” hard disk of 10 GB or around seems appropriate.
It may technically be advantageous to separate the system and the data
disk onto two different hard drives.
A typical process will take between 5 and 10 minutes CPU time on a 500 MHz PIII processor. Simpler processes as Zàee takes 5 minutes but the more complicated ttbar events take about 7 minutes. The output data files produced are 2 to 3 MB per event. So 200 ttbar events take 23 hours and will produce a 600 MB output data file. So a 100 farm nodes can produce on average 20,000 events per day and 60 GB of simulated data. This would be roughly 6,000,000 generated events per year counting on an 80% uptime of the farm.
The size of the Monte Carlo program is expected to be on the order of 150 MB under the Linux operating system. Therefore the worker nodes require 256 MB of RAM. The reconstruction program will require significantly more but the precise size is yet unknown. At this moment the program is as big as 500 MB but it is not yet finished and is compiled in a non-optimized mode. It seems appropriate therefore to equip the early machines with 256 MB on one sim card such that an extension slot is available for a second card to upgrade to 512 MB.
Roughly speaking, when the processors are in use, they can produce 1 GB of data per processor per day. One hundred processors can thus produce 100 GB per day or 30 TB per year counting on some down time. Although we don’t plan to store all simulated and reconstructed data locally, we should count on the possibility to store this amount of data. This will not be needed during the year 2000 and experience will allow us to shape up our ideas and to more accurately specify this number. D0 and Fermilab are currently testing tape media for data import/export between Fermilab and the collaborating institutes and universities. Final recommendations have not been made yet, but Exabyte Mammoth tapes are currently frontrunner. The Mammoth-2 technology, which will be commercially available this summer, provides a storage capacity of 60 GB per tape and a reading speed of 12 MB/s per drive. Exabyte X-200 Arrowhead Mammoth libraries allow automated random access to 200 tapes. The exact amount of data that will be needed at NIKHEF for analysis at any time is not known yet and will depend very much on the analysis manpower and the channels we will study.
If we assume that there should be disk space available to store about 10% of the data that is on tape we should aim for a common disk space of 1.2 TB. This is not an over estimate if one considers that this corresponds to 12 GB of common disk space per CPU.
As time goes by the ideas described above may change. Prices for disk
space decease much faster than prices for tapes and tape robots. By the time the storage capacity is needed it may be cheaper to store
all data permanently on disk. It may still be desirable to have the capability
to write tapes in case the network is not capable to cope with the dataflow we
need.
Given the above rates, it seems clear that 1 Gbit/s network is required eventually for the tape store but also for the manager nodes and possibly a few worker nodes that might be dedicated to I/O tasks. For individual worker nodes running data analysis or Monte Carlo simulation jobs 100 Mbit/s network connectivity will be perfectly adequate.
The milestone cast in stone is the start of the Run II at Fermilab: March 1, 2001. By that time we have to have the farm up and running. This means that the last moment to order the hardware for the full farm is end 2000. Up to that date we can run with a 10% scaled down model. From the software point of view, a farm with 10 nodes is the same as a farm with 100 nodes. From the hardware point of view this is not the case. It is not clear that hardware problems scale linearly with the size of the farm. Moreover it is not clear how data throughput scales with the number of nodes. We have to keep these uncertainties in mind when working with the scaled down model.
This is now: it's the time for the design of the farm. The result should be a description into some detail of the full farm and this document is the first attempt. A much more detailed plan for the 10% model must result and appointments should be made with industrial partners. A more detailed description of the parts needed and a first estimate of the cost should be finished by the end of February and by the end of March we should be in a position to order the hardware for the 10% model.
The 10% model will be installed and connected. The major task for this period is to install all needed software and to make it work. We can rely on our colleagues at Fermilab and at other sites with PC farms for help. No major financial investments are foreseen at this stage. The farm should be up and running with the latest versions of the D0 simulation and reconstruction software before the summer of this year, i.e. June 21.
During the second half of the year we will have to operate the 10% model of the farm to its maximum and generate and reconstruct as much data as we can to learn more about its performance and limitations. Based on this experience we should then define the parameters of the final farm and should have a detailed specification on what has to be purchased, so negotiations with the PC makers can start about prices and delivery times. The order for a substantial increase of the farm should go out in December 2000 to have sufficient CPU power when data taking starts.
The full farm will be installed and powered up. The software will be adapted to the new environment and simulation and reconstruction will start at the time the data taking starts at Fermilab: March 2001.
A detailed description of the proposed 10% farm follows:
High performance workstation with:
dual 750 MHz processor
512 MB RAM
18 GB hard disk
100 Mbit Ethernet card
monitor
This node is likely to need
the 1 Gbit ethernet connection when it needs to operate as file server for the
farm in operation and has to copy the data from the nodes to the common disk
space and/or tape store. However, a 100 Mbit connection will most likely be
sufficient for the 10% model.
High performance workstation with:
750 MHz processor
512 MB RAM
18 GB hard disk
100 Mbit Ethernet card
monitor
The existing NIKHEF/D0
software server <biotoop> could be re-used for this purpose.
700 MHz per processor
256 MB RAM per processor
18 GB hard disk per processor
100 Mbit Ethernet card
It has to be seen what is
the most cost effective solution: 2,4 or 8 CPU's per board. One CPU per board
is an expensive solution because there is too much duplication of hardware in
this case. The 8 CPU solution is quite expensive because large-scale
integration on a limited board size makes the production more delicate. It
seems that the optimum is either at 2 or at 4 CPU's per board. Memory and disk
space requirements scale with the number of processors per board. In the case
of 4 or 8 CPU's per board 10 Mbit Ethernet cards may have to be replaced by 100
Mbit.
120 Gbyte disk space to be connected to the file server control node
Not really known well enough to specify
3COM or Cisco or similar network switch with 10/100Base ports.
This allows us to have all worker nodes in a separate local network.
|
Item |
Price |
|
1 File server |
kf 10 |
|
100 Gbyte disk space |
kf 10 |
|
1 Batch sever |
kf 10 |
|
5 Worker nodes (10 cpu's) |
kf 50 |
|
Network Switch |
kf 5 |
|
Infrastructure |
kf 10 |
|
Total |
kf 95 |
For the full farm we expect the price to
scale with the number of worker nodes. So if we go for a 100 CPU farm we must
foresee to invest an additional 45 * kf 10 = kf 450. This does not include the costs for a tape library.
For plan 0 to 2, i.e. this year, we expect to need:
0.5 fte for the hardware installation and commissioning and
0.5 fte for the installation and testing of the D0 application software
For the full farm, so for next year, we
expect this need to increase by a factor of 2.