The BaBar experiment uses data since 1999 in examining the violation of charge and parity (CP) symmetry in the field of high energy physics. This event simulation experiment is a compute intensive task due to the complexity of the Monte-Carlo simulation implemented on the GEANT engine. Data needed as input for the simulation (stored in the ROOT format), are classified into two categories: conditions data for describing the detector status when data are recorded, and background triggers data for noise signal necessary to obtain a realistic simulation. In this chapter, the grid approach is applied to the BaBar production framework using the INFN-GRID network.
The BaBar experiment (Cowan, 2007), developed at SLACSLAC (Stanford Linear Accelerator Center), Stanford University, studies the violation of charge and parity (CP) symmetry, a well known topic in the high energy physics field. The universe presents a composition where the difference between matter and anti-matter is subtle, and thus the experiment is geared towards understanding why matter prevails on anti-matter. High-energy electrons and positrons continuously collide every 250 million times per second to create rare B-meson and anti-B-meson. Such events are recorded for further analysis.
High speed electronics events require about 30kB of storage for each event. Some events are reconstructed from raw data and then separated (”skimmed”) into approximately 200 data streams according to their physics properties. These data streams are made available as datasets for analysis and used by 600 researchers based at 75 institutes in 10 countries. The data streams result in increased storage requirements as each event is duplicated in different streams but each data stream can be analyzed more quickly. The BaBar experiment has accumulated to date about 525 fb-1 integrated luminosity.
Another important task, called Simulation Production (SP), is focused on the simulation of the experiment to reconstruct events produced through a simulation based on the Monte Carlo method that compares real data with the theoretical model. Accurate simulations, based on the Monte Carlo method, need fast reprocessing of data for distributing a large amount of simulated events for analysis purpose.
All information concerning the detector, like calibrations and efficiencies, represent its status during data acquisition and are called condition data. This information is mandatory for describing the real state of the system during the generation of simulated events. Along with condition data, other important information is represented by the background triggers component, the noise recorded when data are taken, that addresses the requirement for a realistic reconstruction of simulated events. At least three times as many simulated events are needed as data events. With the traditional production system, each simulated event takes 4 seconds on a modern processor and results in 20kB of storage.
In order to speed up data access for the huge amount of events produced, both types of data are stored following the ROOT (Brun and Rademakers, 1997) framework schema that allows one to represent data as objects, describing parameters like energy, speed and trajectory as attributes that can be easily accessed through specific methods, by the code in charge of analyzing them.
Key Terms in this Chapter
ROOT: A set of frameworks on data analysis that can be easily extended to the Object Oriented approach implemented. Data are represented as objects that can be accessed to retrieve all information needed for further computation.
LFC: LCG File Catalog. It is a high performance file catalogue that addresses availability and scalability issues storing both logical and physical file mappings.
BaBar Software: BaBar software is organized in terms of packages. A package is a self-contained piece of software intended to perform a well defined task. Some packages may not be usable on their own, requiring integration with others. A software release consists of a coherent set of packages together with the libraries and binaries created for various machine architectures.
Distributed Computing: The distributed computing paradigm envisages the execution of particular software on two or more computational systems. The software can be developed for a pure parallel computation (e.g. parallelism on data, on task, instruction level parallelism) or can be managed by a dependency structure as a pipeline or farm model. In the BaBar scenario the computational tasks are performed using data parallelism on a distributed computer farm (twenty farms in 5 countries) and on LCG/gLite based Grid in Italy and UK.
Monte Carlo Method: A statistical approach ideal for those kind of problems that are too complicated to solve using analytical methods and that guarantees accurate results when applied several times to the problem domain. A classical example is the computation of the value of p generating couples of random numbers, (x,y). The ratio of the number of couples that satisfy the rule: x²+y²<=1 and the total number of couples generated is an approximation of p/4. The more random couples that are generated, the more accurate the approximation is.
BaBar Import/Export Tools: The main tasks the BaBar collaboration carries out at remote sites are the event reconstruction, the simulation production and the physics data analysis. Despite the fact that each duty needs to access a different data type and database metadata information, only one import/export software suite is shared between all the site managers.
Grid Monitoring: The INFN-Grid` infrastructure includes several monitoring systems for different purposes and granularity. Monitor activity can focus on resources status and services availability at each site or across the grid as a whole and display useful data aggregated per site, service and VO.