Molecular Structure Determination on the Grid

Molecular Structure Determination on the Grid

Russ Miller, Charles Weeks
DOI: 10.4018/978-1-60566-374-6.ch017
(Individual Chapters)
No Current Special Offers


Grids represent an emerging technology that allows geographically- and organizationally-distributed resources (e.g., compute systems, data repositories, sensors, imaging systems, and so forth) to be linked in a fashion that is transparent to the user. The New York State Grid (NYS Grid) is an integrated computational and data grid that provides access to a wide variety of resources to users from around the world. NYS Grid can be accessed via a Web portal, where the users have access to their data sets and applications, but do not need to be made aware of the details of the data storage or computational devices that are specifically employed in solving their problems. Grid-enabled versions of the SnB and BnP programs, which implement the Shake-and-Bake method of molecular structure (SnB) and substructure (BnP) determination, respectively, have been deployed on NYS Grid. Further, through the Grid Portal, SnB has been run simultaneously on all computational resources on NYS Grid as well as on more than 1100 of the over 3000 processors available through the Open Science Grid.
Chapter Preview

1. Introduction

The Grid is a rapidly emerging and expanding technology that allows geographically-distributed resources that extend across administrative boundaries to be linked together in a transparent fashion (; Berman et al., 2003; Foster & Kesselmann, 1999). These resources include compute systems, data storage devices, sensors, imaging systems, visualization devices, and a wide variety of Internet-ready instruments. The concept and terminology of the Grid is borrowed from the electrical grid, where utility companies have the ability to share and move resources (electricity) in a fashion that is transparent to the consumer. With rare exception, the view taken by a consumer is that they are able to plug a piece of equipment into a power outlet in order to obtain electricity and do not need to know, and in fact, do not want to know, the details pertaining to the manner in which electricity makes its way to the outlet. Similarly, the power of both computational grids (i.e., seamlessly connecting compute systems and their local storage) and data grids (i.e., seamlessly connecting large storage systems) lies not only in the aggregate computing power, data storage, and network bandwidth that can readily be brought to bear on a particular problem, but also on its ease of use.

Numerous government-sponsored reports state that grid computing is a key to 21st century discovery by providing seamless access to the high-end computational infrastructure that is required for revolutionary advances in contemporary science and engineering. In fact, National Science Foundation Director Arden Bement stated that “leadership in cyberinfrastructure may determine America’s continued ability to innovate – and thus our ability to compete successfully in the global arena.”

Grids are now a viable solution to certain computationally- and data-intensive computing problems for reasons that include the following.

  • Users can access many grids through a Web portal from virtually anywhere in the world. That is, a user only needs an account on a Grid administrative server in order to use a grid. This is similar to how someone uses a search engine or large e-business system. One only needs access to a gateway and not specifically to each individual server that the company has configured to be able to handle the requests/queries/business. For most grids, a user needs access to a Grid Portal, but does not need to be logged in to a site that hosts a particular Grid resource, does not need to be logged in to a computer that is on a Grid, and does not need to install any additional software on their Web-accessible system (workstation, cellular phone, laptop, etc.) in order to be able to use a Grid.

  • The Internet is mature and able to serve as the fundamental infrastructure for network-based computing. In fact, network bandwidth, which has been doubling approximately every 12 months, has increased to the point of being able to provide efficient and reliable services for the vast majority of Grid applications.

  • Storage capacity, which has been doubling approximately every 9 months, has now reached commodity levels, where one can purchase a terabyte of disk for roughly the same price as a high-end PC.

  • Many instruments are Internet-aware.

  • Clusters, supercomputers, storage and visualization devices are becoming more mainstream in terms of their ability to host scientific applications.

Key Terms in this Chapter

X-Ray Crystallography: The scientific method most commonly used to determine molecular structures. Individual molecules cannot be seen under a light microscope because the wavelength of visible light is larger than the molecular size. However, crystals are made up of an array of many (~1011-1012) identical, regularly-spaced molecules, and the regular spacing allows a technique called X-ray diffraction to be used to “see” the molecules that comprise the crystal.

Molecular Structure Determination: The study of the three-dimensional architecture of molecules and the arrangement of their component atoms in space.

The Phase Problem: A diffraction pattern consists of a set of spots called reflections that result from capturing the image of an X-ray beam as it is scattered by the atoms in a crystal. Each reflection has a magnitude, which is experimentally accessible, and a phase, which is not. The inability to measure phase angles experimentally is known as the phase problem. Together, the magnitude and phase of a reflection constitute a quantity known as a structure factor. The set of structure factors provides the reciprocal-space representation of the structure, and the set of atomic coordinates provides the real-space representation. A Fourier transformation of the complex-valued structure factors leads directly to the real-valued electron density, which, after suitable interpretation, reveals the atomic positions and describes the molecular architecture of the crystalline material responsible for the scattering. Therefore, a solution to the phase problem requires an algorithm that will recover phase information that cannot be directly measured in the diffraction experiment.

SnB and BnP: SnB is a computer program developed in Buffalo (at the Hauptman-Woodward Medical Research Institute and SUNY-Buffalo) that provides an efficient implementation of the Shake-and-Bake method of molecular structure determination. This program has been distributed worldwide from the website and is available for workstations, networks of workstations, clusters, and grids. BnP is a computer program that combines the direct-methods program SnB with components of Bill Furey’s (University of Pittsburgh) PHASES suite. BnP targets the determination of large protein structures. First, the SnB component is used to find a substructure consisting of heavier atoms, and then the substructure is used as a starting point for phasing the complete protein.

Grid Computing: Computational efforts involving computing, networking, storage, or visualization that involves geographically-distributed and independently-operated resources that are linked together in a transparent fashion.

Shake-and-Bake: A powerful algorithmic formulation of direct methods that, given complete and accurate diffraction data, has made possible the ab initio phasing of crystal structures containing as many as ~2000 independent non-hydrogen atoms. The distinctive feature of this algorithm is the cyclical alternation of phase refinement with the imposition of atomicity constraints.

GUI: A graphical user interface allows people to interact with a computer and computer-controlled devices. Instead of offering only text menus or requiring typed commands, graphical icons, visual indicators or special graphical elements are presented. Often the icons are used in conjunction with text, labels or text navigation to fully represent the information and actions available to a user. The actions are usually performed through direct manipulation of the graphical elements.

Diffraction: In very simple terms, diffraction is the bending or spreading of waves as they pass through an obstruction or gap. The gaps or distances between molecules in a crystal are such that X-rays (with a wavelength of ~106 mm) are diffracted by crystals. The X-rays scattered in a diffraction experiment produce a distinctive pattern that is related to the atomic arrangement within the crystal that was irradiated. Diffraction patterns can be recorded on photographic film or a suitable electronic recording device.

Cyberinfrastructure: Provides for the transparent and ubiquitous application of technologies central to contemporary science and engineering, including high-end computing, networking, and visualization, data warehouses, science gateways, and virtual organizations, to name a few. It is a comprehensive phenomenon that involves creation, dissemination, preservation, and application of knowledge.

Direct Methods: A mathematical approach that makes it possible to glean phase information from the diffraction magnitudes. For example, the fact that molecules consist of atoms, and that atoms are small, discrete points relative to the spaces between them, creates certain constraints. Since there are many more reflections in a diffraction pattern than there are independent atoms in the corresponding crystal, the phase problem is overdetermined, and the existence of relationships among the measured magnitudes is implied. Certain linear combinations of three phases have been identified as relationships useful for determining unknown structures, and direct methods use probabilistic techniques to exploit these relationships. Herbert Hauptman and Jerome Karle won a Nobel prize in 1985 for their work in developing direct methods.

Complete Chapter List

Search this Book: