Massively Threaded Digital Forensics Tools

Massively Threaded Digital Forensics Tools

Lodovico Marziale, Santhi Movva, Golden G. Richard III, Vassil Roussev, Loren Schwiebert
DOI: 10.4018/978-1-60566-836-9.ch010
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

Digital forensics comprises the set of techniques to recover, preserve, and examine digital evidence, and has applications in a number of important areas, including investigation of child exploitation, identity theft, counter-terrorism, and intellectual property disputes. Digital forensics tools must exhaustively examine and interpret data at a low level, because data of evidentiary value may have been deleted, partially overwritten, obfuscated, or corrupted. While forensics investigation is typically seen as an off-line activity, improving case turnaround time is crucial, because in many cases lives or livelihoods may hang in the balance. Furthermore, if more computational resources can be brought to bear, we believe that preventative network security (which must be performed on-line) and digital forensics can be merged into a common research focus. In this chapter we consider recent hardware trends and argue that multicore CPUs and Graphics Processing Units (GPUs) offer one solution to the problem of maximizing available compute resources.
Chapter Preview
Top

Introduction

The complexity of digital forensic analysis continues to grow in lockstep with the rapid growth of the size of forensic targets—as the generation of digital content continues at an ever-increasing rate, so does the amount of data that ends up in the forensic lab. According to FBI statistics (Federal Bureau of Investigation, 2007), the average amount of data examined per criminal case has been growing at an average annual rate of 35%—from 83GB in 2003 to 277GB in 2007. However, this is just the tip of the iceberg—the vast majority of forensic analyses are in support of either civil cases or internal investigations and can easily involve the examination of terabyte-scale data sets.

Ultimately, a tiny fraction of that information ends up being relevant—the proverbial ‘needle in a haystack’—so there is a pressing need for high-performance forensic tools that can quickly sift through the data with increasing sophistication. As an illustration of the difficulty of the problem, consider the 2002 Department of Defense investigation into a leaked memo with Iraq war plans. It has been reported (Roberts, 2005) that a total of 60TB of data were seized in an attempt to identify the source. Several months later, the investigation was closed with no results. The Enron case involved over 30TB of raw data and took many months to complete. While these examples might seem exceptional, it is not difficult to come up with similar, plausible scenarios in a corporate environment involving large amounts of data. As media capacity continues to double every two years, such huge data sets will be increasingly the norm, not the exception.

Current state-of-the-art forensic labs use a private network of high-end workstations backed up by a Storage Area Network as their hardware platform. Almost all processing for a case is done on a single workstation—the target is first pre-processed (indexed) and subsequently queried. Current technology trends (Patterson, 2004) unambiguously render such an approach as unsustainable: I/O capacity increases at a significantly faster rate than corresponding improvements in throughput and latency.

This means that, in relative terms, we are falling behind in our ability to access data on the forensic target. At the same time, our raw hardware capabilities to process the data have kept up with capacity growth. The basic problem that we have is two-fold: a) current tools do a poor job of maximizing compute resource usage; b) the current index-query model of forensic computation effectively neutralizes most of the gains in compute power by traversing the I/O bottleneck multiple times.

Before we look at the necessary changes in the computational model, let us briefly review recent hardware trends. Starting in 2005, with the introduction of a dual-core Opteron processor by AMD, single-chip multiprocessors entered the commodity market. The main reason for their introduction is that chip manufacturing technologies are approaching fundamental limits and the decades-old pursuit of speedup by doubling the density every two years, a.k.a. keeping up with Moore’s Law, had to make a 90 degree turn. Instead of shrinking the size and increasing the clock rate of the processor, more processing units are packed onto the same chip and each processor has the ability to simultaneously execute multiple threads of computation. This is an abrupt paradigm shift towards massive CPU parallelism and existing forensic tools are clearly not designed to take advantage of it.

Another important hardware development that gives us a peek into how massively parallel computation on the desktop will look in the near future is the rise of Graphics Processing Units (GPUs) as a general-purpose compute platform. GPUs have evolved as a result of the need to speedup graphics computations, which tend to be highly parallelizable and follow a very regular pattern. As a result, GPU architectures have followed a different evolution from that of the CPU. Instead of having relatively few, very complex processing units and large caches, GPUs have hundreds of simpler processing units and very little cache on board.

Key Terms in this Chapter

Single Instruction Multiple Thread (SIMT): SIMT is an approach to parallel computing where multiple threads execute the same computations on different data items.

Graphics Processing Unit (GPU): A GPU is a computing device that was traditionally designed specifically to render computer graphics. Modern GPU designs more readily support general computations.

File Carving: File carving is the process of extracting deleted files or file fragments from a disk image without reliance on filesystem metadata.

Single Instruction Multiple Data (SIMD): SIMD is an approach to parallel computing where multiple processors execute the same instruction stream but on different data items.

Digital forensics: Digital forensics is the application of forensic techniques to the legal investigation of computers and other digital devices.

String Matching Algorithm: A string matching algorithm is a procedure for finding all occurrences of a string in a block of text.

Multicore CPU: A multicore CPU is a single-chip processor that contains multiple processing elements.

Multi-Pattern String Matching: A multi-pattern string matching algorithm is a procedure for finding all occurrences of any of a set of text strings in a block of text.

Beowulf Cluster: A Beowulf cluster is a parallel computer built from commodity PC hardware.

Complete Chapter List

Search this Book:
Reset