Article Preview
TopIntroduction
In C, C++, or Objective-C programs, there are two types of code files: header files and source files. Header files contain the declarations of classes, functions, global variables, and identifiers. Source files contain the implementations of classes and functions and the usages of variables and identifiers. The purpose of this organization is to separate program structure and interface from specific implementations (Stroustrup, 2013). On the other hand, software evolution is inevitable. Software products need to adapt to new requirements and new environments (Bendifallah, 1987; Gall et al., 1997; Kemerer & Slaughter, 1999; Lehman, 1980; Yu & Ramaswamy, 2009b; Yu & Ramaswamy, 2006). In this evolution process, both the program structure (interface) and the program implementation will need to be changed.
Because header files contain the program structure and interface, they are expected to be more robust than source files to requirement changes and environment changes. However, there has been no empirical research to validate this speculation. The objective of this study is to understand and compare the evolution of header files and source files. Using normalized compression distance that is derived from Kolmogorov complexity, we measure the difference of versions of header files and difference of versions of source files. Our study is performed on the source code of Apache HTTP, a web server written in C. Through this study, our objective is to answer the following two questions:
- 1.
In the source code of Apache HTTP, do header files and source files have similar paces of evolution?
- 2.
In the source code of Apache HTTP, is the evolution of header files correlated with the evolution of source files?
To answer these two questions, the remainder of the paper is organized as follows. Section 2 reviews Kolmogorov complexity and normalized compression distance. Section 3 describes our research method and research data. Section 4 presents the results of this study. Conclusions appear in Section 5.
TopKolmogorov Complexity And Normalized Compression Distance
In algorithmic information theory, the Kolmogorov complexity of an object, such as a string (a piece of text), is the length of the string’s shortest description in some fixed universal description language (Li & Vitányi, 2009). The Kolmogorov complexity of a binary string x can be represented by denoting the shortest length of string x described using a universal language.
Based on Kolmogorov complexity, Bennett et al. (1998) defined information distance between two binary strings x and y as the following.
(1)In Equation 1, denotes the conditional Kolmogorov complexity of string x given string y. Information distance measures the absolute distance between two strings x and y and thus does not reflect the relative difference between the two strings. Li et al. [10] then defined normalized information distance between two binary strings x and y.
(2)
Equation 2 is based on Kolmogorov complexity, which is non-computable in practice. To make the information distance more practical to use, Li et al. (2004) further substituted Kolmogorov complexity of a string x with the optimal compressed length of string x. The normalized compression distance between two binary strings x and y is defined below (Li et al., 2004).