Article Preview
Top1. Introduction
Till today, software theft has been causing serious damage to software industry. From the BSA global software survey 20161, 39% of software installed on computers in the world is not properly licensed. Also, violation of open source software (OSS) licenses, such as GPL2, by unexpected and unaware reuse of OSS source code3 has now become a serious problem for both software companies and OSS developers (Monden et al., 2011). Software birthmark methods have been proposed against such software theft to enable us to detect the theft (Tamada et al., 2004), (Tamada et al., 2005). A software birthmark is a set of characteristics which a program originally possesses. It is extracted from a binary code and used to evaluate the similarity between one program and another (extraction and comparison phases). Various types of birthmarks have been proposed, each focusing on different characteristics in a program. Different extraction methods and comparison methods have also been defined for each type of birthmark and have been evaluated according to those definitions.
Software birthmarks are designed to search of large amounts of software to detect suspected copies; hence, their use requires high-speed, large-volume software repository searches. However, the software birthmark has the one essential problem in the practical use case. That is, the scale of the target software was not assumed. Figure 1 illustrates the problem in use of the software birthmarks. The developer can examine for detecting the copy of p0 from the target set programs p1 to pn. However, the many unchecked programs are still existing in the Internet. The most important issue of the software theft is to detect suspected copies. The programs pn+1 to pn+m in the Figure 1 are never investigated because memory constraints, vast amount of time consumed for comparison, and the enormous computational complexity. However, almost programs are innocent and quite different. Therefore, to detect the software theft requires more simple and casual algorithm for huger target set.
Therefore, we proposed the method for the software birthmark procedure to narrow the defendants with compressing birthmark information and simplifies comparison algorithms. Figure 2 shows the difference of the conventional and the proposed birthmark procedures. Form Figure 2, we insert two phases, compression phase and pre-comparison phase, between the conventional phases. The compression phase compresses the birthmark information for the next phase. The pre-comparison phase compares compressed birthmarks by simple algorithm and computes similarity. Then, remains of the pre-comparison are still defendants, then, the remains are the inputs for the comparison phase.
This remainder of this paper is organized as following. Section 2 reviews the related works. Section 3 describes the proposed method and illustrates the novel procedure of the birthmark system. Section 4 represents the empirical studies of our method. Section 5 shows conclusion and some future works.
Figure 1. The problem of the current birthmark system
Figure 2. The difference between the conventional method and the proposed method