The Image as Big Data Toolkit: An Application Case Study in Image Analysis, Feature Recognition, and Data Visualization

The Image as Big Data Toolkit: An Application Case Study in Image Analysis, Feature Recognition, and Data Visualization

Kerry E. Koitzsch (Kildane Software Technologies Inc., USA)
DOI: 10.4018/978-1-5225-3142-5.ch018

Abstract

This chapter is a brief introduction to the Image As Big Data Toolkit (IABDT), a Java-based open source framework for performing a variety of distributed image processing and analysis tasks. IABDT has been developed over the last two years in response to the rapid evolution of Big Data architectures and technologies, distributed and image processing systems. This chapter presents an architecture for image analytics that uses Big Data storage and compression methods. A sample implementation of our image analytic architecture called the Image as Big Data Toolkit (IABDT) addresses some of the most frequent challenges experienced by the image analytics developer. Baseline applications developed with IABDT, status of the toolkit and directions for future extension with emphasis on image display, presentation, and reporting case studies are discussed to motivate our design and technology stack choices. Sample applications built using IABDT, as well as future development plans for IABDT are discussed.
Chapter Preview
Top

Overview

Rapid changes in the evolution of “Big Data” software techniques have made the ability to perform image analytics --- the automated analysis and interpretation of complex semi-structured and unstructured data sets derived from computer imagery --- with much greater ease, accuracy, flexibility and speed than has been possible before even with the most sophisticated and high-powered single computers or data centers. The “Big Data” processing paradigm” including Hadoop, Apache Spark and distributed computing systems have enabled a host of application domains to benefit from image analytics and the treatment of images as Big Data including medical, aerospace, geospatial analysis and document processing applications. However, several challenges exist when developing image analytic applications. Modular, efficient and flexible toolkits are still in formative or experimental development. Integration of image processing components, data flow control and other aspects of image analytics remain poorly defined and tentative. The rapid changes in Big Data technologies have made even the selection of a 'technology stack' to build image analytic applications problematic. The need to solve these challenges in image analytics application development have led us to develop an architecture and baseline framework implementation specifically to support Big Data image analytics.

In the past, low level image analysis and machine learning modules have been combined within a computational framework to accomplish domain tasks. With the advent of distributed processing frameworks such as Hadoop and Apache Spark, it has been possible to build integrated image frameworks that integrate seamlessly with other distributed frameworks and libraries and in which the ‘image as Big Data’ concept has become a fundamental principle of the framework architecture.

IABDT provides a flexible modular and plug-in oriented architecture which makes it possible to combine many different software libraries, toolkits, systems and data sources within one integrated, distributed computational framework. It is a Java and Scala-centric framework as it uses both Hadoop and its ecosystem as well as the Apache Spark framework and its ecosystem to perform the image processing and the functionality of image analytics. IABDT may be used with NoSQL databases such as Neo4j or Cassandra as well as with traditional relational database systems such as MySQL to store computational results. Apache Camel and the Spring Framework may also be used as “glue” to integrate components with one another.

One of the motivations for creating IABDT is to provide a modular, extensible infrastructure for performing preprocessing, analysis and visualization and reporting of analysis results specifically for images and signals. Leveraging the power of distributed processing as with the Apache Hadoop and Apache Spark frameworks and inspired by such toolkits as BoofCV, HIPI, LIRE, Caliph, Emir, ImageTerrier, Apache Mahout and many others, IABDT provides frameworks, modular libraries and extensible examples to perform Big Data analysis on images using efficient, configurable and distributed data pipelining techniques.

Image as Big Data toolkits and components are becoming resources in an arsenal of other distributed software packages based on Apache Hadoop and Apache Spark as shown in Figure 1.

Figure 1.

Image as Big Data toolkits as distributed systems (Koitzsch, 2016)

Some potential modules being investigated as distributed module technologies in IABDT include:

Key Terms in this Chapter

Sensor Fusion: Combination of information from multiple sensors or data sources into an integrated, consistent and homogeneous data model. Sensor fusion may be accomplished by a number of mathematical techniques including some Bayesian techniques.

Experimental Metadata Based Image Retrieval (EMIR): Emir is short for Experimental Metadata based Image Retrieval and uses the metadata created by Caliph for retrieval (Lux, 2015 AU129: The citation "Lux, 2015" matches multiple references. Please add letters (e.g. "Smith 2000a"), or additional authors to the citation, to uniquely match references and citations. ).

Deep Learning (DL): A branch of machine learning based on learning based data representations and algorithms modeling high level data abstractions. Deep learning uses multiple, complex processing levels and multiple nonlinear transformations. For a review of deep learning techniques, please see Masters (2015) AU128: The in-text citation "Masters (2015)" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. , and Awad and Khanna (2015).

Read-Eval-Print Loops (REPLs): A read–eval–print loop (REPL) also known as an interactive top level or language shell is a simple, interactive computer programming environment that takes single user inputs (i.e. single expressions), evaluates them and returns the result to the user; a program written in a REPL environment is executed piecewise (Wikipedia,2015). The term is most usually used to refer to programming interfaces similar to the classic Lisp machine interactive environment. Common examples include command line shells and similar environments for programming languages, and is particularly characteristic of scripting languages (Wikipedia, 2015).

Genetic Algorithms (GA Algorithms): See Koza (1992) for a thorough treatment of this class of algorithm.

Lucene Image Retrieval (LIRE): LIRE is a Java library that provides a simple way to retrieve images and photos based on color and texture characteristics (Lux, 2015 AU130: The citation "Lux, 2015" matches multiple references. Please add letters (e.g. "Smith 2000a"), or additional authors to the citation, to uniquely match references and citations. ). LIRE creates a Lucene index of image features for content based image retrieval (CBIR) using local and global state-of-the-art methods (Lux, 2015 AU131: The citation "Lux, 2015" matches multiple references. Please add letters (e.g. "Smith 2000a"), or additional authors to the citation, to uniquely match references and citations. ). Easy to use methods for searching the index and result browsing are provided. Best of all it's all open source. LIRE is successfully used at the WIPO, a United Nations Agency, to search in millions of trademark images and the Danish National Police to find similar scenes and to detect near duplicates (Lux, 2015 AU132: The citation "Lux, 2015" matches multiple references. Please add letters (e.g. "Smith 2000a"), or additional authors to the citation, to uniquely match references and citations. ).

Hadoop Image Processing Interface (HIPI): HIPI is an image processing library designed to be used with the Apache Hadoop MapReduce parallel programming framework. HIPI facilitates efficient and high throughput image processing with MapReduce style parallel programs typically executed on a cluster (University of Virginia Computer Graphics Lab, 2016). It provides a solution for how to store a large collection of images on the Hadoop Distributed File System (HDFS) and make them available for efficient distributed processing (University of Virginia Computer Graphics Lab, 2016). HIPI also provides integration with OpenCV, a popular open source library that contains many computer vision algorithms. The latest release of HIPI has been tested with Hadoop 2.7.1 (University of Virginia Computer Graphics Lab, 2016).

Machine Learning (ML): Machine Learning techniques may be used for a variety of image processing tasks including feature extraction, scene analysis, object detection, hypothesis generation, model building and model instantiation.

Image As Big Data (IABD): The IABD concept entails treating signals, images and video in the same way as any other source of “Big Data”, including the 4V conceptual basis of “variety, volume, velocity and veracity”. Special requirements for IABD include various kinds of automatic processing such as compression, format conversion and feature extraction.

Sparkling Water: Sparkling Water integrates H2O’s fast scalable machine learning engine with Spark. Sparkling Water excels in leveraging existing Spark based workflows that need to call advanced machine learning algorithms (Malohlava, Tellez, & Lanford, 2015). A typical example involves data “munging” with the help of the Spark API, in which a prepared table is passed to the H2O Deep Learning algorithm (Malohlava, Tellez, & Lanford, 2015). The constructed Deep Learning model estimates different metrics based on the testing data which can be used in the rest of the Spark workflow (Malohlava, Tellez, Lanford, 2015). More details can be found from the website https://h2o-release.s3.amazonaws.com/h2o/rel-slater/9/docs-website/h2o-docs/booklets/SparklingWaterVignette.pdf.

Ontology Driven Modeling: Ontologies as a description of entities within a model and the relationships between these entities may be developed to drive and inform a modeling process in which model refinements, metadata and even new ontological forms and schemas are evolved as an output of the modeling process.

Neural Net: Neural Nets are a kind of mathematical model which emulate the biological models of high level reasoning in humans. Many types of distributed neural net algorithms are useful for image analysis, feature extraction and two and three dimensional model building from images.

Apache Maven: (maven.apache.org). Maven is a build automation tool for Java projects. Maven first addresses how the software is to be built and then later it describes its dependencies. More details about Apache Maven can be found in the website maven.apache.org (Wikipedia, 2016 AU122: The in-text citation "Wikipedia, 2016" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. ).

BoofCV: Open Source Computer Vision Java library intended for software developers. Gradle is the most preferred way to build BoofCV. Manual, examples and tutorial can be read in detail from the website http://boofcv.org/index.php?title=Manual .

Agency Based Systems: Cooperative multi-agent systems or agencies are an effective way to design and implement IABD systems. Individual agent node processes cooperate in a programmed network topology to achieve common goals. For a discussion of the theory of agency, see Minton, Stephen (Ed.) (1993).

Classification Algorithm: Distributed classification algorithms within the IABDT include large and small margin (a margin is the confidence level of a classification) classifiers. A variety of techniques including genetic algorithms, neural nets, boosting and support vector machines (SVMs) may be used for classification. Distributed classification algorithms such as the standard k-means or fuzzy-k-means techniques are included in standard support libraries such as Apache Mahout. K-means and fuzzy-k-means algorithms are discussed in Bezdek & Pal (1992).

Apache Mahout: Project of the Apache Software Foundation to produce free implementations of distributed or scalable machine learning algorithms focused in the areas of collaborative filtering, clustering and classification. These implementations use the Apache Hadoop platform (Wikipedia, 2016 AU121: The in-text citation "Wikipedia, 2016" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. ).

ImageTerrier: ImageTerrier is an open source, scalable, high-performance search engine platform for content based image retrieval applications. The ImageTerrier platform provides a comprehensive test-bed for experimenting with image retrieval techniques (The University of Southampton, 2011-2015). The platform incorporates a state-of-the-art implementation of the single pass indexing technique for constructing inverted indexes and is capable of producing highly compressed index data structures. ImageTerrier is written as an extension to the open source Terrier test-bed platform for textual information retrieval research (The University of Southampton, 2011-2015).

Common and Light-Weight PHoto Annotation (CALIPH): MPEG-7 image annotation and retrieval GUI tools (Lux, 2015 AU126: The citation "Lux, 2015" matches multiple references. Please add letters (e.g. "Smith 2000a"), or additional authors to the citation, to uniquely match references and citations. ). Caliph is short for Common And Light-weight PHoto annotation and provides means to create a MPEG-7 XML based description of a photo (Lux, 2015 AU127: The citation "Lux, 2015" matches multiple references. Please add letters (e.g. "Smith 2000a"), or additional authors to the citation, to uniquely match references and citations. ).

Bayesian Image Processing: Array based image processing using Bayesian techniques typically involves constructing and computing with a Bayes network, a graph in which the nodes are considered as random variables and the graph edges are conditional dependencies. Image feature extraction is performed in the usual way and Bayesian inferential processes are applied to the extracted feature data. Random variables and conditional dependencies are standard Bayesian concepts from the fundamental Bayesian statistics. Following Opper and Winther (Smolla, Bartlett, Scholkopf & Schuurmans, 2001 AU123: The in-text citation "Smolla, Bartlett, Scholkopf & Schuurmans, 2001" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. ), Bayesian optimal prediction can be characterized as: (19) , (Smolla, Bartlett, Scholkopf & Schuurmans, 2001 AU124: The in-text citation "Smolla, Bartlett, Scholkopf & Schuurmans, 2001" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. ) where we are in essence performing an inference task of the kind described in the paper. Our goal is prediction of correct labels for points x, the ‘y’ quantity, is the binary optimal prediction, (Smolla, Bartlett, Scholkopf & Schuurmans, 2001 AU125: The in-text citation "Smolla, Bartlett, Scholkopf & Schuurmans, 2001" is not in the reference list. Please correct the citation, add the reference to the list, or delete the citation. ) D m is the training set, the data set used to train the classifiers: (Smolla, Bartlett, Scholkopf & Schuurmans,, 2001) the Bayesian ‘prior’ p is the probability distribution that would express belief about this quantity before existing evidence is accounted for. Object hypotheses, prediction and sensor fusion are typical problem areas for Bayesian image processing, and many serial and distributed versions of standard Bayesian algorithms such as naïve Bayes, Gaussian Naive Bayes, Multinomial Naive Bayes, Averaged One-Dependence Estimators (AODE), Bayesian Belief Network (BBN) and Bayesian Network have been implemented for mainstream toolkits like MLlib, Mahout and H2O.

Distributed System: Software systems based on a message passing architecture over a networked hardware topology. Distributed systems may be implemented in part by software frameworks such as Apache Hadoop and Apache Spark.

Complete Chapter List

Search this Book:
Reset