Migrating a Legacy Web-Based Document-Analysis Application to Hadoop and HBase: An Experience Report

Migrating a Legacy Web-Based Document-Analysis Application to Hadoop and HBase: An Experience Report

Himanshu Vashishtha (University of Alberta, Canada), Michael Smit (University of Alberta, Canada) and Eleni Stroulia (University of Alberta, Canada)
DOI: 10.4018/978-1-4666-2488-7.ch010
OnDemand PDF Download:
No Current Special Offers


Migrating a legacy application to a more modern computing platform is a recurring software-development activity. This chapter describes the authors’ experience with a contemporary rendition of this activity, migrating a Web-based system to a service-oriented application on two different cloud software platforms, Hadoop and HBase. Using the case study as a running example, they review the information needed for a successful migration and examine the trade-offs between development/re-design effort and performance/scalability improvements. The two levels of re-design, towards Hadoop and HBase, require notably different levels of effort, and as the authors found through exercising the migrated applications, they achieve different benefits. The authors found that both redesigns led to substantial benefit in performance improvement, and that expending the additional effort required by the more complex migration resulted in notable improvements in the ability to leverage the benefits of the platform.
Chapter Preview


Migrating applications to cloud-computing environments is a software-engineering activity attracting increasing attention, as cloud environments become more accessible and better supported. Such migrations pose questions regarding the changes necessary to the code and to the architecture of the original software system, the effort necessary to perform these changes, and the possible performance improvements to be gained by the migration. The software-development team undertaking a migration-to-the-cloud project needs to address the following questions.

  • What types of software (i.e., components and/or libraries) can developers expect when undertaking a migration project?

  • What are the modifications typically required in order for the migrated application to better leverage the potential of the target cloud platform? What are the implications of the various platforms to the architectural and detailed design of the software deployed on them?

  • Will the particular software application benefit from its migration to a cloud environment? How might one assess the trade-off between the costs of the planned modifications vs. the improvements anticipated of the application post-migration?

The term cloud computing characterizes the perspective of end users, who are offered a service (which could be in the form of a computing platform or infrastructure) while being agnostic about its underlying technology. The implementation details of the service are abstracted away, and it is consumed on a pay-per-use basis, as opposed to being acquired as an asset. In principle, one distinguishes among three different types of cloud-based services. When infrastructure is offered as a service (IaaS), end users are able to procure virtualized hardware. When a software platform is offered as a service (PaaS), end users consume a software platform, i.e., a combination of an operating system, basic tools and libraries. Finally, when a software application is offered as a service (SaaS), end users consume as clients a specific application that is independently deployed and managed. Of course, these offerings can be combined into a stack of service offerings.

All three above scenarios promote improved scalability albeit through different mechanisms. The first scenario eliminates the need for users to acquire, manage and replace hardware, since any number of appropriately configured virtual machines can be easily procured (and abandoned), for example, through Amazon Web Services1. The second scenario promises improved scalability with novel tools and computational metaphors, such as those of the Hadoop ecosystem for storing and manipulating “big data.” Finally, when a software system is offered as a service, such as SalesForce2, its consumers are offered state-of-the-art functionality, regularly maintained, and extended, with guaranteed quality, at negotiable costs.

In this chapter, we report on our experience migrating a legacy application, TAPoR, to take advantage of IaaS (using AWS) and PaaS (in two scenarios, Hadoop and HBase). The original version of TAPoR had severe performance limitations and it was the promise of scalability through its migration “to the cloud” that motivated our study. The original application ran on a single machine, in a single thread, within a single process. Taking advantage of the IaaS model, we modified it to incorporate a load-balancing component to distribute incoming requests to multiple identical processes, running on multiple virtual machines (Smit, Nisbet, Stroulia, Iszlai, & Edgar, 2009). This change however did not address the fundamental inability of the application to scale to large documents. To that end, we investigated the advantages of an architectural shift to exploit the advantages of (two variants of) the Hadoop ecosystem as a platform. To summarize, we have performed three types of modification to the original system:

  • No architectural changes; deploy the software (with a load balancer) to multiple machines (on Amazon EC2, for instance);

  • Rearchitecting towards the MapReduce paradigm; modify the architecture and implementation to make use of the distributed computation features of Hadoop; and

  • Rearchitecting to use a NoSQL database; further change the implementation to also make use of the distributed database feature of Hadoop, HBase3.

Complete Chapter List

Search this Book: