Special Offers
- IGI Global’s New Emerging Topic e-Book Collections
  Acquire highly focused and affordable Cutting-Edge Peer-Reviewed Research Content through a selection of 17 topic-focused e-Book Collections discounted up to 90%, compared to list prices. Collection topics include Artificial Intelligence, Data Science, Language Learning, Marketing and Customer Relations, Sustainability, and many more. Hosted on the InfoSci^® platform, these collections feature no DRM, no additional cost for multi-user licensing, no embargo of content, full-text PDF & HTML format, and more.
  Learn More
- Open Access Book (Free Access) - Encyclopedia of Information Science and Technology, Sixth Edition (ISBN: 9781668473665)
  The Encyclopedia of Information Science and Technology, Sixth Edition) continues the legacy set forth by the first five editions by providing comprehensive coverage and up-to-date definitions of the most important issues, concepts, and trends pertaining to technological advancements and information management within a variety of settings and industries. The entire book is being published under open access.
  Read Now
- Open Access Book (Free Access) - Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries (ISBN: 9781668456293)
  Food Sustainability, Environmental Awareness, and Adaptation and Mitigation Strategies for Developing Countries provides information on the recent technology, mitigation, and environmental protection that must be applied for food sustainability in developing countries. This book is being published under Platinum Open Access through funding from Diponegoro University, Indonesia.
  Read Now
- Open Access Book (Free Access) - New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY (ISBN: 9781668438091)
  The Walmart Corporation and the Lumina Foundation have provided funding to make New Models of Higher Education: Unbundled, Rebundled, Customized, and DIY fully open access, completely removing any paywall between scholars in education and the latest research on new models for the future of higher education.
  Read Now
- Open Access Book (Free Access) - Handbook of Research on the Global View of Open Access and Scholarly Communications (ISBN: 9781799898054)
  Through a collaboration between IGI Global and the University of North Texas, the Handbook of Research on the Global View of Open Access and Scholarly Communications has been published as fully open access, completely removing any paywall between researchers of any field, and the latest research on the equitable and inclusive nature of Open Access and all of its complications.
  Read Now
Books
- - Books by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Books by Field
Journals
- - Journals
  - OnDemand Journal Articles
  - Journals by Subject
  - Business, Administration, & Management
  - Scientific, Technical, & Medical (STM)
  - Education
  - Journals by Field
e-Collections
Open Access
- View All Open Access Opportunities
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Find an Open Access Journal for Your Next Manuscript
  Search across all of IGI Global’s available open access publishing opportunities to unleash your research potential.
  Submit an Open Access Book Proposal
  Learn more about open access book publishing and how it can propel your research forward in the field.
  Convert Your Work to Open Access
  Already published? You can convert your work to open access to increase its impact through IGI Global’s Restrospective Open Access Program.
  Utilize Open Access Collection Database
  Open up your research potential by utilizing our open access content or integrating the open access collection into your library
  Consider Open Access Agreements
  For Libraries: consider no-cost or investment-level open access agreements with IGI Global to support your faculty's research endeavors.
  Search Funding Resources
  Looking for additional funding resources to support your open accesss endeavors? View industry resources compiled by our open access team.
  Review Open Access Policies & Ethical Guidelines
  Considering IGI Global to publish your work under open access? Review IGI Global’s open access policies and ethical guidelines
Publish with Us
Resources
- - Instructors
  - Course Adoption
  - Teaching Cases
  - K-12 Online Learning Collection
  - Authors and Editors
  - eEditorial Discovery^® System
  - Peer Review Process
  - Ethics and Malpractice
  - COPE Membership
  - Fair Use Policy
  - Open Access Publishing
  - FAQ
Catalogs
About Us
Newsroom

Scheduling of Extract, Transform, and Load (ETL) Procedures with Genetic Algorithm

Vedran Vrbanić, Damir Kalpić

Source Title: International Journal of Business Analytics (IJBAN) 2(3)

DOI: 10.4018/IJBAN.2015070103

OnDemand:

(Individual Articles)

Available

$37.50

Current Special Offers

No Current Special Offers

Abstract

A number of ETL procedures are used in the process of loading data to data warehouse systems. Some procedures can be executed concurrently in parallel mode, while for the others there are precedence constraints. Thus, the problem in scheduling procedures for execution is similar to the problem of scheduling of jobs in multiprocessor systems. The solution to this problem has been proposed in the optimum schedule of jobs minimizing the total execution time. When optimizing the schedule for ETL procedures, minimization of the total execution time is not the primary goal. Namely, the ETL procedures provide data required for reports aimed for business users and such reports need to be prepared until the user-defined deadlines. If the deadlines are not breached, the solution is satisfactory, regardless of the total execution time. Also, one cannot assume that all ETL processes are of the same importance – some have higher priorities than the others. That is the reason why prioritization and introduction of explicit bounds to completion time for individual ETL processes is attempted with genetic algorithm (GA). This paper encompasses implementation of the algorithm, experiments with different parameters and testing the quality of obtained solutions.

Article Preview

Top

1. Introduction

A scenario for optimum execution schedule of ETL procedures is of great significance for functioning of a real-world data warehouse (DWH) system (Vassiliadis, Simitsis, Georgantas, Terrovitis, & Skiadopoulos, 2005). Relational database management system is a powerful solution in storing and processing structured data and generating reports using complex queries, including features like transaction consistency, real-time writing and multi-table queries (Abdelhafez, 2014), but ETL process development constitutes the most expensive part of a data warehouse project, expressed both in consumption of time and resources. Principal reason for this is that in practice ETL processes have been traditionally designed with a specific vendor tool in mind, from the start of the data warehouse project. Unfortunately, commercial ETL tools are not easy to use, they have a steep learning curve, due to a lack of standard capabilities, e.g. they provide different underlying languages with a wide spectrum of functionality features or complex wizards (Akkaoui, Zimányi, Mazón, & Trujillo, 2013). During the first years of the 21^st century, the importance of ETL procedures for data integration tasks has become obvious since they are expensive, labour intensive, mission critical and therefore most important for success of any data warehousing project. The difficulty lies in the combination of data integration and data cleaning tasks where data are transferred from one data store to another, while the details of schema and value mappings play a great role (Anitha & Babu, 2014).

A DWH system is designed in such a way that data processing begins late at night or in early morning hours. In a perfect scenario case, complete data extraction process should be completed before the daily work starts, so the business users can always have the latest data at their disposal. This goal is sometimes reachable, and sometimes not, due to different limitations in the legacy information systems and alike, so that occasionally one has to improvise with imperfect solutions.

The term optimal schedule needs to be defined in the context of this problem. A DWH system consists of a certain number of data marts which are used as a background for the creation of corresponding reports (Bonifati, Cattaneo, Ceri, Fuggetta, & Paraboschi, 2011). Every report must be completed until its deadline. In this perspective, the best schedule is the one in which the minimum number of reports breaches user-defined deadlines. Total execution time does not affect the quality of the solution.

The reports differ in importance. E.g. if a bank’s DWH system is in question, the reports on bank portfolio structure which are sent to Central National Bank are of maximum priority because, in case of their delays, the bank would have to pay penalties. That is why the system must urge procedures which are a precondition for the creation of these kind of reports; even if that means that lower priority reports would be delayed. Regardless to the prioritization, explicit time limits must also be introduced. For example, if the Central National Bank requests data every day by 11 AM, then this condition must be specified within the genetic algorithm.

DWH systems fetch data from a number of legacy transaction systems. There are cases when some of these sources fail at the moment when the DWH system must start the data loading process. Data processing cannot be initiated until the data source is enabled. This is a real scenario and in that case the ETL procedure execution schedule is not optimal. But, the solution to this problem is fairly simple, because such cases can be predicted due to previous data source failures and one can create the optimal schedule for this scenarios using GA. If such a scenario happens, one simply should implement the adequate schedule which now does not contain procedures dependent on inoperable data sources. As those procedures are executed using shell scripts or Database Management System (DBMS) jobs, it is not difficult to use different schedules due to data source conditions.

Complete Article List

Search this Journal:

Reset

Volume 11: 1 Issue (2024)

Volume 10: 1 Issue (2023)

Volume 9: 6 Issues (2022): 4 Released, 2 Forthcoming

Volume 8: 4 Issues (2021)

Volume 7: 4 Issues (2020)

Volume 6: 4 Issues (2019)

Volume 5: 4 Issues (2018)

Volume 4: 4 Issues (2017)

Volume 3: 4 Issues (2016)

Volume 2: 4 Issues (2015)

Volume 1: 4 Issues (2014)

View Complete Journal Contents Listing

MLA

APA

Chicago

Export Reference

Scheduling of Extract, Transform, and Load (ETL) Procedures with Genetic Algorithm

Abstract

1. Introduction

Complete Article List