A Big Data Pipeline and Machine Learning for Uniform Semantic Representation of Data and Documents From IT Systems of the Italian Ministry of Justice

A Big Data Pipeline and Machine Learning for Uniform Semantic Representation of Data and Documents From IT Systems of the Italian Ministry of Justice

Beniamino Di Martino, Luigi Colucci Cante, Salvatore D'Angelo, Antonio Esposito, Mariangela Graziano, Fiammetta Marulli, Pietro Lupi, Alessandra Cataldi
Copyright: © 2022 |Pages: 31
DOI: 10.4018/IJGHPC.301579
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In this paper a Big Data Pipeline is presented, taking in consideration both structured and unstructured data made available by the Italian Ministry of Justice, regarding their Telematic Civil Process. Indeed, the complexity and volume of the data provided by the Ministry requires the application of Big Data analysis techniques, in concert with Machine and Deep Learning frameworks, to be correctly analysed and to obtain meaningful information that could support the Ministry itself in better managing Civil Processes. The Pipeline has two main objectives: to provide a consistent workflow of activities to be applied to the incoming data, aiming at extracting useful information for the Ministry's decision making tasks; to homogenize the incoming data, so that they can be stored in a centralized and coherent Datalake to be used as a reference for further analysis and considerations.
Article Preview
Top

Introduction

A data pipeline is a series of data processing steps, mainly consisting of three key elements: a source, a set of processing steps, and a destination (sometimes called a sink). Data pipelines enable the flow of data from a producer (e.g., the applications) to a data warehouse, from a data lake to an analytics database, or into a payment processing system, for example.

Data Pipelines differ from more traditional ETL (Extract/Transform/Load) layers, since the latter refers to a specific type of data pipeline. ETL is the process of moving data from a source, such as an application, to a destination, usually a data warehouse. ETL has typically been used for batch workloads, especially on a large scale. Data Pipelines represent a different breed of streaming ETL tools suitable for real-time streaming event data.

Like many components of data architecture, data pipelines have evolved to support big data. The term “big data” implies that there is a huge volume to deal with, that can open opportunities for use cases such as predictive analytics, real-time reporting, and alerting, among its multiple applications.

Big data pipelines are data pipelines built to fit better than a traditional data pipeline the three main traits of big data, as for the velocity, the volume and the variability. The velocity of big data makes requires to build streaming data pipelines so that the data can be captured and processed in real time and some actions can then occur. The volume of big data requires that data pipelines must be scalable, as the volume can be variable over time. In practice, there are likely to be many big data events that occur simultaneously or very close together, so the big data pipeline must be able to scale to process significant volumes of data concurrently. The variety of big data requires that big data pipelines be able to recognize and process data in many different formats—structured, unstructured, and semi-structured.

Furthermore, with the renewed interest in the machine learning techniques and the rising of deep learning and deep neural networks, the relationships between intelligence based analysis and information mining systems and the data itself, has grown significantly. So, all the techniques for mining and analyzing data that falls in the sphere of what is called ”artificial intelligence” (AI), are strongly linked to the universe of data, to the point of raising the need to analyze the relationships between AI and Big Data Storage, as big Data pipelines and Data Lakes and how these data storage systems can effectively support AI meaningful applications.

Data collection is a major bottleneck in current deep learning applications and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as deep learning has become more widely-used, since new applications do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data.

Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.

In this work is proposed a Big Data Pipeline, built for extracting, collecting and storing both structured and unstructured data from effective users’ ”data flows”, in order to mine data for extracting new information to support data producers and consumers, based on the exploitation of knowledge bases (e.g, domain ontologies) and machine and deep learning techniques.

The provided pipeline was built on an effective case study basing on the big data made available by the Italian Ministry of Justice and related to a particular procedure known as ”Telematics Civil Process”. Indeed, the complexity and volume of the data provided by the Ministry requires the application of Big Data analysis techniques, in concert with Machine and Deep Learning frameworks, to be correctly analysed and to obtain meaningful information that could support the Ministry itself in better managing Civil Processes.

Complete Article List

Search this Journal:
Reset
Volume 16: 1 Issue (2024)
Volume 15: 2 Issues (2023)
Volume 14: 6 Issues (2022): 1 Released, 5 Forthcoming
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing