Techniques and Methods That Help to Make Big Data the Simplest Recipe for Success

Techniques and Methods That Help to Make Big Data the Simplest Recipe for Success

Copyright: © 2019 |Pages: 34
DOI: 10.4018/978-1-5225-7609-9.ch006


Data analytics has grown in a machine learning context. Whatever the reason data is used or exploited, customer segmentation or marketing targeting, it must be processed first and represented on feature vectors. Many algorithms, such as clustering, regression, classification, and others, need to be represented and clarified in order to facilitate processing and statistical analysis. If we have seen, through the previous chapters, the importance of big data analysis (the Why?), as with every major innovation, the biggest confusion lies in the exact scope (What?) and its implementation (How?). In this chapter, we will take a look at the different algorithms and techniques analytics that we can use in order to exploit the large amounts of data.
Chapter Preview


Our weakness forbids our considering the entire universe and makes us cut it up into slices.

Poincaré (1913, p. 1386)

With the wide usage of computers and the internet, there has recently been a huge increase in publicly available data that can be analyzed. Data analyzed is no longer necessarily structured in the same way as in traditional analysis, but can now be text, images, multimedia content, digital traces, connected objects, etc. With big data, a new object seems indeed to have entered our lives: algorithms. Yet they have always existed: an algorithm is nothing more than a series of instructions to obtain a result. What is new is the application of the algorithm to gigantic masses of data.

More and more companies are now moving towards big data - designating the analysis of volumes of data to be processed more and more considerable and presenting a strong business challenge - in order to refine their business strategy. In addition, algorithms - complex equations programmed to perform automatically using a computer to respond to a specific problem - known only to their owners, today govern the operation of most social networks and websites.

This chapter offers a variety of methods and algorithms that can be adopted when working with big data. By opening the black box of algorithms through its categorization, this chapter helps you better understand what the algorithms do and how they work.


Big Data Analytics Coupled With Machine Learning Algorithms

The profitability of big data lies largely in the ability of the company (how?) to analyze the amount of data in order to generate useful information. The answer is: “Machine learning algorithms” (Sedkaoui, 2018a).

Born from pattern recognition, machine learning refers to all the approaches that give computers the ability to learn autonomously. These approaches, which overcome strictly static programs for their ability to predict and make decisions based on the data input, were used for the first time in 1952 by Arthur Samuel, one of the pioneers of the AI, for a game of checkers. Samul defines machine learning as the field of study aimed at giving the ability to a machine to learn without being explicitly programmed.

Tom Mitchell of Carnegie Mellon University proposed a more precise definition:

A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.

For example, if you are looking for a solution capable of estimating the houses prices. Here we can translate the definition given by Mitchell to identify: T, P, and E, so we can say:

  • Task (T): Estimate the price of houses

  • Performance (P): The precision of the algorithm prediction and how close it is to the real price of houses

  • Experience (E): The description of the house and its actual price

The objective of machine learning is to learn from data, in another word, learn from real observations. As we have noticed previously, these data can come from different sources and in different natures and forms. Depending on the case, they are more or less complex to analyze. In this order, algorithms aim to extract some regularity that will allow learning.

For example, by analyzing the content of the websites, search engines can define which words and phrases are the most important in defining a certain web page, and they can use this information to return the most relevant results for a given search phrase (Witten et al., 2016)

Another great example of machine learning applications is the IBM Watson that saved the life of a woman dying from cancer (Bort, 2016). The Watson computer system ran her genomic sequence and found it she had two strains of leukemia instead of the discovered one. This enabled another and more substantiated cure.

Key Terms in this Chapter

Cluster Analysis: A statistical technique whereby data or objects are classified into groups (clusters) that are similar to one another but different from data or objects in other clusters.

Supervised Learning: A supervised learning algorithm applies a known set of input data and drives a model to produce reasonable predictions for responses to new data. Supervised learning develops predictive models using classification and regression techniques.

Unsupervised Learning: Unsupervised learning identifies hidden patterns or intrinsic structures in the data. It is used to draw conclusions from datasets composed of labeled unacknowledged input data.

Algorithm: A set of computational rules to be followed to solve a mathematical problem. More recently, the term has been adopted to refer to a process to be followed, often by a computer.

Big Data: A generic term that designates the massive volume of data that is generated by the increasing use of digital tools and information systems. The term big data is used when the amount of data that an organization has to manage reaches a critical volume that requires new technological approaches in terms of storage, processing, and usage. Volume, velocity, and variety are usually the three criteria used to qualify a database as “big data.”

Analytics: Has emerged as a catch-all term for a variety of different business intelligence (BI) and application-related initiatives. For some, it is the process of analyzing information from a particular domain, such as website analytics. For others, it is applying the breadth of BI capabilities to a specific content area (for example, sales, service, supply chain and so on). In particular, BI vendors use the “analytics” moniker to differentiate their products from the competition. Increasingly, “analytics” is used to describe statistical and mathematical data analysis that clusters, segments, scores and predicts what scenarios are most likely to happen. Whatever the use cases, “analytics” has moved deeper into the business vernacular. Analytics has garnered a burgeoning interest from business and IT professionals looking to exploit huge mounds of internally generated and externally available data.

Deep Learning: Also known as deep structured learning or hierarchical learning is part of a broader family of machine learning methods based on learning data representations, as opposed to task-specific algorithms.

Regression: Regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables when the focus is on the relationship between a dependent variable and one or more independent variables (predictors).

Classification: In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.

Complete Chapter List

Search this Book: