The Emerging Threats of Web Scrapping to Web Applications Security and Their Defense Mechanism

Rizwan Ur Rahman (Maulana Azad National Institute of Technology, Bhopal, India), Danish Wadhwa (JayPee University of Information Technology, Solan, India), Aakash Bali (JayPee University of Information Technology, Solan, India) and Deepak Singh Tomar (Maulana Azad National Institute of Technology, Bhopal, India)
Copyright: © 2020 |Pages: 22
DOI: 10.4018/978-1-5225-9715-5.ch053


Web scraping is the technique exploited to robotically obtain particular information from web applications instead of manually copying it. The purpose of a web scraper is to search for certain class of information, dig out, and aggregate it into new database. More precisely, web scrapers are used to transform unstructured web data and store them in structured databases. It is a continuing threat to web applications that aims to steal sensitive data from a victim or from web applications. The key objective of this article is to examine to what extent web scraping can cause a threat to web application security. This article explores the classification of web scraping such as content scraping, web scraping, price scraping, and database scraping in general and presents the most widely used scraping tools such as Web Content Extractor, and Screen Scrapper. Consequently, the aim of this article is to give evaluation of vulnerabilities, threats of web scraping associated with web application applications, and effective measures to counter them.
Chapter Preview

Web Scraping

Web scraping is also known by some other names like web harvesting and web data extraction basically is used for extraction of data from the websites on the WORLD WIDE WEB. In other words, it can be defined as the process consisting of the extraction and combination of content gathered from the web in a systematic manner (Vargiu & Urru, 2012).

Software applications are available for doing the web scrapping which may do their work of accessing the World Wide Web using Hypertext Transfer Protocol or web browser. Web scraping can also be done manually by the user but is preferably done in an automated fashion implemented using a bot or web crawler. In this, some software also known as web robot is mimicking the browsing between the web and the human in a conventional web traversal.

This robot may gather the data from as many websites as needed and the parsing of the contents is done to easily find and fetch the data required and stores them in the structures as desired.

Generally, this task of web scraping is somewhat similar to copying; in this particular data is collected and copied from the Internet into some manageable and readable storage structure like some spreadsheets or databases.

In this process, the web page is downloaded or fetched (it happens whenever the browser opens up some pages) first and saved for later use and then the data is extracted from it. Hence we can say that web crawling is an important component of the process.

At the second step of the process the content present in the page is parsed, searched or some type of reformatting is done to understand the content for the data to get it inserted into the spreadsheets or database by copying. Generally, the web scrapping software may sometime take a part of the page which can be useful for the authority for some other purpose.

Web Scrapping is being used in various things in today’s life like in advertisements and marketing generally by contact scraping and also an important part of the application made for data mining and web mining, and sometimes used to do some price comparisons, for online price change monitoring, weather data monitoring, research and for providing a service to the user where the content comprises of more than one source also known as web mashup for instance, like trivago and mybestprice applications.

Basically, these web scrapers are APIs which are used to extract data from a web page or a website present on the Internet. Also, some big companies like Amazon Web Services and Google provide web scrapping tools free of cost to end users.

Key Terms in this Chapter

News Scraping: It is a process of scraping the news from the newspaper websites.

Database Scraping: It is a process of directly extracting data from the database is known as the Database scraping.

Article Scraping: It is a process of scraping of the articles from the blogs or websites.

Content Scraping: It is a process of lifting off the displayed content from various websites and using it somewhere else or displaying it on other websites.

Data Scraping: It is a process used to extract massive amount of data from websites in which the data is stored in local computer system or in structured database.

Price Scraping: It is a process of extracting or collecting the prices of various items in e-commerce site available over the internet without the consent.

Web Scraping: The process of extracting data from the websites in a systematic manner.

Email Harvesting: The mechanism to obtain a large number of email addresses using different methods or techniques.

