Approximation of Hate Detection Processes in Spanish and Other Non-Anglo-Saxon Languages

Approximation of Hate Detection Processes in Spanish and Other Non-Anglo-Saxon Languages

Copyright: © 2023 |Pages: 16
DOI: 10.4018/978-1-6684-8427-2.ch005
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

In this chapter, the authors present how the use of artificial intelligence (AI) can help to identify and reduce the new digital crimes according to hate messages. The appearance of the internet in our lives, at the end of the last century, has meant a great technological advance, providing easier access to a huge volume of information and communication between people. The rise of communication-oriented networks has been such that true digital environments have been created, the so-called social networks, with millions of users all over the planet. This has meant, to a large extent, the modification of our personal relationships, and, unfortunately, the appearance of new ways of sending hate messages. The work presented is aimed at a digital tool built for this purpose for the automatic detection of hate (and non-hate) messages, in Spanish and other non-Anglo-Saxon languages, with AI algorithms, using training data from the Spanish language.
Chapter Preview
Top

Background

The advent of the Internet has enabled the spread of social networks in all countries, particularly non-English-speaking countries. According to Ethnologic study by Pereltsvaig (2020), it indicates the following population by the number of speakers (this classification is elaborated taking into account the mother tongue): Mandarin Chinese (917.8 million); Spanish (460.1 million); English (379 million); Hindi-Urdu (341.2 million); Arabic (280 million); Bengali ($228.3 million); Portuguese (220.7 million); Russian (153.7 million); Japanese ($128.2 million); Punjabi (92.7 million).

However, the most spoken languages in the world, considering the second language, we have the following ranking: English (1.268 million); Mandarin Chinese (1.12 billion); Hindi (637 million); Spanish (537 million); French (280 million); Arabic (274 million); Bengali (US$ 265 million); Russian (258 million); Portuguese (252 million); Indonesian (199 million).

While English is the most widely spoken, we have many other languages with very large populations. The following ranking considers only the languages we speak on the Internet: English (25,3%); Mandarin Chinese (19,8%); Spanish (8%); Arabic (4.8%); Portuguese (4.1%); Indonesian (4.1%); Japanese (3%); Russian (2.8%); French (2.8%); German (2.2%); Other (23.1%).

Another significant factor to consider is that thanks to the automatic translators we now have multiple news items translated into other languages. This is the list of the most translated languages.

  • 1.

    English

  • 2.

    Spanish

  • 3.

    Chinese

  • 4.

    French

  • 5.

    German

This ranking differs slightly from the previous one. For example, French is in fourth place. This is not surprising given that French is the official language of the European Union, the United Nations and the International Court of Justice.

At European level, in addition to English, languages such as Spanish, French, German, Portuguese and Italian are widely used in social networks and by the media. Another important factor to take into account is the ease with which news and comments can be translated online, with numerous tools such as: DeepL, Google translate, Wordreference, Bing Traslator, etc.

Social networks are a powerful tool for connecting with people around the world and sharing information and opinions in real time. However, there have also been a number of problems associated with the overuse of social networks in non-English speaking countries. Here are some of the most common problems:

  • Read of false and misleading news through social media is a common problem in many non-English speaking countries. A UNESCO report notes that the spread of false information on social media has been particularly problematic during the COVID-19 pandemic in countries like Brazil, Mexico, India, and Nigeria (FreedomHouse, 2020; UNESCO, 2020).

  • Threats to online privacy and security: social media also presents significant challenges in terms of online privacy and security. In many non-English speaking countries, regulation of social media is inadequate, and technology companies have little legal liability for misuse of users' personal data. This has led to several cases of data breaches and account hacks in countries like Brazil and Mexico (Kirchgaessner, 2022).

  • Censorship and government restrictions: Many non-English speaking countries have restrictive government regulations regarding social media. In China, for example, authorities have blocked access to several Western social media platforms, such as Facebook and Twitter. In other countries, such as Turkey and India, restrictions have been imposed on social media in response to civil unrest and protests.

  • Difficulties accessing content in other languages: In many non-English speaking countries, access to content in other languages is limited due to language barriers. This can limit the ability of social media users to interact with people and communities in other countries and perpetuate online fragmentation.

Key Terms in this Chapter

NIS Directive: Ordered Attribute: European Union. “Directive (EU) 2016/1148 of the European Parliament and of the Council, of July 6, 2016”. Official Gazette of the European Union.

TN: This means true negative.

FN: This means false negative.

Recall: Recall, also known as sensitivity, is defined as the ratio of true positives (TP) over all true positives (TP + FN). That is, it measures the ability of the model to detect all the real positives.

Confusion Matrix: This shows the predictions made by the model comparing them with the actual results, which allows knowing how many times the model is correct and how many times it is wrong in each of the classes that are being evaluated. In technical terms, a confusion matrix shows good results when high precision and sensitivity are observed in the classification of the data. That is, when most of the predictions made by the model are correct and there is a minimum number of false positives and false negatives in the data classification.

Precision: Precision is defined as the ratio of true positives (TP) over all predicted positives (TP + FP). That is, it measures the accuracy of positive predictions.

F1-Score: The F1-score is a measure that combines both precision and recall and is defined as the harmonic mean between both values.

TP: This means true positive.

Accuracy: If a binary model has an accuracy of 90%, it means that 90% of the predictions are correct. If it is assumed that the model has a balanced performance between accuracy and recall, then accuracy and recall should also be close to 90%. In this case, the F1-score is also at 90%, as can be seen in the previous table.

FP: This means false positive.

NLP: Natural Language Processing is a subfield of artificial intelligence (AI) that focuses on the interaction between computers and human language. The goal of NLP is to enable computers to understand, interpret, and generate human language in a way that is useful to humans. NLP involves a wide range of techniques and methods, including machine learning, statistical analysis, and computational linguistics.

Complete Chapter List

Search this Book:
Reset