About the Project

Project description

Este proyecto surge de la Tesis para obtener el Título Profesional de Ingeniería de Sistemas, titulado «Comparación de Modelos de Clasificación para detectar Ciberbullying en Twitter en el Lenguaje Español-Peruano»

This research addresses cyberbullying detection on Twitter, focusing on Peruvian Spanish, by creating a dataset and comparing four machine learning models. With a comprehensive approach, it seeks to overcome the lack of studies focused on the linguistic particularities of Latin America and the Caribbean (LAC). It aims to provide results that contribute to developing more effective strategies to combat cyberbullying in the region.

Background

In an increasingly digitalized context, the phenomenon of cyberbullying emerges as a significant social concern, affecting multiple groups and communities. In the case of LAC, the lack of research and the complexity of the regional language pose additional challenges in detecting and preventing cyberbullying. This project arises as a response to this problem, seeking to understand the linguistic and cultural peculiarities that influence the dynamics of cyberbullying in Peru to develop effective strategies to combat this problem and promote safer and healthier digital environments.

A Comparison of Classification Models to Detect Cyberbullying in the Peruvian Spanish Language on Twitter

This study compares four machine-learning models for cyberbullying detection on Twitter, specifically focusing on Peruvian Spanish. It uses a specially designed dataset for training these models and presents detailed results on each model's performance and ability to identify cyberbullying.


Manual: Creation and Validation of a Dataset for the Detection of Cyberbullying in the Peruvian Spanish Language

This manual addresses the linguistic challenges associated with creating data to address social problems in the LAC digital environment, particularly in Peruvian Spanish. It describes in detail the process of creating and validating a specific dataset for this task, facing the complexities of the regional language. In addition, it provides a thorough explanation of data pre-processing using natural language processing techniques to improve detection efficiency.

General Objective

Create resources and evaluate machine learning models to contribute to improving cyberbullying detection in the LAC digital environment.

Specific Objectives

1. Develop a detailed manual describing the process of creating and validating a specific dataset for cyberbullying detection in Peruvian Spanish, considering the linguistic and social complexities of the region.

2. Evaluate the performance of four machine learning models in cyberbullying detection on Twitter, using a dataset designed to adequately reflect the characteristics of language and online interactions in LAC.

3. Provide detailed results on the effectiveness of the evaluated models, highlighting their strengths and limitations. The aim is to inform and guide future research and intervention strategies on cyberbullying in the region.

Findings and Results Achieved

A Comparison of Classification Models to Detect Cyberbullying in the Peruvian Spanish Language on Twitter

  • The study addressed the lack of research on cyberbullying in LAC, focusing on Peruvian Spanish and comparing four machine learning models for its detection on Twitter.
  • The analyses revealed that machine learning models based on semantic representation outperformed those based on syntax, highlighting the importance of understanding the context and meaning of language in detecting cyberbullying. 
  • Exploring the impact of emoticons and jargon on cyberbullying detection opens new avenues for research and technological development. These considerations enrich our understanding of online behavior and guide the design of future tools and policies to address digital violence more effectively.

Manual: Creation and Validation of a Dataset for the Detection of Cyberbullying in the Peruvian Spanish Language

  • A specific dataset was created for cyberbullying detection in Peruvian Spanish, representing a significant advance in the availability of resources to address this problem in LAC.
  • The dataset's content underwent validation with the involvement of experts in the problem through a web application. This process guarantees the quality and relevance of the data used to train the models.
  • The linguistic challenges specific to Peruvian Spanish were recognized, enabling the creation of a dataset that accurately captures the nuances of regional language and online interactions.
  • Advanced natural language processing techniques were applied to pre-process the data, thus improving the effectiveness in identifying cyberbullying. This approach contributes to strengthening cyberbullying detection and prevention strategies in the region.

Impact and Conclusions

The findings of this research significantly impact the comprehension and response to cyberbullying within the LAC context. 

The manual on creating and validating a specific dataset for cyberbullying detection in Peruvian Spanish provides a fundamental tool to address this problem in the region, overcoming linguistic challenges and facilitating more effective intervention strategies. 

On the other hand, comparing machine learning models for detecting cyberbullying on Twitter highlights the importance of specific research for LAC, evidencing the effectiveness of these models and pointing out areas for improvement, such as considering emoticons and local slang. 

Together, these projects contribute to a more holistic and culturally sensitive approach to tackling cyberbullying in the region, promoting digital safety for all.

EN
Scroll to Top