Sensitive data identification and classification using natural language processing (NLP) techniques

Authors

DOI:

https://doi.org/10.23925/ddem.v.3.n.12.68378

Keywords:

Machine Learning (ML), Data Classification, Naive Bayes, Random Forest, Confusion Matrix

Abstract

This study investigates the application of Natural Language Processing (NLP) and Machine Learning (ML) techniques in the identification and classification of sensitive data, with an emphasis on compliance with the General Data Protection Law (LGPD). The process includes the preprocessing of textual data, vectorization with TF-IDF, and the implementation of Naive Bayes and Random Forest algorithms, with hyperparameter optimization using Grid Search. The performance of the models is evaluated by analyses such as accuracy, confusion matrix and ROC curve. The proposed approach aims to assist companies in data protection and management, ensuring compliance with the privacy and security requirements determined by the legislation.

Author Biographies

Eric Henrique da Silva Passos, Universidade de São Paulo - USP, São Paulo, SP

Postgraduate course in DATA SCIENCE and ANALYTICS – 2024 from the USP/ESALQ MBA. Bachelor of Law.

Lisleandra Machado, IFSUDESTEMG

Professor and Researcher at CNPQ, FAPEMIG, FUNDEP and CAPES. She holds degrees in Law, Business Administration, Production Engineering, Pedagogy, a PhD in Production Engineering from UNIMEP and a Master's degree in Production Engineering from UFSC - Federal University of Santa Catarina. She is currently the coordinator of the undergraduate course in Railway and Metro Engineering. She has in-depth knowledge of Data Science and Analytics, Digital Business (Business Intelligence). She is a professor at the Federal Institute of Education, Science and Technology of Southeast Minas Gerais - Juiz de Fora, MG. Since 2002, she has been an ad hoc evaluator of undergraduate courses (INEP/MEC).

Domingos Sávio da Cunha Garcia, Universidade Estadual de Campinas - UNICAMP, Campinas, SP

PhD in Economic History at UNICAMP.

Leonardo Amorim de Araújo, Universidade Federal do Rio de Janeiro - UFRJ - Rio de Janeiro, RJ

PhD in Transportation Engineering at UFRJ

Samuel Alves de Freitas, Universidade Federal de São João del Rei, Campus Santo Antônio, São João del Rei, MG

Master's Degree in Mechanical Engineering from UFSJ

Ana Paula Lima dos Santos, Universidade Federal de São João del Rei, Campus Santo Antônio, São João del Rei, MG

Master's Degree in Electrical Engineering from UFSJ

Gustavo José Santiago Rosseti, Universidade Federal de Juiz de Fora, Juiz de Fora, MG, Brasil

PhD in Electrical Engineering from UFJF

Silvana Rodrigues Pires Moreira, Universidade federal de Visçosa - UFV - Visçosa, MG

PhD in Agricultural Biochemistry from UFV

References

BRASIL. 1988. Constituição da República Federativa do Brasil. Estabelece a Constituição Federal do Brasil. Disponível em: https://www.planalto.gov.br/ccivil_03/constituicao/constituicao.htm. Acesso em: 29 de agosto de 2023.

BRASIL. 2018. Lei nº 13.709, de Agosto de 2018. Dispõe de maneira geral sobre a proteção de dados no âmbito nacional. Disponível em: https://www.planalto.gov.br/ccivil_03/_ato2015-2018/2018/lei/l13709.htm. Acesso em: 29/08/2023.

JOHNSON, Kjell; KUHN, Max. 2016. Applied Predictive Modeling.1ed. Editora Springer. New York, USA. Disponível em: https://www.ic.unicamp.br/~wainer/cursos/1s2021/432/2013_Book_AppliedPredictiveModeling.pdf. Acesso em: 04 de outubro de 2023.

LIMA, Ana P. M. C.; CRESPO, Marcelo; PINHEIRO, Patricia P.2020. LGPD aplicada. Editora Atlas, São Paulo, SP, Brasil. Disponível em: https://integrada.minhabiblioteca.com.br/reader/books/9788597026931/epubcfi/6/10[%3Bvnd.vst.idref%3Dcopyright]!/4/12/4/1:0[%2Cby. Acesso em: 29 de setembro de 2023.

LIMA, Rosa. P. Lima. 2021. ANPD e LGPD: Desafios e perspectivas. Editora Almedina, São Paulo, SP, Brasil. Disponível em: https://integrada.minhabiblioteca.com.br/reader/books/9786556272764/pageid/50. Acesso em: 29 de setembro de 2023.

LOUPPE, Gilles. 2014. Understanding Random Forest: From Theory to Practice. PhD Dissertation. University Of Liége. Liége, Bélgica. Disponível em: https://arxiv.org/pdf/1407.7502.pdf. Acesso em: 02 de setembro de 2023.

OLIVEIRA, Adrielly L. S.; SANTOS, Alessandra, P. B.; LIRA, Bruno B. L.; ABRÃO, Bianca, B.; CAMARGO, Caio P. F. 2022. LGPD e a Proteção de dados pessoais na SOCIEDADE EM REDE: Dados de Criança e Adolescentes na Internet; Tratamento de Proteção de Dados no Comércio Eletrônico; Proteção de Dados Falecidos; Violação de Direitos da Personalidade e Responsabilidade Civil.1ed. Editora Almedina, São Paulo, SP, Brasil. Disponível em: https://integrada.minhabiblioteca.com.br/reader/books/9786556276373/pageid/3. Acesso em: 28 de setembro de 2023.

TEIXEIRA, Tarcisio. 2021. LGPD e E-commerce. 2ed. Editora Saraiva, São Paulo, SP, Brasil. Disponível em: https://integrada.minhabiblioteca.com.br/reader/books/9786555598155/epubcfi/6/36[%3Bvnd.vst.idref%3Dmiolo15.xhtml]!/4. Acesso em: 20 de setembro de 2023.

VAJJALA, Sowmya. MAJUMDER, Bodhisattwa. GUPTA, Anuj. SURANA Harshit. 2020. Practical Natural Language Processing: A Comprehensive Guide to Building Real-World NLP Systems. O’Reilly Media, Inc., Sebastopol, CA, USA. Disponível em: https://www.oreilly.com/library/view/practical-natural-language/9781492054047/. Acesso em: 08 de setembro de 2023.

WITTEN, Daniela; JAMES Gareth; TIBSHIRANI. 2021. An introduction to statistical learning: Witch Application in R. Editora Springer. New York, USA. Disponível em: https://www.stat.berkeley.edu/users/rabbee/s154/ISLR_First_Printing.pdf. Acesso em: 01 de outubro de 2023.

Published

2024-12-19