MDLText e Indexação Semântica aplicados na Detecção de Spam nos Comentários do YouTube

Renato Moraes Silva, Túlio C. Alberto, Tiago A. Almeida, Akebo Yamakami

Resumo


Muitos usuários do YouTube produzem conteúdo regularmente e fazem desta tarefa seu principal meio de vida. Contudo, esse sucesso vem despertando a atenção de usuários mal-intencionados, que propagam comentários indesejados para se autopromoverem ou para disseminar links maliciosos. Neste cenário, métodos tradicionais de categorização de texto podem sofrer limitações devido às características inerentes ao problema: (1) os comentários costumam ser curtos e mal redigidos e (2) o problema de classificação é naturalmente online. Este artigo avalia um método de classificação baseado no princípio da descrição mais simples e compara os resultados com os de métodos tradicionais de aprendizado online. Também é proposta uma técnica ensemble, que combina os métodos de classificação com diferentes técnicas de processamento de linguagem natural. Os experimentos foram cuidadosamente realizados e a análise estatística dos resultados indica que a técnica proposta obteve desempenho superior ao obtido quando apenas os comentários originais foram empregados.

Palavras-chave


aprendizado de máquina; categorização de texto; princípio da descrição mais simples; YouTube

Texto completo:

PDF

Referências


Alberto, T. C., Lochter, J. V., and Almeida, T. A. (2015). Tubespam: Comment spam filtering on YouTube. In: Proceedings of the 14th International Conference on Machine Learning and Applications (ICMLA’15), pages 138–143, Miami, FL, USA. IEEE.

Almeida, T. A., Silva, T. P., Santos, I., and Hidalgo, J. M. G. (2016). Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering.Knowledge-Based Systems, 108:25–32.

Almeida, T. A., Yamakami, A., and Almeida, J. (2011). Spam filtering: how the dimensionality reduction affects the accuracy of naive Bayes classifiers. Journal of Internet Services and Applications, 1(3):183–200.

Ammari, A., Dimitrova, V., and Despotakis, D. (2012). Identifying relevant YouTube comments to derive socially augmented user models: A semantically enriched ma-chine learning approach. In Ardissono, L. and Kuflik, T., editors,Advances in User Modeling: UMAP 2011 Workshops, Girona, Spain, July 11-15, 2011, Revised Selected Papers, pages 71–85, Berlin, Heidelberg. Springer Berlin Heidelberg.

Assis, F., Yerazunis, W., Siefkes, C., and Chhabra, S. (2006). Exponential differential document count – a feature selection factor for improving Bayesian filters accuracy. InProceedings of the 2006 MIT Spam Conference (SP’06), pages 1–6, Cambridge, MA,USA.

Baldwin, T., Cook, P., Lui, M., MacKinlay, A., and Wang, L. (2013). How noisy social media text, how different social media sources? InProceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP’13), pages 356–364,Nagoya, Japan.

Barron, A., Rissanen, J., and Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Transaction on Information Theory, 44(6):2743–2760.

Bratko, A., Filipic, B., Cormack, G. V., Lynam, T. R., and Zupan, B. (2006). Spam filte-ring using statistical data compression models. Journal of Machine Learning Research,7:2673–2698.

Chaitin, G. J. (1969). On the length of programs for computing finite binary sequences: statistical considerations.Journal of the Association for Computing Machinery(JACM), 16(1):145–159.

Dalvi, N., Domingos, P., Mausam, Sanghai, S., and Verma, D. (2004). Adversarial classification. In: Proceedings of the 10th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD’04), pages 99–108, Seattle, WA, USA.ACM.

Demsar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30.

Eisenstein, J. (2013). What to do about bad language on the internet. InProceedings of NAACL-HLT 2013, pages 359–369, Atlanta, GA, USA. Association for Computational Linguistics.

Freund, Y. and Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296.

García, S., Fernández, A., Luengo, J., and Herrera, F. (2009). A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability. Soft Computing, 13(10):959–977.

Gentile, C. (2002). A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242.

Goodman, J., Heckerman, D., and Rounthwaite, R. (2005). Stopping spam. Scientific American, 292:42–49.

Grünwald, P., Kontkanen, P., Myllymäki, P., Silander, T., and Tirri, H. (1998). Minimum encoding approaches for predictive modeling. In: Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI’98), pages 183–192, Madison, Wisconsin, USA. Morgan Kaufmann.

Grünwald, P. D. (2005). A tutorial introduction to the minimum description length principle. In: Advances in Minimum Description Length: Theory and Applications. MIT Press.

Grünwald, P. D. (2007). The Minimum Description Length Principle. The MIT Press.

Grünwald, P. D., Myung, I. J., and Pitt, M. A. (2005).Advances in Minimum DescriptionLength: Theory and Applications. The MIT Press.

Hoi, S. C. H., Wang, J., and Zhao, P. (2014). Libol: A library for online learning algorithms. Journal of Machine Learning Research, 15(1):495–499.

Kolmogorov, A. N. (1965). Three approaches to the quantitative definition of information.Problems of Information Transmission, 1:1–7.

Li, M. and Vitanyi, P. M. (2008). An Introduction to Kolmogorov Complexity and Its Applications. Springer, New York, NY, 3rd edition.

Li, Y. and Long, P. M. (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):361–387.

Manning, C. D., Raghavan, P., and Schütze, H. (2009). Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.

O’Callaghan, D., Harrigan, M., Carthy, J., and Cunningham, P. (2012). Network analysis of recurring YouTube spam campaigns. CoRR, abs/1201.3783.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Radulescu, C., Dinsoreanu, M., and Potolea, R. (2014). Identification of spam comments using natural language processing techniques. InProceedings of the 10th IEEE International Conference on Intelligent Computer Communication and Processing (ICCP’14), pages 29–35, Cluj-Napoca, Romania. IEEE.

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471.

Rissanen, J. (1983). A universal prior for integers and estimation by minimum description length. The Annals of Statistics, 11(2):416–431.

Rissanen, J. (1996). Fisher information and stochastic complexity.IEEE Transaction onInformation Theory, 42(1):40–47.

Rocchio, J. J. (1971). Relevance feedback in information retrieval. In Salton, G., editor, The Smart retrieval system - experiments in automatic document processing, pages 313–323. Prentice-Hall, Englewood Cliffs, NJ.

Shannon, C. E. (1948). A mathematical theory of communication.The Bell System Technical Journal, 27:379–423, 623–656.

Silva, R. M., Alberto, T. C., Almeida, T. A., and Yamakami, A. (2016a). Filtrando comentários do YouTube através de classificação online baseada no princípio MDL e indexação semântica. In: Anais do 13th Encontro Nacional de Inteligência Artificial eComputacional (ENIAC’16), pages 2–15, Recife, PE, Brasil.

Silva, R. M., Almeida, T. A., and Yamakami, A. (2015). Quanto mais simples, melhor! categorização de textos baseada na navalha de Occam. In: Anais do 12th Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’15), pages 2–15, Natal,RN, Brasil.

Silva, R. M., Almeida, T. A., and Yamakami, A. (2016b). Detecçãao automática de SPIMe SMS spam usando método baseado no princípio da descrição mais simples. In: Anais do 13th Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’16),pages 2–15, Recife, PE, Brasil.

Silva, R. M., Almeida, T. A., and Yamakami, A. (2017). MDLText: An efficient and lightweight text classifier.Knowledge-Based Systems, 118:152–164.

Solomonoff, R. J. (1964). A formal theory of inductive inference: Parts 1 & 2.Information and Control, 7(1, 2):1–22, 224–254.

Vo, D. and Ock, C. (2015). Learning to classify short text from scientific documents using topic models with various types of knowledge.Expert Systems with Applications,42(3):1684–1698.

Wilbur, W. J. and Kim, W. (2009). The ineffectiveness of within-document term frequency in text classification.Information Retrieval, 12(5):509–525.

Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: Proceedings of the 21th International Conference on MachineLearning (ICML’04), pages 116–123, Banff, Alberta, Canada. ACM.

Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. InProceedings of the 20th International Conference on Machine Learning(ICML’03), pages 928–936, Washington, DC, USA. AAAI Press.




Article Metrics

Metrics Loading ...

Metrics powered by PLOS ALM


iSys - Revista Brasileira de Sistemas de Informação - CESI/SBC
ISSN Eletrônico: 1984-2902