Development and Evaluation of a Sarcasm Detection Algorithm based on Machine Learning and Natural Language Processing Techniques

Julián Amigó Francés. Development and Evaluation of a Sarcasm Detection Algorithm based on Machine Learning and Natural Language Processing Techniques. Final Career Project (TFG). Universidad Politécnica de Madrid, ETSI Telecomunicación.

Abstract:
Automatic detection of irony in texts is an open challenge. There are a significant amount of publication in detection of irony [1]. Due to the fact that many people express themselves through the use of ironies, it can be very interesting in the area of NLP (Natural Language Processing), and more concretely, sentiment analysis. In this way, an NLP system can prevent misinterpretation of opinions. In this case, the sarcasm (use of irony to mock or transmit contempt) is almost the same, but with another purpose. Therefore, the need for this detection is important for the aforementioned field. The main objective for this Project will be to create a model for detecting these ?cruel irony? in texts. The problem is that it can be quite subjective,even for human readers, if there is not an accurate signal which reference the text as a sarcastic(a typical sarcastic label, for example #sarcasm in Twitter). Therefore, it will be studied a new model for sarcasm detection. As baseline, we will use the work from the evaluation of the features of SARC Corpus [Cita 2]. This corpus has more than 5 million comments, of which 1.3 million are sarcastic statements. It is made from Reddit forum, a community where a wide number of topics are discussed. Methods: For the creation of this model, Machine Learning and NLP (Natural Language Processing) techniques will be used. First of all, the available dataset will be preprocessed. Then, a selection of a machine learning algorithm will be made. Finally, the developed model will be evaluated accordingly to previous work. Objectives: The main objective is getting an automatic sarcasm detection model with the highest evaluation possible. Also, an additional objective is to obtain insights of the sarcasm phenomenon so that it can be used in other NLP-related tasks. The development and evaluation environment will be based on ?Jupyter?, a web-based application suitable for capturing the whole computation process. This environment is easy to use and it allows you to see the outputs of the code line by line. The language used will be Python, because it provides many libraries (scikit-learn, pandas, numpy, scipy) and facilities for Machine Learning and NLP. 1. Cruces, P. (2017). Modelo de detección automática de ironía en textos en español. 2. Khodak, M., Saunshi, N., & Vodrahalli, K. (2017). A large self-annotated corpus for sarcasm. arXiv preprint arXiv:1704.05579.