Design and Development of a Machine Learning Classifier of ISIS Terrorist Texts

de Sotomayor Vergara, M. Á. (2018). Design and Development of a Machine Learning Classifier of ISIS Terrorist Texts. Trabajo Fin de Titulación (TFG). Universidad Politécnica de Madrid, ETSIT, Madrid.

Abstract:
Nowadays, terrorism is one of society’s main concerns. One of the main difficulties en- countered by organisms that try to prevent the formation of these cells is the difficulty of controlling such an extensive and open means of communication as Internet. Additionally, interaction of people on social networks is increasing exponentially in recent years, causing an uncontrollable amount of data that only can be processed automatically. Therefore, the main objective of this project is the integration of technology as an enabling tool to detect and classify the texts cited by ISIS in its publications on social networks. This will provide an important value in speeding up the detection of group thought currents, future forms of action or possible recruiting techniques used. In order to achieve the aforementioned objective we have developed a Machine Learning approach which was trained using a compilation of 2,685 religious texts cited by ISIS over a 3 year period.This dataset is a collection of all the religious and ideological texts used in ISIS English-based magazines. In addition, the model has been tested on a second similar prob- lem: ISIS Twitter posts with radical and non-radical content. To implement this classifier a software system that uses natural language processing techniques (NLP) has been devel- oped, written in Python programming language. Regarding the extraction of features we have studied those that provided relevant information to the model considering our typology of input data. These features are as follows: PoS, LDA, word embeddings, embedding-based similarity, and domain-specific word selection. In order to properly evaluate the proposed methods, an extensive evaluation has been made, based in cross-validation. As a conclusion, the scorings obtained reached 81% and 92% for the different problems analyzed during the development of the project. This is because these cases presented different complexities, the first of them had a small dataset (2250, 2) and five possible types of classification. This is why complex algorithms caused the appearance of Overfitting, being the linear classifiers the ones that best adapted to these input data. However, the second problem presented a larger database (34708, 2) and a binary classification, enabling better operation of highly complex algorithms.