Abstract:
The most popular messaging applications that form the basis of Internet technology are E-mail and SMS applications. The biggest problem with these applications is spam. When comparing audio, video and text spam, text spam is known to be common. It is very important to detect textual spam with high accuracy and to know which category the spam content belongs to. This study focuses on the development of TF-IDF (Term Frequency-Inverse Document Frequency) based machine learning methods in the field of textual spam filtering. The study discussed the prominence of TF-IDF as a technique that plays an important role in the field of text mining. This technique uses the ratio between the frequency of the term in the document and its frequency in all documents to determine the importance of a term in a document. Within the scope of the study, data sets consisting of SMS and E-mail texts were collected and the effect of using Support Vector Machines (SVM), Decision Trees, Random Forests, Naïve Bayes, k-NN and TF-IDF together on spam filtering performance was examined. It has been demonstrated that the developed TF-IDF-based machine learning methods have the potential to obtain more effective results in spam filtering systems.