TESLA(TwittEr Spam LeArning) is a course project of Texas A&M University CSCE 670 Information Storage and Retrieval. The project aims to build a tool to check if a user is a spammer on Twitter.
Spam can be generally described as unsolicited, repeated actions that negatively impact other people. As Twitter users, we often see some people we are unfamiliar with intending to follow us or we want to follow others. But we do not want to follow or be followed by spammers since they may send us malicious links or hijack our accounts. Therefore, we need to have a tool helping us if these unfamiliar accounts are spammers.
During our project, to distinguish spam users from genuine users, we first extracted a series of account features and tweets features through data visualization. These features include count of favorite tweets, account age, length of description etc.
Figure 1. Count of Favorite Tweets
Figure 2. Account Age
Figure 3. Length of Description
Other features we used:
For more data visualization results about above listed features, please go to GitHub.
Based on the features, we built an ensemble training model consisting of two parts – account model (which was trained based on user-related features) and tweet model (which was trained based on context of tweets). As for the account model, we intended to compare the accuracy of different kinds of classifiers, including logistic regression, random forest, SVM and KNN. From the training accuracy result, we finally chose random forest as our account model classier. In terms of tweets model, we chose word2vec as tweets model classifier. To get the result, we calculate a weighted probability and if it is higher than 50%, we consider this user as a spammer.
Figure 4. TESLA Training Model
Initially, our main data sources come from the Bot Repository:
At first, our aggregated user dataset was imbalanced. We had ~7,000 labeled spammers and ~5,000 labeled legitimate users. To balance it out, we did not use oversampling/downsampling; rather, we utilized Twitter API again. Our crawler started from President Trump's Twitter account and scraped his friends lists, his friends' following lists, and so on until we collected 2,000 rows of user data. We assume that President Trump is following real users, and his friends follow authentic accounts as well.
The test accuracy result of different classifiers we chose is shown below (data ranges from 2014 to 2016):
Table 1. Training Result Based on 2014-2016 Data
However, although our classifiers showed a high accuracy when testing with original data, after the first actual test, we noticed that our account based model did not perform well. That is because most data was old (except for President Trump's 2000+ records). Thus, we acquired more data using the Twitter API again.
We were able to acquire additional 9,573 spam accounts information using the streaming API. We filtered out tweet streams on 4/15 and 4/16/2018. We used the following keywords, with the assumption that whoever sent a tweet containing this keyword was a spam account: ['make money from home','enter to win','Credit Card', 'lonely', 'debt','deals','ad', '100% free','Act now','apply online','Click below','Click here', 'Extra cash','Offer expires', 'order now','Save $','Serious cash','Satisfaction guaranteed', 'Supplies are limited', 'trial','Work from home','you are a winner','your income','Weight loss','why pay more']. Sources of the spam keywords: 455 Spam Trigger Words to Avoid in 2018, “SPAM Tweets” – 5 Buzzwords that Attract Spammers.
We also scraped 9,464 ham user data using the same assumption as the one for President Trump's above, but this time our initial seed is from Dr. Philip Guo, Assistant Professor of Cognitive Science at UC San Diego.
In this way, we were able to gather a balanced user dataset with spammers:legitimate users ratio roughly to be 1:1 (15,731 spammers, 16,828 legitimate users). It is acknowledged that our aggregated dataset may subject to biases. If time permits, we will collect a larger dataset that covers as many groups as possible.
The test accuracy result of different classifiers we chose is shown below (data ranges from 2014 to 2018):
Table 2. Training Result Based on 2014-2018 Data
From Table 2, we observed that the accuracy of classifiers decreased, we suppose there were following reasons:
According to the above accuracy results, we finally chose random forest as our account model which performed well with both data sets. And the actual test result was better than before.
For more details, please refer to our final account model (Jupyter notebook) here.
Besides the account model, we have also extracted the text field from tweets. For the tweet model, on the first stage, we tried word2vec on the tweet’s text to embed the word corpus. For special information like url and @mention in the tweet text, we chose to delete them considering it would actually pollute the overall corpus.
Then on the second stage, we used the pre-trained embedding weights from word2vec model as a trainable input layer to train an 1-dimensional convolutional network with kernel sizes 1-4 for phrasing semantic. In other word, the model considered not only the word sequence one by one, but also phrases with up to 4 words.
These four types of layers were trained collaterally and concatenated at the end to output the probability of the text being spam or not. The final accuracy was around 88% on the data we used and the online realtime test worked fine as well.
For more details, please refer to our final text model (Jupyter notebook) here.
Table 3. Training Result for Our Text Model.
Figure 5. ROC Curve for the Two Selected Models.
In this project, we initially trained our model with data gathered during 2014 to 2016. However, when we used the model to classify current spam users, the accuracy was not satisfying. From this result, we found that Twitter spam users have been changing during the years. As time passes, their features can be a lot different. Therefore, spam topic drift is a challenge for our classifier.
We believe the problem of data changing is happening in many areas of IR, including search engines as well as recommenders which also involves data training processes to give better accuracy. To deal with this issue, it is apparent that one-time training is not enough. Instead, frequent training with newest data is necessary.
Privacy:Since this tool relies on users' account information as well as their tweets, there can be some people getting others' information without knowing.
Spammer's Counter Attack:The system shows the result of our judgement. It is possible that spammers figure out better ways to camouflage themselves.
Unexpected Killing:Since the precision of our tool can never reach 100%, if it is adopted by Twitter, there might be legitimate user taken as spammers and shut down wrongly.
More targets:We believe similar methods used in this project can be applied to classify other kinds of malicious users, such as those who may be suspended due to various reasons.
Better scalability:The system can be used for other languages if text features of new languages are available.
Form of the app:Currently the app is in the form of a website. We think it would be more convenience for people to use if it is made as a web extension.