Let’s first calculate the frequency of response variable to see if it is imbalanced.
If the funds lost is $0, that means the respective observation or record has not much information to offer to us in terms of a Naive Bayes Model. Also, if no funds were lost then no funds can be returned/recovered. Therefore, filtering out the 0 funds_lost observations makes sense if we want balanced classes for our predictor, scam type.
After filtering out the $0 funds_lost values, 1861 honeypots, 383 rugpulls, and 29 “other” scams are dropped in the process of treating class imbalance. Moreover, no key information, in terms of funds lost or source of the attack, is offered about these removed attacks. We are left with all those rows where funds lost != 0 and where all crypto projects have some key information to offer regarding the extent of the scam, which will be useful in predicting the scam type.
Now we have all records where funds lost != 0 and predictor with balanced classes
As per REKT Database, Honeypot attacks, Rugpull attacks, Abandoned scams, and the Kronos Dao project (classified as “other”) can be pooled together as Exit Scams and all other attacks can be pooled together as Exploits. We, therefore, will conduct Naive Bayes with a binary predictor. Moreover, after pooling together the respective scam types and treating predictor imbalance, the class frequencies of our desired target variable is: n(Exit Scam) = 380 and n(Exploit) = 435
We shall also take log of funds_lost and funds_returned to obtain normal distribution assumption for Naive Bayes as well as for better prediction accuracy.
After pooling together the scam types into respective types as described above:
To make the task more interesting, I will also feature engineer a Sentiment Score for each text content, tweets and news, to create 2 labels: Positive and Negative.
Average Sentiment of All Tweets is -0.39 (rounded to 2 dp), which implies there are more negative words associated with crypto attacks in tweets. The value_counts() below also confirms this as we have 177 negative tweets and 111 positive tweets. To conduct Naive Bayes fairly, I would typically want a more balanced class of labels, let’s see what the overall sentiment is for news articles and hope it is relatively more positive to gain balanced classes
As mentioned in the tweets section, in the news section we obtain a more positive overall sentiment. This will help balance our target variable as we will combine the tweets and news articles into 1 dataframe for Naive Bayes to predict either Positive or Negative Sentiment.
Accuracy score for Train Predictions: 0.88
Recall score for Train Predictions: 0.88
Precision score for Train Predictions: 0.88
F1 score for Train Predictions: 0.88
Accuracy score for Test Predictions: 0.75
Recall score for Test Predictions: 0.75
Precision score for Test Predictions: 0.80
F1 score for Test Predictions: 0.74
Accuracy score: 0.82
Recall score: 0.82
Precision score: 0.82
F1 score: 0.82
All of the above can be used as valid points that other classifiers should be built to outperform the Naive Bayes model. While Naive Bayes is great for spam filtering and Recommendation Systems, it is probably not ideal in most other applications.
Overall Naive Bayes is fast, powerful and explainable. However, the major dependence on the prior probability of the target variable can create misleading and inaccurate results. Classifiers such as Decision Trees, SVM, Random Forests and Ensemble methods are able to outperform Naive Bayes. However, the above point does not undermine the effectiveness of Naive Bayes as a reliable classifier. The independence assumption, inability to handle interactions between features, and normality assumption make Naive Bayes’s predictions hard to trust.