Multinomial Naive Bayes in R (REKT Database Record Data) and Python (Twitter and News Text)

Process Flow of Multinomial Naive Bayes:

Table 1: Links to Github Repository

Codes_and_Data	Link
Naive Bayes Python Code	Github
Clean Tweets	Github
Clean News	Github
Naive Bayes R Code	Github
Clean Record Data	Github

Multinomial Naive Bayes in R (REKT Database Record Data)

Let’s first calculate the frequency of response variable to see if it is imbalanced.

If the funds lost is $0, that means the respective observation or record has not much information to offer to us in terms of a Naive Bayes Model. Also, if no funds were lost then no funds can be returned/recovered. Therefore, filtering out the 0 funds_lost observations makes sense if we want balanced classes for our predictor, scam type.

After filtering out the $0 funds_lost values, 1861 honeypots, 383 rugpulls, and 29 “other” scams are dropped in the process of treating class imbalance. Moreover, no key information, in terms of funds lost or source of the attack, is offered about these removed attacks. We are left with all those rows where funds lost != 0 and where all crypto projects have some key information to offer regarding the extent of the scam, which will be useful in predicting the scam type.

Now we have all records where funds lost != 0 and predictor with balanced classes

As per REKT Database, Honeypot attacks, Rugpull attacks, Abandoned scams, and the Kronos Dao project (classified as “other”) can be pooled together as Exit Scams and all other attacks can be pooled together as Exploits. We, therefore, will conduct Naive Bayes with a binary predictor. Moreover, after pooling together the respective scam types and treating predictor imbalance, the class frequencies of our desired target variable is: n(Exit Scam) = 380 and n(Exploit) = 435

We shall also take log of funds_lost and funds_returned to obtain normal distribution assumption for Naive Bayes as well as for better prediction accuracy.

After pooling together the scam types into respective types as described above:

Confusion Matrix for Training Data:

Confusion Matrix for Test Data:

Discretizing log funds features and day of year feature for more accurate predictions:

Confusion Matrix for Discretized Training Data:

Confusion Matrix for Discretized Test Data:

Multinomial Naive Bayes in Python using Text Data from NewsAPI and Twitter API

To make the task more interesting, I will also feature engineer a Sentiment Score for each text content, tweets and news, to create 2 labels: Positive and Negative.

Twitter Sentiment Analysis for Feature Engineering

Average Sentiment of All Tweets is -0.39 (rounded to 2 dp), which implies there are more negative words associated with crypto attacks in tweets. The value_counts() below also confirms this as we have 177 negative tweets and 111 positive tweets. To conduct Naive Bayes fairly, I would typically want a more balanced class of labels, let’s see what the overall sentiment is for news articles and hope it is relatively more positive to gain balanced classes

News Sentiment Analysis for Feature Engineering

As mentioned in the tweets section, in the news section we obtain a more positive overall sentiment. This will help balance our target variable as we will combine the tweets and news articles into 1 dataframe for Naive Bayes to predict either Positive or Negative Sentiment.

Multinomial Naive Bayes on Merged News and Twitter Text

Accuracy score for Train Predictions: 0.88

Recall score for Train Predictions: 0.88

Precision score for Train Predictions: 0.88

F1 score for Train Predictions: 0.88

Accuracy score for Test Predictions: 0.75

Recall score for Test Predictions: 0.75

Precision score for Test Predictions: 0.80

F1 score for Test Predictions: 0.74

Going Above and Beyond: Linear SVM

Accuracy score: 0.82

Recall score: 0.82

Precision score: 0.82

F1 score: 0.82

Discussion and Conclusions

Advantages of Naive Bayes

Can handle missing values Missing values are ignored while preparing the model and ignored when a probability is calculated for a class value.
Can handle small sample sizes. Naive Bayes does not require a large amount of training data. It merely needs enough data to understand the probabilistic relationship between each attribute in isolation with the target variable. If only little training data is available, Naive Bayes would usually perform better than other models.
Performs well despite violation of independence assumption Even though independence rarely holds for real world data, the model will still perform as usual.
Easily interpretable and has fast prediction time in comparison. Naive Bayes is not a black-box algorithm and the end result can be easily interpreted to an audience.
Can handle both numeric and categorical data. Naive Bayes is a classifier and will therefore perform better with categorical data. Although numeric data will also suffice, it assumes all numeric data are normally distributed which is unlikely in real world data.

Disadvantages of Naive Bayes

Naive Assumption Naive Bayes assumes that all features are independent of each other. In real life it is almost impossible to obtain a set of predictors that are completely independent of each other.
Cannot incorporate interactions between the features. The model’s performance will be highly sensitive to skewed data. When the training set is not representative of the class distributions of the overall population, the prior estimates will be incorrect.
Zero Frequency problem Categorical variables that have a category in the test data but was not in the training data will be assigned a probability of zero (0) and will be unable to make a prediction. As a solution, a smoothing technique must be applied to the category. One of the simplest and most famous techniques is the Laplace Smoothing Technique. Python’s Sklearn implements laplace smoothing by default. Correlated features in the dataset must be removed or else are voted twice in the model and will over-inflate the importance of that feature.
It heavily relies on the prior target class probability for predictions. Inaccurate or unrealistic priors can lead to misleading results. Because Naive Bayes is a probability based machine learning technique, the probability of the target will greatly affect the final prediction.

All of the above can be used as valid points that other classifiers should be built to outperform the Naive Bayes model. While Naive Bayes is great for spam filtering and Recommendation Systems, it is probably not ideal in most other applications.

Overall Naive Bayes is fast, powerful and explainable. However, the major dependence on the prior probability of the target variable can create misleading and inaccurate results. Classifiers such as Decision Trees, SVM, Random Forests and Ensemble methods are able to outperform Naive Bayes. However, the above point does not undermine the effectiveness of Naive Bayes as a reliable classifier. The independence assumption, inability to handle interactions between features, and normality assumption make Naive Bayes’s predictions hard to trust.

External References

https://www.kdnuggets.com/2019/04/naive-bayes-baseline-model-machine-learning-classification-performance.html/2

https://stackoverflow.com/questions/46063234/how-to-produce-a-confusion-matrix-and-find-the-misclassification-rate-of-the-na%C3%AF