Conclusions
Exploratory Data Analysis
From the EDA tab, we saw that log-transforming the funds lost and funds returned variables resulted in a normal distribution for both variables by squeezing the outliers in. The datetime extracted features provided insights into when the attacks took place. Weekdays had a much higher frequency of attacks than weekends, in fact double the number of attacks took place on Mondays, Tuesdays, and Thursdays compared to those on Saturdays and Sundays. Moreover, majority of all attacks took place in August and September. September is generally viewed as a bearish month for stocks and has been observed as a bearish market for crypto over the past four years, leading to the coined phrase the September Effect.
In terms of the type of attack, honeypots were the most frequent scams to take place linked in to the Ethereum chain. In crypto cyber attacks like honeypots, the user’s money will be imprisoned in smart contracts that appear to be vulnerable but contain hidden traps. The honeypot creator (attacker) will be solely able to recover the cash transferred by the victim. Followed by honeypots are exit scams, which refers to when promoters of a cryptocurrency disappear with investors’ money during or after an initial coin offering (ICO).
The text data collected from Twitter API and News API is mainly used to extract and analyze sentiments of the public and media respectively, concerning the topic of crypto cyber risk.
Classification
Regarding Binary Naive Bayes Classification of record data (REKT Database), we achieved a test data accuracy of 60% on the initial model. Our response variable was scam type initially had 9 unique scam types that were then grouped into exit scams and exploits, as per the REKT Database update. I then discretized the log funds variables and saw the accuracy by 12% on the test data. Therefore, discretizing numeric variables before running Naive Bayes should be promoted for better results. Regarding Twitter and News Text data classification, I first extracted sentiments of each tweet and article using NLTK’s Vader library and then discretized the sentiments into Positive or Negative for Multinomial Naive Bayes Classification. The test accuracy obtained on the merged text data was 75%.
The Decision Tree model was performed on the record data (REKT Database). Therefore, we used a binary response variable of scam type, including either exit scams and exploits, on the Decision Tree model for binary classification. Our expectation was that the Decision Tree should perform better than the Naive Bayes Classifiers because the Decision Tree is a more complex and stronger algorithm than Naive Bayes. Indeed, we are proved correct, as the initial Decision Tree outputted a 100% on all metrics on the training data and a 62% accuracy on the test data. This indicated major overfitting on the training data and, hence, hyperparameter tuning was employed for the Decision Tree to generalize better on unseen data. After performing hyperparameter tuning, the accuracy on the test data was significantly higher at 76%, which means that the Decision Tree did outperform the Naive Bayes Classifier trained on discretized data. Therefore, hyperparameter tuning is an important step in the machine learning process and it opens insights on why and how a particular model performs when its parameters are dialed back and forth.
For performing SVM (Support Vector Machine), we examined the performance of 4 different SVM kernels, namely Linear, Polynomial, RBF, and Sigmoid, on our cleaned text data gathered from Twitter and News APIs. Our combined dataset of tweets and news content included a total of 380 observations. We tested support vector machines and a random (baseline) classifier on this data. The baseline outputted a test accuracy of 54%. The hyperparameter tuned RBF and Sigmoid SVM models had the highest classification accuracies at 83% and 82% respectively. However, the lowest accuracy was obtained from the initial Sigmoid SVM model with 71% accuracy. Therefore, hyperparameter tuning is an important step in the machine learning process and it opens insights on why and how a particular model performs when its parameters are dialed back and forth.
Clustering
The K-Means model clustered the feature dataset (X) most aptly in relation to the target variable (Y), scam_type_grouped, of the dataset. DBSCAN performed the worst and Hierarchical clustering performed second best. The primary finding from our results was that the right clustering model can help extract the ground truths (or labels) from only the feature dataset. Therefore, although simple and easy to execute, clustering models can be highly powerful in generating accurate insights of the overall data from independent features only. It would have helped my modeling process if I had more numeric features attached to the dataset, which would have probably led to not only more clusters being formed but also clusters of different shapes and sizes/densities. For example, an additional numeric feature not related to funds would have added to the complexity of the dataset and, hence, comparing the clusters with those of the labelled (Y) clusters may have yielded different results. Moreover, we could have visualized the data in 3-D, using the three numeric features, and, as a result, added a new perspective to the dataset.
ARM and Networking
We found that attacks that were Exploits (Access Control, Flash Loan Attacks, and Phishing) were generally linked to the Ethereum chain and lost as well as recovered funds in the low millions (USD). The CEX cryptocurrency platform too was related to exploit attacks and were able to recover funds in the low millions (USD). Exit Scams comprising Honeypot attacks, Rugpull attacks, and Abandoned scams, on the other hand, were attacks that generally resulted in funds lost in the low millions (USD) but not being able to recover any funds. Exploit attacks were common on weekends rather than weekdays resulted in Our dataset was conducive to running the Apriori Algorithm on it; however, a larger dataset would take more time train on and may result in more insignificant rules to filter from.
Final Thoughts
A crypto attack poses multifarious threats to copious stakeholders. Taking in millions of funds from the public, including investors and the common man, and then either inadvertently losing or purposely siphoning those funds lowers confidence among the public about cryptocurrencies. Although there are many government agencies in the United States and around the world, tasked with overseeing the industry and hunting down crypto criminals, regulators are often outnumbered and outgunned by tech-savvy scammers and fraudsters.
Cryptocurrency is especially attractive to scammers for three main reasons:
Decentralized Industry: Since crypto assets and applications are part of a decentralized financial (DeFi) system, intended to be used without oversight from a bank or government, there’s no central authority to stop a transaction or flag something if it looks suspicious.
Irreversible Transactions: Because of the way the blockchain works, once you’ve sent a crypto transaction, there’s no way to retrieve your funds.
Anonymity: Crypto users interact through wallet addresses, not legal names, so it’s difficult to track down specific users, especially if they’re trying to stay hidden.
Exit Scams result in highest loss of swindled money/assets, with ETH (Ethereum) chain/platform being the most exposed to crypto attacks. In terms of sentiment, our twitter data had a more negative content than that of news content collected. Although I wished to analyze locations from which the attack took place, I would need more data from other sources. Moreover, the “detail” column of the REKT Database could have been further analyzed in-depth to possibly obtain the attacker’s location.
Lastly, crypto is here to stay and could become a mainstream adoption among retail and institutional investors in the next 10 years. Its popularity or, rather, notoriousness among the general public is questioned due to massive funds being lost in sole attacks, such as with the recent cryptocurrency exchange FTX owned by Sam Bankman-Fried. In most crypto attacks, the institutional investors and general public, who invested their funds in the chain, are affected the worst and owners of the chain get away scott-free. Therefore, the rich get richer and the poor poorer.
References
“Why September Is Seen as a Bad Month for Cryptocurrencies.” cnbctv18.com, September 22, 2021. https://www.cnbctv18.com/cryptocurrency/why-september-is-seen-as-a-bad-month-for-cryptocurrencies-10823602.htm.
Yang, Ying, and Geoffrey I. Webb. “Discretization for Naive-Bayes Learning: Managing Discretization Bias and Variance - Machine Learning.” SpringerLink. Springer US, September 4, 2008. https://link.springer.com/article/10.1007/s10994-008-5083-5.
About the author: Dalia writes about crypto security for NerdWallet. She has a B.A. in science & technology studies and critical theory from Wesleyan University. Read more. “Cryptocurrency Scams Explained.” NerdWallet. Accessed December 5, 2022. https://www.nerdwallet.com/article/investing/cryptocurrency-scams.