Data Cleaning

REKT Database (Python)

After loading the record data and getting a quick view of it, we can move to cleaning/pre-processing the data. I first started with stripping off the whitespaces off the column names. Then, I went on to dropping variables that would not assist in our analysis. Most of the variables that we shall drop from our 31 columns are links to external sites, such as web archive, discord, and github. Moreover, these variables contain more than 2900 NaN values out of a total observation count of 3055. However, the variable proof_link could be important if I decide to scrape text data from the linked article about a crypto attack. We shall also get rid of the technical_issue field because it only contains 4 non-NaN values and, more importantly, does not have any insightful use. Therefore, I find it sensible to entirely remove these fields, instead of getting rid of their existing NaN values, for EDA and modeling purposes.

Next, we proceed to variable identification and typecasting, an important step to recognize what types of data our variables fit in. For variables identified as Integer type, the summary is as follows:

id variable is a unique, nominal code indicating the token/coin associated with the crypto attack. Converting it to category type would not be beneficial due to the large number of unique tokens present in the database. This variable should be converted to object/string type.

active variable most probably represents whether the crypto project is currently active in the market. I perused the API documentation to try to find this response variable’s significance, but could not. Moreover, it has only taken on one as a value for ALL observations. This would mean that all crypto projects present in the database are still active. We could keep this variable for now, but converting it to category would be better as it most likely would take on two values, either 1 (active) or 0 (inactive).

For variables identified as Object type, the summary is as follows:

Variables like date, funds_lost, and funds_returned are of type object. This means that Pandas was not able to recognize the datatype of these four variables. Therefore, we shall convert these aforementioned object data type variables to their respective datatypes.

funds_lost and funds_returned are converted to floats and date is converted to pandas datetime.

For variables identified as Float type, the summary is as follows:

The is_verified_source_code and is_public_team variables take on the values of either 0 or 1. Hence, we convert them to category type.

Data Cleaning for Classification Tasks:

If the funds lost is $0, that means the respective observation or record has not much information to offer to us in terms of a Naive Bayes Model. Also, if no funds were lost then no funds can be returned/recovered. Therefore, filtering out the 0 funds_lost observations makes sense if we want balanced classes for our predictor, scam type.

After filtering out the $0 funds_lost values, 1861 honeypots, 383 rugpulls, and 29 “other” scams are dropped in the process of treating class imbalance. Moreover, no key information, in terms of funds lost or source of the attack, is offered about these removed attacks. We are left with all those rows where funds lost != 0 and where all crypto projects have some key information to offer regarding the extent of the scam, which will be useful in predicting the scam type.

Now we have all records where funds lost != 0 and predictor with balanced classes

As per REKT Database, Honeypot attacks, Rugpull attacks, Abandoned scams, and the Kronos Dao project (classified as “other”) can be pooled together as Exit Scams and all other attacks can be pooled together as Exploits. We, therefore, will conduct Naive Bayes with a binary predictor. Moreover, after pooling together the respective scam types and treating predictor imbalance, the class frequencies of our desired target variable is: n(Exit Scam) = 380 and n(Exploit) = 435

We shall also take log of funds_lost and funds_returned to obtain normal distribution assumption for Naive Bayes as well as for better prediction accuracy.

Extracting important time-based features for better EDA experience:

The features from extracted are month_of_attack, day_of_week_of_attack, and day_of_year_of_attack from the raw date variable.

Removing HTML Tags from Description variable

We also have to clean a vital text variable in our dataset that will be used in NLP tasks later. This is the description variable, which verbosely lays out the proof of the attack taking place and verified links to where the reader can obtain more information. In order to process this column, I made use of the Beautiful Soup library’s .get_text function and passed it in a for loop that looped through the ‘description’ variable and appended processed outputs to the REKT_df data frame.

HOWEVER, a crucial step was missing, which was that of getting rid of all the NaN’s present in the description variable. This variable itself has 285 missing values, but when we add the funds_lost variable in the mix, the total missing values is only 284. The funds_lostvariable is a highly valuable attribute and, possibly, a target variable for modeling. We cannot get rid of even that single observation that has a funds_lost value but no description. Therefore, seen in the two screenshots below, we could not get rid of ALL NaN’s present in the funds_lost variable and description variable.

The 284 rows were the only ones discarded while cleaning the dataset. Yes, there are missing values present in other rows, but we have done well to eliminate most, if not all, unnecessary data for EDA!

After all these above steps were completed, we come up with a cleaned data like this:

Clean Data (first 5 rows)

Twitter Text (Python)

Count Vectorizer for Analyzing Cosine Similarity
Regex -> .str.lower().apply(lambda x: re.sub(r"(?:\@|https?\://)\S+", "", x)) along with NLTK stopwords and tokenizations functions allows us to calculate Perplexity for n-gram models.
Using the processed text from when Perplexity was applied, the same text can be used for Sentiment Analysis using Vader. EDA page with more detailed content!

News Text (R Code)

News data was cleaned by subsetting variables that were not useful to EDA or modeling. Only the author variable has missing values. Moreover, this variable does not provide much insight to our data analysis as the newspaper source is more important. We can drop this variable without losing useful information for my project. We must also clean the source column using gsub() because it contains dictionary values, including 2 keys, in the form of strings. We only need the name of the publishing organization and, hence, string operations will help us parse the column correctly.