For this project, data will be collected using Python, R, Google Chrome Extensions* (see point #2), and Postman through:
1. APIs: Application Programming Interface
Used to refer to libraries used internally by respective organizations
Now hijacked by developers to mean interface to access external data
Most web-based services have a public-facing API (Twitter, Amazon, Google, Spotify, etc)
Having used the twitteR API in the Bootcamp, I could easily scrape just more than a 1000 tweets with the query “crypto + crime”. I did clean the data using the same functions from the Bootcamp assignments and, eventually, got left with a dataset of about 500 observations. I shall revisit my data cleaning procedure again and, importantly, I do have my raw tweets file saved.
The NEWSAPI was easy to follow and I obtained content from a hundred articles. However, playing around with the query parameter to obtain relevant articles was a bit time-consuming. I tried several queries with logical operators but I obtained no responses. I, then, only queried for the word “crypto” and found several relevant articles that described crypto attacks. I will hold onto this CSV file for now but I may discard it later because I came across a very potent database, tailored exactly to my topic (read below!)
While brainstorming data sources for my project, I only had popular social media and news media sources in mind. Surfing the web for hours with key words pertaining to my topic was indispensable to coming across to the REKT Database offered by Defiyield App. As per Defiyield, this is the largest database of crypto hacks, including scams, exploits and rug pulls etc. The database, when not filtered, includes 3049 observations, all of which are linked to some kind of crypto attack. Some of the main fields that I wish to include my analysis are dollar funds lost and returned, description of the crime (synopsis provided by Defiyield along with links to news sources for proof of issue occurrence), crypto token(s) and exchanges affected by the issue, and type of scam. I did access the API through Postman and got responses for the first 100 observations, but I am yet to write Python code to paginate through all pages of the database. EDIT: Turns out I did not need to write Python code to Paginate the database as Postman could retrieve all 3049 rows for me! I only changed the limit parameter to all the rows (3049) instead of the default (100). The json response has been updated in the tree directory on Github.
2. Instant Web Scraping Extension
I employed this to play around with the REKT Database website. I was successful in obtaining all but one (very important) field of the database, description of attack. However, I shall soon employ the REKT Database API in Python to obtain the complete database! The Instant Web Scraping Extension is only a short-term fix to getting my hands dirty while obtaining data. Eventually, after using the extension, I produced a CSV file of about 3000 observations and 7 variables. EDIT: Given the recent advancements with Postman I described above, this dataset is most likely redundant. However, I shall still hang on to it for now before I start cleaning the Postman json response.