CoinMarketCal: Scraping Information for Crypto Analysis

Warning:

Coinmarketcal.com has changed layouts, thus the code no longer applies – still a great guide & reference.

 

 

If you’ve ever been to CoinMarketCal.com, you’ll noticed an abundance of crowd sourced information about cryptocurrency updates, roadmaps, and other changes. I wanted to use the crowdsourced roadmap data to see if it had any impact on the coins’ prices. 

Scraping the data

Unfortunately, CoinMarketCal does not have an API to easily gather the data. We can still pull the data in as raw HTML code and parse through it using regular expressions in Python. We’ll start with our imports and the html pull:

import re
import urllib.request
import pandas as pd

text = urllib.request.urlopen('http://coinmarketcal.com/').read().decode()

If we were to print the text we would get some messy HTML code. After some manual searching, we can determine where the information we want is. For the purposes of this project, we will scrape the coin name, coin ticker, update date, and update certainty. We can do some basic tests with this information – like does the update increase the coin price if it’s above a certainty threshold.

The actual data scraping uses regular expressions, which more information can be located here. It looks like the following:

coins = re.findall('(?<=Coin -->\n\t\t\t\t\t\t\t\t\t<h5><strong>).+(?=</strong>)',text)
dates = re.findall('(?<=\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<h5><strong>).+(?=</strong>)', text)
certainty = re.findall('(?<=aria-valuenow=").+?(?=" role)', text)

We can put all the lists together, turn it into a DataFrame and export it as a csv. The code in it’s entirety looks like:

import re
import urllib.request
import pandas as pd
import numpy as np

text = urllib.request.urlopen('http://coinmarketcal.com/').read().decode()

coins = re.findall('(?<=Coin -->\n\t\t\t\t\t\t\t\t\t<h5><strong>).+(?=</strong>)',text)

parts = [coin.split() for coin in coins]

dates = re.findall('(?<=\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t<h5><strong>).+(?=</strong>)', text)

certainty = re.findall('(?<=aria-valuenow=").+?(?=" role)', text)

sheet = []
for i in range(len(coins)):
    if len(parts[i]) == 2:
        sheet.append([parts[i][0], parts[i][1][1:-1], dates[i], int(certainty[i])])

csv = pd.DataFrame(sheet)
csv.to_csv('coincal.csv')

From here we can begin to test price data against these events!

Leave a Reply

Your email address will not be published. Required fields are marked *