In the final, hectic days of a recent Kaggle competition I found myself in want of more GPU power. As one often does in such an occasion. My own laptop, with its GPU setup, was doing a fine job with various small models and small images, but scaling up both aspects became necessary to climb the leaderboard further. My first choice options GCP and AWS quickly turned out to require quota increases which either didn’t happen or took too long. Thus, I decided to explore the paid options of Google Colab.
I had only ever used the free version of Colab, and found 2 paid subscriptions: Colab Pro and Colab Pro+. The Plus version, at a not insignificant $50/month, was advertising “priority access to faster GPUs” compared to the 10 bucks/month Pro tier. While researching more about this and other vaguenesses in the feature descriptions that the website was offering up, I quickly realised that Plus had only be launched a day or two earlier. And nobody really seemed to know what the exact specs were. So I thought to myself “in for a penny, in for a pound”. Or about 36 pounds at the current exchange rate. Might as well get Plus and figure out what it’s about.
This post describes the features I found in Colab Pro+, alongside some notes how to best use any version of Colab in a Kaggle competition. After sharing some findings on Twitter, I received a number of very useful suggestions for Colab alternatives which are compiled in the final part of the post.
Colab Pro+ features
GPU resources: The Plus subscription gave me access to 1 V100 GPU in its “High-RAM” GPU runtime setting. I could only run a single of these sessions at a time. Alternatively, the “Standard” RAM runtime option allowed me to run 2 concurrent sessions with 1 P100 GPU each.
The “High-RAM” runtime did justify its name by providing 53GB of RAM, alongside 8 CPU cores. In my “Standard” RAM sessions I got 13GB RAM and 2 CPUs (which I think might be what’s included in the free Colab).
The runtime limit for any session is 24 hours; which was consistent throughout my tests. The Plus subscription advertises that the notebook keeps running even after closing the browser, which can be a useful feature. But I didn’t test this. Be aware that even in Pro+ the runtime still disconnects after a certain time of inactivity (i.e. no cells are running).
With large data, storage is important. Compared to the 100GB of the Pro subscription, Pro+ provided me with 150GB of disk space. This turned out to make a crucial difference in allowing me to copy all the train plus test data, in addition to pip installing updated libraries.
What I didn’t test: Colab’s TPU runtime as well as the number of concurrent CPU sessions.
Are those features worth the $50/month? On the one hand, having 24h of V100 power is a notable step up from the free Colab and Kaggle resources. On the other hand, being restricted to 1 session at a time, or 2 sessions with slower P100s, can be a limiting factor in a time crunch.
Also note, that the Colab FAQ states that “Resources in Colab Pro and Pro+ are prioritized for subscribers who have recently used less resources, in order to prevent the monopolization of limited resources by a small number of users.” Thus, it seems unlikely that one could use a V100 GPU 24/7 for an entire month. I intend to run more experiments and might encounter this limit sooner or later.
Kaggling on Colab
If you’ve exceeded your (considerable) Kaggle resources for the week or, like me, need a bit more horse power for a short time, then moving your Kaggle Notebook into Colab is a good option to keep training and experimenting. It can be non-trivial, though, and the 2 main challenges for me were getting the data and setting up the notebook environment. Well, that and Colab’s slightly infuriating choice to forego many of the standard Jupyter keyboard shortcuts (seriously: why?).
Get your data into Colab: by far the best and fastest way here is to copy the data via their GCS_DS_PATH
; i.e. Google Cloud Storage path. Since Kaggle was acquired by Google in 2017, there has been significant integration of its framework into Google’s cloud environments. Kaggle datasets and competition data have cloud storage addresses and can be quickly moved to Colab from there.
You can get the GCS_DS_PATH
by running the following code in a Kaggle Notebook. Substitute seti-breakthrough-listen
with the name of your competition or dataset:
from kaggle_datasets import KaggleDatasets
GCS_DS_PATH = KaggleDatasets().get_gcs_path("seti-breakthrough-listen")
print(GCS_DS_PATH)
Within Colab you can then use the gsutil
tool to copy the dataset, or even individual folders, like this:
!gsutil -m cp -r {GCS_DS_PATH}/"train" .
!gsutil -m cp -r {GCS_DS_PATH}/"test" .
This retrieves the data significantly faster than by copying from Google Drive or downloading via the (otherwise great) Kaggle API. And of course getting the data counts towards your 24 hour runtime limit. Keep in mind that the data is gone after your session gets disconnected, and you need to repeat the setup in a new session.
The same is true for any files you create in your session (such as trained model weights or submission files) or for custom libraries that you install. Colab has the usual Python and Deep Learning tools installed, but I found the versions to be rather old. You can update via pip
like this:
!pip install --upgrade --force-reinstall fastai==2.4.1 -qq
!pip install --upgrade --force-reinstall timm==0.4.12 -qq
!pip install --upgrade --force-reinstall torch -qq
Two things to note: after installing you need to restart your runtime to be able to import
the new libraries. No worries: your data will still be there after the restart. Also make sure to leave enough disk space to install everything you need.
Save your outputs on Drive: A final note to make sure to copy the results of your experiments - whether they’re trained weights, submission files, or EDA visuals - to your Google Drive account to make sure you don’t lose them when the runtime disconnects. Better to learn from my mistakes than by making them yourself. You can always download stuff manually, but I found that copying them automatically is more reliable.
You can mount Drive in your Colab notebook like this:
from google.colab import drive
drive.mount('/content/drive')
Then copy files e.g. via Python’s os.system
.
Other cloud GPU options
While Colab and its Pro versions, as outlined above, have several strong points in their favour, you might want explore other cloud GPU alternatives that either offer more power (A100s!) or are cheaper or more flexible to use. The options here were contributed by other ML practitioners via Twitter and are listed in no particular order. I’m not including the well-known GCP or AWS here, although someone recommended preemptible (aka “spot”) instances on these platforms.
Paperspace Gradient: the G1 subscription costs $8/month and gives free instances with smaller GPUs and a 6h runtime limit. Beyond that, pay $2.30/h to run an V100 instance. 200GB storage included, and 5 concurrent running notebooks.
JarvisCloud: great landing page, plus A100 GPUs at $2.4/h are being attractive features here. They also offer up-to-date Pytorch, FastAI, Tensorflow as pre-installed frameworks. Storage up to 500GB at max 7 cents per hour.
Vast.ai: is a marketplace for people to rent out their GPUs. You can also access GCP, AWS, and Paperspace resources here. Prices vary quite a bit, but some look significantly cheaper than those of the big players at a similar level of reliability.
OracleCloud: seems to be at about $3/h for V100, which is comparable to AWS. A100s “available soon”
OHVcloud: a French provider known for being not very expensive. They have 1 V100 with 400GB storage starting at $1.7/h; which is not bad at all.
Plenty of options available to explore further. Maybe we’ll see prices drop a bit more amidst this healthy competition.
Hope those are useful for your projects on Kaggle or otherwise. Have fun!