The easy answer to this question is: the people. Great people build great communities. Case not quite closed yet, though, because there is more to it. Even the most promising group of individuals needs certain conditions in order to grow into a strong and thriving community. The kind of community that lifts up its members beyond their individual capabilities and becomes more than the sum of their proverbial skills and contributions. I believe that such communities are the cornerstones of all scientific fields, including data science, and that those fields succeed or fail depending on their communities.
Here is the case study that prompted this post: last week I took part in the Kaggle Days meeting in San Francisco. Kaggle Days are a new series of local meetings which aim to bring together members of the international, virtual Kaggle community for in-person workshops, presentations, and competitions. Over the last few years, Kaggle itself has transformed from being “merely” the go-to place for sophisticated machine learning competitions to a multi-faceted online community. The range and depth of user-hosted datasets continues to grow rapidly from month to month, as does a unique repository of machine learning and data science code templates in the form of Kaggle Kernels: reproducible R/Python notebooks or scripts in a self-contained, cloud-based environment. There really is something for everyone.
It is fair to say that Kaggle has been a main catalyst for my career change. Joining the platform in early 2017, two years ago, opened my eyes to the multitude of fascinating challenges and problem-solving strategies beyond my narrow academic field. One year later, I had become the first ever Kaggle Kernels Grandmaster - a journey that I plan to revisit in a future post. What drew me into Kaggle, beyond the fun competitions, was a remarkably friendly and supportive community. It’s a rare occurrence to find people who are both extremely smart and happy to help a newcomer in an approachable and relaxed way. Machine Learning can be intimidating, but the people on Kaggle made it fun. Sooner than expected, I started to feel that I had become part of a community which, despite being very competitive, was remarkably efficient at working together to solve hard problems. Especially considering that we all worked on these challenges in our free-time and in different corners of the world.
All this back story might help to illustrate why I felt rather excited to finally encounter many of my fellow Kagglers in person at the Kaggle Days meetup. Excited and a bit nervous as to how the virtual collaboration would translate to the real world. I had high expectations - which were exceeded spectacularly. Kaggle Days was a blast! Almost the entire Kaggle Team was present, including CEO Anthony Goldblum and Co-Founder Ben Hamner. Top Kagglers such as Bojan Tunguz and Dmitry Larko gave presentations and workshops alongside Machine Learning gurus like Francois Chollet (Keras) or Quoc Le (Google Brain / AutoML). So many super smart people to listen and talk to! It was great fun.
Interestingly, as a side effect of my initial Kaggle anonymity (I did not use my real name at all during my first year) I quickly found it more useful to introduce myself as “Heads or Tails”, even though my badge had my actual name on it. I only forgot this when I first met almost the entire Kaggle team at once and needed a second take for a more useful introduction. As this blog shows, I still prefer the “Heads or Tails” moniker for my Data Science personality. Let’s hope that no psychologists read this.
Ever the hands-on community, the second (and final) day of the Kaggle Days meetup gave us the opportunity to form small teams to participate in an on-site competition. This was particularly interesting for me, since I had never teamed up with others to tackle a Kaggle problem. In a team, there is much more code sharing and discussion than in the open Kaggle forum. And even though there were a few small hiccups in this particular competition (Want a change in metric plus additional data halfway through? Say no more!) working with my team mates was a lot of fun. Shout-out to Michael and Garrick - you guys rock! As a result of Kaggle Days I’m definitely more motivated to team up with others in future competitions. Not just that: I’m more motivated in general to spend time on Kaggle.
Now, why is that exactly? What makes the Kaggle community such a fun place? For me, there are several different factors that all enhance each other when combined:
Kagglers are smart yet down to earth. Not only are they happy to share their insights, they really make an effort to do so in an accessible way. I recommend anyone who joins a competition not to immediately abandon it after the results are in, but to read the write ups of the top teams which are usually of high detail and quality. There is lots to learn from such a post mortem.
No big egos in the community. This is related to the previous point but touches on a different aspect. Even though our community has its own big names who’s opinions (deservedly) have weight in discussions, nobody thinks they are more important than others. This is crucial because it lowers the threshold for beginners (and anyone) to ask questions. Asking questions is what drives the improvement of individuals and the community. As a side note: In my time in academia I have come across some really big egos, although luckily never in immediate collaboration. Although these people are very smart you really don’t want to be around them for longer than absolutely necessary. No gossip here - moving on:
A common goal. As soon as you meet other Kagglers you have something to talk about; be it an ongoing competition, the new Kernels interface, or the most recent Machine Learning tools. But it goes beyond that: competitions are the best example for creating a specific goal that everyone can focus on and contribute to, to the best of their abilities. And while sharing is encouraged, there is plenty of competition at the highest level. The last days of a competition are one of the most intense examples of a singular focus in an online community of hundreds to ten thousands of people from all over the world.
Diversity. Speaking of ‘all over the world’. Kagglers come from a large number of countries and have many different backgrounds. It is true that we are still predominantly male and STEM based, and we are working on becoming more inclusive towards many other groups. You can think about it this way: When doing Machine Learning, diversity is a big advantage. If you average over several models then your results will be better the less similarity these models have (i.e. the less collinearity there is). An insight that is missed by one model might be picked up by a different method. The likelihood that all models will overfit in the same direction is smaller. And the same is true for communities. Different points of view help us to challenge pre-conceived beliefs and broaden our horizons. For deeper insights into the diversity of Kagglers you can check out my analysis, and those of many others, of the latest Kaggle Survey which, true to form, is a detailed annual assessment of the state of the community.
People care about the community. This is one of the most important factors. It might somewhat derive from the points above but it’s by no means a given. I have been part of (and witness to) passionate discussions in the Kaggle forums about difficult issues in the community. Often Kagglers themselves have suggested solutions to problems that the administrative team might not have been aware of. And even if tempers flare up, which is more understandable in a competitive context, there is mutual respect and usually a (virtual) handshake once the dust has settled. Kaggle is our community and we care about keeping it friendly and welcoming.
Infrastructure for collaboration and communication. Last but not least, for a community to function well there need to be tools and environments in place that allow for efficient communication. The lower the thresholds are for exchanging information the better it will work. Ideally, the infrastructure should be designed in a way that encourages different ways of interaction for the community members. This further promotes a welcoming and inclusive atmosphere. Kaggle provides all this through discussion forums (general ones and those specific to each competition or data set). In addition, the aforementioned Kernel notebooks have comments enabled, which is a great way to show appreciation to an author’s work or ask for clarification. From my point of view, commenting on Kernels is great, low-threshold way of starting to actively participate in the community. And I can guarantee that Kernel authors appreciated feedback. My own Kernels have frequently been improved by people kind enough to post their ideas and suggestions.
A special kind of infrastructure is an in-person meetup like Kaggle Days. Remote interaction works fine, but from my experience there is a certain extra factor in face-to-face meetings. During my time in academia I had the privilege to be part of many successful teams and to work with many smart people. The highlights of these collaborations were always our team meetings in which we could brainstorm new ideas and strategies. Sometimes at a hotel pool, sometimes late at night over drinks or pizza; but always with extra energy and creativity. And when your creativity goes into overdrive then you have found a great community.
Finally, this post doesn’t feel complete without highlighting the remarkable way in which the Data Science community (and especially the R community) is responding to the disgraceful case of sexual harassment and the botched attempt at a cover up at DataCamp. The community is supporting the victim and former employees who spoke out and were fired. Many content creators are pulling their courses from DataCamp to push for necessary change. Here is a community actively working to transform bad practices that those who are primarily responsible are repeatedly failing to address. Because I’m an optimist at heart, I want to close by pointing to one of the most remarkable products of this sorry situation: a free natural language programming (NLP) course plus a great interactive app template build by Ines Montani:
Like many of you, I'm incredibly disappointed by DataCamp. I wanted to make a free version of my spaCy course so you don't have to sign up for their service – and ended up building my own interactive app. Powered by the awesome @mybinderteam & @gatsbyjs 💖 https://t.co/2QOuDPoZEX— Ines Montani 〰️ (@_inesmontani) April 17, 2019