The first six months as a Data Scientist have gone by in a blur of new tools, techniques, and experiences. In my new job, I’m lucky to be (once again) part of a great team of people who are not just smart, but always happy to share their knowledge and feedback in a productive and thoughtful way. A supportive environment helps a lot. Surprising insight - I know. Worth repeating nonetheless. Now it’s time to look back and reflect on the main challenges and lessons learnt during this initial stage in my journey from Academic Astronomy to Data Science in the Real World.
This post is mainly aimed at astronomers who are interested in making the switch. As always, I’m doing a stellar job in choosing the largest possible target audience. Still, I think that my thoughts might also be interesting for other academics, young students, or anyone interested in Data Science or Machine Learning. Yeah…, that should cover enough people to propel my fame.
First, let’s start with the tools of the trade:
Programming languages: If you want to learn one language then learn Python. And that’s a hard thing for me to write, because personally I prefer the R language with its tidyverse collection of packages. But Python has a lot going for it: firstly, it has the appeal of being a general purpose programming language and as such is accessible to a vast audience from many different backgrounds. Secondly, it also has an immense repository of specialised libraries. For instance, in astrophysics we have the great astropy package. For Machine Learning basics there’s the ubiquitous scikit-learn. And thirdly: a lot of cutting-edge methods from Neural Network research are being quickly ported into Python frameworks. Those factors mean that today most Data Science teams communicate in Python, and this trend is likely to continue in the near future.
Straight talk: R is a powerful language but it’s a bit more idiosyncratic (with a somewhat steeper learning curve) and currently less widely used, which gives Python the edge in popularity. And now is a crucial time for popularity to drive adoption: Both R and Python are growing, yet Python’s growth is exceptional.
Objectively, R can do everything that Python can do (including e.g. Tensorflow and Keras wrappers). Many things it can do better, such as beautiful visualisations (e.g. with ggplot2) or all the statistics you could possibly ever need in your life. Sometimes it might do things a little slower. But if you really need speed then native Python won’t cut it either. I digress. More importantly: astronomers: forget about IDL. Outside astronomy, no one even knows it exists - and all that legacy code is a weakness rather than a strength. MATLAB is not very popular either. Forget about Fortran (unless you want to write your own Deep Learning Nets or Gradient Boosted Trees). Some C might come in handy. And certainly forget about IRAF and MIDAS. A lot of 80s things are back in fashion but even nostalgia has its limits. Even if you plan to stay in academia, do your career a favour and learn Python. If you have the resources, learn R and Python.
Data bases: You will need at least basic skills in one data query language; ideally advanced skills in several ones to boost your productivity. An SQL flavour will be very useful; those are relatively similar and transitions between them will be easy. NoSQL tools like Elasticsearch can be required, too. Some astrophysics projects have used these frameworks for a while, but in my field we mainly relied on sending fits files back and forth. Having well-designed data base schemata is an immense advantage. Almost all industry jobs will involve the Notorious BFD (Big Freaking Data) - which will be stored in those very data bases. The more of data selection, cleaning, and merging you can do on those servers, the more efficient is your work. This also includes software like Spark. Successful data preparation is more than half the battle.
Environments: Most people in Data Science use Jupyter Notebooks to share Exploratory Data Analysis, code samples, or proof of concept software snippets. I know that those tools have already achieved some popularity in Astronomy, but it’s worth emphasising how valuable they are for analysing data as a team. As an R user, in addition to using Jupyter, you can have an even smoother experience with Rmarkdown - ideally in Rstudio which I can strongly recommend as an IDE. Here is also a good place to mention version control: git has become the most popular tool, and places like github or gitlab provide convenient hosting space. These shared spaces are pretty much indispensable for having multiple people working on the same project, but even for your individual projects git
is well worth the couple hours investment to learn.
In the second part, let’s talk about workflow:
Data Science projects can span from a few hours to several weeks. They can also involve only yourself, several data scientists, or other parts of the company such as software engineering or sales. Which aspects dominate depends on your actual position in the team and on your company’s business. Similarly, you can spend different amounts of your time on data cleaning, exploratory analysis, visualisation, or machine learning. In my case, projects typically last weeks - with the occasional request for a couple of hours or days. Coming from an observational astrophysics field focussed on transient sources, I was already used to quickly switching my attention to a new project (when an interesting nova eruption was found). There are obvious parallels here in the shape of client requests, news stories, financial IPOs or similar (semi-) unexpected events that warrant immediate attention. Multi-tasking is important. Month-long projects are rare. The iteration loops are much quicker than what I had known in academia. It’s a bit of a cliché by now, but you really want to fail fast. Time is of the essence and it rarely seems to move linearly.
Time budgets: Within a typical data science project, expect to spend more of your time on data cleaning and quality control than you would have liked. Probably upwards of 50%, including several iterations of identifying outliers and fixing data errors. Real-world data is considerably messier than astrophysical data. Yes, that is possible. And yes: I have worked with multi-wavelength, multi-telescope projects that produced messy data. A random set of typical industry projects have less in common among them than their astronomical counterparts; and sources of uncertainty and systematics can be plentiful. Data exploration will take up another maybe 25% of your time, including visualisations. The remaining 25% will likely go towards documenting and communicating your findings. I’m including here the time it will take to include your results into a production environment. If you are aiming to be a Machine Learning engineer these tasks will take up a larger chunk of your time.
Soft skills: Let me stress that it is a valuable and non-trivial skill to be able to communicate your results clearly and succinctly to a variety of audiences. Your team members will care about which statistical tests you used - management or clients will probably not. At academic conferences, we’ve all sat through those presentations that had slides upon slides filled with text and equations and intricate findings, yet little mention of the bigger picture or what can be learned from it. This is what you want to avoid - both as a speaker and an audience member, incidentally. Adjust your slides to your audience. Together with (well-documented) code samples, the major deliverable of your project is an efficient communication of its results.
Thus concludes this communication of my impressions and experiences; at least for the moment. There will be more detailed updates on certain key aspects as time progresses. Let’s see what the next 6 months will bring.
PS: I’m trying out italics as a sarcasm/humour indicator. Since written text has no tone; and so on. But I’m sure that was abundantly clear from context.