This post continues my series on translations between R and Python code for common data science and machine learning tasks. A documented cheat sheet for those such as myself, who frequently switch between the two languages for different tasks.
I had written the first post in this series back in late 2020, full of optimism that it would kick off a whole series of frequent posts on the topic. Alas, other projects took precedence, as they often do. Now, I’m back with renewed elan to do more blogging this year, and write more posts on R & Python.
Let’s get the setting out of the way: all code is reproducible in Rmarkdown via Rstudio IDE (version 1.2 onwards). Python is integrated through reticulate; check out my intro post on reticulate to get started with that. Those are our R and Python libraries:
# R
libs <- c('dplyr', 'stringr', # wrangling
'palmerpenguins', 'gt', # data, table styling
'vroom', 'readr', # read & write data
'tidyr', 'purrr', # wrangle & iterate
'fs', # file system
'reticulate') # python support
invisible(lapply(libs, library, character.only = TRUE))
use_python("/usr/bin/python3")
df_orig <- penguins %>%
mutate_if(is.integer, as.double) %>%
select(-contains("length"), -contains("depth"))
# Python
import pandas as pd
import glob
pd.set_option('display.max_columns', None)
Once again we will use the Palmer Penguins dataset from the eponymous R package. I will use a smaller subset of columns for reasons of readability.
This dataset contains measurements and characteristics for 3 species of penguin, namely:
df_orig %>%
count(species) %>%
gt()
species | n |
---|---|
Adelie | 152 |
Chinstrap | 68 |
Gentoo | 124 |
In this exercise we will take the data frame apart into different files by species, and then put it back together again.
Briefly: Writing multiple csv files
The breaking-apart bit is not the focus of this post, though. I’m borrowing it from this stackoverflow post. To make the rest of this exercise more compact, we will only write out 1 row per species to a csv file:
# R
df_orig %>%
group_by(species) %>%
slice_head(n = 1) %>%
ungroup() %>%
nest(data = c(-species)) %>%
pwalk(function(species, data) write_csv(data, file.path(".", str_c(species, "_penguins.csv"))))
For reasons of brevity this code chunk will remain somewhat of black box; otherwise this post would balloon into something much longer. See it as a sneak peak for future content; and also something you can just use as is or explore for yourself.
Some brief notes only: slice_head
is a new addition to the dplyr
package - an evolution of the previous slice
- and it does pretty much what you would expect: slicing off the first n entries (here by group). In nest
we have a tidyr
tool that bundles together the select columns into a list column (here everything except species
). Finally, the purrr
package function pwalk
iterates over a list, here writing out the different parts of the dataframe with custom names.
As it stands, purrr
remains one of my weak points in the tidyverse and I plan to revisit it in a future post. For now, onward with the meat of this episode.
Reading a single csv file
In R, the standard tool for reading csv files is readr::read_csv
. However, I prefer the vroom
function from the vroom package. Like its onomatopoeic name suggests, this package is fast. Faster than data.table::fread
, which I had preferred until vroom
came around.
In the Python / pandas world, options are more limited and we use pd.read_csv
:
# R
single_file <- vroom("Adelie_penguins.csv", col_types = cols())
single_file %>%
gt()
island | body_mass_g | sex | year |
---|---|---|---|
Torgersen | 3750 | male | 2007 |
# Python
single_file = pd.read_csv("Adelie_penguins.csv")
single_file
## island body_mass_g sex year
## 0 Torgersen 3750 male 2007
Other than pandas adding its characteristic index, we get the same data frame as the result. So far, so straight-forward.
Reading multiple csv files
But what I really want to write about here is the reading of multiple csv files. Those approaches will be very convenient if you’re in a situation where you have a directory that contains many csv files with the same schema (perhaps from an automated daily pipeline).
Before we can read those files, we first have to find them. In R, we use the dir_ls
tool from the fs
package to search in the current directory (.
) for a global pattern ("*penguins.csv"
):
# R
(files <- fs::dir_ls(".", glob = "*penguins.csv"))
## Adelie_penguins.csv Chinstrap_penguins.csv Gentoo_penguins.csv
Which gives us the 3 files that we had split our penguin data into. Now we can feed that vector of files directly into vroom
(or read_csv
):
# R
df <- vroom(files, col_types = cols(), id = "name")
df %>%
gt()
name | island | body_mass_g | sex | year |
---|---|---|---|---|
Adelie_penguins.csv | Torgersen | 3750 | male | 2007 |
Chinstrap_penguins.csv | Dream | 3500 | female | 2007 |
Gentoo_penguins.csv | Biscoe | 4500 | female | 2007 |
As you can see, the id
parameter here allows us to add the nifty name
column, which holds the file name. Since we had named our files after the species of penguin, this allows us to get that species information back in our table. Other applications would be to have file names that carry time stamps or other information about your upstream pipeline. If you have control over the naming schema of that pipeline output, or can bribe someone who has, then you can probably make your life quite a bit easier by including useful information in those file names
In Python, we use glob
to grab the file names:
files_py = glob.glob("*penguins.csv")
files_py
## ['Adelie_penguins.csv', 'Chinstrap_penguins.csv', 'Gentoo_penguins.csv']
And then read like this. What pd.concat
does it is binds together a list of data frames into a single data frame. That list comes from one of Python’s favourite approaches: the list comprehension. There’s much to say about the flexibility and elegance of list comprehensions, but when it comes down to it they’re basically a for loop in a single line that outputs a list. So here we loop through the 3 file names and read them into a list of data frames to feed to pd.concat
.
df_py = pd.concat((pd.read_csv(f) for f in files_py))
df_py
## island body_mass_g sex year
## 0 Torgersen 3750 male 2007
## 0 Dream 3500 female 2007
## 0 Biscoe 4500 female 2007
But that is still missing the file name. Luckily, we can get that information through the nifty assign function:
df_py = pd.concat((pd.read_csv(f).assign(name = f) for f in files_py))
df_py
## island body_mass_g sex year name
## 0 Torgersen 3750 male 2007 Adelie_penguins.csv
## 0 Dream 3500 female 2007 Chinstrap_penguins.csv
## 0 Biscoe 4500 female 2007 Gentoo_penguins.csv
Great! Now we got the information that is contained in the file name conveniently in our data frame.
Like I wrote earlier, in the wild it can often happen that file names contain vital information. Sometimes so much so, that we want to parse this feature out further.
Separating 1 column into multiple
In R, for turning one column into multiple we have the separate
function from the tidyr
package. You give it the names that you want to assign to the new columns, along with the separator. (Note, that here we need to escape the dot so that it doesn’t get misinterpreted as a regular expression).
df %>%
separate(name, into = c("name", "filetype"), sep = "\\.") %>%
gt()
name | filetype | island | body_mass_g | sex | year |
---|---|---|---|---|---|
Adelie_penguins | csv | Torgersen | 3750 | male | 2007 |
Chinstrap_penguins | csv | Dream | 3500 | female | 2007 |
Gentoo_penguins | csv | Biscoe | 4500 | female | 2007 |
And then of course we can further take apart the name
column in the same way:
df %>%
separate(name, into = c("name", "filetype"), sep = "\\.") %>%
separate(name, into = c("species", "animal"), sep = "_") %>%
gt()
species | animal | filetype | island | body_mass_g | sex | year |
---|---|---|---|---|---|---|
Adelie | penguins | csv | Torgersen | 3750 | male | 2007 |
Chinstrap | penguins | csv | Dream | 3500 | female | 2007 |
Gentoo | penguins | csv | Biscoe | 4500 | female | 2007 |
In Python, pandas has no dedicated operation for splitting columns. Instead, this can be accomplished by using string functions, of which Python has a comparable set to R. For pandas data frames, we need to first add the .str
method to indicate that a string operation follows, which will then be vectorised to the entire column. Here, we use two split
calls on the same delimiters as above. The expand
parameter takes care of the separation into multiple columns (otherwise the result would be a single column containing a list feature). Afterwards, we need to drop
the name
column to get the same result as for the R operations.
# Python
df_py = pd.concat((pd.read_csv(f).assign(name = f) for f in files_py))
df_py[['name', 'filetype']] = df_py['name'].str.split('.', expand=True)
df_py[['species', 'animal']] = df_py['name'].str.split('_', expand=True)
df_py = df_py.drop('name', axis = 'columns')
df_py
## island body_mass_g sex year filetype species animal
## 0 Torgersen 3750 male 2007 csv Adelie penguins
## 0 Dream 3500 female 2007 csv Chinstrap penguins
## 0 Biscoe 4500 female 2007 csv Gentoo penguins
Note, that in contrast to R’s separate
this process requires intermediate steps and cannot be chained.
Uniting multiple columns into a single one
As you can guess from my leading headers, I prefer the way that the tidyverse handles those operations. In order to join back together what had been put asunder, we use the tidyr
function unite
. We start from the separated version:
df %>%
separate(name, into = c("name", "filetype"), sep = "\\.") %>%
separate(name, into = c("species", "animal"), sep = "_") %>%
gt()
species | animal | filetype | island | body_mass_g | sex | year |
---|---|---|---|---|---|---|
Adelie | penguins | csv | Torgersen | 3750 | male | 2007 |
Chinstrap | penguins | csv | Dream | 3500 | female | 2007 |
Gentoo | penguins | csv | Biscoe | 4500 | female | 2007 |
And then go full circle and put it back together in the same breath. The 2 separating steps are being followed by two uniting steps that reverse them. For unite
, you need to specify the columns that you want to paste together, then the name of the new column, and the delimiter character(s) to use.
df %>%
separate(name, into = c("name", "filetype"), sep = "\\.") %>%
separate(name, into = c("species", "animal"), sep = "_") %>%
unite(species, animal, col = "name", sep = "_") %>%
unite(name, filetype, col = "name", sep = ".") %>%
gt()
name | island | body_mass_g | sex | year |
---|---|---|---|---|
Adelie_penguins.csv | Torgersen | 3750 | male | 2007 |
Chinstrap_penguins.csv | Dream | 3500 | female | 2007 |
Gentoo_penguins.csv | Biscoe | 4500 | female | 2007 |
Here, during the uniting the extra columns are being automatically dropped (i.e. species
, animal
, filetype
), but you can change that behaviour by specifying remove = FALSE
. The same applies to separate
.
The pandas table is already separated, which was the hard part. Combining them is much more intuitive; making use of Python’s style to enable simple operations on complex objects. Strings can be concatenated using the +
operator and constants get automatically expanded to vector dimensions. Thus you can treat those columns the same way as you would zero-dimensional variables.
The we remove the extra columns again, and voila: we’re back to where we started. Which, in this case, is a good thing:
df_py['name'] = df_py['species'] + "_" + df_py['animal'] + "." + df_py['filetype']
df_py = df_py.drop(['species', 'animal', 'filetype'], axis = 'columns')
df_py
## island body_mass_g sex year name
## 0 Torgersen 3750 male 2007 Adelie_penguins.csv
## 0 Dream 3500 female 2007 Chinstrap_penguins.csv
## 0 Biscoe 4500 female 2007 Gentoo_penguins.csv
And this is it for today. To recap: we had a look at how to load a single csv file, then multiple csv files with R and Python. Then we used those data frames to practice separating and uniting of columns.
We thus added some rather import bread-and-butter tools of data analysis to our bilingual repertoire. (See what I did there?) More Rosetta Stone content to come soon.