R & Python Rosetta Stone: Reading files and column transformations - head spin

This post continues my series on translations between R and Python code for common data science and machine learning tasks. A documented cheat sheet for those such as myself, who frequently switch between the two languages for different tasks.

I had written the first post in this series back in late 2020, full of optimism that it would kick off a whole series of frequent posts on the topic. Alas, other projects took precedence, as they often do. Now, I’m back with renewed elan to do more blogging this year, and write more posts on R & Python.

Let’s get the setting out of the way: all code is reproducible in Rmarkdown via Rstudio IDE (version 1.2 onwards). Python is integrated through reticulate; check out my intro post on reticulate to get started with that. Those are our R and Python libraries:

# R
libs <- c('dplyr', 'stringr',     # wrangling
          'palmerpenguins', 'gt', # data, table styling
          'vroom', 'readr',       # read & write data
          'tidyr', 'purrr',       # wrangle & iterate
          'fs',                   # file system
          'reticulate')           # python support 
invisible(lapply(libs, library, character.only = TRUE))

use_python("/usr/bin/python3")

df_orig <- penguins %>% 
  mutate_if(is.integer, as.double) %>% 
  select(-contains("length"), -contains("depth"))

# Python
import pandas as pd
import glob
pd.set_option('display.max_columns', None)

Once again we will use the Palmer Penguins dataset from the eponymous R package. I will use a smaller subset of columns for reasons of readability.

This dataset contains measurements and characteristics for 3 species of penguin, namely:

df_orig %>% 
  count(species) %>% 
  gt()

species	n
Adelie	152
Chinstrap	68
Gentoo	124

In this exercise we will take the data frame apart into different files by species, and then put it back together again.

Briefly: Writing multiple csv files

The breaking-apart bit is not the focus of this post, though. I’m borrowing it from this stackoverflow post. To make the rest of this exercise more compact, we will only write out 1 row per species to a csv file:

# R
df_orig %>%
  group_by(species) %>% 
  slice_head(n = 1) %>% 
  ungroup() %>% 
  nest(data = c(-species)) %>% 
  pwalk(function(species, data) write_csv(data, file.path(".", str_c(species, "_penguins.csv"))))

For reasons of brevity this code chunk will remain somewhat of black box; otherwise this post would balloon into something much longer. See it as a sneak peak for future content; and also something you can just use as is or explore for yourself.

Some brief notes only: slice_head is a new addition to the dplyr package - an evolution of the previous slice - and it does pretty much what you would expect: slicing off the first n entries (here by group). In nest we have a tidyr tool that bundles together the select columns into a list column (here everything except species). Finally, the purrr package function pwalk iterates over a list, here writing out the different parts of the dataframe with custom names.

As it stands, purrr remains one of my weak points in the tidyverse and I plan to revisit it in a future post. For now, onward with the meat of this episode.

Reading a single csv file

In R, the standard tool for reading csv files is readr::read_csv. However, I prefer the vroom function from the vroom package. Like its onomatopoeic name suggests, this package is fast. Faster than data.table::fread, which I had preferred until vroom came around.

In the Python / pandas world, options are more limited and we use pd.read_csv:

# R

single_file <- vroom("Adelie_penguins.csv", col_types = cols())

single_file %>% 
  gt()

island	body_mass_g	sex	year
Torgersen	3750	male	2007

# Python

single_file = pd.read_csv("Adelie_penguins.csv")

single_file

##       island  body_mass_g   sex  year
## 0  Torgersen         3750  male  2007

Other than pandas adding its characteristic index, we get the same data frame as the result. So far, so straight-forward.

Reading multiple csv files

But what I really want to write about here is the reading of multiple csv files. Those approaches will be very convenient if you’re in a situation where you have a directory that contains many csv files with the same schema (perhaps from an automated daily pipeline).

Before we can read those files, we first have to find them. In R, we use the dir_ls tool from the fs package to search in the current directory (.) for a global pattern ("*penguins.csv"):

# R
(files <- fs::dir_ls(".", glob = "*penguins.csv"))

## Adelie_penguins.csv    Chinstrap_penguins.csv Gentoo_penguins.csv

Which gives us the 3 files that we had split our penguin data into. Now we can feed that vector of files directly into vroom (or read_csv):

# R
df <- vroom(files, col_types = cols(), id = "name")

df %>% 
  gt()

name	island	body_mass_g	sex	year
Adelie_penguins.csv	Torgersen	3750	male	2007
Chinstrap_penguins.csv	Dream	3500	female	2007
Gentoo_penguins.csv	Biscoe	4500	female	2007

As you can see, the id parameter here allows us to add the nifty name column, which holds the file name. Since we had named our files after the species of penguin, this allows us to get that species information back in our table. Other applications would be to have file names that carry time stamps or other information about your upstream pipeline. If you have control over the naming schema of that pipeline output, or can bribe someone who has, then you can probably make your life quite a bit easier by including useful information in those file names

In Python, we use glob to grab the file names:

files_py = glob.glob("*penguins.csv")
files_py

## ['Adelie_penguins.csv', 'Chinstrap_penguins.csv', 'Gentoo_penguins.csv']

And then read like this. What pd.concat does it is binds together a list of data frames into a single data frame. That list comes from one of Python’s favourite approaches: the list comprehension. There’s much to say about the flexibility and elegance of list comprehensions, but when it comes down to it they’re basically a for loop in a single line that outputs a list. So here we loop through the 3 file names and read them into a list of data frames to feed to pd.concat.

df_py = pd.concat((pd.read_csv(f) for f in files_py))
df_py

##       island  body_mass_g     sex  year
## 0  Torgersen         3750    male  2007
## 0      Dream         3500  female  2007
## 0     Biscoe         4500  female  2007

But that is still missing the file name. Luckily, we can get that information through the nifty assign function:

df_py = pd.concat((pd.read_csv(f).assign(name = f) for f in files_py))
df_py

##       island  body_mass_g     sex  year                    name
## 0  Torgersen         3750    male  2007     Adelie_penguins.csv
## 0      Dream         3500  female  2007  Chinstrap_penguins.csv
## 0     Biscoe         4500  female  2007     Gentoo_penguins.csv

Great! Now we got the information that is contained in the file name conveniently in our data frame.

Like I wrote earlier, in the wild it can often happen that file names contain vital information. Sometimes so much so, that we want to parse this feature out further.

Separating 1 column into multiple

In R, for turning one column into multiple we have the separate function from the tidyr package. You give it the names that you want to assign to the new columns, along with the separator. (Note, that here we need to escape the dot so that it doesn’t get misinterpreted as a regular expression).

df %>% 
  separate(name, into = c("name", "filetype"), sep = "\\.") %>% 
  gt()

name	filetype	island	body_mass_g	sex	year
Adelie_penguins	csv	Torgersen	3750	male	2007
Chinstrap_penguins	csv	Dream	3500	female	2007
Gentoo_penguins	csv	Biscoe	4500	female	2007

And then of course we can further take apart the name column in the same way:

df %>% 
  separate(name, into = c("name", "filetype"), sep = "\\.") %>% 
  separate(name, into = c("species", "animal"), sep = "_") %>% 
  gt()

species	animal	filetype	island	body_mass_g	sex	year
Adelie	penguins	csv	Torgersen	3750	male	2007
Chinstrap	penguins	csv	Dream	3500	female	2007
Gentoo	penguins	csv	Biscoe	4500	female	2007

In Python, pandas has no dedicated operation for splitting columns. Instead, this can be accomplished by using string functions, of which Python has a comparable set to R. For pandas data frames, we need to first add the .str method to indicate that a string operation follows, which will then be vectorised to the entire column. Here, we use two split calls on the same delimiters as above. The expand parameter takes care of the separation into multiple columns (otherwise the result would be a single column containing a list feature). Afterwards, we need to drop the name column to get the same result as for the R operations.

# Python
df_py = pd.concat((pd.read_csv(f).assign(name = f) for f in files_py))
df_py[['name', 'filetype']] = df_py['name'].str.split('.', expand=True)
df_py[['species', 'animal']] = df_py['name'].str.split('_', expand=True)
df_py = df_py.drop('name', axis = 'columns')

df_py

##       island  body_mass_g     sex  year filetype    species    animal
## 0  Torgersen         3750    male  2007      csv     Adelie  penguins
## 0      Dream         3500  female  2007      csv  Chinstrap  penguins
## 0     Biscoe         4500  female  2007      csv     Gentoo  penguins

Note, that in contrast to R’s separate this process requires intermediate steps and cannot be chained.

Uniting multiple columns into a single one

As you can guess from my leading headers, I prefer the way that the tidyverse handles those operations. In order to join back together what had been put asunder, we use the tidyr function unite. We start from the separated version:

df %>% 
  separate(name, into = c("name", "filetype"), sep = "\\.") %>% 
  separate(name, into = c("species", "animal"), sep = "_") %>% 
  gt()

species	animal	filetype	island	body_mass_g	sex	year
Adelie	penguins	csv	Torgersen	3750	male	2007
Chinstrap	penguins	csv	Dream	3500	female	2007
Gentoo	penguins	csv	Biscoe	4500	female	2007

And then go full circle and put it back together in the same breath. The 2 separating steps are being followed by two uniting steps that reverse them. For unite, you need to specify the columns that you want to paste together, then the name of the new column, and the delimiter character(s) to use.

df %>% 
  separate(name, into = c("name", "filetype"), sep = "\\.") %>% 
  separate(name, into = c("species", "animal"), sep = "_") %>% 
  unite(species, animal, col = "name", sep = "_") %>% 
  unite(name, filetype, col = "name", sep = ".") %>% 
  gt()

name	island	body_mass_g	sex	year
Adelie_penguins.csv	Torgersen	3750	male	2007
Chinstrap_penguins.csv	Dream	3500	female	2007
Gentoo_penguins.csv	Biscoe	4500	female	2007

Here, during the uniting the extra columns are being automatically dropped (i.e. species, animal, filetype), but you can change that behaviour by specifying remove = FALSE. The same applies to separate.

The pandas table is already separated, which was the hard part. Combining them is much more intuitive; making use of Python’s style to enable simple operations on complex objects. Strings can be concatenated using the + operator and constants get automatically expanded to vector dimensions. Thus you can treat those columns the same way as you would zero-dimensional variables.

The we remove the extra columns again, and voila: we’re back to where we started. Which, in this case, is a good thing:

df_py['name'] = df_py['species'] + "_" + df_py['animal'] + "." + df_py['filetype']
df_py = df_py.drop(['species', 'animal', 'filetype'], axis = 'columns')

df_py

##       island  body_mass_g     sex  year                    name
## 0  Torgersen         3750    male  2007     Adelie_penguins.csv
## 0      Dream         3500  female  2007  Chinstrap_penguins.csv
## 0     Biscoe         4500  female  2007     Gentoo_penguins.csv

And this is it for today. To recap: we had a look at how to load a single csv file, then multiple csv files with R and Python. Then we used those data frames to practice separating and uniting of columns.

We thus added some rather import bread-and-butter tools of data analysis to our bilingual repertoire. (See what I did there?) More Rosetta Stone content to come soon.