Search type

Cathie Marsh Institute for Social Research

Computational Science using Big Data in R

Date: 15 February 2018

Time: 10am–4.30pm

Instructor: Peter Smyth

Level: Advanced
Fee: £195 (£140 for those from educational, government and charitable institutions). 

We offer up to five subsidised places at a reduced rate of £60 per course day to research staff and students within Humanities at The University of Manchester. These places are awarded in order of application. In some instances, such as for unfunded PhD students, we may be able to offer free or bursary places.

Please note: This is not guaranteed and is considered on a case-by-case basis. Please contact us for more information.


This course will introduce a workflow for working efficiently with large amounts of data in R, using data from the Human Mortality Database (HMD) and Human Fertility Database (HFD). Using both of these large databases in an extended case study, the course will show how the R packages plyr and purrr can be used to automate and speed up all stages of the quantitative social science workflow, from tidying and loading data from multiple sources, to producing dozens of separate analyses and data visualisations through a single chunk of code.

While working through the extended case study, related packages, processes and patterns for working with large-scale and complex data efficiently will be introduced, including packages like stringr, tidyr and dplyr for data management, and ‘piped coding’ approaches for making R code more ‘literate’: easier to write, understand and reason about.

If you use the HMD and HFD, the code presented will likely be useful right away for your work. Even if you do not, the general patterns, concepts and methods introduced through the case study will help you think about how to manage large amounts of data and automate your own data workflows.


By the end of the course, you will:


  • Understand the difference between ‘piped’ and ‘standard’ R code, and why ‘piped’ expressions are closer to written and spoken language, and so easier to reason about, develop, and debug.
  • Understand the concept of the ‘data to information’ chain, and why you should think carefully about all stages in the sequence linking the acquisition of raw data to the development of new knowledge.
  • Have been introduced to the ‘tidy data’ paradigm for storing and working with standard, rectangular data.
  • Have reasoned through the challenges of loading data from multiple sources, and arranging and combining data into a tidy data target source.
  • Have been introduced to and applied the ‘split-apply-combine’ paradigm from plyr, and the functional programming paradigm from purrr, to achieve process automation in two related ways.
  • Be introduced to the pattern of solving programming tasks first in specific cases, and then of generalising these solutions to form functions which can be applied many times.
  • Understand how to automate the production of multiple figures and other outputs using both plyr and purrr.
  • Be able to use a pre-existing efficient data workflow when working with data from the HFD and HMD, and be ready to produce analogous workflows for other tasks and sources of data.


You must already be adept and comfortable using R in quantitative research, as well as willing to explore alternative approaches for working with R. Ideally, you should also be familiar with the RStudio integrated design environment for working with R. 

Recommended reading 

About the instructor

Peter Smyth is a Research Associate at the University of Manchester, based in the Cathie Marsh Institute. He has spent 35 years working in IT at various large and small commercial organisations before taking an MSc in Big Data Analytics at Sheffield Hallam University and moving into academia. In his previous roles he used any convenient programming environment to hand to solve problems. Now he teaches a variety of programming languages to help others to do the same.

He is a qualified Data and Software Carpentry instructor.