Computational Science using Big Data in R
Time: 9am – 5pm
Instructor: Dr Jonathan Minton
Fee: £195 (£140 for those from educational and charitable institutions).
The Cathie Marsh Institute (CMIST) offers five free places to research staff and students within the Faculty of Humanities at The University of Manchester and the North West Doctoral Training Centre.
Postgraduate students requesting a free place will be required to provide a letter of support from their supervisor.
This course will introduce a workflow for working efficiently with large amounts of data in R, using data from the Human Mortality Database (HMD) and Human Fertility Database (HFD). Using both of these large databases in an extended case study, the course will show how the R packages plyr and purrr can be used to automate and speed up all stages of the quantitative social science workflow, from tidying and loading data from multiple sources, to producing dozens of separate analyses and data visualisations through a single chunk of code.
While working through the extended case study, related packages, processes and patterns for working with large-scale and complex data efficiently will be introduced, including packages like stringr, tidyr and dplyr for data management, and ‘piped coding’ approaches for making R code more ‘literate’: easier to write, understand and reason about.
If you use the HMD and HFD, the code presented will likely be useful right away for your work. Even if you do not, the general patterns, concepts and methods introduced through the case study will help you think about how to manage large amounts of data and automate your own data workflows.
By the end of the course, you will:
- Understand the difference between ‘piped’ and ‘standard’ R code, and why ‘piped’ expressions are closer to written and spoken language, and so easier to reason about, develop, and debug.
- Understand the concept of the ‘data to information’ chain, and why you should think carefully about all stages in the sequence linking the acquisition of raw data to the development of new knowledge.
- Have been introduced to the ‘tidy data’ paradigm for storing and working with standard, rectangular data.
- Have reasoned through the challenges of loading data from multiple sources, and arranging and combining data into a tidy data target source.
- Have been introduced to and applied the ‘split-apply-combine’ paradigm from plyr, and the functional programming paradigm from purrr, to achieve process automation in two related ways.
- Be introduced to the pattern of solving programming tasks first in specific cases, and then of generalising these solutions to form functions which can be applied many times.
- Understand how to automate the production of multiple figures and other outputs using both plyr and purrr.
- Be able to use a pre-existing efficient data workflow when working with data from the HFD and HMD, and be ready to produce analogous workflows for other tasks and sources of data.
You must already be adept and comfortable using R in quantitative research, as well as willing to explore alternative approaches for working with R. Ideally, you should also be familiar with the RStudio integrated design environment for working with R.
- The Tidy Data Paradigm
- The Split-Apply-Combine Strategy
- Piping through the Magrittr package
- The tidyr and dplyr packages:
- Functions and functional programming with R (online course, free introduction)
About the instructors
Dr Jon Minton is AQMeN Research Fellow at the University of Glasgow, based in Urban Studies, with a broad interest in ‘social data science’, and a special interest in demographic data visualisation for identifying complex patterns, including age-period-cohort effects, in health and fertility data. He has a PhD in Welfare Reform, worked as a health economist, and has been using and misusing R for around ten years.