An unmet need for data science training

This post was motivated in part by Maëlle Salmon’s post about blogging and secondly from discussion with Laura Graham about my long term mission is to convince the University of Southampton (here in the UK) of the need to provide some minimum data science training and support.

The aim is to try to define the problem(s) a bit better and also a bit of a cry for help. I appreciate that none of this may be novel, but I needed to get it written down and out of my head.

Some preliminaries

Firstly, a few vital statistics about me at the time of writing: 23 years since I started work in the TV industry, 10 years since finishing my Engineering degree and starting in immunology, 5 years since I finished my PhD, 3 years since I started working in proteomics, and 3 years since I started learning R through the John Hopkins. (I still haven’t completed the capstone.) And finally 1 year since I did Data Carpentry instructor training.

Learning R and about the related data science issues through the community has completely changed the way I work.

Secondly, I’ve currently got four related soapbox issues:

  • establishing if there is an unmet need for data science training
  • establishing best practices in delivering data science training
  • making the case for the data science officer
  • biological data science

The first and second issues are the subject of this post as they are so closely related.

The third issue follows the first two, but perhaps even trickier: this is my proposal that just as one should have experimental officers for the purpose of training and development of skills in the laboratory, who are permanent members of staff and not on fixed contracts and thus retaining knowledge within the institution, these should exist for data skills. It’s a subject worthy of a separate post. And to be honest even if I could gather the evidence to convince myself, I’m not sure in the current political and economic climate of metric chasing I’ll be able to convince those in charge of it’s merits.

The fourth issue also needs a separate post, but contains issues such as how to communicate between experimenters and informaticians, and how to deal with the fact that (in my experience) experiments evolve with each replicate whereas analysts would like to have all data sets generated in a standard way.

So that’s where I’m coming from, but returning to the matter at hand I’ve identified a number of issues which I’d love to investigate and get input on, and I expect there are others I’ve not thought of. Ultimately, I want to approach this as a scientist. My working assumption is that there is an unmet need for more data science training and support, but I always consider it possible that I am mistaken, and I’d like to take an evidence based approach.

Issues

Identification

Pretty much everyone has data and has to analyse data, whether they are an archaeologist, biologist, social scientist, art historian or an engineer. I personally consider writing a formula in an excel spreadsheet coding. Therefore I assume almost everyone in academia codes to one degree or another. But when I turn to my neighbours (biologists) and ask them if they code they say No. They don’t identify as coding.

Furthermore, when I attend events such as those for the research software community organised by my excellent colleagues in the Software Sustainability Institute it appears to be predominantly engineers, physicists and computer scientists in attendance. Which also means it is predominantly men. The third area these events highlight is the difference between those who predominately code to analyse data, and those who are more concerned with other aspects of coding such as machine control or code efficiency.

Coding for me has been a necessary side effect of becoming a better scientist, not an end itself, and I suspect, but don’t know that this would be true of lots of other people, regardless of seniority.

Which leads me to two questions:

  1. How do we reach those who work with data, but who don’t identify as coders?
  2. And how do we establish if there is an unmet need amongst these individuals?

To this end I created a draft Data Analysis Survey Google Form. The aim was to create a survey that could be completed in a couple of minutes and ideally carried out face-to-face. Emails just get deleted and face-to-face may yield a more representative sample of the University population.

Even if this is a good idea, the challenge I currently face is actually carrying out the survey. I really need a group of like minded individuals to get it done. I’d be interested to know if anyone has done this before and how they did it. Or if there is a better way forward.

Inclusivity

Inclusivity is bound up with identification, and although I’ve not addressed it in my draft survey, I wouldn’t be surprised if it is part of why some people don’t identify as coding. They don’t feel welcome. Following examples such the carpentries code of conduct this would need to be part of any data science training set-up.

Evidence base

If we can identify those who would benefit from training/support, the next question is how do we establish what works best?

For example:

  • Do biologists need something different from social scientists?
  • Do post-graduates need something different from post-docs?
  • Can we demonstrate to principal investigators that learning data science skills is not a waste of time?
  • Perhaps most importantly of all, can we demonstrate that it brings value to the University? (They have the money.)

I don’t have any answers at the moment, but as far as I’m aware no-one is gathering information about what works and what doesn’t work here in Southampton. If anyone has ideas on about how to assess the efficacy of data science training or knows of studies, please let me know.

Data science skills as academic infrastructure

In the last 10 years I’ve sat in numerous meetings where statistical/ informatics training and support have been proposed across biology and medicine, and yet we still don’t have a bioinformatics core or any standardised University wide training. I don’t know why, but working assumption is that is in part because these proposals are ad-hoc and based upon anecdote.

In comparison, in Southampton we have a large super computer that is key to many researchers. My understanding is that it only exists because of two years work to establish a business case throughout the University for such an infrastructure project.

To me, this suggests that we should consider data science skills as part of our infrastructure and approach as an issue worthy of sufficient resources to investigate systematically. And if this case is made, invested in to a similar degree.

Timing, delivery and cognitive load

Another anecdotal observation frequently made by postgraduate students when they start is that they get hours of inductions and workshops before they’ve generated any data or even understand their project. This makes little sense both in terms of the cognitive load of such a data dump or in terms of telling someone something weeks, or possibly months in advance of it’s utility.

As with other issues, I don’t have a clear idea of how to address this, but we need to consider this problem when thinking about training. I subscribe to DataCamp as it provides the flexibility for me to learn when it’s convenient or when a specific problem arises. What it lacks however is the invaluable face-to-face discussion of a bespoke situation that occurs in workshops or with a colleague.

My feeling is that there needs to be a range of resources available: on-line materials, user groups and peer led workshops for example. And my favourite idea: The Data Science Officer. It’s another reason why we need an evidence base to work from.

How to get started?

That’s my non-exhaustive list of challenges to getting data science training off the ground here in Southampton. And whilst it doesn’t exclude one thing to do another, I’m not sure whether we should be starting by trying to get a useR group going to gauge the demand, or focusing on the survey, or doing something else? Time is limited always, and I don’t know what is most effective. One of my other concerns is that doing things for free always lets the University off the hook when they should be investing in their people.

As Maëlle said in her blog, one reason to write is voice amplification and I’d consider this post a good start if I get the attention of some helpful people either inside the University or in the community at large.

Related