August Newsletter

Hi everyone-

Let’s face it, July was a bit lacking in sunshine but at least ended with a couple of scorchers. Possibly not the best conditions to be pulling together a newsletter, but we’d always take the sun over rain!

Following is the August edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity while figuring out whether or not you want to get on a flight, and whether or not you’ll be able to…

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

Industrial Strength Data Science August 2020 Newsletter

RSS Data Science Section

Covid Corner

Just when you thought it was safe to go out, local lockdowns and positive case volume spikes around the world remind us that our battle with COVID-19 is far from over sadly. As always numbers, statistics and models are front and centre in all sorts of ways.

  • Although everyone is clear that in the ideal world, all schools would re-open at the end of the summer, it is far from easy to assess the risks involved with this course of action, and also what steps could be made to mitigate those risks. DELVE (Data Evaluation and Learning for Viral Epidemics), a multi-disciplinary group convened by the Royal Society, has recently released a very well researched paper on this topic which provides an excellent assessment. Although the evidence is far from conclusive, they reason that the benefits do outweigh the risks but have some very clear recommendations about how to re-open in as safe as possible way.
  • The DELVE report calls out the importance of ventilation given the increasing evidence of airborne transmission of the virus. This extensive article in the Atlantic digs into the topic further attempting to uncover why we still have such limited understanding of exactly how the virus spreads.
  • Testing is still absolutely critical, and this post from the Google AI Blog, talks through an elegant way of using bayesian group testing to generate faster screening. Given the amount of testing going on in the world, the improved efficiency from adopting this approach is significant.
  • Finally, R, the effective reproduction rate of the virus in a given area, seems to be less in the news these days but it is clearly still an important metric. Rt.live does an elegant job of bringing changes and relative differences in R to life.

Committee Activities

It is still relatively quiet for committee member activities although we are still playing an active role in joint RSS/British Computing Society/Operations Research Society discussions on data science accreditation and attempting to help define and drive the RSS’s overall Data Science strategy.

Janet Bastiman is running a series of lectures for Barclays on Deepfakes – the second (ethics of deepfakes) and third (how to make one) are on the 5th and 12th August respectively and are very relevant as this technique becomes increasingly accessible and prevalent.

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup – and has been active in lockdown with virtual events. The next event is on August 6th at 6.30pm, “Learning to Continually Learn”, when Nick Cheney, Assistant Professor at the University of Vermont, will talk through one of the best papers from ICLR. All the talks (past and future) are available here – a very useful resource.

Elsewhere in Data Science

Lots of non-Covid data science going on, as always!

What’s this Deep Learning stuff all about?
Deep Learning is an increasingly prevalent machine learning technique, particular in audio, video and image related fields, but it is easy to forget how quickly it has come to prominence and the steps involved. A couple of useful articles talk through the key historical innovations and also function as useful primers for those new to the technique:

Quick follow up on bias and diversity from last time
As as quick follow on from our discussion of algorithmic bias and the importance of diversity, we wanted to recommend a great curated twitter account – ‘Women in Statistics and Data Science’ – definitely worth following.

The continuing GPT-3 saga
We’ve talked about OpenAI’s announcement of the 175 billion parameter GPT-3 model for a couple of issues now but it is still generating news, especially as more and more people are able to use it via the API.

The Verge does a nice job of assessing potential use-cases for the new model, as well as giving a high level view of how it works.

Kevin Lacker decided to give GPT-3 the Turing Test with some really interesting findings. It can appear incredibly “human like” and in many cases produces remarkable and accurate answers but perhaps doesn’t quite know when to say no:

(KL)   Q: How do you sporgle a morgle?
(GPT3) A: You sporgle a morgle by using a sporgle.

All of these Deep Learning based NLP models have to convert text into numeric vector form in some way or another. This post digs into the intriguing way GPT-3 encodes numbers…. not obvious at all!

This entertaining post highlights more coherent examples:

Omfg, ok so I fed GPT3 the first half of my "How to run an Effective Board Meeting" (first screenshot)

AND IT FUCKIN WROTE UP A 3-STEP PROCESS ON HOW TO RECRUIT BOARD MEMBERS THAT I SHOULD HONESTLY NOW PUT INTO MY DAMN ESSAY

This article steps back a little and tries to assess how much of a breakthrough GPT-3 really is while this piece in the Guardian also tries to put the new capabilities in perspective.

Finally Azeem Azhar does a good job putting the model and approach in historical context and discusses the different approaches (symbolic vs statistical) to solving the NLP problem.

What to work on and how to get it live
Data Science teams are often inundated with requests, and may well have a number of separate research streams they would like to progress. Deciding what to work on to make the most of limited time and resources is always difficult- this post provides a framework to think about this tricky topic

We talked about ML Ops last time and how to optimise the process of getting ML models live. There are various frameworks and solutions that can potentially be very useful for this, if used in the right context (and at the right price…). This article gives one of the more comprehensive summaries of a number of the options available including those created at the big tech companies (Michelangelo at Uber, BigHead at Airbnb, Metaflow at Netflix, and Flyte at Lyft), the relevant cloud offerings from GCP – in particular TFX- , AWS and Azure as well as some interesting honourable mentions (H2O, MLFlow). The distinction between this end to end model management and AutoML (building an individual model in the fastest way) is an interesting one and important to understand when considering options.

Finally this post is well worth a read. Stitchfix have historically been transparent and informative on their A/B testing methods, and this evolves their approach in an interesting way, focusing on finding “winning interventions as quickly as possible in terms of samples used”.

Practical Projects
As always here are a few potential practical projects to while away the lockdown hours:

Updates from Members and Contributors

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

– Piers

July Newsletter

Hi everyone-

Well, we have made it to summer (at least the British variety) – and it’s time for the July edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity while enjoying a socially distanced day on the beach with everyone else…

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

Industrial Strength Data Science July 2020 Newsletter

RSS Data Science Section

Covid Corner

COVID-19 is not going anywhere soon and while talk turns to relaxing of regulations and local lockdown measures there are some interesting discussions and evaluations of the initial modelling approaches.

Committee Activities

It continues to be relatively quiet for committee members although we are still playing an active role in joint RSS/British Computing Society/Operations Research Society discussions on data science accreditation and attempting to help define and drive the RSS’s overall Data Science strategy.

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and has been active in lockdown with virtual events. Last week there was a great discussion of “Cross-Lingual Transfer Learning” by Sebastian Ruder (DeepMind) and coming up next on July 9th is the catchily titled “Make VAEs Great Again” from Max Welling. All the talks (past and future) are available here – a very useful resource.

Elsewhere in Data Science

Lots of non-Covid data science going on, as always!

Algorithmic Bias
As AI and Machine Learning approaches gain traction in more and more facets of society, algorithmic decisions are more directly affecting people’s lives and livelihoods. This makes algorithmic bias is an increasingly important topic with far reaching implications to individuals and society.

Facial recognition is a case in point. Increasing research has shown that the accuracy of facial recognition systems is far from uniform across different racial and gender groups This, combined with its increasing use and miss-use, has led to recent announcements from leading providers. IBM announced they will no longer offer, develop or research facial recognition technology Similarly Amazon has put a curbs on who is allowed to use its Rekognition service.

Much of this recent activity has been driven by the publication of objective analysis of the relative performance of the different systems which can be viewed at gendershades.org. It has sparked ongoing discussion and controversy amongst key luminaries of the data science community regarding the underlying source of the bias with some interesting coverage in venture beat and the verge.

As far back as 2018 (a long time in “AI” years), the Council of Europe was highlighting potential areas at risk for discrimination from biases in algorithmic decision making. Specific at risk decision making systems called out included predictive policing, benefits eligibility, job applicant screening, loan and credit eligibility. It is well documented how discriminatory biases exist in many walks of life (Invisible Women, by Perez, is an excellent example), and if models are blindly trained on existing historical data then we should not be surprised if this discrimination perpetuates.

Academia is increasingly addressing the issue, with initiatives in many well known research institutes including Cornell and Turing. There is also interesting discussion around what responsibilities lie with individual data scientists as they develop these systems. Wired magazine recently advocated a ‘council of citizens’ approach.

In many ways, although some of these machine learning techniques have been around for some time, the packaging up of them into automated services is still relatively new. Putting together the right approaches and frameworks for how we decide what is acceptable and how we assess bias and impact is crucial and something the DSS section is passionate about.

(Many thanks to my colleague Weiting Xu for pulling together this selection of posts on bias.)

GPT-3
We mentioned the OpenAI announcement of the 175 billion parameter GPT-3 model last time. Now the model is available via an API, and there are various commentary pieces exploring the its performance. Chuan Li discusses where the model excels but highlights how the training of these types of models is increasingly out of reach of all but the large organisations (“Training GPT-3 would cost over $4.6M“).

Intriguingly, OpenAI have found they can use the same model design to create coherent image completions.

Getting stuff into production
We all know that there is a significant difference between getting encouraging results in a prototype, and encapsulating the model in a live production environment. This is a great example talking through some of the trials and tribulations involved.

MLOps is a term that has sprung up, to try and talk to these challenges and the leading cloud providers are attempting to help with automated services that remove some of the painful steps. There are all sorts of tools and frameworks now available attempting to help- this post looked at over 200 tools. Github have recently shown how their “github actions” can be used in a similar way which could be an option for engineering teams already leveraging the github eco-system.

Practical Projects
As always here are a few potential practical projects to while away the lockdown hours:

Updates from Members and Contributors

  • Kevin O’Brien would like to make readers aware of JuliaCon 2020:
    “JuliaCon 2020 will be held as a virtual conference in late July. The full schedule has been announced and can be found here.
    We are particularly keen to get new users to participate, and are running a series of workshops each day from Friday 24th to Tuesday 28th July. The talks will take place from Wednesday 29th to Friday 31st July.
    The event is 100% free, but you must register in advance with Eventbrite
  • Glen Wright Colopy would like to announce:
    “The American Statistical Association (along with KISS, IISA, IET, SSA, and ICSA) is sponsoring a fall video series on “The Philosophy of Data Science”. The series is aimed at incoming statistics and data science students (but will be of significant interest to the general statistics / data science community). The topics will focus on how scientific reasoning is essential to the practice of data science.
    We’ve already confirmed a lineup of top speakers for the first three sessions, and will be adding more shortly.
    The series website is here and the mailing list is here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

– Piers

June Newsletter

Hi everyone-

Another month flies by- somehow lockdown days seem to go slowly but weeks disappear – and it’s time for the June edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity…

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

Industrial Strength Data Science June 2020 Newsletter

RSS Data Science Section

Covid Corner

We can’t not talk about COVID-19 and as always there is plenty of data science related themes to wade through.

Committee Activities

It has been a quieter time for committee members this month although we are playing an active role in joint RSS/British Computing Society/Operations Research Society discussions on data science accreditation.

  • There is still time to submit to NEURIPS, the conference on Neural Information Processing systems which Danielle Belgrave is organising.
  • Magda Woods is writing a paper with her ex-BBC colleagues, trying to understand what is helping some companies thrive during the crisis and would love feedback from readers.

Elsewhere in Data Science

Lots of non-Covid data science going on, as always!

With a little more time at home on our hands (at least for some) we’ve come across some useful primers on relevant data topics:

If you prefer your “brain-food” in audible form, Lex has had some fantastic conversations recently- they are long but well worth the time.

  • His conversation with Steven Wolfram was an epic. Wolfram is the founder and CEO of Wolfram Research which produces Mathematica, Wolfram Alpha and Wolfram Language amongst other things. His background is in Physics although his work on Cellular Automata and computation brought him more public recognition.
    • An interesting component of the discussion focused on general intelligence and the work that Wolfram has accomplished in pulling together and codifying the underlying semantic knowledge base that drives Wolfram Alpha (which apparently powers Siri and Alexa). Wolfram Language takes a high level abstracted approach but is certainly thought provoking and worth exploring.
  • His conversation with Iliya Sutskever was very insightful. Sutskever is one of the founders of OpenAI and a co-author on the original AlexNet paper with Hinton, so ‘influential’ in Deep Learning to say the least!
    • Some great topics covered including a definition of Deep Learning as “the geometric mean of physics and biology”
    • A discussion on the “Double Descent” phenomenon in Deep Learning where model performance on a given data set first increases with model size (number of parameters), then decreases (as over-fitting kicks in), but then increases again! This is one of the drivers of the recently released GPT-3 NLP model, with 175 billion parameters… I definitely need to dig into this more as it’s never happened for me!

Is machine learning living up to the hype? There has been some recent commentary that progress in both machine learning research, and the commercial application of machine learning have not been delivering the purported benefits.

A few more practical tips:

For those wanting a bit more of a hands-on project…

  • This (OpenTPOD) must be the simplest way of creating your own deep-learning based object detection system from scratch!
  • Similarly on object detection, if you want to get a little bit more “under the hood”, then facebook have open-sourced another interesting pytorch application, DE:TR. This makes use of Transformers which feel increasingly like the go to building block for Deep Learning architecture.
  • How about bringing your cartoon characters to life with pose-animation from tensor-flow?

Updates from Members and Contributors

  • Kevin O’Brien highlights the great work the R Forwards foundation is doing in promoting diversity and inclusion in the data science community:
  • Ole Schulz-Trieglaff announces that Py Data Cambridge is now running online meetups every Wednesday- more info here
  • Finally, Glen Wright Colopy asked to include the following:
    • “In June, the American Statistical Association is sponsoring a set of weekly podcasts celebrating precision medicine research at the Statistical and Applied Mathematical Sciences Institute (SAMSI).
      Highlights include (i) machine learning and mathematical modelling of wound healing, (ii) big data squared – combining brain imaging and genomics for Alzheimer’s studies, and (iii) innovative trial design and master trials. You can hear about these episodes as they come out by joining the mailing (https://www.podofasclepius.com/mail-list) or subscribing to the YouTube channel (https://www.youtube.com/channel/UCkEz2tDR5K6AjlKw-JrV57w)”

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

And this feels like an appropriate way to conclude…
https://xkcd.com/2311/

– Piers

Take care with graphs…

We are all been bombarded with statistics and graphs right now. Both social media and traditional news outlets are showing graphs of infections, changes in death rates, impact on the economy or environment. We are being overwhelmed with data and it’s important that professionals presenting data do so clearly to a wide audience.

Continue reading “Take care with graphs…”

May Newsletter

Hi everyone-

Well, April is now behind us, so it is high time to pull together the May edition of our Royal Statistical Society Data Science Section newsletter. Rather than dwell solely on COVID related data science topics (of which we know there are plenty) we thought we’d try and bring a more balanced assortment of updates and reading materials this month. we hope these prove entertaining or at least help pass the lockdown hours away.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

Industrial Strength Data Science May 2020 Newsletter

RSS Data Science Section

Committee Activities

As we mentioned last time, we have unfortunately had to postpone our meetup/discussion meeting plans for the time being, and so have been focused on ‘spreading the word’ of industrial data science in other ways.

Corona Corner

We can’t not talk about COVID-19 and there is certainly plenty to discuss from a data science perspective.

It’s great to see some of the principles Martin highlighted in his post on communicating model uncertainty gaining more widespread airtime:

Elsewhere in Data Science

For those looking for a break from all the COVID talk, there is certainly plenty to keep you occupied:

For those wanting a bit more of a hands-on project…

Updates from Members and Contributors

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

– Piers

Data Science Needs Technical Managers

We believe a lack of data scientists in leadership roles is a major reason why organisations don’t get value from data science. Our recent survey has shown that data scientists with a manager from a data science background are more likely to feel they’re delivering value for an organisation.

Last year we conducted a survey of 150 practicing data scientists, to understand the issues they face and to find out how the RSS can help. One of the most interesting results was that half of the respondents didn’t feel their work brought significant value to their employer. We were interested to understand why this seemed to be happening fairly universally across a range of industries and organisations.

In general these data scientists seemed less worried about the quality of their data, budget, skills or the technology available to them. Instead the biggest challenges they face are social. Their organisation doesn’t understand them, or doesn’t support them, or both when it comes to delivering value from data science projects. When we asked them what their main obstacle was, their top two answers were “lack of strategy” and “limited support from senior managers”.

blog-1

In the survey we asked a set of freeform questions to help us get some deeper insight. When respondents talked about “lack of strategy”, one root cause was a lack of common language between data scientists and senior management. For example, respondents said that they were facing a “lack of … appreciation for what data science actually is” and “projects … that have little grounding in technical reality”. It seems clear that there has been a failure to find an overlap between business goals and the things the data science team can practically deliver.

We also found that for data scientists with a non-technical manager, 45% felt their work delivered significant value. This rose to 66% when a data scientist had a manager from a data science or academic research background (see figure below). This difference was statistically significant (p < 0.05). We believe that to improve the value you get from your data science team you need to ensure your data science team is backed up by a senior manager who understands their work.

blog-2

It’s worth noting that our respondents are primarily based in the UK and affiliated with the RSS so may not be totally representative of all industries. But we suspect this is an opportunity for many real organisations today. And we believe this backs up the RSS Data Science Section approach of trying to promote the career development of data scientists as a way to improve the value organisations get from data science.

One of our goals at the Data Science Section of the RSS is to help advance the careers of data scientists. One way to address this “communication gap” in organisations and get more value from data science is to bring more people with data science experience into the high-level conversations about business strategy. The field is relatively young, but we believe that as more people with data science experience progress into senior management roles this will go some way to resolve the challenges data scientists are facing today.

Thank you to everyone who contributed to the survey, the results have helped direct our efforts and we’ll be publishing more articles from the results over the coming months. Please join our mailing list to keep up to date with the work of the RSS Data Science Section.

The effectiveness of cloth masks has been misrepresented by #Masks4All

I recently advised caution about COVID-19 research performed by people without a background in infectious diseases. Some people hated that advice. I’m going to show an example of why it matters.

In recent weeks, entrepreneur Jeremy Howard has led the #Masks4All campaign to make it mandatory to wear cotton face masks in public. Howard claims to have led “the world’s first cross-disciplinary international review of the evidence” for the effectiveness of masks, but he has no formal scientific training. In spite of that, he’s gained coverage from organisation like The Washington Post, The Atlantic, the BBC and the Guardian.

His key claim is that “cotton masks reduce virus emitted during coughing by 96%”, citing a recent South Korean study. He also quotes people like Prof David Heymann of the WHO (for example in his Guardian article):

Untitled-2

Sounds compelling, right?

But the South Korean study did not reference a 96% reduction anywhere. In fact, the paper’s conclusion is ‘Neither surgical nor cotton masks effectively filtered SARS–CoV-2 during coughs by infected patients.’

How does Howard’s “review of the evidence” report this negative finding ?

Another relevant (but under-powered, with n=4) study (31) found that a cotton mask blocked 96% (reported as 1.5 log units or about a 36-fold decrease) of viral load on average…

Hold on a second. It simply isn’t true that the Korean group ‘found that a cotton mask blocked 96%’ of viral load. Deliberately misrepresenting the results of a peer-reviewed publication would be academic misconduct. Assuming an honest mistake, I pointed this out by email. Howard issued a justification in a series of tweets:

Whether the Korean team made a mistake or not – and I don’t believe it did – for a literature review to silently ‘correct’ the scientific record is a breach of ethics. To make matters worse, Howard’s ‘correction’ is itself wrong, and distorts the experimental findings. (If you’re interested in the technical details, please head to the appendix below.)

Even more seriously, the quote from Prof Heymann is not accurate. David Heymann has never said those words and his office has asked Howard to stop misquoting him (but not before Howard published the misquote in both the Washington Post and the Guardian).

But that’s not all. The #Masks4All review omitted the central finding of one of its key references, which was that cotton masks filtered out only 3% of particles during testing. Go back and read that again. The researchers found that 97% of particles penetrated through cotton masks. Why would a ‘review of the evidence’ neglect this key finding?

The evidence for mask wearing by the general public is weak, but I’m not claiming that people shouldn’t wear masks: more research may yet emerge. At a time when many are suggesting to cancel lockdown in favour of mandatory mask-wearing, we need to keep a clear view of the scientific evidence. The claims of the #Masks4All campaign should be treated with caution.

Martin Goodson (Chair of the RSS Data Science Section)

Appendix

The #Masks4All review states that the South Korean study ‘found that a cotton mask blocked 96% (reported as 1.5 log units or about a 36-fold decrease) of viral load on average, at eight inches away from a cough from a patient infected with COVID-19.’

The original study reports:

The median viral loads after coughs without a mask, with a surgical mask, and with a cotton mask were 2.56 log copies/mL, 2.42 log copies/mL, and 1.85 log copies/mL, respectively.

The difference between the median viral loads after coughs without mask and with a cotton mask is 0.71 (or a ratio of about 5, after converting from log units). So about 20% of the virus particles got through the masks. This is a bit rough and ready because the Korean scientists have excluded some of the data points, those marked as ‘ND’ for not detected.

This is where the Howard’s ‘correction’ comes in. His method was to replace all of the results marked as ‘ND’ with zero. This is never done by experimental scientists, because it’s very unlikely that the true values are zero. If nothing is detected, the experimenter must record an ‘ND’ and replace this figure with a standard value when analysing the data.

Every laboratory test has a limit under which detection is unreliable, the limit of detection (LOD). All we know with certainty is that the undetected values must lie somewhere below the LOD.

Here is one authority on this topic (emphasis mine):

…account must be taken of the data below the limit of detection; they cannot be excluded from the calculation nor assumed to be zero without introducing unnecessary and sometimes large errors…. The simplest method is to set all values below the LOD to a half the LOD.

So, what the experimenter must not do is analyse the result as zero. It’s like if you measure your finger for a wedding ring and your finger is smaller than the smallest hole in the ring measurement tool. You know your finger is smaller than the smallest hole but you definitely haven’t measured its size as 0cm.

I have reanalysed the Korean data using the suggested replacement value of half the LOD and the results don’t change very much, suggesting a reduction of 70% of virus particles when using cloth masks. There might be a million virus particles in a single cough. This is a tiny study with only four participants—one of whom could not produce detectable virus particles even without a mask. The authors were correct to draw their weak conclusion that cotton masks do not effectively filter the COVID-19 virus.

Thanks to Piers Stobbs, who edited an earlier draft of this post.

April Newsletter

Hi everyone-

What a month… the end of February seems like an age ago, and life for everyone has changed beyond comprehension since then.

The dramatic rise of the COVID-19 pandemic has highlighted the crucial underlying importance of rigourous analytical methods to both understand what is going on, and to inform decisions around the best course of action.

Given this, we thought we would dedicate this, the April edition of our Royal Statistical Society Data Science Section newsletter, to highlighting features and articles on the Data Science of the COVID-19.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

Industrial Strength Data Science April 2020 Newsletter

RSS Data Science Section

Data Science and COVID-19

The data…

One thing that became apparent pretty quickly as COVID-19 started spreading was that there was all sorts of data readily available on the extent of the pandemic and that this was reported on in all sorts of ways…

Identifying trusted resources that allow you to cut through the click-enticing headlines and put the figures in context has been crucial. Some of the sites we have found useful follow below-

  • Johns Hopkins University of Medicine has been at the forefront, and their tracker has become widely used for understanding how COVID-19 has spread globally.
  • Although the Financial Times have been just as guilty as others in their reporting of scientific research, their visualisation of deaths over time by country is a good example of putting the figures in context allowing for quick comparison of the efficacy of the different actions taken around the world.
  • Another useful resource is our world in data with detailed descriptions of metrics and data sources and clean visualisations of the different exponential growth rates in different countries.

Forecasting and predictions- the story so far…

Scroll back a couple of weeks, and the UK Government’s initial response was focused on containment and ‘herd immunity’. Although this was heavily influenced by scientific experts including (amongst others) an experienced team from Imperial College London, it was at odds with much of the rest of the world. This generated consternation from a wide variety of commentators, with a widely read article (40m views and counting) from Tomás Pueyo perhaps summarising the concerns the best. Other trusted sources on the pandemic who are easily followed on twitter include: @CT_Bergstrom, @mlipsitch, @maiamajumder and @MackayIM.

These concerns were not unchallenged (an interesting counter-point to the Pueyo post is here from Thomas House) but became less relevant on Monday 16th March when the Imperial College COVID-19 Response team issued a new paper, apparently based on updated data from Italy, depicting a very different future, and urging for stronger action. Almost immediately the UK Government began the process of moving the country into the current state of lockdown to attempt to stem the spread of the virus.

‘Model Addiction’ and Best Practices

The UK Government has come in for a good deal of criticism for the decisions made, and the apparent clouding of responsibility behind the banner of ‘science’. Nassim Taleb (of Black Swan and Fooled by Randomness fame) wrote an opinion piece in the Guardian taking the government to task on their over-reliance on forecasting models without thoroughly understanding the underlying assumptions. Coronadaily makes a similar point in a thoughtful post about Model Addiction. (For anyone interested in the basics of how the underlying model works, try out this on youtube).

There are other aspects of the models informing policy which do not seem to adhere to best practices from a data science perspective. Code transparency and reproducibility are core components of good data science, and although Neil Ferguson and his team at Imperial are attempting to provide more details, it was disconcerting to hear that the approach was based on “thousands of lines of undocumented C”. A well formulated approach to reproducible research, such as that advocated by Kirstie Whitaker at the Turing Institute would go a long way to help.

Although the models used in the Imperial paper have had success historically (particularly in the lesser developed world with outbreaks such as Ebola) the area of infectious diseases has, unfortunately, been extremely underfunded. Thus the main people working on these models and who are best placed to advise policy are in a poorly resourced area of academia.

Regardless of the accuracy of a given predictive model, there will always be assumptions and alternatives, and another area in which the combined government/research group have foundered is in communicating this uncertainty. This is certainly far from straightforward, but one entity we could all learn from is the IPCC and the way they assimilate different approaches to modelling climate change impact, producing a number of well articulated alternative scenarios with clearly documented assumptions.

Martin Goodson, the RSS Data Science section chairman wrote a provocative post bringing together all these threads, advocating 6 rules for policy-makers, journalists and scientists.

Calls to action and collaboration

The increased attention and importance of ‘experts’ and mathematical modelling in general, has driven numerous ways for the community to participate. There are many calls to action and ways to get involved including:

In addition, a number of Data Science and AI related tools are being made available to the community for free:

Other Posts we Like

It’s sometimes hard to remember, but of course there are other things going on in the world- here’s a few posts on data science we enjoyed this month.

Upcoming Events and Section and Member Activities

Sadly, but not surprisingly, we have had to put on hold a number of our upcoming events. However, we are still keen to continue in an adapted way, and are looking to re-work our program in an online format- more details to follow. Many of the Data Science and AI meetups are doing the same so keep checking back to meetup.com for details.

Finally, it was great to see RSS Data Science Committee members Richard Pugh and Jim Weatherall make the dataIQ top 100 list.

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

– Piers

All models are wrong, but some are completely wrong

COVID -19, Coronavirus Infection inside human body. Respiratory disease is spreading. Chinese epidemic, infected cells under microscope. 3d illustration. Development, research of a vaccine

At this critical time in the modern history of the human race, mathematical models have been pushed into the foreground. Epidemic forecasting informs policy and even individual decision-making. Sadly, scientists and journalists have completely failed to communicate how these models work.

Last week the Financial Times published the headline ‘Coronavirus may have infected half of UK population’, reporting on a new mathematical model of COVID-19 epidemic progression. The model produced radically different results when the researchers changed the value of a parameter named ρ – the rate of severe disease amongst the infected. The FT chose to run with an inflammatory headline, assuming an extreme value of ρ that most researchers consider highly implausible.

Since its publication, hundreds of scientists have attacked the work, forcing the original authors to state publicly that they were not trying to make a forecast at all. But the damage had already been done: many other media organisations, such as the BBC, had already broadcast the headline [1].

Epidemiologists are making the same mistakes that the climate science community made a decade ago. A series of crises forced climatologists to learn painful lessons on how (not) to communicate with policy-makers and the public.

In 2010 the 4th IPCC report was attacked for containing a single error – a claim that the Himalayan glaciers would likely have completely melted by 2035 (‘Glacier Gate’). Climate denialists and recalcitrant nations such as Russia and Saudi Arabia seized on this error as a way to discredit the entire 3000 page report, which was otherwise irreproachable.

When the emails of the Climatic Research Unit (CRU) of University of East Anglia were hacked in 2009, doubt arose over the trustworthiness of the entire climate science community. Trust was diminished because the head of the CRU refused to openly share computer code and data. The crisis was to cast a pall over the climate science community for many years.

By the time of the 5th IPPC report, mechanisms had been developed to enforce clear communication about the uncertainty surrounding predictive models; and transparency about models and data. The infectious disease community needs to learn these lessons. And learn them quickly.

Over the last days, several infectious disease non-experts have gained media coverage for various ‘too good to be true’ (and plain wrong) coronavirus forecasts. Ideologically-driven commentators have used these results to justify easing of social distancing rules, with potentially devastating consequences.

Scientists and journalists have a moral responsibility to convey the uncertainty inherent in modelling work. There is much at stake. Here we recommend a handful of rules for policy-makers, journalists and scientists.

 

Rule 1. Scientists and journalists should express the level of uncertainty associated with a forecast

All mathematical models contain uncertainty. This should be explicit – researchers should communicate their own certainty that a result is true. A range of plausible results should be provided, not just one extreme result.

Rule 2. Journalists must get quotes from other experts before publishing

The worst cases of poor COVID-19 journalism have broken this simple rule. Other scientists have weighed in after publication. But by then a misleading article has reached an audience of millions and taken hold in the public consciousness.

Rule 3. Scientists should clearly describe the critical inputs and assumptions of their models 

How sensitive is the model to the input parameters? How sure are you of those parameters? Do other researchers disagree?

Rule 4. Be as transparent as possible

Release data and code so that scientific scrutiny can take place. Consider open peer-review so that other experts can quickly give their opinion on a piece of work.

Rule 5. Policy-makers should use multiple models to inform policy

The Imperial college model created by Neil Ferguson has been reported on almost exclusively as the modelling input to UK pandemic policy. Have other models from other groups been considered? What is the degree of agreement between the models?

Rule 6. Indicate when a model was produced by somebody without a background in infectious diseases 

Would we encourage an epidemiologist to apply ‘fresh thinking’ to the design of an electrical substation? Perhaps we should treat with caution the predictions of electrical engineers about pandemic disease outbreaks.

Martin Goodson (Chair of the RSS Data Science Section)

 

Notes

[1] Post-publication, the FT have modified the report text but have left the headline unchanged.

Thanks to Danielle Belgrave, Piers Stobbs, Lucy Hayes and Adam Davison for helpful comments

March Newsletter

Hi everyone-

Time flies- even with the extra day, February felt pretty short…

Anyway, here’s round 2 of the Royal Statistical Society Data Science Section monthly newsletter- any and all feedback most welcome!

If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

Industrial Strength Data Science March 2020 Newsletter

RSS Data Science Section

Section and Member Activities

Jim Weatherall  is hosting our next RSS DSS event, which is in Manchester on the 18th March. It will be an expert panel discussion focused on skills and ethics for modern data science- sign up for free tickets here

Danielle Belgrave has a busy few weeks coming up!
She is co-organising the Advances in Data Science event – more info here – in Manchester (June 22-23) where Anjali Mazumder is a keynote speaker.
In addition she is tutorial chair for NeurIPS – any tutorial proposals from the community would be very welcome.
Finally, she is giving an upcoming talk (March 12th) at an Imperial college diversity event with other women in AI including 2 other panelists and speakers from DeepMind (Marta Garnelo and Laura Weidinger). More info here.

Anjali Mazumder is also organising a workshop in Washington DC this week on AI for combating modern slavery as part of her work with Code 8.7 of which Florian Ostmann is also a committee member.

Janet Bastiman is speaking at an upcoming Women in AI event on the 16th March (tickets here) and also at AI and Big Data World on 12th March. More Women in Data events highlighted at the end.

Finally, Charles Radclyffe published an article in the MIT Tech review summarising the findings of his Whitepaper on Digital Ethics and the ‘techlash’.

Posts We Like

As we collectively plough on with leaving the EU, it was interesting to see the EU’s take on AI : “Prepare for socio-economic changes brought about by AI”…

On the practical applications of machine learning front, there were a couple of compelling results in the health/pharma area.

In addition, Amazon released some useful insight into how they use Markov Chains (in particular absorbing ones) to help Alexa learn from “her” own mistakes.

From a tools perspective, some useful recent releases from some of the leading data science companies.

For those into Causality (and everything involved…) this was a good read– “In this post we explain a Bayesian approach to infering the impact of interventions or actions” – although you may need a quiet spot and a bit of time!

Finally, great to see unintended use cases… how about building a chess playing program using GPT2, one of the best NLP models around!

… and Probabilistic Inference in Bayesian Networks has finally entered the mainstream

Upcoming Events

We’ve already highlighted a number of events our committee members are involved with above.

In addition, there are lots of things going on around international women’s day, such as Women in AI on March 9th and Women in Fintech on 31st March.

Finally a couple of upcoming meetups that look interesting: Data Science London on SubGraph matching on March 31st and PyTorch on March 10th

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

– Piers

Our First Newsletter (of many…)

We thought it could be useful (and fun) to pick the collective brains of our Data Science Section committee members (as well as those of our impressive array of subscribers and followers) and put together a monthly newsletter. This will undoubtedly be biased but will hopefully surface materials that we collectively feel is interesting and relevant to the data science community at large.

So, without further ado, here goes our first attempt, creatively titled…


Industrial Strength Data Science Feb 2020 Newsletter

RSS Data Science Section

To give this some vague attempt at structure, we thought we would roughly break the newsletter down into three sections: Section and Member Activities; Posts We Like; Upcoming Events

Section and Member Activities

Our very own Section Chair, Martin Goodson, has been at his thought-provoking best, wading into the deep-learning vs semantic/symbolic learning debate and taking on the illustrious Gary Marcus . Either way GPT2 is still pretty impressive!

Jim Weatherall digs into Data Science and AI in biopharma and gives a realistic assessment of where we currently stand.

On a similar theme, Richard Pugh presented on the impact of data science in the pharmaceutical industry

And Magda Piatkowska, active as always in the community, is helping drive the incredibly important Women in Data agenda with “A Tale of a Girl and her High Tech Genies

Posts We Like

It is easy to assume there is always a right way and a wrong way to do data science, and certainly in many instances some approaches are objectively better than others. However, we all know that often it is far more nuanced than non-practitioners might assume- here’s an opinionated guide to Machine Learning we found interesting

There has been some amazing progress in NLP over the last few years, with the previously mentioned GPT2 from Open AI bringing an impressively powerful model to anyone’s hands. This is an entertaining read giving some practical tips on utilising GPT2 in python.
Google of course are ever-present in this space and recently made a big announcement of their own

We may be a little late to the party, but we are recently binging on Lex (our Louis Theroux of ML). His podcasts are always provocative and thought provoking

Many of the technical skills you learn in academia are useful in the ‘real-world’ but others don’t translate very well. Some useful pointers from David Dale on transitioning from Academia to Industry and Business

Regardless of your views on Facebook as a product, they employ some pretty impressive data scientists and produce some pretty impressive work (e.g. Prophet is great if you’ve not come across it). Reproducibility in machine learning is an increasingly important topic, and is surprisingly (or not so to those who do it…) difficult. While it is is key in academia in order to build on the foundations of others, it is also crucial in an industrial setting to make sure you have complete audit trails and can reproduce decisions made in the past. This piece from the Facebook AI group provided some interesting commentary

Finally, understanding why a machine learning model produces a given output is also an increasingly hot topic. Even though fundamentally the multi-dimensional nature of the underlying models makes it very complex and hard to “boil down” to a simple explanation, the field of ‘model explainability’ is looking to do so, and we found this a useful primer on the topic

Upcoming Events

This meetup on Feb 12th on detecting violent propaganda could be interesting
And this looks very useful on Feb 28th – London AI and Deep Learning on Operational AI and best coding practices

The open source data collection event next week (“Into the Light” Feb 5th) hosted by the economist looks like it could be interesting

In the rest of his spare time, Martin also runs the excellent London Machine Learning Meetup. The Jan 22nd event with David Silver was fantastic

Other meetups we are enthusiasts of include:
London AI and Deep Learning
Data Science London
Data Kind
Impactful AI


That’s it for now- tell us what you think? We will aim to get a new one out every month and would love to include commentary from followers and subscribers.

If you liked this, do please send on to your friends- we are looking to build a strong committee of data science practitioners- and sign up for future updates here

The do’s and don’ts of starting a data science team

Last week, the Prime Minister’s chief strategic adviser – Dominic Cummings – wrote a blog which attracted a huge amount of media attention. He called for a radical new approach to civil service recruitment – suggesting that data scientists (among others) should play increasingly important roles.

But while data scientists were top of Cummings’ list, it was his call, later on, for more ‘weirdos’ in Whitehall which really caught the media’s imagination. Here, we outline some do’s and don’ts when building a data science team.

yay_science

For anyone kicking off the year with a new data science initiative, we applaud you! Embedding data and technology into decision making processes can be a wonderful thing. To help you along your way, here are a few do’s and don’ts that have been borne out of experience.

Don’t… Assume R&D is easy
Do… Appoint a technical leader
If you’ve been tasked with managing this initiative, but you’re not an experienced data scientist, then you need someone who is. You need a team leader who lives and breathes selection bias, measurement bias, and knows when a result is meaningless. Without this experience in your team you will at best waste time and resources, and at worst create dangerously unsound technology.

Don’t… Just hire weirdos and misfits
Do… Carefully craft your team
The notion that data scientists are geniuses who can solve all your problems, armed only with a computer and some data, is flattering – but ridiculous. Data scientists come in many flavours, with different interests and experience, and the problems worth solving require a team effort – with the best ideas coming from diverse teams who can communicate well.

Don’t… Trust textbook knowledge alone
Do… Hire for experience too
There is data science knowledge you can glean from a textbook, and then there is the hard-earned stuff you learn from years of building models and algorithms with real data, implemented in the real world. Nothing makes you understand overfitting and the limits of theoretical models like living through that cycle a few (hundred) times.

Don’t… Ignore ethical issues
Do… Take an ethics-first approach
Get ahead of any ethical and legal issues with your work, or the data you are using. Don’t assume it’s OK to do something just because you heard a Silicon Valley start-up does it like that.

Don’t… Obsess on the latest academic papers
Do… Identify questions
Normal rules of business apply to data science; you want a return for your investment. Start by identifying the intersection of high-value business problems and the information contained in the data. You could ‘dart about’, trying out ideas from cool papers you’ve read, to see if anything useful comes out. But such unstructured work is akin to randomly digging for treasure on a beach. Get yourself a metal detector—identify business problems first.

Don’t… Show off
Do… Keep it simple, stupid
Unless you have been specifically asked to build something superficially clever and incomprehensible (and this is a genuine objective for some), then you should use interpretable models first. Often this will be good enough. Only introduce complexity if you need to, and use a simple model as a baseline against which you can measure improvements.

Don’t… Propagate hype
Do… Manage expectations
So, you’ve been thrown some resources to set up a data science team and you’re embedded in an organisation that doesn’t necessarily understand what data science is. With such power comes responsibility! Avoid hype. Manage expectations. Help your peers and leaders understand what you are doing, and make sure they have input to it. This is a joint effort and they bring important domain knowledge. Agree on goals, and be transparent about progress.

Don’t… Command and control
Do… Create a scientific culture
Do your team feel they can challenge the scientific views of the leadership—or are they scared of being ‘binned’ if they step out of line? Your team is on a mission to solve a problem, and it is unlikely the path will be an easy one. Your data scientists will spend most of their time stuck, navigating a sea of unknowns, while in pursuit of answers. Scientists need to be able to talk freely about what they do and don’t know, and to share ideas with each other without any sense of one-upmanship.

“We are not unicorns”​

Inaugural Industrial Strength Data Science event report

On Thursday May 16th, The Royal Statistical Society’s Data Science Section hosted our inaugural Industrial Strength Data Science event of the year at the RSS headquarters in central London. The event was titled “We are not unicorns” and consisted of a panel discussion on a range of topics centered around the current state of data science in industry today, and how external expectations are affecting the success or failure of data science projects and teams.

No alt text provided for this image

We assembled an experienced panel of data science practitioners:

  • Adam Davison, Head of Insight and Data Science at The Economist (AD)
  • Kate Land, Chief Data Scientist at Havelock London (KL)
  • Simon Raper, Founder at Coppelia Machine Learning and Analytics (SR)
  • Magnus Rattray, Director of the Data Science Institute at the University of Manchester (MR)
  • Piers Stobbs, Chief Data Officer at Moneysupermarket (PS)

And the the event was very ably hosted by Magda Piatkowska (Head of Data Science Solutions, BBC) and opened by Martin Goodson (CEO Evolution AI, and chair of the RSS Data Science Section)

We had a lively debate, together with some excellent audience interaction and participation which continued over drinks later in the evening. Some key takeaways include:

  • Data science hype is driving unrealistic expectations both from data scientists (about what they will be working on), and from businesses (about what they will be able to achieve).
  • To mitigate this, data science leaders need to work closely with business stakeholders and sponsors to clearly define the problems to be addressed and the actions to be taken on delivery of data science projects.
  • In addition, they need to recruit for more general skills including stats and coding as well as key attributes such curiosity and pragmatism and be clear with candidates on the type and variety of work that will be undertaken on a day to day basis.
  • Data science leaders need to drive buy-in for efficient data and analytics platforms and drive self-sufficiency within the data teams by leveraging engineering best practice and serverless cloud based services.

Below is a more detailed summary of the key discussion points – the full video of the event can be viewed here and below.

“Effects of the hype”

No alt text provided for this image

After introductions and quick biographies, we started with some comments around the evolution of data science as a capability, highlighting the positive benefits of bringing together quantitative practitioners from different functional areas of a business to share experiences and approaches. In academia, MR explained how historically, the techniques currently found in data science were predominantly explored in maths and computer science departments, but that there has been a move to where the data is generated- more physics and biology based research. This has led to more isolated researchers, and so the rise of the cross functional data science department has similarly reduced this isolation.

We then moved on to questions around the effect of all the data science hype. Firstly we discussed the effects on practitioners- with all the hyperbole in the press, and the breakthroughs released by google on a regular basis, it is not surprising that many data science practitioners can feel they are “not the authentic data scientist” (KL) unless they are uncovering new deep learning architectures or working on petabyte scale problems. Of course this is one of the key purposes of these types of discussions, to demystify what actually goes on and highlight the fact that data science can drive incredibly positive impact in a business setting without needing to push the boundaries of research or reinvent the wheel. A key component of the recruitment process has to be explaining the type and variety of work expected from candidates and making sure this is aligned to expectations.

We moved on to discuss the hype effect on business, the fact that CEOs and business leaders are feeling pressured to invest in “AI” without really knowing what it is and how it can help. This can be a ”recipe for disaster” (PS), as teams of data scientists are hired without a clear remit and without the right infrastructure in place. “You can’t do AI without machine learning, you can’t do machine learning without analytics, and you can’t do analytics without data infrastructure” (PS quoting Hilary Mason)- businesses often jump to the top of the tree without building the foundations (pulling the data together in one place, data engineering). “A lot of companies think they are ready for data science but are probably not (MR).

Are these fundamental misunderstandings based on the hype contributing to a perceived lack of success? Likely so. One key component is having senior business leaders (chief data scientists or chief data officers) who understand more than the hype and can help educate decision makers to direct the efforts on tractable problems. What is the “signal to noise of the problem” (KL): it should be possible to differentiate between cat and dog images but predicting the direction of movement of a stock might not be in the data.

One final discussion point around hype was the benefits of embracing it. Although there was general consensus that true general intelligence (AGI) was still some way off, there were tangible benefits from a marketing and funding perspective to embracing the term. The Turing institute successfully headed off other “AI” focused entities by incorporating the term (MR), and it might well be worth data science teams embracing the term despite any misgivings if only to avoid “AI teams” springing up in the same organisation.

“What does good look like”

No alt text provided for this image

An additional consequence of the hype is a recruiting process focused on buzzwords and methods because the recruiting manager doesn’t know what they need- “we want someone who is an expert on Restricted Boltzmann Machines” (SR). There was general agreement that from a recruiting perspective, you want people who are more interested in problem solving than algorithm development although a solid background in probability with strong quantitative fundamentals is important so you can understand how different techniques work, what assumptions are made and where the gotchas lie.

Another theme that came out was around the makeup of a good team, whether specifically in data science or more broadly across data in general. The team needs a variety of skills ranging from business and process understanding to strong statistical methods, to strong production standard coding (the classic venn diagram) but although individuals should be encouraged to gain skills in all areas, it is the team that becomes the unicorn, rather than in the individual. The classic “T-shape” profile works well- with general capabilities across a broad range of areas combined with deeper knowledge in one or two.

Another area of discussion was self-sufficiency- data science/data teams need to be self sufficient with dependencies on tech resources minimised. It is critical to gain agreement from the technology function about who is able to do what and instilling the requisite skills and processes within the team, so that a model doesn’t need to be re-written to go into production. The increasing prevalence of server-less services in AWS and GCP make this self sufficiency much more realistic and data science teams in general much more productive.

This lead into a lively conversation about how to set up good data science projects. A key theme was to focus on the problem and be crystal clear with stakeholders on what the outcome of the project would produce and how it would be used, not about what methods would be utilised- SR characterised it elegantly as “solving problems in business with maths” . “Find something you can practically deliver to someone who can take advantage of it” (AD). Stakeholder management and delivering early and often, with feedback on iterations was a recurring theme. The comparison to the software development process was made- business stakeholders are now used to the concept of software being delivered in an agile and iterative way and we will hopefully see this approach becoming more acceptable and adopted for data science.

We ended with the provocative question – “should all CEOs become chief data scientists?” – which was met with a resounding “No” from the panel: “I’m not very good at golf” (SR).

Audience Q&A

No alt text provided for this image

We concluded with an excellent interactive session with the audience, including many relevant questions:

“To what extent should data science be responsible for production” – general feeling that data science teams should be able to own and manage productionised processes.

“What about role proliferation: research data scientist, product data scientist, machine learning engineer etc…”?- general feeling to be wary of overly specialised job titles, although a realisation that there may become some specialisation between automated decision making vs operations research/helping people make better decisions

“What is the best mix of skills for data science teams; what about management skills?”- general agreement that it depends on the scale of the organisation and the team: larger teams in larger more bureaucratic organisations could well benefit from data product/program managers to help manage stakeholders and change. In general though you want people who “can write production code, who are driven to build stuff- not coding up algorithms” (MG) .

“What about standards- what is a data scientist, should there be a qualification?”- tricky one- there are definitely core required skills but because the field and roles are still evolving it might be premature. However, the RSS DSS is keen to shape the discussion and our next event in July will be focused on this topic. From an education perspective, “we do need some kind of guidelines over what the masters courses need to deliver” (MR)

“Where should ethics sit- should data scientists own ethics or should it be a separate role?” There was consensus that the potential for doing bad things with data is high, and that data scientists should strive to have a high ethical and moral standards. Depending on the organisation, though, there may be specialist roles in compliance or risk departments that should be leveraged/included in the discussion.

“What should be the interaction between data science and behavioural science?” Agreement on a huge overlap between the two, particularly in finance (KL); bring back research teams (SR)!

So, all in all if felt like a very successful and enjoyable evening- do check out the full video below and do let us know in the comments your thoughts on any of these topics, and also any questions you would like to see discussed in the future.