Hi everyone-
Let’s face it, July was a bit lacking in sunshine but at least ended with a couple of scorchers. Possibly not the best conditions to be pulling together a newsletter, but we’d always take the sun over rain!
Following is the August edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity while figuring out whether or not you want to get on a flight, and whether or not you’ll be able to…
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
Industrial Strength Data Science August 2020 Newsletter
RSS Data Science Section
Covid Corner
Just when you thought it was safe to go out, local lockdowns and positive case volume spikes around the world remind us that our battle with COVID-19 is far from over sadly. As always numbers, statistics and models are front and centre in all sorts of ways.
- Although everyone is clear that in the ideal world, all schools would re-open at the end of the summer, it is far from easy to assess the risks involved with this course of action, and also what steps could be made to mitigate those risks. DELVE (Data Evaluation and Learning for Viral Epidemics), a multi-disciplinary group convened by the Royal Society, has recently released a very well researched paper on this topic which provides an excellent assessment. Although the evidence is far from conclusive, they reason that the benefits do outweigh the risks but have some very clear recommendations about how to re-open in as safe as possible way.
- The DELVE report calls out the importance of ventilation given the increasing evidence of airborne transmission of the virus. This extensive article in the Atlantic digs into the topic further attempting to uncover why we still have such limited understanding of exactly how the virus spreads.
- Testing is still absolutely critical, and this post from the Google AI Blog, talks through an elegant way of using bayesian group testing to generate faster screening. Given the amount of testing going on in the world, the improved efficiency from adopting this approach is significant.
- Finally, R, the effective reproduction rate of the virus in a given area, seems to be less in the news these days but it is clearly still an important metric. Rt.live does an elegant job of bringing changes and relative differences in R to life.
Committee Activities
It is still relatively quiet for committee member activities although we are still playing an active role in joint RSS/British Computing Society/Operations Research Society discussions on data science accreditation and attempting to help define and drive the RSS’s overall Data Science strategy.
Janet Bastiman is running a series of lectures for Barclays on Deepfakes – the second (ethics of deepfakes) and third (how to make one) are on the 5th and 12th August respectively and are very relevant as this technique becomes increasingly accessible and prevalent.
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup – and has been active in lockdown with virtual events. The next event is on August 6th at 6.30pm, “Learning to Continually Learn”, when Nick Cheney, Assistant Professor at the University of Vermont, will talk through one of the best papers from ICLR. All the talks (past and future) are available here – a very useful resource.
Elsewhere in Data Science
Lots of non-Covid data science going on, as always!
What’s this Deep Learning stuff all about?
Deep Learning is an increasingly prevalent machine learning technique, particular in audio, video and image related fields, but it is easy to forget how quickly it has come to prominence and the steps involved. A couple of useful articles talk through the key historical innovations and also function as useful primers for those new to the technique:
- In “Deep Learning’s Most Important Ideas”, Denny Britz talks through and explains the key details of the breakthrough ideas, putting them in historical context,
- while “Deep Learning Papers Reading” provides a more topic driven reading list.
Quick follow up on bias and diversity from last time
As as quick follow on from our discussion of algorithmic bias and the importance of diversity, we wanted to recommend a great curated twitter account – ‘Women in Statistics and Data Science’ – definitely worth following.
The continuing GPT-3 saga
We’ve talked about OpenAI’s announcement of the 175 billion parameter GPT-3 model for a couple of issues now but it is still generating news, especially as more and more people are able to use it via the API.
The Verge does a nice job of assessing potential use-cases for the new model, as well as giving a high level view of how it works.
Kevin Lacker decided to give GPT-3 the Turing Test with some really interesting findings. It can appear incredibly “human like” and in many cases produces remarkable and accurate answers but perhaps doesn’t quite know when to say no:
(KL) Q: How do you sporgle a morgle?
(GPT3) A: You sporgle a morgle by using a sporgle.
All of these Deep Learning based NLP models have to convert text into numeric vector form in some way or another. This post digs into the intriguing way GPT-3 encodes numbers…. not obvious at all!
This entertaining post highlights more coherent examples:
Omfg, ok so I fed GPT3 the first half of my "How to run an Effective Board Meeting" (first screenshot)
AND IT FUCKIN WROTE UP A 3-STEP PROCESS ON HOW TO RECRUIT BOARD MEMBERS THAT I SHOULD HONESTLY NOW PUT INTO MY DAMN ESSAY
This article steps back a little and tries to assess how much of a breakthrough GPT-3 really is while this piece in the Guardian also tries to put the new capabilities in perspective.
Finally Azeem Azhar does a good job putting the model and approach in historical context and discusses the different approaches (symbolic vs statistical) to solving the NLP problem.
What to work on and how to get it live
Data Science teams are often inundated with requests, and may well have a number of separate research streams they would like to progress. Deciding what to work on to make the most of limited time and resources is always difficult- this post provides a framework to think about this tricky topic
We talked about ML Ops last time and how to optimise the process of getting ML models live. There are various frameworks and solutions that can potentially be very useful for this, if used in the right context (and at the right price…). This article gives one of the more comprehensive summaries of a number of the options available including those created at the big tech companies (Michelangelo at Uber, BigHead at Airbnb, Metaflow at Netflix, and Flyte at Lyft), the relevant cloud offerings from GCP – in particular TFX- , AWS and Azure as well as some interesting honourable mentions (H2O, MLFlow). The distinction between this end to end model management and AutoML (building an individual model in the fastest way) is an interesting one and important to understand when considering options.
Finally this post is well worth a read. Stitchfix have historically been transparent and informative on their A/B testing methods, and this evolves their approach in an interesting way, focusing on finding “winning interventions as quickly as possible in terms of samples used”.
Practical Projects
As always here are a few potential practical projects to while away the lockdown hours:
- How about a spot of steganography? – check out this neural network that can hide one picture inside another
- Being ‘encouraged’ to learn a new programming language?: why bother when you can “translate” from one to another?.
Updates from Members and Contributors
- David Higgins has recently published a peer-reviewed guide to product development for AI in Healthcare highlighting the importance of bringing data, algorithms, clinicians and regulatory experts together. “From Bit to Bedside: a practical framework for artificial intelligence product development in Healthcare” is published in Advanced Intelligence Systems and has received very positive feedback so far.
- For those interested in exploring NLP, Mani Sarkar has published his NLP Profiler Kernel on kaggle together with some useful tutorial notebooks.
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
– Piers