October Newsletter

Hi everyone-

As the rain pours down it definitely feels like winter has arrived- all the more reason to spend some time indoors huddled up with some good data science reading materials!

Following is the October edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity while figuring the difference between second waves and spikes…

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

Industrial Strength Data Science October 2020 Newsletter

RSS Data Science Section

Covid Corner

As Trump tests positive, the inevitable seems to be happening, with COVID-19 cases on the rise again in many areas of the world. As always numbers, statistics and models are front and centre in all sorts of ways.

#19 Publish your results in a newspaper first (all criticism 
of the study by scientists will be old news and sour grapes 
by the time they get a chance to make it, and government policy 
will already have been made)
The RSS has been concerned that, during the Covid-19 outbreak, 
many new diagnostic tests for SARS-CoV-2 antigen or antibodies 
have come to market for use both in clinical practice and for 
surveillance without adequate provision for statistical 
evaluation of their clinical and analytical performance. 
  • Of course, this is all rather undermined when you discover the official national case tracking data is being managed in excel
  • Elsewhere, Wired gives a good analysis of the different approaches being taken by the various vaccine research groups to show whether or not their vaccine actually works. A recent paper, Machine Learning for Clinical Trials in the Era of COVID-19 in the Statistics in Biopharmaceutical Research Journal, highlights how machine learning can help with some of these issues.
  • On the epidemiological front, a recent article in Nature, highlights how innovative use of anonymised mobile phone data can be used to track the virus spread.
  • Is dispersion (k) the overlooked variable in our quest to understand the spread of the virus? Breaking down the distribution of infection events (rather than using the average, as with R) could help better explain super-spreaders and inform test and trace programs. Really interesting article from the Atlantic.
  • If anyone is keen to roll up their sleeves and dig in to the data, the c3.ai COVID-19 Grand Challenge might be of interest…
  • Finally, The Alan Turing Institute is convening a public conference “AI and Data Science in the Age of COVID-19” on November 24th. In addition to public discussion there will be a series of closed workshop sessions to assess the response of the UK’s data science and AI community to the current pandemic- if you are interested in participating in the closed sessions you can apply here.

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

As previewed in our last newsletter, and our recent release, we are excited to be launching a new initiative: AI Ethics Happy Hours. If you have encountered or
witnessed ethical challenges in your professional life as a data scientist that you think would make for an interesting discussion, we would love to hear from you at dss.ethics@gmail.com (deadline October 15th).

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and has been active in lockdown with virtual events. Next up, on Monday October 12th, is “From Machine Learning to Machine Reasoning“, by Drew Hudson from Stanford University. Videos are posted on the meetup youtube channel – and future events will be posted here.

Anjali Mazumder is helping organise the Turing Institute event mentioned above in Covid Corner.

Elsewhere in Data Science

Lots of non-Covid data science going on, as always!

Bias and more bias…

The more we collectively dig into the underlying models driving our every day activities, the more issues we uncover…

"I don't trust linear regressions when it's harder to guess the
direction of the correlation from the scatter plot than 
to find new constellations on it"

Recommenders Gone Wild …

One example that Rachel Thomas discussed in the talk above, is recommendation systems. With the proliferation of content and product choices now available online, we could all use some help curating and narrowing down the options available. When implemented well, recommendation systems can elegantly assist in this. Many typically work through some form of collaborative filtering which really boils down to identifying similar behaviours and extrapolating:

If Alice likes oranges, pineapples and mangos, 
and Bob likes oranges and pineapples, 
maybe Bob will also like mangos...

However, depending on how these similarities are codified and calculated, it has now been shown that feedback loops can quite easily be generated.

  • Wired dug into the YouTube recommender in 2019 with Guillaume Chaslot, one of the original engineers on the project, highlighting the importance the metric chosen to optimise – in this case viewing time – has in driving the material recommended and so consumed.
  • In a recent follow up, “YouTube’s Plot to Silence Conspiracy Theories” , they highlight some of the changes that have been implemented to reduce the issues identified. Interestingly the focus seems to be on identifying potentially hazardous material that is then excluded from the recommender rather than changing the recommender itself.
  • DeepMind recently released research digging into these feedback loops (“Degenerate Feedback Loops in Recommender Systems”) giving a theoretical grounding to the concepts of “echo chambers” and “filter bubbles” and why they occur.
  • In “Overcoming Echo Chambers in Recommender Systems“, Ryan Millar digs into alternative methods using the fabled MovieLens data set, giving an example of how, through different objective functions, you can reduce the feedback loop effects in the recommender system itself. This feels similar to the concepts of “explore” vs “exploit” in Thompson sampling, and an approach well worth considering if you are building a system yourself.
  • Finally Eugene Yan gives a useful summary of RecSys 2020, highlighting a number of research papers on the topic of feedback loops and bias in recommender systems.

Yet more GPT-3 …
Continuing our regular feature on GPT-3 (OpenAI’s 175 billion parameter NLP model) as it continues to generate news and commentary.

AI Trends and Business

Practical Projects
As always here are a few potential practical projects to while away the socially distanced hours:

Updates from Members and Contributors

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here

– Piers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: