Hi everyone-
It’s a bank holiday weekend, so it’s probably May and another month has flown by… I hope the excitement of venturing out from our cave-like lockdown has not proved too overwhelming … perhaps a few curated data science reading materials might prove relaxing?
Following is the May edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity … A particularly strong section from Members and Contributors this month- good reason to read to the end! Also we are moving Covid Corner to the end to change the focus a little.
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.
Industrial Strength Data Science May 2021 Newsletter
RSS Data Science Section
Committee Activities
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.
Fresh on the heels of our incredibly successful event with Andrew Ng, we are excited to announce the next instalment in the series at 6.30pm on Thursday May 20th. The RSS Data Science section invites you to a fireside conversation with Anthony Goldbloom – founder and CEO of Kaggle (now a Google company), the world’s largest data science and machine learning community with over 6MM members. Hear Anthony share his thoughts and experiences from the past 10 years at the forefront of competitive Machine Learning – sign up here to attend.
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with virtual events. The next event is on 10th May where Noam Brown, research scientist at Facebook AI in New York, will give a talk titled ‘AI for Imperfect-Information Games: Poker and Beyond‘. Videos are posted on the meetup youtube channel – and future events will be posted here.
This Month in Data Science
Lots of exciting data science going on, as always!
Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…
- Ethically compromised uses of AI continue to hit the news:
- BuzzFeedNews investigates how Clearview Facial recognition has been quietly deployed to police forces across the US.
- The Verge highlights the pitfalls of automatic gender recognition.
- And it’s been a busy month for the regulators…
- European Commission published its draft proposal for the regulation of AI: if you havn’t taken a look yet, you should, as it could potentially change what AI is used for and how it is used in many ways.
- In addition the EU is apparently considering a ban on AI for mass surveillance and social credit scores.
- Going in a different direction, the UK government has announced plans to ‘green-light’ self driving cars on motorways through specific regulation.
- Meanwhile the FTC in the US published their proposal for how company’s should be responsible for their use of AI: “Aiming for truth, fairness, and equity in your company’s use of AI“
- Clearly some form of regulation is important given the increasingly prevalent questionable AI use cases coming to light. However, the EU draft regulations would seem to be too broad- as Benedict Evans commented:
"This is like trying to write a single law covering 'cars', that covers drunk driving, emissions standards, parking, and the tax treatment of highways.."
- In addition, it is very hard to regulate ‘AI’ when it is far from clear we have a good definition of what ‘AI’ actually is, as our very own Martin Goodson points out in his recent blog post.
"The Act has already caused dismay amongst statisticians, who had no idea they were actually doing AI all along."
- In a more humerous take on ‘Big Tech’s’ approach to ethics issues in AI, the MIT Technology Review provides this useful guide on terminology
"accountability (n) - The act of holding someone else responsible for the consequences when your AI system fails."
Developments in Data Science…
As always, lots of new developments…
- We have commented previously on how Deep Learning architectures are now able to aid scientific discovery, by estimating solutions to previously intractable physics equations and progress continues
- ‘Can Neural Nets learn a Fourier Transform’? – yes they can but the more interesting question is why is this useful?
- Quanta magazine has an excellent article on new developments in solving partial differential equations (such as Navier-Stokes) at scale, with all sorts of practical applications.
- Deep Learning models are getting bigger and bigger… so it’s always useful to understand ways we can reduce their size – this is a good primer on deep learning model compression. In addition this tool looks very useful for attempting to understand how these huge language models actually work – Language Interpretability Tool
- Not sure how useful this will be, but it’s very impressive – researchers have created a physical artificial neural network: no energy is consumed to run the device because it only uses diffraction of light!
- IBM’s Watson initiative may not have proved broadly successful, but ‘Project Debator‘ looks innovative, and addresses a topic (automatically generated logical argument) with a wide variety of applications.
- Some interesting developments from Microsoft Research and Peking University around the concept of “knowledge neurons” in transformers which express specific factual information learned from a corpus.
- Finally in this section, some good progress in approaches to building robust ML models with smaller amounts of labeled data, utilising “Self-Supervised learning“.
- Impressive work on self-supervised video object segmentation from the visual geometry group at Oxford University.
- Facebook is bringing self-supervised approaches to computer vision with SEER – innovative and worth a read.
- SEER actually uses a specific self-supervised approach called contrastive learning, an elegant method that encodes what makes two examples similar or different – useful tutorials here and here.
- Of course another approach is to automatically generate labeled data which GANs are proving useful for.
Real world applications of Data Science
Making a difference in the real world
- Promising results for early detection of colon cancer.
- Interesting discussion on how best to utilise the impressive capabilities of language model powered chatbots in health care settings – helping troubled teenagers at The Trevor Project
- Another useful application of the increasing prevalence of regularly updating satellite imagery – mangrove growth and deforestation
- Decoding whale language – wouldn’t you love to be involved in that!
- Fun and games with images and paintings:
- Relighting and colour grading images with machine learning
- Recreating a lost Picasso – the researchers used ‘style transfer’ in this process: to learn more about this approach and experiment with it yourself, checkout Starry Cat.
Practical pointers on recommenders and search
Lots of good tips on search and recommendations this month
- First of all, a useful discussion of how and why Netflix is attempting to move beyond recommendations to ‘end scrolling’
- How is search different from recommendations? Good breakdown of some of the underlying concepts from Eugene Yan
- What is similarity search? – most search or recommendation approaches rely on some sort of distance calculation across multiple dimensions and it is important to understand the tradeoffs involved.
- How Flipkart generates auto-suggestions for their search box
- Some useful insight from Pinterest on how they utilise AutoML
How does that work?
A new section on understanding different approaches and techniques
- Where should I start with supervised learning? Kaggle now have a very useful set of supervised learning challenges that are quick to run and focus on helping newcomers to machine learning get up to speed- the Tabular Playground Series
- Useful tutorial on using Google’s new Speech to Text API in python from Grettel Juarez
- Can you run a gradient boosted tree in tensor flow? Yes you can, shows Rebecca Nettleship (ok… full disclosure, we work together… great work Rebecca!)
It’s all about the data …
Which is more important… the data or the algorithm?
- Increasing evidence that the data sets that have driven many of the ground breaking developments in image recognition contain underlying flaws
- Facebook is attempting to help, with a new open sourced data set ‘Casual Conversations‘ that focuses on fair representation across diverse groups.
- Vicky Boykis discusses how critical good quality data is to the whole analytic process (along with a number of other useful tips she describes as “ghost knowledge”) – excellent post
"Having clean data is in this category of “ghost knowledge” that, if you’ve been working in data for a long time, you know painfully from your own experience."
- And Andrew Ng is a big believer in this as well as he discusses in this interview
"Systematic improvement of data quality on a basic model is better than chasing the state-of-the-art models with low-quality data."
The Art of Visualisation
Making data science look right..
- Google explores how far a map can go as a medium for displaying relevant information
- Ilya Kashnitsky recommends keeping it simple with dot-plots
- Awesome – how to generate cool mathematical animations in python from Khuyen Tran
- An elegant new visual approach to understanding clustering – Clustergram from Martin Fleischmann
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to while away the socially distanced hours:
- A bot that bird watches so you don’t have to!
- Apartment hunting with python
- Building a self-supervised binary emotion classifier from scratch!
- If you want more of a hardware project and aren’t pressed for cash… “How I built a €25K Machine Learning Rig”
- I really want to try this – “Live Plotting Data with Matplotlib and RaspberryPi”
Covid Corner
Again, more positive progress in the UK on the Covid front with over 35m people now having received their first vaccine dose and other metrics, such as deaths and hospitalisations all progressing in the right direction.
- The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be roughly 1 in 1000 so we have come a considerable way from January, when prevalence peaked at around 1 in 50. It is interesting to note though that we are not yet back to where we were last summer, when it dropped to 1 in 2000.
- Some very positive results regarding the efficacy of the various vaccines ‘in the wild’ against the current variants have been recently published in the BMJ.
"Vaccination with a single dose of Oxford-AstraZeneca or Pfizer-BioNTech vaccines, [] significantly reduced new SARS-CoV-2 infections in this large community surveillance study"
- Of course the trajectory in India is tragically different at the moment, with hospitals overwhelmed and deaths from covid passing 200,000. While the UK was suffering some of the worst per capita death rates in the world in January, India seemed to have Covid under control, highlighting how rapidly the situation can change. There is concern that a new variant may be the driving force behind the change, but it is far from clear.
- Really interesting results from recent academic research (Wellcome Sanger Institute, Newcastle University, University College London, the University of Cambridge, EMBL’s European Bioinformatics Institute) – (hat-tip to my mum…):
- Researchers have identified differences in the immune response to COVID-19 between asymptomatic people and those with severe symptoms which could be used to identify potential targets for developing therapies
- Given the critical importance of vaccination programs around the world, it is heartening to know that vaccine development still continues – a new low cost highly targeted approach (NDV-HXP-S) has just gone into clinical trials and shows great promise.
“The new vaccine can be mass-produced in chicken eggs — the same eggs that produce billions of influenza vaccines every year in factories around the world”
- False positives continue to confuse, particularly in connection to Lateral Flow Tests… David Spiegelhalter again does an excellent job of explaining the issue managing to link Bunhill Fields (where Thomas Bayes is buried) to the prosecutor’s fallacy If you have not come across the later, this is well worth a read
- Finally, on the topic of miss-use of statistics this was a useful reminder about the problem with averages, and how the ‘average human’ doesn’t exist.
Updates from Members and Contributors
- Harald Carlens and Eniola Olaleye have recently published an in-depth review of over 100 Machine Learning competitions from last year across Kaggle and other platforms. The summary can be found here highlighting lots of useful trends in terms of ML approaches and libraries (a sign of the times- only a single winner used R…).
- Marco Gorelli will be running his excellent workshop on contributing to Pandas again on 8th May. Sign up is not yet online, but it will be run in collaboration with PyLadies London and is specifically targeting people from underrepresented genders in tech.
- Shirley Coleman announces that the European Network of Business and Industrial Statistics will be running the 2 day online ENBIS Spring Meeting on Data Science in Process Industries, on 17/18th May 2021. Free registration is available here and all are welcome.
- Hillary Juma highlights some excellent new opportunities from the ONS:
- ESRC-ADR UK No.10 data science fellowships 2021: deadline June 2nd – spend a year collaborating with 10 Downing Street’s data science team (10DS) and the Office for National Statistics (ONS)
- The Data Science Campus at the Office for National Statistics have recently launched a cross government Data Science Graduate Programme: a unique opportunity to work at the heart of Data Science in the public sector. Open day for anyone interested on Wednesday 12th May 5pm to 6pm (register here).
- David Higgins has published another excellent article on Medical AI, highlighting the technical challenges involved.
- Jencir Lee is working on a time series terminal project, focused on causal effects in time series- early days but looks intriguing.
- Kevin O’Brien highlights that Professor Jonathan Rougier‘s presentation on the analysis of historical volcano data with R hosted by the Why R? foundation is available to watch here. He would also like to thank RSS members who provided assistance in resourcing R workshops in Africa (see footage of an R workshop in Togo that took place last December)
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.
– Piers
The views expressed are our own and do not necessarily represent those of the RSS