August Newsletter

Hi everyone-

That was quick, August already, but at least we have had the occasional day when it properly feels like summer- and now we have some Olympics to watch which is always entertaining! … How about a few curated data science materials for reading in while watching the marathon?

Following is the August edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity … We are continuing with our move of Covid Corner to the end to change the focus a little.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science August 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

We are still working on releasing the video and a summary of the latest in our ‘Fireside chat’ series- an engaging and enlightening conversation with with Anthony Goldbloom, founder and CEO of Kaggle. Sorry for the delay- we will post a link when it is available.

Thank you all for taking the time to fill in our survey responding to the UK Government’s proposed AI Strategy (If you haven’t already, you can still contribute here). We are passionate about making sure the government focuses on the right things in this area, and are now analysing the results which we will publish shortly.

The full programme for this year’s RSS Conference, which takes place in Manchester from 6-9 September, has been confirmed.  The programme includes keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers.  Registration is open

Speaking of the RSS Conference, we are running a session there, and we need your help! We would like to hear stories about your worst mistakes in data science. From these, we will select common themes and topics, and create a crowd-sourced compilation of the deadliest sins of data science. These will be presented – anonymously – to our panel, for a live, interactive discussion in front of an audience, at our session on Tuesday 7 September, 11:40 – 13:00. We hope this will both entertain and inform. Maybe your pain can help save someone else’s (data science) soul… CONFESS YOUR SINS HERE – the survey is anonymous, we won’t embarrass anyone!

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with virtual events. The most recent event was on July 14th when Xavier Bresson, Associate Professor in the Department of Computer Science at the National University of Singapore, discussed “The Transformer Network for the Traveling Salesman Problem“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"One gave our candidate a high score for English proficiency when she spoke only in German."
  • We talk about bias a fair amount, and it’s always good to define terms – this summary from the ACM (Association for Computer Machinery) gives a good overview. They split biases in AI systems into four sensible high level areas (as well as splitting out more specific types in each area):
    • Data-creation bias
    • Biases related to problem formulation
    • Biases related to the algorithm/data analysis
    • Biases related to evaluation/validation
  • It’s easy to overlook the first area highlighted above – data-creation bias. Often we train supervised learning models based on hand-labeled examples which we assume to be ‘correct’ but may not be. This article from O’Reilly talks through this issue and discusses different approaches (such as semi-supervised learning and weak supervision), while this article (from Sandeep Uttamchandani) gives some practical tips on data set selection for ML model building.
There is no such thing as gold labels: even the most well-known hand labeled datasets have label error rates of at least 5% (ImageNet has a label error rate of 5.8%!).
  • More positively, Apple has released information about their approach for face detection in photos, highlighting positive aspects such as on-device scoring, and fairness.
  • And this analysis charting the ‘data-for-good’ landscape shows it’s not all doom and gloom…

Developments in Data Science…
As always, lots of new developments…

  • When the ‘founding fathers’ of Deep Learning (Bengio, Hinton and LeCun) get together it’s always worth reading… here they discuss the future of Deep Learning and key research directions. They highlight key issues with existing approaches (large volumes of data for supervised learning or large numbers of iterations for reinforcement learning) but are not convinced by hybrid approaches including symbolic learning, believing research into more efficient learning from fewer examples will bear fruit.
“Humans and animals seem to be able to learn massive amounts of background knowledge about the world, largely by observation, in a task-independent manner. This knowledge underpins common sense and allows humans to learn complex tasks, such as driving, with just a few hours of practice.”
Interestingly, the ways that languages categorize color vary widely. Nonindustrialized cultures typically have far fewer words for colors than industrialized cultures. So while English has 11 words that everyone knows, the Papua-New Guinean language Berinmo has only five, and the Bolivian Amazonian language Tsimane’ has only three words that everyone knows, corresponding to black, white and red

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

"we’ve found that other approaches, such as reinforcement learning with human feedback, lead to faster progress in our reinforcement learning research"
"GitHub Copilot has been described as ‘magical’, ‘god send’, ‘seriously incredible work’, et cetera. I agree, it’s a pretty impressive tool, something I see myself using daily ... In my experience, Copilot excels at writing repetitive, tedious, boilerplate-y code. With minimal context, it can whip up a function that slices and dices a dataset, trains and evaluates several ml models, and, if you ask it nicely, also makes a nice batch of french fries"
  • Ok, so maybe not quite so practical, but still great fun – AI driven art out of Berkley (‘Alien Dreams’)
"this CLIP method is more like a beautifully hacked together trick for using language to steer existing unconditional image generating models"
  • A useful rundown from DoorDash on how they use ML models to balance supply and demand, including some interesting discussion on optimisation approaches which are often the way of turning a ML model into something that is used in decision making.

How does that work?
A new section on understanding different approaches and techniques

Diffusion models are a new type of generative models that are flexible enough to learn any arbitrarily complex data distribution while tractable to analytically evaluate the distribution

Getting it live
How to drive ML into production

  • Andrew Ng brings to life the challenges of building an AI product…
"Unsurprisingly, things did not go exactly as planned. Thus, this post is about what worked and what didn’t. I have focused on the most challenging aspects of trying to get data scientists to get review from their peers. I hope this helps others who wish to formalize peer review processes in data science"

Correlation or Causation?
A deep dive into causal analysis in machine learning

  • You have a machine learning model and it seems to perform great, not only on the training set, but even on hold out test sets- sorted right? It’s worth considering how you are going to use the model- if you are making predictions and using the output as is, then maybe you are ok; but if you are going to use the model for scenario planning, and counter-factual assessment (‘what-ifs?’) it would be worth thinking about causal analysis. Here’s a good starting point, from Jane Huang.
  • Here’s a useful example – estimating price elasticity
  • The technique often relies on something called ‘Double Machine Learning’
    • Overview here, with different implementations here and here and a worked example here
As any great technology, Double Machine Learning for causal inference has the potential to become pretty ubiquitous. But let’s calm the enthusiasm of this writer down and go back to our task
  • Finally, an intriguing approach for time series and econometrics… causal forests

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

How to get involved in the IRCAI AI Award 2021?

The International Research Centre in Artificial Intelligence under the auspices of UNESCO is launching an AI Award for individuals who have dedicated their work to solving problems related to the United Nations Sustainable Development Goals (SDGs) by means of the application of Artificial Intelligence.

Covid Corner

Not sure what to say here… vaccinations keep progressing in the UK, which is good news, but we now have what appear to be the highest covid case levels we have seen over the whole of the pandemic due to the Delta variant…

  • The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be roughly 1 in 65 people, up from 1 in 75 the week before and an almost unbelievable increase from only June, when the estimate was 1 in 1100.
  • More or Less gives an excellent review of the Delta variant and how it has come to dominate other strains of coronavirus the world over
  • One of the core findings about Delta, as discussed by More or Less, is its apparent ability to transmit through vaccinated individuals (or those with antibodies from prior infections) – in other words vaccinations, while still protecting against the worst outcomes, are not as effective at reducing transmission.
  • This definitely raises the stakes of the recent UK governmental re-opening and relaxation of restrictions on July 13th (symbolically welcomed by the prime minister in self-isolation…) which has been roundly condemned by the scientific community
  • In addition, in a recent article in the guardian, SAGE committee member Professor Robert West states the government’s express intention is to allow infections to rip through the younger population, a very worrying statement.
“What we are seeing is a decision by the government to get as many people infected as possible, as quickly as possible, while using rhetoric about caution as a way of putting the blame on the public for the consequences”

Updates from Members and Contributors

  • Marco Gorelli announces the first official release (1.0.0) of his highly acclaimed nbQA repo, full of very useful code formatting features and pre-commit hooks for jupyter notebooks
  • Alex Spanos will be presenting TrueLayer’s data science work at the RSS conference in Manchester (“An end-to-end Data Science workflow for building scalable and performant data enrichment APIs in Open Banking“) – another great reason to attend in September!
  • Mark Baillie highlights an upcoming special issue of the Biometrical Journal
    “Data scientists are frequently faced with an array of methods to choose from; often this makes selection difficult especially beyond one’s own particular interests and expertise. Neutral comparison studies are an essential cornerstone towards the improvement of this situation, providing evidence to help guide practitioners. For the special issue of Biometrical Journal we are interested in submissions that define, develop, discuss or illustrate concepts related to practical issues and improvement of neutral method comparison studies, as well as articles reporting well-designed neutral comparison studies of methods”

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

In memoriam

With great sadness I announce the untimely death of Rebecca Nettleship, a valued colleague and talented data scientist, on 22nd July 2021. She will be sorely missed. Our deepest condolences go out to her family and friends.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: