October Newsletter

Hi everyone-

I guess summer is over, what there was of it- I was hoping we might get a bit of autumn sunshine but it feels like it’s big coat weather already… definitely time for some tasty data science reading materials in front of a warm fire!

Following is the October edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science October 2021 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

First of all, we have a new name… Data Science and AI Section! To be honest, we’ve always talked about machine learning and artificial intelligence, and have some very experienced practitioners both on the committee and in our network, so it doesn’t really change our focus. It is nice to have it officially recognised by the RSS though.

Thank you all for taking the time to fill in our survey responding to the UK Government’s proposed AI Strategy. As you may have seen, Martin Goodson, our chair, summarised some of the findings in a recent post, highlighting the significant gaps in the government’s proposed approach based on comments from you. Some of these gaps, particularly on open-source, have now been publicly acknowledged, multiple times. In addition Martin, and Jim Weatherall met with Sana Khareghani (director of the Office for AI) and Tabitha Goldstaub (chair of the AI council) in order to further advocate for our community’s needs, with Sana agreeing that the Office for AI will run workshops together with the RSS focused on the technical practitioner community, in order to gain their perspective and identify their needs.

“Confessions of a Data Scientist” seemed to go down very well at the recent RSS conference- massive thanks to Louisa Nolan for making it so successful, and to you all for your contributions.

Of course, the RSS never sleeps… so preparation for next year’s conference, which will take place in Aberdeen, Scotland from 12-15 September 2022, is already underway. The RSS is inviting proposals for invited topic sessions. These are put together by an individual, group of individuals or an organisation with a set of speakers who they invite to speak on a particular topic. The conference provides one of the best opportunities in the UK for anyone interested in statistics and data science to come together to share knowledge and network. Deadline for proposals is November 18th.

Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The last talk was on September 7th where Thomas Kipf, Research Scientist at Google Research in the Brain Team in Amsterdam, discussed “Relational Structure Discovery“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

“Artificial intelligence can be a force for good, helping societies overcome some of the great challenges of our times. But AI technologies can have negative, even catastrophic, effects if they are used without sufficient regard to how they affect people’s human rights”
"Depoliticizing people’s feeds makes sense for a company that is perpetually in hot water for its alleged impact on politics"
"We don’t want viewers regretting the videos they spend time watching and realized we needed to do even more to measure how much value you get from your time on YouTube."

Developments in Data Science…
As always, lots of new developments… thought we’d have a more extended look at some of the new research this month

  • Plenty of great arXiv papers out there this month- I know these can be a bit dry, so will try and give a bit of context…
    • One theme of research we have been following is “fewer-shot” training of models. Fundamentally, humans don’t need millions of examples of an orange before being able to identify one, so learning from limited examples should be possible. Large language models like GPT-3 have shown great promise in this area, where, given a few “prompts” (question and answer examples), they seem to be able to provide remarkable results to this type or problem. Sadly, this paper, “true few-shot learning” suggests we need a more standardised approach to example selection as previous results may have been artificially inflated by biased approaches.
    • More positively, “Can you learn an algorithm” talks through recent research showing that simple recurrent neural networks can learn approaches that can be successfully applied to larger scale problems, just as humans can learn from toy examples. Similarly, a new sequence to sequence learning approach from MIT CSAIL includes a component that learns “grammar” across examples.
    • Another popular research theme is simplifying architecture and reducing processing. A team at Google Brain have shown (“Pay Attention to MLPs“) that you can almost replicate the performance of transformers (a more complex deep learning architecture) with a simpler approach based on basic building blocks (multi-layer perceptrons)
    • GANs (generational adversarial networks) are pretty cool – they generate new similar looking examples from input data (see here for an intro). A recent paper (GAN’s N’ Roses) takes this to a new level, generating stable video from an input and a theme. (“GAN’s N’ Roses” is clearly a popular meme – this tutorial predates the paper by 4 years!)
  • Of course the big industrial research powerhouses (Google/DeepMind, Facebook etc.) keep churning out fantastic work:
“We would like our agents to leverage knowledge acquired in previous tasks to learn a new task more quickly, in the same way that a cook will have an easier time learning a new recipe than someone who has never prepared a dish before"
  • Finally, one paper I encourage everyone to read- “A Farewell to the Bias-Variance Tradeoff?“, one of the conundrums I still struggle to fully understand … why is that over-parameterised models (those which seem to have far too many parameters given the data set they are trained on) are able to generalise so well.

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

  • Great article in Wired on the development of large language models outside of the US, and the English language
"What's surprising about these large language models is how much they know about how the world works simply from reading all the stuff that they can find"
"It is a pioneering program that’s mixing responsible AI and science with indigenous led knowledge and solving complex environmental management problems at spots in Northern Australia"
  • We don’t hear much from Amazon about their use of AI, although clearly they have very advanced applications across their business. This was an interesting post digging into the practical problem of how you help delivery workers find the actual entrance to a given residence, from noisy data.
  • “In this project, we’ve trained physically simulated humanoids to play a simplified version of 2v2 football” …. and there’s video!
  • And the Boston Dynamics robots continue to fascinate/scare in equal measure… they can now do Parkour!
"On the Atlas project, we use parkour as an experimental theme to study problems related to rapid behavior creation, dynamic locomotion, and connections between perception and control that allow the robot to adapt – quite literally – on the fly."
"Everyone was floored, there was a lot of press, and then it was radio silence, basically. You’re in this weird situation where there’s been this major advance in your field, but you can’t build on it.”

How does that work?
A new section on understanding different approaches and techniques

  • Hyper-parameter optimisation can often require more art than science if you don’t have a systematic approach- some useful tips here using Argo
  • There are lots of different activation functions (defining the output from given inputs) you can use in neural networks, but which one should you use for a given task? Useful paper here.
  • Interesting comparison: using meme search to explore the performance of different image encoders, in particular CLIP from OpenAI vs Google’s Big Transfer
  • I’m not a massive fan of media-mix modelling (building models that optmise marketing expenditure based on historic performance) because it always feels there is so much fundamentally missing in the underying data sets. However, they can certainly be useful, and using a Bayesian approach would seem to be a good way to go (more detail here)
"The Bayesian approach allows prior knowledge to be elegantly incorporated into the model and quantified with the appropriate mathematical distributions."

Practical tips
How to drive analytics and ML into production

"Companies that are starting with the problem first, improving on a defined metric and reach ML as a solution naturally are the ones that will treat their models as a continuously developing product”

Bigger picture ideas
Longer thought provoking reads

"the modern data stack isn't enough. We have to create a modern data experience."
"We call for the replacement of the deep network technology to make it closer to how the brain works by replacing each simple unit in the deep network today with a unit that represents a neuron, which is already—on its own—deep"

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

What’s interesting with that system, contrary to classical game development, is that you don’t need to hard-code every interaction. Instead, you use a language model that selects what’s robot possible action is the most appropriate given user input.
Our goal is to create a formal call for blog posts at ICLR to incentivize and reward researchers to review past work and summarize the outcomes, develop new intuitions, or highlight some shortcomings.

Covid Corner

Although life seems to be returning to normal for many people in the UK, there is still lots of uncertainty on the Covid front… vaccinations keep progressing in the UK, which is good news, but we still have high community covid case levels due to the Delta variant…

"By comparing Eva’s performance against modelled counterfactual scenarios, we show that Eva identified 1.85 times as many asymptomatic, infected travellers as random surveillance testing, with up to 2-4 times as many during peak travel, and 1.25-1.45 times as many asymptomatic, infected travellers as testing policies that only utilize epidemiological metrics."

Updates from Members and Contributors

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: