I guess summer is over, what there was of it- I was hoping we might get a bit of autumn sunshine but it feels like it’s big coat weather already… definitely time for some tasty data science reading materials in front of a warm fire!
Following is the October edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.
Industrial Strength Data Science October 2021 NewsletterRSS Data Science Section
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.
First of all, we have a new name… Data Science and AI Section! To be honest, we’ve always talked about machine learning and artificial intelligence, and have some very experienced practitioners both on the committee and in our network, so it doesn’t really change our focus. It is nice to have it officially recognised by the RSS though.
Thank you all for taking the time to fill in our survey responding to the UK Government’s proposed AI Strategy. As you may have seen, Martin Goodson, our chair, summarised some of the findings in a recent post, highlighting the significant gaps in the government’s proposed approach based on comments from you. Some of these gaps, particularly on open-source, have now been publicly acknowledged, multiple times. In addition Martin, and Jim Weatherall met with Sana Khareghani (director of the Office for AI) and Tabitha Goldstaub (chair of the AI council) in order to further advocate for our community’s needs, with Sana agreeing that the Office for AI will run workshops together with the RSS focused on the technical practitioner community, in order to gain their perspective and identify their needs.
“Confessions of a Data Scientist” seemed to go down very well at the recent RSS conference- massive thanks to Louisa Nolan for making it so successful, and to you all for your contributions.
Of course, the RSS never sleeps… so preparation for next year’s conference, which will take place in Aberdeen, Scotland from 12-15 September 2022, is already underway. The RSS is inviting proposals for invited topic sessions. These are put together by an individual, group of individuals or an organisation with a set of speakers who they invite to speak on a particular topic. The conference provides one of the best opportunities in the UK for anyone interested in statistics and data science to come together to share knowledge and network. Deadline for proposals is November 18th.
Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The last talk was on September 7th where Thomas Kipf, Research Scientist at Google Research in the Brain Team in Amsterdam, discussed “Relational Structure Discovery“. Videos are posted on the meetup youtube channel – and future events will be posted here.
This Month in Data Science
Lots of exciting data science going on, as always!
Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…
- Last month in the ethics section we mentioned the recent Australian court case, where an AI was recognised as the inventor in a patent. This issue is now coming to a head in different jurisdictions, including the US and UK: useful summaries from the Register and the verge, with a more in-depth look at the legal side from Pinsent Masons here
- Good to see Margaret Mitchell, the former co-head of Google’s Ethical AI research group, has successfully moved on from her controversial ousting (covered in previous newsletters) and taken up a roll at Hugging Face, a well respected open source AI community
- The UN High Commissioner for Human Rights released an urgent call for action around AI risks to privacy
“Artificial intelligence can be a force for good, helping societies overcome some of the great challenges of our times. But AI technologies can have negative, even catastrophic, effects if they are used without sufficient regard to how they affect people’s human rights”
- Meantime, new applications of AI highlight these risks…:
- MIT technology review reveals how porn deep fakes are now accessible with a click of a button
- Facebook is apparently launching augmented reality glasses which not surprisingly triggers all sorts of privacy concerns…
- And of course we know how fragile some of these systems are to adversarial attack (tricking the AI with something that is obvious to a human). It seems you can now avoid facial recognition with some simple makeup.
- Although recent research into the attitudes of ML researchers highlights ethical concerns about applications in a number of industries, including the military, this hasn’t stopped rapid development in this space. Indeed, the NY Times reports on the AI assisted killing of an Iranian nuclear scientist by Israeli agents.
- In a similar vein, this interesting behind the scenes exposé of Deep Mind’s struggle for control within Google, highlights the concerns of ML researchers at the potential applications of their work.
- Medical AI is rapidly developing and there are questions as to whether and how medical practitioners keep to up speed with the pros and cons of different approaches and the ethical challenges that occur – good paper talking through the issues
- Recommendation systems are everywhere and increasingly simple to implement.
- They are tuned to optimise particular metrics as we have discussed in previous newsletters- so if engagement or ‘attention’ is the metric of choice (often the case as it drives profitability at many social networks) they will naturally surface more ‘attention grabbing’ material, which is more likely to be contentious
- Facebook has tacitly admitted that engagement might not be the most appropriate measure to optimise – which would be big news if they move in that direction
"Depoliticizing people’s feeds makes sense for a company that is perpetually in hot water for its alleged impact on politics"
- Transparency is a key component to avoiding bias and reducing ethical concerns and we have a couple of positive examples from leading firms this month
- Twitter has come up with a novel approach to identifying algorithmic bias – rewards for identifying bias in their systems (like bug bounties in software)
- And YouTube has released a simple guide to how their recommendation system works with another tacit admission that engagement might not be the best measure to optimise…
"We don’t want viewers regretting the videos they spend time watching and realized we needed to do even more to measure how much value you get from your time on YouTube."
Developments in Data Science…
As always, lots of new developments… thought we’d have a more extended look at some of the new research this month
- Plenty of great arXiv papers out there this month- I know these can be a bit dry, so will try and give a bit of context…
- One theme of research we have been following is “fewer-shot” training of models. Fundamentally, humans don’t need millions of examples of an orange before being able to identify one, so learning from limited examples should be possible. Large language models like GPT-3 have shown great promise in this area, where, given a few “prompts” (question and answer examples), they seem to be able to provide remarkable results to this type or problem. Sadly, this paper, “true few-shot learning” suggests we need a more standardised approach to example selection as previous results may have been artificially inflated by biased approaches.
- More positively, “Can you learn an algorithm” talks through recent research showing that simple recurrent neural networks can learn approaches that can be successfully applied to larger scale problems, just as humans can learn from toy examples. Similarly, a new sequence to sequence learning approach from MIT CSAIL includes a component that learns “grammar” across examples.
- Another popular research theme is simplifying architecture and reducing processing. A team at Google Brain have shown (“Pay Attention to MLPs“) that you can almost replicate the performance of transformers (a more complex deep learning architecture) with a simpler approach based on basic building blocks (multi-layer perceptrons)
- GANs (generational adversarial networks) are pretty cool – they generate new similar looking examples from input data (see here for an intro). A recent paper (GAN’s N’ Roses) takes this to a new level, generating stable video from an input and a theme. (“GAN’s N’ Roses” is clearly a popular meme – this tutorial predates the paper by 4 years!)
- Of course the big industrial research powerhouses (Google/DeepMind, Facebook etc.) keep churning out fantastic work:
- Facebook released textless-NLP which generates speech direct from audio. It is based on an underlying Generative Spoken Language Model which can thus work on languages without huge text corpuses. They have also released a new approach to search, based on what they call ‘neural-databases’ which could greatly improve results to complex queries.
- DeepMind released another ground breaking approach to reinforcement learning called “Collect and Infer” which dramatically improves the efficiency of RL approaches, requiring less “practice” to get to a solution.
“We would like our agents to leverage knowledge acquired in previous tasks to learn a new task more quickly, in the same way that a cook will have an easier time learning a new recipe than someone who has never prepared a dish before"
- Finally, one paper I encourage everyone to read- “A Farewell to the Bias-Variance Tradeoff?“, one of the conundrums I still struggle to fully understand … why is that over-parameterised models (those which seem to have far too many parameters given the data set they are trained on) are able to generalise so well.
Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!
- Great article in Wired on the development of large language models outside of the US, and the English language
"What's surprising about these large language models is how much they know about how the world works simply from reading all the stuff that they can find"
- Interesting overview on current state of the art in “creative automation” (“the ability to generate original, high quality content leveraging data and technology”) – lots of fun things to try out!
- Google has released a new approach to “upscaling” photos – removing that pixelated effect – with impressive results.
- A new “magic carpet” developed at MIT can estimate human poses and activity simply from its tactile sensors.
- Quite fitting that DeepMind and the MetOffice in the UK are taking rain forecasting to the next level!
- OpenAI has developed an approach that can summarise books of arbitrary lengths using an elegant approach to a very complex task (paper here)
- Slightly more mundane…. “The new Roomba uses AI to avoid smearing dog poop all over your house” – good to know – and “AI assisted smoke impact analysis for California Winemakers“…
- SpaceCows …. you had me with the name! Tracking feral cattle and buffalo across Northern Australia (a 25,000 square kilometre area) with 25 nano satellites and ML image recognition
"It is a pioneering program that’s mixing responsible AI and science with indigenous led knowledge and solving complex environmental management problems at spots in Northern Australia"
- We don’t hear much from Amazon about their use of AI, although clearly they have very advanced applications across their business. This was an interesting post digging into the practical problem of how you help delivery workers find the actual entrance to a given residence, from noisy data.
- “In this project, we’ve trained physically simulated humanoids to play a simplified version of 2v2 football” …. and there’s video!
- And the Boston Dynamics robots continue to fascinate/scare in equal measure… they can now do Parkour!
"On the Atlas project, we use parkour as an experimental theme to study problems related to rapid behavior creation, dynamic locomotion, and connections between perception and control that allow the robot to adapt – quite literally – on the fly."
- An historic moment…. scikit-learn reaches version 1.0!
- Finally really interesting background on developments in protein structure after DeepMind’s alpha fold announcement and the concern that the underlying code might not be released.
"Everyone was floored, there was a lot of press, and then it was radio silence, basically. You’re in this weird situation where there’s been this major advance in your field, but you can’t build on it.”
How does that work?
A new section on understanding different approaches and techniques
- Hyper-parameter optimisation can often require more art than science if you don’t have a systematic approach- some useful tips here using Argo
- There are lots of different activation functions (defining the output from given inputs) you can use in neural networks, but which one should you use for a given task? Useful paper here.
- Interesting comparison: using meme search to explore the performance of different image encoders, in particular CLIP from OpenAI vs Google’s Big Transfer
- I’m not a massive fan of media-mix modelling (building models that optmise marketing expenditure based on historic performance) because it always feels there is so much fundamentally missing in the underying data sets. However, they can certainly be useful, and using a Bayesian approach would seem to be a good way to go (more detail here)
"The Bayesian approach allows prior knowledge to be elegantly incorporated into the model and quantified with the appropriate mathematical distributions."
- You have your model in production, but you need to make it faster…
- Getting into the nitty-gritty of ML compilers and optimisers
- How about parallelising your python code?
- More pointers from those who have done it! (“Scaling TensorFlow to 300 million predictions per second“)
- Finally, Knowledge Graphs – elegant ways to represent relationships between “all things”
How to drive analytics and ML into production
- Useful pointers to bear in mind when you first start on an ML problem: “The First Rule of Machine Learning: Start without Machine Learning”
- Interesting take on how best to apply ML Ops in your organisation
"Companies that are starting with the problem first, improving on a defined metric and reach ML as a solution naturally are the ones that will treat their models as a continuously developing product”
- We’ve talked about “Data-centric AI” previously and are advocates…
- Here’s the story so far – a good summary from Stanford AI Lab
- And here’s more specifics on a key area – incorrect labels in your data sets (also here)
- Excellent summary here on semi-supervised learning, active-learning and human-in-the-loop approaches to enhancing your training data
- And some good pointers from MonteCarloData on how you can be more proactive in identifying underlying data issues
- It’s gathering momentum – there is now a workshop at NeurIPS
- Getting really practical… useful things to learn:
- Not a bad list in “Nine Tools I Wish I Mastered before My PhD in Machine Learning” (although some say pipenv is better than conda…)
- And awk… what’s not to like?
Bigger picture ideas
Longer thought provoking reads
"the modern data stack isn't enough. We have to create a modern data experience."
- If AI is based on machine learning systems, how can we make them “un-learn” something?
- We thought neurons were the simple building blocks of the brain – but they may be far more complex than we thought
"We call for the replacement of the deep network technology to make it closer to how the brain works by replacing each simple unit in the deep network today with a unit that represents a neuron, which is already—on its own—deep"
- Are we reaching diminishing returns in Deep Learning’s conquest of all ML challenges?
- How “Big Data” has driven graph theory to prominence
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:
- Examining the use of punctuation in different novels
- Some fun music projects:
- Spotify has open-sourced it’s audio effects library, Pedalboard
- Going the whole way – music composition with Deep Learning!
- Building a smart robot AI with Hugging Face and Unity
What’s interesting with that system, contrary to classical game development, is that you don’t need to hard-code every interaction. Instead, you use a language model that selects what’s robot possible action is the most appropriate given user input.
- Feeling like a bigger challenge? You could always submit a blog post to ICLR...
Our goal is to create a formal call for blog posts at ICLR to incentivize and reward researchers to review past work and summarize the outcomes, develop new intuitions, or highlight some shortcomings.
Although life seems to be returning to normal for many people in the UK, there is still lots of uncertainty on the Covid front… vaccinations keep progressing in the UK, which is good news, but we still have high community covid case levels due to the Delta variant…
- The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be roughly 1 in 85 people, which is still very high, but at least better than the 1 in 70 a month or so ago.
- Still lots of confusion over base-rates and metrics … this is quite a nuanced one, where the issue is with the underlying estimate of the unvaccinated population (because we don’t really know how many people live in the UK…)
- One of the the best examples of the use of AI to provide tangible and practical help during the pandemic: reinforcement learning for testing at the Greek border (nature paper here. )
"By comparing Eva’s performance against modelled counterfactual scenarios, we show that Eva identified 1.85 times as many asymptomatic, infected travellers as random surveillance testing, with up to 2-4 times as many during peak travel, and 1.25-1.45 times as many asymptomatic, infected travellers as testing policies that only utilize epidemiological metrics."
Updates from Members and Contributors
- Many congratulations to Prithwis De whose paper (“An Alternative Approach to Propensity Score Matching Technique in Real-World Evidence“) has been successfully included in an upcoming data science publication from springer.
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.
The views expressed are our own and do not necessarily represent those of the RSS