March Newsletter

Hi everyone-

Another month flies by – at least it finally seems to be getting a bit lighter in the mornings although I fear sunny spring days are still a way off… I imagine you are suffering withdrawal from a lack of dramatic Olympics Curling action so perhaps some thought provoking data science reading materials to fill the void…

Following is the March edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. Check out our new ‘Jobs!’ sectionan extra incentive to read to the end!

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science March 2022 Newsletter

RSS Data Science Section

Committee Activities

We have all been shocked and saddened by events in the Ukraine and our thoughts and best wishes go out to everyone affected

The committee is busy planning out our activities for the year with lots of exciting events and even hopefully some in-person socialising… Watch this space for upcoming announcements.

We are very pleased to announce that Jennifer Hall, Senior AI Lab Data Scientist at NHSX and Will Browne, Associate Partner – Data Science & Analytics at CF Healthcare are both joining the Data Science and AI Section committee. They bring a wealth of talent and experience in all aspects of data science and we are very much looking forward to their contributions across our various activities.

Florian Ostmann has been involved with recent developments of the AI Standards Hub pilot (led by the Alan Turing Institute, in partnership with BSI and NPL)

Anyone interested in presenting their latest developments and research at the Royal Statistical Society Conference? The organisers of this year’s event – which will take place in Aberdeen from 12-15 September – are calling for submissions for 20-minute and rapid-fire 5-minute talks to include on the programme.  Submissions are welcome on any topic related to data science and statistics.  Full details can be found here. The deadline for submissions is 5 April.

Martin Goodson continues to run the excellent London Machine Learning meetup and is very active with events. The next one is on March 9th when Lucas Beyer, a Researcher at Google Brain Zurich, will discuss his research on “Learning General Visual Representations“. Videos are posted on the meetup youtube channel – and future events will be posted here.

Help RSS to support the data science community

The Royal Statistical Society (RSS) is developing resources to support everyone working in data science to meet their learning and development goals and career objectives. If you have an interest in data science, we invite you to take part in this survey, whether or not you are a member of RSS. 

The survey should take around 15 minutes to complete. Your responses will be invaluable in helping us to understand and meet the wants and needs of the data science community, and to support your work in this exciting, fast-developing field.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

"The UK Statistics Authority have written to Downing St to advise them that the Prime Minister's claim that there are more people in work now than at the start of the pandemic is wrong. He has now made this claim 7 times but knows it is wrong! When will he correct the record?!."
"Those surveyed were asked: suppose there was a diagnostic test for a virus. The false-positive rate (the proportion of people without the virus who get a positive result) is one in 1,000. You have taken the test and tested positive. What is the probability that you have the virus? Of the politicians surveyed, 16 per cent gave the correct answer that there was not enough information to know."
The truth is AI failures are not a matter of if but when. AI is a human endeavor that combines information about people and the physical world into mathematical constructs. Such technologies typically rely on statistical methods, with the possibility for errors throughout an AI system’s lifespan. As AI systems become more widely used across domains, especially in high-stakes scenarios where people’s safety and wellbeing can be affected, a critical question must be addressed: how trustworthy are AI systems, and how much and when should people trust AI? 
We found two key through lines: Lawmakers and the public lack fundamental access to information about what algorithms their agencies are using, how they’re designed, and how significantly they influence decisions.
Tesla Chief Executive Officer Elon Musk said on Twitter "there were no safety issues" with the function. "The car simply slowed to ~2 mph & continued forward if clear view with no cars or pedestrians," Musk wrote.
To train InstructGPT, OpenAI hired 40 people to rate GPT-3’s responses to a range of prewritten prompts, such as, “Write a story about a wise frog called Julius” or “Write a creative ad for the following product to run on Facebook.” Responses that they judged to be more in line with the apparent intention of the prompt-writer were scored higher. Responses that contained sexual or violent language, denigrated a specific group of people, expressed an opinion, and so on, were marked down. This feedback was then used as the reward in a reinforcement learning algorithm that trained InstructGPT to match responses to prompts in ways that the judges preferred.

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

"Synthetically generated faces are not just highly photorealistic, they are nearly indistinguishable from real faces and are judged more trustworthy"
  • The researchers at Facebook/Meta have been busy:
    • They have built their own super-computer, dubbed the AI Research Super Cluster...
    • They have developed a Natural Language Processing (NLP) approach that does not use text or labels at all – it is able to learn directly from raw audio signals- pretty astonishing!
"GSLM leverages recent breakthroughs in representation learning, allowing it to work directly from only raw audio signals, without any labels or text. It opens the door to a new era of textless NLP applications for potentially every language spoken on Earth—even those without significant text data sets."
“People can flexibly maneuver objects in their physical surroundings to accomplish various goals. One of the grand challenges in robotics is to successfully train robots to do the same, i.e., to develop a general-purpose robot capable of performing a multitude of tasks based on arbitrary user commands"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

“It’s an incredibly powerful method,” says Jonathan Citrin at the Dutch Institute for Fundamental Energy Research, who was not involved in the work. “It’s an important first step in a very exciting direction.”
“Outracing human drivers so skillfully in a head-to-head competition represents a landmark achievement for AI,” said Chris Gerdes, a professor at Stanford who studies autonomous driving, in an article published on Wednesday alongside the Sony research in the journal Nature.
"To help clinicians avoid remedies that may potentially contribute to a patient’s death, researchers at MIT and elsewhere have developed a machine-learning model that could be used to identify treatments that pose a higher risk than other options"

How does that work?
A new section on understanding different approaches and techniques

"A* is a modification of Dijkstra’s Algorithm that is optimized for a single destination. Dijkstra’s Algorithm can find paths to all locations; A* finds paths to one location, or the closest of several locations. It prioritizes paths that seem to be leading closer to a goal."
"Vector databases are purpose-built to store, index, and query across embedding vectors generated by passing unstructured data through machine learning models."

Practical tips
How to drive analytics and ML into production

"Professor: “Yes, outstanding. However, you failed to ask me what metrics I used to grade your model. Your opinion of model quality doesn’t matter. It’s your users’ needs that do.”

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink!

"A lot has happened in the past half century! The eight ideas reviewed below represent a categorization based on our experiences and reading of the literature and are not listed in a chronological order or in order of importance. They are separate concepts capturing different useful and general developments in statistics."
  • There are lots of “here are all the problems with statistical significance” type articles out there, but the visual examples in this one make it more compelling than many
"You can have a miniscule effect size and still have a significant effect. Do we always prefer the (c) to the (a)? Is a meager, but mostly positive benefit necessarily better than a treatment potentially of large benefit to some but harmful to others necessarily? Wouldn’t it be in our interest to understand this spread of outcomes so we could isolate the group of individuals who benefit from the treatment?'”
"So consider this Deep Blue’s final gift, 25 years after its famous match. In his defeat, Kasparov spied the real endgame for AI and humans. “We will increasingly become managers of algorithms,” he told me, “and use them to boost our creative output—our adventuresome souls.”
But in the future, he says, systems will be needed that can handle all other scenarios as well: “It’s not just about the trajectory of a missile or the movement of a robotic arm, which can be modeled through careful mathematics. It’s about everything else, everything we observe in the world: About human behavior, about physical systems that involve collective phenomena like water or branches in a tree, about complex things for which humans can easily develop abstract representations and models,” LeCun said

Bringing data to life – the art and science of visualisation
Leland Wilkinson, author of Grammar of Graphics, sadly passed away at the end of last year. Hadley Wickham created ggplot2 as a way to implement the ideas contained in this formative work (gg = grammar of graphics) and I know I for one have been heavily influenced by it in how I think about visualisation. In memory of Leland I thought it would be fitting to call out some recent articles of interest in the field.

"The problem with guidelines based on precision is that visualization is not really about precision. Sure, there are cases where precision matters because it allows readers to detect important differences that would otherwise be missed. But visualization is less about precision, and much  more about what the visual representation expresses."

Covid Corner

Well, apparently Covid is now all over according to the UK government, or at least there is no need for any more restrictions…

  • Given the government is removing requirements and incentives to test for Covid, the ONS Coronavirus infection survey is now one of the only ways we can tell the prevalence of the virus in our society.
  • The latest results estimate 1 in 25 people (4%) in England have Covid. While this is down from its peak of 1 in 15 in January it is still a long way from the 1 in 1000 we had last summer. Bear in mind in the chart below that the levels we had in February 2021 were enough to drive a national lockdown …

Updates from Members and Contributors

  • Jona Shehu and her colleagues at Helix Data Innovation are hosting what looks to be a high quality and relevant online roundtable on model explainability with leaders across the AI, finance, consumer rights and data governance sectors. The event is on March 15th (11-12.30) and is free to attend. Register here
  • Kevin OBrien highlights the inaugural SciMLCon (of the Scientific Machine Learning Open Source Software Community) taking place online on Wednesday 23rd March 2022. Core topics include: Physics-Informed Model Discovery and Learning, Compiler-Assisted Model Analysis and Sparsity Acceleration, ML-Assisted Tooling for Model Acceleration and many more. SciMLCon is focused on the development and applications of the Julia-based SciML tooling -with expansion into R and Python planned in the near future.
  • Maria Rosario Mestre is CEO of DataQA which offers tools to search, label and organise unstructured documents: sounds very useful! They are currently enrolling beta customers for the first release of the platform which includes a free trial so could be well worth checking out.


A new section highlighting relevant job openings across the Data Science and AI community (let us know if you have anything you’d like to post here…)

  • Holisticai, a startup focused on providing insight, assessment and mitigation of AI risk, has a number of relevant AI related job openings- see here for more details
  • EvolutionAI, are looking for a machine learning research engineer to develop their award winning AI-powered data extraction platform, putting state of the art deep learning technology into production use. Strong background in machine learning and statistics required
  • AstraZeneca are looking for a Data Science Training Developer – more details here
  • Lloyds Register are looking for a data analyst to work across the Foundation with a broad range of safety data to inform the future direction of challenge areas and provide society with evidence-based information.
  • Cazoo is looking for a number of senior data engineers – great modern stack and really interesting projects!

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: