September Newsletter

Hi everyone-

I hope you have all been enjoying a great summer. Certainly lots to engage with from heat waves, sewage spills, leadership elections, spiralling energy costs… and of course on a much more positive note the Lionesses winning the Euros for the first time (it’s come home…)! Apologies for skipping a month but it does mean we have plenty to talk about so prepare for a somewhat longer than normal read…

Following is the September edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. (If you are reading this on email and it is not formatting well, try viewing online at

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science September 2022 Newsletter

RSS Data Science Section

Committee Activities

Committee members continue to be actively involved in the Alliance for Data Science Professionals, a joint initiative between the RSS and various other relevant organisations in defining standards for data scientist accreditation. The first tranche of data scientists to complete the new defined standard of professionalism received their awards at a special ceremony at the Royal Society in July. The U. K’s National Statistician welcomed the initiative.

Our recent event “From paper to pitch, success in academic/industry collaboration” which took place on Wednesday 20th July was very successful with strong attendance and a thought provoking and interactive discussion- may thanks to Will Browne for organising. We will write up a summary and publish shortly.

We also excited to announce our next event catchily titled “IP Freely, making algorithms pay – Intellectual property in Data Science and AI” which will be held on Wednesday 21 September 2022, 7.00PM – 8.00PM. Sign up here to hear leading figures such as Dr David Barber (Director of the UCL Centre for Artificial Intelligence ) and Professor Noam Shemtov (Intellectual Property and Technology Law at Queen Mary’s University London) in what should be an excellent discussion.

The RSS 2022 Conference is rapidly approaching (12-15 September in Aberdeen). The Data Science and AI Section is running what will undoubtedly be the best session(!) … ‘The secret sauce of open source’, which will discuss using open source to bridge the gap between academia and industry.

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on September 14th when Gwanghyun Kim, Ph.D. student at Seoul National University (SNU), will discuss “Text-Guided Diffusion Models for Robust Image Manipulation”. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

  • We have to acknowledge that many of the new AI tools are astonishing both in their performance and their sophistication and that it is incredibly hard if not impossible to eliminate all mistakes. However, applying best practice and using high quality data sets should be at the core of all work in this area.
"“They were claiming near-perfect accuracy, but we found that in each of these cases, there was an error in the machine-learning pipeline,” says Kapoor."
"The new Vehicle General Safety Regulation starts applying today. It introduces a range of mandatory advanced driver assistant systems to improve road safety and establishes the legal framework for the approval of automated and fully driverless vehicles in the EU"
"Facebook’s stated mission is “to give people the power to build community and bring the world closer together.” But a deeper look at their business model suggests that it is far more profitable to drive us apart. By creating “filter bubbles”—social media algorithms designed to increase engagement and, consequently, create echo chambers where the most inflammatory content achieves the greatest visibility—Facebook profits from the proliferation of extremism, bullying, hate speech, disinformation, conspiracy theory, and rhetorical violence"
“We remain committed to protecting our users against improper government demands for data, and we will continue to oppose demands that are overly broad or otherwise legally objectionable,” Ms. Fitzpatrick wrote.
"He posted again on Twitter later in the day, saying: "apparently, this exploit happened because the gov developer wrote a tech blog on CSDN and accidentally included the credentials", referring to the China Software Developer Network."
"This paper first discusses what humanoid robots are, why and how humans tend to anthropomorphise them, and what the literature says about robots crowding out human relations. It then explains the ideal of becoming “fully human”, which pertains to being particularly moral in character."

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

  • Solving proteins…
    • We have previously discussed the groundbreaking work of DeepMind in solving the protein folding problem with AlphaFold, which can generate the estimated 3d structure for any protein. They have now gone a step further and publicly released the structures of of over 200m proteins
    • Lots of background and commentary on this ground breaking step here and here
"Prof Dame Janet Thornton, the group leader and senior scientist at the European Molecular Biology Laboratory’s European Bioinformatics Institute, said: “AlphaFold protein structure predictions are already being used in a myriad of ways. I expect that this latest update will trigger an avalanche of new and exciting discoveries in the months and years ahead, and this is all thanks to the fact that the data are available openly for all to use."
Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions
  • Finally another phenomenon I find pretty extraordinary… “Grokking” where model performance improves after a seemingly over-fitting. Researchers at Apple give the full story
"The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 ) refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

Many real-world machine learning problems can be framed as graph problems. On online platforms, users often share assets (e.g. photos) and interact with each other (e.g. messages, bookings, reviews). These connections between users naturally form edges that can be used to create a graph.

However, in many cases, machine learning practitioners do not leverage these connections when building machine learning models, and instead treat nodes (in this case, users) as completely independent entities. While this does simplify things, leaving out information around a node’s connections may reduce model performance by ignoring where this node is in the context of the overall graph.

How does that work?
Tutorials and deep dives on different approaches and techniques

"To conclude: we have shown that for in the presence of (many) irrelevant variables, RF performance suffers and something needs to be done. This can be either tuning the RF, most importantly increasing the mtry parameter, or identifying and removing the irrelevant features using the RFE procedure rfe() part of the caret package in R. Selecting only relevant features has the added advantage of providing insight into which features contain the signal."
"Text Embeddings give you the ability to turn unstructured text data into a structured form. With embeddings, you can compare two or more pieces of text, be it single words, sentences, paragraphs, or even longer documents. And since these are sets of numbers, the ways you can process and extract insights from them are limited only by your imagination."

Practical tips
How to drive analytics and ML into production

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink! …

"As these LLMs become more common and powerful, there seems to be less and less agreement over how we should understand them. These systems have bested many “common sense” linguistic reasoning benchmarks over the years, many which promised to be conquerable only by a machine that “is thinking in the full-bodied sense we usually reserve for people.” Yet these systems rarely seem to have the common sense promised when they defeat the test and are usually still prone to blatant nonsense, non sequiturs and dangerous advice. This leads to a troubling question: how can these systems be so smart, yet also seem so limited?"
"To be sure, there are indeed some ways in which AI truly is making progress—synthetic images look more and more realistic, and speech recognition can often work in noisy environments—but we are still light-years away from general purpose, human-level AI that can understand the true meanings of articles and videos, or deal with unexpected obstacles and interruptions. We are still stuck on precisely the same challenges that academic scientists (including myself) having been pointing out for years: getting AI to be reliable and getting it to cope with unusual circumstances."
"Ongoing debates about whether large pre-trained models understand text and images are complicated by the fact that scientists and philosophers themselves disagree about the nature of linguistic and visual understanding in creatures like us. Many researchers have emphasized the importance of “grounding” for understanding, but this term can encompass a number of different ideas. These might include having appropriate connections between linguistic and perceptual representations, anchoring these in the real world through causal interaction, and modeling communicative intentions. Some also have the intuition that true understanding requires consciousness, while others prefer to think of these as two distinct issues. No surprise there is a looming risk of researchers talking past each other."
Liberating the world’s scientific knowledge from the twin barriers of accessibility and understandability will help drive the transition from a web focused on clicks, views, likes, and attention to one focused on evidence, data, and veracity. Pharma is clearly incentivized to bring this to fruition, hence the growing number of startups identifying potential drug targets using AI — but I believe the public, governments, and anyone using Google might be willing to forgo free searches in an effort for trust and time-saving. The world desperately needs such a system, and it needs it fast
"To put this in context: until this paper, it was conventional to train all large LMs on roughly 300B tokens of data.  (GPT-3 did it, and everyone else followed.)

Insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got[6].

People put immense effort into training models that big, and were working on even bigger ones, and yet none of this, in principle, could ever get as far Chinchilla did."

Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

Updates from Members and Contributors

  • Kevin O’Brien highlights the PyData Global 2022 Conference, taking place online between Thurs 1st and Sat 3rd December. Calls for proposals are still open until September 12th, 2022. Submit here.
  • Ole Schulz-Trieglaff also mentions the PyData Cambridge meetup which is running a talk on Sept 14th by Gian Marco Iodice (Tech Lead ML SW Performance Optimizations at ARM)
  • Ronald Richman and colleagues have published a paper on their innovative work using deep neural nets for discrimination free pricing in insurance, when discriminatory characteristics are not known. Well worth a read.
  • Many congratulations to Prithwis De who has published a book on a very relevant topic: “Towards Net-Zero Targets: Usage of Data Science for Long-Term Sustainability Pathways
  • Mark Marfé and Cerys Wyn Davies recently published an article about data and IP issues in the context of AI deployed on ESG projects which looks interesting and relevant.
  • Finally, more news from The Data Science Campus who are helping organise this year’s UN Big Data Hackathon, November 8-11.
    • The UN Big Data Hackathon is an exciting global competition for data professionals and young people from all around the world to work together on important global challenges.
    • It’s part of this year’s UN Big Data conference in Indonesia. There are two tracks, one for data science professionals and the other for young people and students (under 32 years of age).
    • Registrations should preferably be done as a team of 3 to 5 people, but individual applications can also be accepted. Registration deadline in Sept 15th.


The Job market is a bit quiet over the summer- let us know if you have any openings you’d like to advertise

  • EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: