September Newsletter

Hi everyone-

I hope you have all been enjoying a great summer. Certainly lots to engage with from heat waves, sewage spills, leadership elections, spiralling energy costs… and of course on a much more positive note the Lionesses winning the Euros for the first time (it’s come home…)! Apologies for skipping a month but it does mean we have plenty to talk about so prepare for a somewhat longer than normal read…

Following is the September edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. (If you are reading this on email and it is not formatting well, try viewing online at http://datasciencesection.org/)

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science September 2022 Newsletter
RSS Data Science Section

Committee Activities

Committee members continue to be actively involved in the Alliance for Data Science Professionals, a joint initiative between the RSS and various other relevant organisations in defining standards for data scientist accreditation. The first tranche of data scientists to complete the new defined standard of professionalism received their awards at a special ceremony at the Royal Society in July. The U. K’s National Statistician welcomed the initiative.

Our recent event “From paper to pitch, success in academic/industry collaboration” which took place on Wednesday 20th July was very successful with strong attendance and a thought provoking and interactive discussion- may thanks to Will Browne for organising. We will write up a summary and publish shortly.

We also excited to announce our next event catchily titled “IP Freely, making algorithms pay – Intellectual property in Data Science and AI” which will be held on Wednesday 21 September 2022, 7.00PM – 8.00PM. Sign up here to hear leading figures such as Dr David Barber (Director of the UCL Centre for Artificial Intelligence ) and Professor Noam Shemtov (Intellectual Property and Technology Law at Queen Mary’s University London) in what should be an excellent discussion.

The RSS 2022 Conference is rapidly approaching (12-15 September in Aberdeen). The Data Science and AI Section is running what will undoubtedly be the best session(!) … ‘The secret sauce of open source’, which will discuss using open source to bridge the gap between academia and industry.

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on September 14th when Gwanghyun Kim, Ph.D. student at Seoul National University (SNU), will discuss “Text-Guided Diffusion Models for Robust Image Manipulation”. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

We know how good AI is now getting with image recognition, generation and natural language tasks. As always these technical innovations create opportunities and challenges as this thoughtful NYTimes pieces talks to.
- On the one hand we have identification of anonymous faces in WWII photos, movie studios able to save a fortune using ‘DeepFake’ approaches on last minute editing challenges, musicians incorporating AI tools into their creative processes and even ‘AI-assisted’ authors increasing their output through creative use of language models.
- But all these tools are trained on data- fine when it’s facts, figures and publicly available information, but how should we think about art generated ‘in the style of’ another artist?
- And of course the underlying premise with all of this, is that the AI actually works well… which may not be the case.
  “AI rapper FN Meka dropped by Capitol over racial stereotyping”;
  “Chess robot grabs and breaks finger of seven-year-old opponent“;
  “Cruise’s Robot Car Outages Are Jamming Up San Francisco” (although Waymo results are very impressive to watch…)
  “Meta’s new AI chatbot can’t stop bashing Facebook”

Good morning to everyone, especially the Facebook https://t.co/EkwTpff9OI researchers who are going to have to rein in their Facebook-hating, election denying chatbot today pic.twitter.com/wMRBTkzlyD
— Jeff Horwitz (@JeffHorwitz) August 7, 2022

We have to acknowledge that many of the new AI tools are astonishing both in their performance and their sophistication and that it is incredibly hard if not impossible to eliminate all mistakes. However, applying best practice and using high quality data sets should be at the core of all work in this area.

"“They were claiming near-perfect accuracy, but we found that in each of these cases, there was an error in the machine-learning pipeline,” says Kapoor."

Investment and interest in new applications is still increasing (“NATO launches innovation fund“) – so how do we make sure these increasingly opaque models are built in the right way? Increasingly some sort of regulation and auditing seems essential.
- “Regulating AI: The Horizontal vs Vertical Approach“
- “Microsoft Responsible AI Standard v2” as well as research from Princeton into the “reproducibility crisis“
- AI “Ethical Toolkits”
- Summary of AI policy at the state level in the US
- “UK Government gives the green light for World’s longest drone ‘superhighway’”
- “New rules to improve road safety and enable fully driverless vehicles in the EU”

"The new Vehicle General Safety Regulation starts applying today. It introduces a range of mandatory advanced driver assistant systems to improve road safety and establishes the legal framework for the approval of automated and fully driverless vehicles in the EU"

And it’s not just regulation and auditing of new use cases and approaches that are needed. Meta/Facebook continues to come under scrutiny for current and historic practices
- “United States Attorney Resolves Groundbreaking Suit Against Meta Platforms, Inc To Address Discriminatory Advertising For Housing” which resulted in actual changes at Facebook
- Even the suggestion that Facebook’s business model itself drives unethical behaviour

"Facebook’s stated mission is “to give people the power to build community and bring the world closer together.” But a deeper look at their business model suggests that it is far more profitable to drive us apart. By creating “filter bubbles”—social media algorithms designed to increase engagement and, consequently, create echo chambers where the most inflammatory content achieves the greatest visibility—Facebook profits from the proliferation of extremism, bullying, hate speech, disinformation, conspiracy theory, and rhetorical violence"

The recent overturning of Roe v Wade by the US Supreme Court and implications on the legality of abortion at the individual state level, has led to increased focus on the implications of data gathering.
- Automated vehicle license plate readers can track individuals across state boundaries
- While Google has agreed to delete location data when users visit abortion clinics

“We remain committed to protecting our users against improper government demands for data, and we will continue to oppose demands that are overly broad or otherwise legally objectionable,” Ms. Fitzpatrick wrote.

And sadly data leaks continue to happen, at seemingly larger and larger scale…
- “Leak of California gun owners’ private data far wider than originally reported“
- “Hacker claims to have stolen 1 bln records of Chinese citizens from police”

"He posted again on Twitter later in the day, saying: "apparently, this exploit happened because the gov developer wrote a tech blog on CSDN and accidentally included the credentials", referring to the China Software Developer Network."

Finally, a thought provoking paper on the potential implications of replacing human relations with humanoid robots

"This paper first discusses what humanoid robots are, why and how humans tend to anthropomorphise them, and what the literature says about robots crowding out human relations. It then explains the ideal of becoming “fully human”, which pertains to being particularly moral in character."

Developments in Data Science…
As always, lots of new developments on the research front and plenty of arXiv papers to read…

Solving proteins…
- We have previously discussed the groundbreaking work of DeepMind in solving the protein folding problem with AlphaFold, which can generate the estimated 3d structure for any protein. They have now gone a step further and publicly released the structures of of over 200m proteins
- Lots of background and commentary on this ground breaking step here and here

"Prof Dame Janet Thornton, the group leader and senior scientist at the European Molecular Biology Laboratory’s European Bioinformatics Institute, said: “AlphaFold protein structure predictions are already being used in a myriad of ways. I expect that this latest update will trigger an avalanche of new and exciting discoveries in the months and years ahead, and this is all thanks to the fact that the data are available openly for all to use."

How should we think about the relationship between the size of a model, and the size of the data set it is trained on for a given computational budget? Is there some sort of optimal relationship between the two? Researchers at DeepMind think so, and conclude (the ‘Chinchilla’ paper) that the current set of large language models are actually under-trained. (Interesting twitter discussion on scale here)
And the increasing focus on data (both size and quality) rather than model improvement has led to new data-centric benchmarks for AI development
Researchers are always striving to ‘do more’ with the data and compute they have
- Apple in collaboration with the University of British Columbia have proposed a new way to reconstruct a human in a scene from a single “in-the-wild” video
- Elegant research into using large language models to generate synthetic training data in specific use cases: “Language Models Can Teach Themselves to Program Better“
- More self-supervised learning- this time with masked autoencoders in reinforcement learning: “Masked Model Worlds for Visual Control“
- And again, this time for image embedding: “Selfie: Self-supervised Pretraining for Image Embedding”
- What precision do you really need in your weights? Looks like Integers may be enough!
- Ground truth labelling of data is sometimes impossible, with different people labelling data points in different ways. Elegant method for learning from these disagreements (“Jury Learning: Integrating Dissenting Voices into Machine Learning Models”)
- And going even further, using human-in-the-loop pipelines for social policy design
And robustness and generalisation is always a hot-topic
- Innovative approach using contrastive learning to remove spurious correlations in image recognition
- Re-evaluating Transformers vs CNNs for robustness– no clear winner…
- Good paper digging into Concept Drift and Model Degradation
New research into time series methods:
- Encoding time series as implicit neural representations – HyperTime
- What looks like a good practical approach for anomaly detection in multi-variate time series
Graph Methods can be very successful at utilising features harder to represent in more traditional approaches and their application continues to expand
- A Generalization of Transformer Networks to Graphs
- Molecular representation in graphs
Ever since I found multiple single variable models combined together outperformed a single model using all the variables I’ve been sold on the concept of ensemble models… Model Soup (paper here) looks interesting- averaging weights instead of averaging outputs.
With model complexity continuing to increase, methods to interpret to model structure and reasoning are increasingly important, particular in the context of transparency. Good survey paper here
And yes, I still find this to be the case…: “Why do tree-based models still outperform deep learning on tabular data?“

Results show that tree-based models remain state-of-the-art on medium-sized data (∼10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions

Finally another phenomenon I find pretty extraordinary… “Grokking” where model performance improves after a seemingly over-fitting. Researchers at Apple give the full story

"The grokking phenomenon as reported by Power et al. ( arXiv:2201.02177 ) refers to a regime where a long period of overfitting is followed by a seemingly sudden transition to perfect generalization. In this paper, we attempt to reveal the underpinnings of Grokking via a series of empirical studies. Specifically, we uncover an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

We talk a lot about Large Language models like GPT3 and how good they are becoming – but what are the real like applications for them?
- General summary from an investor gives useful perspective
- Simple and elegant- english to regex (and vica versa) app powered by GPT3
- Using GPT3 to explain how code works
- Amazing… solving quantitative reasoning problems
- And fantastic work from Hugging Face to release BLOOM, a fully open source Large Language Model
Of course Large Language Models are a key component of Multi Modal Models (combining data from different modalities like images and text) such as DALLE which continue to generate lots of interest
- This is well worth listening too- how DALLE Works on the The Data Exchange. Really interesting to understand the components (LLM, CLIP, diffusion models to generate the output).(OpenAI have also published commentary of some of their risk reduction techniques mentioned in the discussion)
- If you want to get more hands on, here is a fast minimal port of DALLE Mini to pytorch
- Fun application of DALLE2 generating adventure game graphics
- Impressive application of multi modal learning from tractable.ai for disaster prediction, management and relief
- Who needs DALLE2 when you have Stable Diffusion- an open source text-to-image model released by stability.ai (although not without some controversy) This is a major breakthrough from the open source community (BigScience) and a big step in reducing the lock in of large commercial firms.

A/B testing is a critical tool in any organisation and approaches are becoming more and more sophisticated- some great examples here:
With the proliferation of data readily available, spurious correlations are an increasing problem which explains the growing interest in causal inference
Evolving approaches to recommendation systems:
- General tips- 10 mistakes to avoid with recommendation systems
- Good tutorial on conversational recommendation systems
- Using hybrid features at Yelp
- Reinforcement Learning for Budget Constrained Recommendations at Netflix
Excellent overview of the use of graph neural networks at Airbnb (useful package here to experiment with)

Many real-world machine learning problems can be framed as graph problems. On online platforms, users often share assets (e.g. photos) and interact with each other (e.g. messages, bookings, reviews). These connections between users naturally form edges that can be used to create a graph.

However, in many cases, machine learning practitioners do not leverage these connections when building machine learning models, and instead treat nodes (in this case, users) as completely independent entities. While this does simplify things, leaving out information around a node’s connections may reduce model performance by ignoring where this node is in the context of the overall graph.

Continuous integration at scale: applying machine learning to improve testing efficiency at Mozilla
More applications of AI across increasingly diverse industries:
- Spotting topographic changes over time with satellite imagery
- Improving football (soccer…) scouting
- Automated damage assessment of cars
- Automated electronic discovery in legal cases
- Elegant approach to matching supply and demand in energy markets (the ‘unit commitment problem’) – hugely beneficial for the utilisation of variable green energy sources like solar and wind.
Robots…

How does that work?
Tutorials and deep dives on different approaches and techniques

An impressive resource for all the various transformer based models, from ALBERT to XLNET and all in between.
Wanting to get up to speed on Deep Learning but don’t know where to start? – this looks to be a comprehensive guide from Sebastian Raschka
Great tutorial on self-supervised learning and applying Deep Learning to small data sets
Bayesian approaches are often elegant but can have steep learning curves- a few useful tutorials:
Dimensionality reduction is a critical skill and understanding the different approaches can be very useful:
- Autoencoders, latent space and the curse of high dimensionality
- Singular Value Decomposition
The days of coding up your perceptrons from scratch are behind us (thankfully!) but understanding how differential programming helps solve optimisation problems is still very useful – and an interesting library to experiment with (betty)
Optimisation (and algorithms in general) can be a under-represented topic in data science courses- but can be very important in ML implementation
- Great resource for key algorithms
- Excellent service for visualising how sorting algorithms work
- Understanding Paxos, “one of the oldest, simplest, and most versatile algorithms in the field of distributed consensus”
- If you’re looking to explore, evotorch looks interesting
Simple and elegant approach using SALT to better distribute your data when it is skewed
Useful tutorial on SHAP for model explainability
Excellent practical article on fine tuning random forest models

"To conclude: we have shown that for in the presence of (many) irrelevant variables, RF performance suffers and something needs to be done. This can be either tuning the RF, most importantly increasing the mtry parameter, or identifying and removing the irrelevant features using the RFE procedure rfe() part of the caret package in R. Selecting only relevant features has the added advantage of providing insight into which features contain the signal."

Identifying when your models decay through changing (drifting) data is critical to maintaining model performance
Good overview of real-time machine learning and whether you really need it
Detailed and exhaustive tutorial on Generalised Visual Language Models from Lilian Weng – how to fuse visual information into language models. Well worth a read.
Need to identify ‘topics’ across text documents (basically text clustering)? We have come along way from LDA Topic modelling – these days it’s all about BERTopic
And an elegant visual explanation of text embeddings which underpin almost all language models

"Text Embeddings give you the ability to turn unstructured text data into a structured form. With embeddings, you can compare two or more pieces of text, be it single words, sentences, paragraphs, or even longer documents. And since these are sets of numbers, the ways you can process and extract insights from them are limited only by your imagination."

Finally, if you’re interested in exploring satellite image processing, this is the place to start

Practical tips
How to drive analytics and ML into production

ML Ops is still frustratingly vague, with a proliferation of services offering a wide array of capabilities. But what do you really need? Useful set of principles here and Google’s practitioners guide here.
There are so many options now for ML/Data Science platforms that it can be very hard to know where to start if you are looking to evolve how you work. So it’s always useful to see what other innovative companies use:
- Zalando’s machine learning platform
- Monzo’s machine learning stack
One thing that is often touted as best practice is to use as much config/code driven infrastructure and pipelines as possible:
- Configuration Driven Machine Learning pipelines at stitchfix
- ML infrastructure as code with terraform
More learning from how other leading companies do it:
How do you run a large language model in production? Good overview from Cohere.ai
If you are experimenting with the ever expanding list of autoML tools, this could be very useful– a comprehensive way of benchmarking across a variety of different problems
This looks interesting – DeepChecks, “testing and validating your machine learning models and data”
A simple way of using SQL for RestAPIs and few SQL tips and tricks
And finally some fun things to explore in python…
- Blazingly fast data frames with polars
- Packages to improve workflow
- 4 pandas anti-patterns to avoid
- Make things prettier with pretty-jupyter and ipyvizzu story

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink! …

AI and the Limits of Language from Jacob Browning and Yann LeCun (full 62 pages from LeCun on the path to AI here!)

"As these LLMs become more common and powerful, there seems to be less and less agreement over how we should understand them. These systems have bested many “common sense” linguistic reasoning benchmarks over the years, many which promised to be conquerable only by a machine that “is thinking in the full-bodied sense we usually reserve for people.” Yet these systems rarely seem to have the common sense promised when they defeat the test and are usually still prone to blatant nonsense, non sequiturs and dangerous advice. This leads to a troubling question: how can these systems be so smart, yet also seem so limited?"

And the predicted retort from Gary Marcus in Scientific American – Artificial Intelligence is not as imminent as you might think

"To be sure, there are indeed some ways in which AI truly is making progress—synthetic images look more and more realistic, and speech recognition can often work in noisy environments—but we are still light-years away from general purpose, human-level AI that can understand the true meanings of articles and videos, or deal with unexpected obstacles and interruptions. We are still stuck on precisely the same challenges that academic scientists (including myself) having been pointing out for years: getting AI to be reliable and getting it to cope with unusual circumstances."

Another take from Raphael Milliere

"Ongoing debates about whether large pre-trained models understand text and images are complicated by the fact that scientists and philosophers themselves disagree about the nature of linguistic and visual understanding in creatures like us. Many researchers have emphasized the importance of “grounding” for understanding, but this term can encompass a number of different ideas. These might include having appropriate connections between linguistic and perceptual representations, anchoring these in the real world through causal interaction, and modeling communicative intentions. Some also have the intuition that true understanding requires consciousness, while others prefer to think of these as two distinct issues. No surprise there is a looming risk of researchers talking past each other."

How to build a GPT3 for Science

Liberating the world’s scientific knowledge from the twin barriers of accessibility and understandability will help drive the transition from a web focused on clicks, views, likes, and attention to one focused on evidence, data, and veracity. Pharma is clearly incentivized to bring this to fruition, hence the growing number of startups identifying potential drug targets using AI — but I believe the public, governments, and anyone using Google might be willing to forgo free searches in an effort for trust and time-saving. The world desperately needs such a system, and it needs it fast

Chinchilla’s wild implications

"To put this in context: until this paper, it was conventional to train all large LMs on roughly 300B tokens of data.  (GPT-3 did it, and everyone else followed.)

Insofar as we trust our equation, this entire line of research -- which includes GPT-3, LaMDA, Gopher, Jurassic, and MT-NLG -- could never have beaten Chinchilla, no matter how big the models got[6].

People put immense effort into training models that big, and were working on even bigger ones, and yet none of this, in principle, could ever get as far Chinchilla did."

Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:

The Stanford geospatial model of the Roman World… 47 days from Alexandria to Londinium!
Granular interactive maps of noise levels in London, NY and Paris
Birdsong – you have to check this out!
Build new things from your old bricks…
Fun step by step project with python code- finding the shortest cycling path in the shade

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

The latest results from the ONS tracking study estimate 1 in 45 people in England have Covid- a little better than last month (1 in 30) and at least down on it’s peak when it reached 1 in 14… Still a far cry from the 1 in 1000 we had last summer.
The UK has approved the Moderna ‘Dual Strain’ vaccine which protects against original strains of Covid and Omicron.

Updates from Members and Contributors

Kevin O’Brien highlights the PyData Global 2022 Conference, taking place online between Thurs 1st and Sat 3rd December. Calls for proposals are still open until September 12th, 2022. Submit here.
Ole Schulz-Trieglaff also mentions the PyData Cambridge meetup which is running a talk on Sept 14th by Gian Marco Iodice (Tech Lead ML SW Performance Optimizations at ARM)
Ronald Richman and colleagues have published a paper on their innovative work using deep neural nets for discrimination free pricing in insurance, when discriminatory characteristics are not known. Well worth a read.
Many congratulations to Prithwis De who has published a book on a very relevant topic: “Towards Net-Zero Targets: Usage of Data Science for Long-Term Sustainability Pathways”
Mark Marfé and Cerys Wyn Davies recently published an article about data and IP issues in the context of AI deployed on ESG projects which looks interesting and relevant.
Finally, more news from The Data Science Campus who are helping organise this year’s UN Big Data Hackathon, November 8-11.
- The UN Big Data Hackathon is an exciting global competition for data professionals and young people from all around the world to work together on important global challenges.
- It’s part of this year’s UN Big Data conference in Indonesia. There are two tracks, one for data science professionals and the other for young people and students (under 32 years of age).
- Registrations should preferably be done as a team of 3 to 5 people, but individual applications can also be accepted. Registration deadline in Sept 15th.

Jobs!

The Job market is a bit quiet over the summer- let us know if you have any openings you’d like to advertise

EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Committee Activities

This Month in Data Science

Covid Corner

Updates from Members and Contributors

Jobs!

Share this:

Related

Leave a comment Cancel reply