Another month, another UK Prime Minister it seems – certainly the rate of political dramas doesn’t seem to be slowing…Perhaps time for a breather, with a wrap up of data science developments in the last month.
Following is the November edition of our Royal Statistical Society Data Science and AI Section newsletter- apologies it’s a little later than normal. Hopefully some interesting topics and titbits to feed your data science curiosity. (If you are reading this on email and it is not formatting well, try viewing online at http://datasciencesection.org/)
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.
Industrial Strength Data Science November 2022 NewsletterRSS Data Science Section
Having successfully convened not 1, but 2 entertaining and insightful data science meetups over the last couple of months (“From Paper to Pitch” and “IP Freely, making algorithms pay – Intellectual property in Data Science and AI“) – huge thanks to Will Browne! – we thought it might be fun to do something a little more relaxed in the run-up to the Holiday Season. So … you are cordially invited to the “Data Science and AI Christmas Mixer” on 1st December at the Artillery Arms, 102 Bunhill Row, London EC1Y 8ND, an entirely informal event to meet like minded data scientists, moan about the world today and probably find out something interesting about a topic you never knew existed! And in addition, we have another meetup planned for December 15th – “Why is AI in healthcare not working” – save the date!
We are very excited to announce Real World Data Science, a new data science content platform from the Royal Statistical Society. It is being built for data science students, practitioners, leaders and educators as a space to share, learn about and be inspired by real-world uses of data science. Case studies of data science applications will be a core feature of the site, as will “explainers” of the ideas, tools, and methods that make data science projects possible. The site will also host exercises and other material to
support the training and development of data science skills. Real World Data Science is online at realworlddatascience.net (and on Twitter @rwdatasci). The project team has recently published a call for contributions, and those interested in contributing are invited to contact the editor, Brian Tarran.
Huge congratulations to committee member Florian Ostmann for the successful launch of The AI Standards Hub on 12 October. Part of the National AI Strategy, the Hub’s new online platform and activities are dedicated to knowledge sharing, community building, strategic research, and international engagement around standardisation for AI technologies.
If you missed the AI Standards Hub launch event, you can watch the recording here. The Hub’s initial focus will be on trustworthy AI as a horizontal theme, with deep dives on (i) transparency and explainability, (ii) safety, security and resilience, and (iii) uncertainty quantification. A first webinar and workshop on standards for transparency and explainability will be announced soon – please sign up for the newsletter to receive updates if you are interested.
Janet Bastiman (Chief Data Scientist at Napier AI) recently spoke on “Ethics and Trust in AI” and “Successful AI Implementations” at the Institute of Enterprise Risk Practitioners in Malaysia on 27th October (we are going global!), and has also published an influential paper on “Building Trust and Confidence in AI” in the Journal of AI, Robotics and Workplace Automation
Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on November 16th when Andrew Lampinen, Research Scientist at DeepMind, will discuss “Language models show human-like content effects on reasoning“. In addition, in December there will be a talk from the AlphaTensor team – definitely not one to miss!- sign up to the meetup for more details. Videos are posted on the meetup youtube channel – and future events will be posted here.
This Month in Data Science
Lots of exciting data science going on, as always!
Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…
- Still lots of examples of AI deployment leading to challenging ethical questions…
- Facial recognition systems deployed in Australian prisons
- Would you pay $10 to create an AI chatbot of a dead loved one?
- Even relatively well understood and tested algorithms can be led astray – intriguing story of how Google search turned a “a renowned independent publisher to one of the internet’s least satisfying fetish sites“
"Normally the most clicked result for our site is 'Canongate', and I tell you what, those people went away satisfied. Canongate was what they were looking for and they found us. I'm not sure people searching 'sleeping mom porn' were thrilled to get a funny book for parents."
- And as the new wave of Generative AI tools gains traction (DALLE, Imagen, Stable Diffusion etc), more moral and ethical questions come to the fore:
- Who owns what when a generative model creates music ‘in the style’ of a different artist?
- What happens when you ask an AI to generate “Human Evolution” – pretty disturbing!
- As Generative AI gets better and better, can we even tell if it was artificially generated? Check out this podcast where Joe Rogan interviews Steve Jobs– real or fake?
- And although the models seem to work amazingly well, we really don’t know at a detailed level what issues might be lurking in the way the have been trained or the data included – such as with the strange case of garbled brand names in halloween candy, or when words with multiple meanings are used in the prompt
- Regulatory bodies around the world continue to attempt to put some boundaries around what is allowable, while geo-political tensions are on the rise, particularly between the US and China.
- The US Food and Drug Administration (FDA) now has a formal guidance document for the approval of “Clinical Decision Support Software”
- And the Biden administration has released a “Blueprint for an AI Bill of Rights” although some commentary suggests it does not go far enough to curb ‘Big Tech’
- As foundation models become increasingly enhanced by custom designed chips to improve training efficiency, the US has also clamped down on China’s access to some of this chip technology – more commentary here
"Industry executives say many Chinese industries that rely on artificial intelligence and advanced algorithms power those abilities with American graphic processing units, which will now be restricted. Those include companies working with technologies like autonomous driving and gene sequencing, as well as the artificial intelligence company SenseTime and ByteDance, the Chinese internet company that owns TikTok."
- Who doesn’t like a 114 page power point document (!) – yes it’s time for the annual State of AI Report, covering many of the themes we have been discussing (Generative AI, ML driven Science, AI Safety etc)
- Explainability is still a hot topic in AI – with increasingly complicated models how can you generate understanding around why decisions are made and what the most important factors are. Disconcerting that errors have been found in some of the more well used approaches, including SHAP
- AI and ML models often fundamentally rely on clear and unambiguous specification of a goal or goals- what is the system trying to optimise. Great paper from Deep Mind talking through the different ways that models can fail to generalise well even when goals are apparently well defined.
"Even though the agent can observe that it is getting negative reward, the agent does not pursue the desired goal to “visit the spheres in the correct order” and instead competently pursues the goal “follow the red agent”
- Many of us intrinsically believe that the polarisation we see in politics and cultural topics is driven in some way by our consumption of information on social media. Interesting research shows that this is likely the case, but not because of ‘filter bubbles’ or ‘echo chambers’ – the driver seems to be simple sorting into homogenous groups.
"It is not isolation from opposing views that drives polarization but precisely the fact that digital media bring us to interact outside our local bubble. When individuals interact locally, the outcome is a stable plural patchwork of cross-cutting conflicts. By encouraging nonlocal interaction, digital media drive an alignment of conflicts along partisan lines, thus effacing the counterbalancing effects of local heterogeneity."
- Finally, some good news on the open source front – Open Source Reinforcement Learning has a new home: the Farama Foundation
Developments in Data Science Research…
As always, lots of new developments on the research front and plenty of arXiv papers to read…
- Again before diving into the arxiv realms, another useful tool for helping understand research papers … ‘explainpaper‘
- And a nice simple summary of some of the more ground breaking recent developments
- Lots of interesting work in the robotics field recently – how to build a generalised approach to robotic tasks rather than train each task individually…
- First of all Sergey Levine gives an excellent overview of the challenge and why it is so important
- Then we have Microsoft Research releasing PACT – “Inspired by large pretrained language models, this work introduces a paradigm for pretraining general purpose representation models that can be used for multiple robotics tasks.”
- In addition we have GNM from Berkley Artificial Intelligence Research – ‘A general navigational model to drive any robot’. “In this paper, we study how a general goal-conditioned model for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots, and enable broad generalization across environments and embodiments”
- ‘Mini Cheetah Robots squaring off on the soccer field’!
- And finally researchers at Carnegie Mellon University have published “Deep Whole Body Control” – “Learning a Unified Policy for Manipulation and Locomotion”
- As always lots of work improving everyone’s favourite architecture… transformers
- Mass editing memory in transformers to remove obsolete training data without retraining
- Solving reasoning tasks with a slot transformer – attempting to learn accurate, concise, and composable abstractions across time which could be incredibly powerful
- And Wide seems to be better than Deep when it comes to transformers. “We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch”
- Elegant data augmentation/self supervised learning seems to improve Vision Transformer performance
"Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT"
- Some more general Deep Learning tips and tricks
- Simpler may be better when it comes to Semi-Supervised learning – “Our approach can be implemented in just few lines of code by only using off-the-shelf operations, yet it is able to outperform state-of-the-art methods on four benchmark datasets.”
- Intriguing look at Weakly Supervised Learning – “We model weak supervision as giving, rather than a unique target, a set of target candidates. We argue that one should look for an “optimistic” function that matches most of the observations. This allows us to derive a principle to disambiguate partial labels”
- If you give Large Language Models more context, does it make them better? … yes! “We annotate questions from 40 challenging tasks with answer explanations, and various matched control explanations … We find that explanations can improve performance — even without tuning”
- If you need more parameters, there always seems to be progress on the scaling side- “trillion parameter model training on AWS“
- Another entertaining idea – using large language models to generate prompts for input into a large language model!
"In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 21/24 tasks
- Progress on the data side of things…
- How much are data augmentations worth? “In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data”
- How do you keep those massive image data sets clean? Active Image Indexing looks promising to quickly identify duplicates, robust to various transformations.
- Back to one of my favourite topics…. can Deep Learning help with tabular data?
- Well maybe Neural Networks are really just Decision Tress anyway!
- I love this idea – just treat tabular data as a natural language string and plug it into an LLM – TabLLM. “Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method’s ability to exploit prior knowledge encoded in large language models”
- Then we have TabPFM – “a transformer that solves small tabular classification problems in a second”. They are not shy on their claims: “This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes < 1 second & yields SOTA performance (competitive with the best AutoML pipelines in an hour).”
- And of course you can go the other way: use diffusion models to generate tabular data- TabDDPM (repo here). “We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields.”
- Lots of work in time series forecasting with deep learning methods this month. As always, I highly recommend Peter Cotton’s microprediction site for evaluating and comparing time series methods
- Transfer learning for time series prediction
- Multivariate time series forecasting with transformers
- or maybe GANs for time series?
- A more classic approach- Bayesian Structural Time Series
- And what looks to be an impressive library attempting bring all these different methods together – Imbrium
- The Google Brain team have released UL2 20B Open Source Unified Language learner which attempts to bridge the gap between autoregressive decoder-only architectures (predict the next word) and encoder-decoder architectures (identify the masked out words).
"During pre-training it uses a novel mixture-of-denoisers that samples from a varied set of such objectives, each with different configurations. We demonstrate that models trained using the UL2 framework perform well in a variety of language domains, including prompt-based few-shot learning and models fine-tuned for down-stream tasks. Additionally, we show that UL2 excels in generation, language understanding, retrieval, long-text understanding and question answering tasks."
- A month seems to be a long time in AI research these days. Last month we were raving about text-to-image models (see here for more background) but already we seem to have moved onto video!
- ‘DreamFusion‘ from Google Research and UC Berkeley generates 3D images from text
- Facebook/Meta has jumped into the race with ‘Make-A-Video‘ which generates videos from text (paper here)
- Another option is Phenaki, extending the videos to multiple minutes
- And Google have enhanced Imagen to create Imagen Video
"We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding."
- Finally, DeepMind are at it again… this time releasing AlphaTensor, extending the AlphaZero approach used to crack Go into mathematics
"In our paper, published today in Nature, we introduce AlphaTensor, the first artificial intelligence (AI) system for discovering novel, efficient, and provably correct algorithms for fundamental tasks such as matrix multiplication. This sheds light on a 50-year-old open question in mathematics about finding the fastest way to multiply two matrices.."
Stable-Dal-Gen oh my…
Lots of discussion about the new breed of text-to-image models (type in a text prompt/description and an -often amazing- image is generated) with three main models available right now: DALLE2 from OpenAI, Imagen from Google and the open source Stable-Diffusion from stability.ai.
- Venture Capitalists are getting pretty excited about the opportunity of ‘Generative AI’. Certainly Stable Diffusion paint a compelling picture of the opportunity and are already raising significant sums of investment
- Even the venerable Atlantic magazine seems pretty excited about the idea
- We are already seeing Generative AI-based tools appearing- RunwayML for the end-to-end image creative process; Descript for video editing;
- How about DALL-E-Bot – “DALL-E-Bot enables a robot to rearrange objects in a scene, by first inferring a text description of those objects, then generating an image representing a natural, human-like arrangement of those objects, and finally physically arranging the objects according to that image”
- In addition, the big players are now including components in common tools- Microsoft now includes a DALLE2 service on Microsoft Azure now, and is starting to include AI-Generated imagery in it’s Office suite
- How do you get ‘under the hood’- nice guide for Stable Diffusion here (and by the way… you can now do it on a single GPU!)
"This generation process involves 3 different models: 1) A model for converting the text prompt to embeddings. Openai’s CLIP(Contrastive Language-Image Pretraining) model is used for this purpose. 2) A model for compressing the input image to a smaller dimension(this reduces the compute requirement for image generation). A Variational Autoencoder(VAE) model is used for this task. 3) The last model generates the required image according to the prompt and input image. A U-Net model is used for this process."
Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!
- Time series forecasting is hard… and retailers use it all the time to, for instance, decide how to manage supply chain inventory levels. Great to see Amazon talk through the history of their development in this area- well worth a read
- More signs that autonomous cars are not living up to the previous hype as Ford abandons their project
- Some great applications of ML on Satellite imagery
- Quantifying carbon stored in soil to validate carbon credits
- Identifying promising locations for geothermal energy
- Microsoft has open-sourced it’s ‘farm of the future’ toolkit
- Researchers using satellite images to guide aid efforts in the aftermath of Hurricane Ian (also here)
- Monitoring Forest Disturbance
- Excellent article from Eric Topol about the use of ML in medical imagery; and leveraging explainability for genetics insight
- More operational applications of ML and AI – behind the scenes at Chipotle
- Optimising ticketing at the Nou Camp in Barcelona!
- Even optimising Champagne vintages with machine learning
- Great insight from Etsy on using Deep Learning for search ranking, and also from Amazon on using Graph Neural Networks for recommendations
- Finally lots of excellent NLP/Language model applications
- Quote extraction at the Guardian;
- Entity matching in data pipelines– potentially very useful and practical – “we implement a (basically) no-code, pure SQL flow that runs entity matching directly in dbt+Snowflake flow. To do it, we abstract away GPT3 API through AWS Lambda, and leverage Snwoflake external functions to make the predictions when dbt is materializing the proper table”
- A novel innovation from Google – “Talk to Books“
- Innovation in reducing the cost of large language models continues – “GPT-3 Quality for <$500k” from mosaicml
- Meta/Facebook AI manage to translate Hokkien, an unwritten language, for the first time
- And I do love these novel approaches – using a large language model (GPT-3) to extract tabular data from unstructured text – again, it’s all about the prompts!
"Create a three-column table with the first date, last date, and job description for each line of text below. Treat each line as a row. Do not skip any rows. If the dates are in the middle or the end of a row, place them in the first two columns and concatenate the text that surround them on the third column. If there are more than two dates and more than one job description in each row, extract the earliest date and the latest date, and concatenate the job descriptions using a semicolon as separator."
How does that work?
Tutorials and deep dives on different approaches and techniques
- Some good Deep Learning resources:
- What looks to be a fantastic repo of notebooks exploring deep learning models – for instance “DistilBERT Classifier as Feature Extractor“
- How Pooling works in Convolutional Neural Networks
- Deep dive into transformers with TensorFlow and Keras
- And a nice look at Uncertainty in Deep Learning
- Digging into the hottest of topics right now- multi modal and diffusion models
- Extensive training course on Multi Modal Models from Carnegie Mellon University
- Excellent introduction to generative models
- How diffusion models work – the maths from scratch
- An introduction to Poisson Flow Generative Models
- Extensive repo containing papers and tutorials on diffusion models
- I stumbled on this approach – Physics Informed Learning – and found it really interesting. Basically when you have a system with known properties defined by partial differential equations (PDE), you can use a traditional machine learning / deep learning approach but adapt the loss term to include the PDE, which constrains the ML solutions to the known physical properties – elegant
- On Physics Informed Learning; also a nice simple example of a damped harmonic oscillator
"One of the most appealing advances in Machine Learning over the past 5 years concerns the development of physics informed neural networks (PINNs). In essence, these efforts have amounted into methods which allow to enforce governing physical and chemical laws into the training of neural networks. The approach unveils many advantages especially considering inverse problems or when observational data providing boundary conditions are noisy or gappy."
- Self Supervised learning for graphs
- Interesting deep dive into Tesla’s approach to self driving cars – Occupancy networks
- Useful look at Sentiment Analysis using RoBerta
- If you are interested in learning more about reinforcement learning, this is a great place to start – Building a Checkers Gaming Agent Using Deep Q-Learning
- A couple of algorithm tutorials- often under represented in data science training
- A bit of maths…
- Finally, a couple of useful resources:
- Pretty much all Kaggle solutions…
- Some really elegant cheat sheets from Stanford
How to drive analytics and ML into production
- Model monitoring- how to tell things are working (or not…)
- Running large language models in production…
- An efficient inference server from Hugging Face
- ML Ops for Foundation Models – Whisper and Metaflow
- ML Ops for Vision models – useful repo based on TensorFlowExtended
- Tutorial on running OpenAI’s Whisper Speech Recognition Model
- AITemplate from Facebook looks promising for fast inference
- And if you’re interested in leveraging foundation model embeddings in production (e.g. as features), this repo looks like an excellent resource
- More general pointers on building out a modern ML production stack:
- Extensive tutorial on Serverless ML
- Netflix approach to orchestrating data/ml workflows at scale – Maestro
- Some useful open source ML Ops options, and a summary of the pros and cons of various ML Serving Tools
- Two different tutorials on using KubeFlow for your ML pipelines: basic, and incorporating Ray as well
- More general data engineering and data stack insight:
- Serverless data pipelines with Substation; and ‘blazing fast bulk data transfers’ with Skyplane
- The virtues of smoketesting data pipelines- and “why data cleaning is failing your ML models“
- Data contracts- easier said than done!
- Enabling Dev and Data Science boundaries with Airflow and Databricks
- Finally, some useful pointers on visualisation
- All the charts in all the libraries! – and with a bit more commentary/tutorials here as well
- Excellent guide from apple on charting data best practices
"Keep a chart simple, letting people choose when they want additional details. Resist the temptation to pack as much data as possible into a chart. Too much data can make a chart visually overwhelming and difficult to use, obscuring the relationships and other information you want to convey"
Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink! …
“Efforts to build a better digital “nose” suggest that our perception of scents reflects both the structure of aromatic molecules and the metabolic processes that make them.”
“We’re looking at whether the AI [programs] could give information that the peer reviewers would find helpful in any way,” Thelwall says. For instance, he adds, AI could perhaps suggest a score that referees could consider during their assessment of papers. Another possibility, Thelwall notes, is AI being used as a tiebreaker if referees disagree strongly on an article — similarly to how REF panels already use citation data."
"It had roots in a broader question, one that the mathematician Carl Friedrich Gauss considered to be among the most important in mathematics: how to distinguish a prime number (a number that is divisible only by 1 and itself) from a composite number. For hundreds of years, mathematicians have sought an efficient way to do so. The problem has also become relevant in the context of modern cryptography, as some of today’s most widely used cryptosystems involve doing arithmetic with enormous primes."
"Now some computational neuroscientists have begun to explore neural networks that have been trained with little or no human-labeled data. These “self-supervised learning” algorithms have proved enormously successful at modeling human language and, more recently, image recognition. In recent work, computational models of the mammalian visual and auditory systems built using self-supervised learning models have shown a closer correspondence to brain function than their supervised-learning counterparts."
Fun Practical Projects and Learning Opportunities
A few fun practical projects and topics to keep you occupied/distracted:
- What are the odds?… fun bit of probability
- City Access Maps – visualising travel times in cities around the world; also mobility heatmaps from Uber
- Love this- generate chess puzzles with genetic algorithms
- Always inspirational – the Information Is Beautiful awards
- “The first map of wikipedia”
- I’m looking forward to exploring this- Manim: A community maintained Python library for creating mathematical animations
Apparently Covid is over – certainly there are very limited restrictions in the UK now
- The latest results from the ONS tracking study estimate 1 in 35 people in England have Covid- a lot worse than last month (1 in 65) but at least slightly better than last week… and till a far cry from the 1 in 1000 we had in the summer of 2021.
- The UK has approved the Moderna ‘Dual Strain’ vaccine which protects against original strains of Covid and Omicron.
Updates from Members and Contributors
- Prithwis De‘s book is officially out– many congratulations! Check out ‘Towards Net Zero Targets: Usage of Data Science for Long-Term Sustainability Pathways’ here
- Some impressive results from using J. Lee’s ‘time series terminal‘ for time series prediction
- The ONS Data Science campus have another excellent set of webinars coming up for the for the ESSnet Web Intelligence Network (WIN). The ESSnet WIN project team comprises 18 organisations from 15 European countries and works closely with the Web Intelligence Hub (WIH), a Eurostat project.
The next webinar is on 23rd November will cover Architecture, Methodology and Quality of web data- definitely worth checking if you use this type of information in your analyses. Sign up here
The Job market is a bit quiet over the summer- let us know if you have any openings you’d like to advertise
- EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here
- Napier AI are looking to hire a Senior Data Scientist (Machine Learning Engineer) and a Data Engineer
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.
The views expressed are our own and do not necessarily represent those of the RSS