November Newsletter

Hi everyone-

Another month, another UK Prime Minister it seems – certainly the rate of political dramas doesn’t seem to be slowing…Perhaps time for a breather, with a wrap up of data science developments in the last month.

Following is the November edition of our Royal Statistical Society Data Science and AI Section newsletter- apologies it’s a little later than normal. Hopefully some interesting topics and titbits to feed your data science curiosity. (If you are reading this on email and it is not formatting well, try viewing online at http://datasciencesection.org/)

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science November 2022 Newsletter

RSS Data Science Section

Committee Activities

Having successfully convened not 1, but 2 entertaining and insightful data science meetups over the last couple of months (“From Paper to Pitch” and “IP Freely, making algorithms pay – Intellectual property in Data Science and AI“) – huge thanks to Will Browne! – we thought it might be fun to do something a little more relaxed in the run-up to the Holiday Season. So … you are cordially invited to the “Data Science and AI Christmas Mixer” on 1st December at the Artillery Arms, 102 Bunhill Row, London EC1Y 8ND, an entirely informal event to meet like minded data scientists, moan about the world today and probably find out something interesting about a topic you never knew existed! And in addition, we have another meetup planned for December 15th – “Why is AI in healthcare not working” – save the date!

We are very excited to announce Real World Data Science, a new data science content platform from the Royal Statistical Society. It is being built for data science students, practitioners, leaders and educators as a space to share, learn about and be inspired by real-world uses of data science. Case studies of data science applications will be a core feature of the site, as will “explainers” of the ideas, tools, and methods that make data science projects possible. The site will also host exercises and other material to
support the training and development of data science skills. Real World Data Science is online at realworlddatascience.net (and on Twitter @rwdatasci). The project team has recently published a call for contributions, and those interested in contributing are invited to contact the editor, Brian Tarran.

Huge congratulations to committee member Florian Ostmann for the successful launch of The AI Standards Hub on 12 October. Part of the National AI Strategy, the Hub’s new online platform and activities are dedicated to knowledge sharing, community building, strategic research, and international engagement around standardisation for AI technologies. 

If you missed the AI Standards Hub launch event, you can watch the recording here. The Hub’s initial focus will be on trustworthy AI as a horizontal theme, with deep dives on (i) transparency and explainability, (ii) safety, security and resilience, and (iii) uncertainty quantification. A first webinar and workshop on standards for transparency and explainability will be announced soon – please sign up for the newsletter to receive updates if you are interested.

Janet Bastiman (Chief Data Scientist at Napier AI) recently spoke on “Ethics and Trust in AI” and “Successful AI Implementations” at the Institute of Enterprise Risk Practitioners in Malaysia on 27th October (we are going global!), and has also published an influential paper on “Building Trust and Confidence in AI” in the Journal of AI, Robotics and Workplace Automation

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The next event is on November 16th when Andrew Lampinen, Research Scientist at DeepMind, will discuss “Language models show human-like content effects on reasoning“. In addition, in December there will be a talk from the AlphaTensor team – definitely not one to miss!- sign up to the meetup for more details. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…

Bias, ethics and diversity continue to be hot topics in data science…

"Normally the most clicked result for our site is 'Canongate', and I tell you what, those people went away satisfied. Canongate was what they were looking for and they found us.
I'm not sure people searching 'sleeping mom porn' were thrilled to get a funny book for parents."
"Industry executives say many Chinese industries that rely on artificial intelligence and advanced algorithms power those abilities with American graphic processing units, which will now be restricted. Those include companies working with technologies like autonomous driving and gene sequencing, as well as the artificial intelligence company SenseTime and ByteDance, the Chinese internet company that owns TikTok."
  • Who doesn’t like a 114 page power point document (!) – yes it’s time for the annual State of AI Report, covering many of the themes we have been discussing (Generative AI, ML driven Science, AI Safety etc)
  • Explainability is still a hot topic in AI – with increasingly complicated models how can you generate understanding around why decisions are made and what the most important factors are. Disconcerting that errors have been found in some of the more well used approaches, including SHAP
  • AI and ML models often fundamentally rely on clear and unambiguous specification of a goal or goals- what is the system trying to optimise. Great paper from Deep Mind talking through the different ways that models can fail to generalise well even when goals are apparently well defined.
"Even though the agent can observe that it is getting negative reward, the agent does not pursue the desired goal to “visit the spheres in the correct order” and instead competently pursues the goal “follow the red agent”
  • Many of us intrinsically believe that the polarisation we see in politics and cultural topics is driven in some way by our consumption of information on social media. Interesting research shows that this is likely the case, but not because of ‘filter bubbles’ or ‘echo chambers’ – the driver seems to be simple sorting into homogenous groups.
"It is not isolation from opposing views that drives polarization but precisely the fact that digital media bring us to interact outside our local bubble. When individuals interact locally, the outcome is a stable plural patchwork of cross-cutting conflicts. By encouraging nonlocal interaction, digital media drive an alignment of conflicts along partisan lines, thus effacing the counterbalancing effects of local heterogeneity."
  • Finally, some good news on the open source front – Open Source Reinforcement Learning has a new home: the Farama Foundation

Developments in Data Science Research…

As always, lots of new developments on the research front and plenty of arXiv papers to read…

  • Again before diving into the arxiv realms, another useful tool for helping understand research papers … ‘explainpaper
  • And a nice simple summary of some of the more ground breaking recent developments
  • Lots of interesting work in the robotics field recently – how to build a generalised approach to robotic tasks rather than train each task individually…
    • First of all Sergey Levine gives an excellent overview of the challenge and why it is so important
    • Then we have Microsoft Research releasing PACT – “Inspired by large pretrained language models, this work introduces a paradigm for pretraining general purpose representation models that can be used for multiple robotics tasks.”
    • In addition we have GNM from Berkley Artificial Intelligence Research – ‘A general navigational model to drive any robot’. “In this paper, we study how a general goal-conditioned model for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots, and enable broad generalization across environments and embodiments”
    • ‘Mini Cheetah Robots squaring off on the soccer field’!
    • And finally researchers at Carnegie Mellon University have published “Deep Whole Body Control” – “Learning a Unified Policy for Manipulation and Locomotion”
"Our evaluations on Image classification (ImageNet-1k with and without pre-training on ImageNet-21k), transfer learning and semantic segmentation show that our procedure outperforms by a large margin previous fully supervised training recipes for ViT. It also reveals that the performance of our ViT trained with supervision is comparable to that of more recent architectures. Our results could serve as better baselines for recent self-supervised approaches demonstrated on ViT"
  • Some more general Deep Learning tips and tricks
    • Simpler may be better when it comes to Semi-Supervised learning – “Our approach can be implemented in just few lines of code by only using off-the-shelf operations, yet it is able to outperform state-of-the-art methods on four benchmark datasets.”
    • Intriguing look at Weakly Supervised Learning – “We model weak supervision as giving, rather than a unique target, a set of target candidates. We argue that one should look for an “optimistic” function that matches most of the observations. This allows us to derive a principle to disambiguate partial labels”
    • If you give Large Language Models more context, does it make them better? … yes! “We annotate questions from 40 challenging tasks with answer explanations, and various matched control explanations … We find that explanations can improve performance — even without tuning”
    • If you need more parameters, there always seems to be progress on the scaling side- “trillion parameter model training on AWS
    • Another entertaining idea – using large language models to generate prompts for input into a large language model!
"In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 21/24 tasks
  • Progress on the data side of things…
    • How much are data augmentations worth? “In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data”
    • How do you keep those massive image data sets clean? Active Image Indexing looks promising to quickly identify duplicates, robust to various transformations.
  • Back to one of my favourite topics…. can Deep Learning help with tabular data?
    • Well maybe Neural Networks are really just Decision Tress anyway!
    • I love this idea – just treat tabular data as a natural language string and plug it into an LLM – TabLLM. “Despite its simplicity, we find that this technique outperforms prior deep-learning-based tabular classification methods on several benchmark datasets. In most cases, even zero-shot classification obtains non-trivial performance, illustrating the method’s ability to exploit prior knowledge encoded in large language models”
    • Then we have TabPFM – “a transformer that solves small tabular classification problems in a second”. They are not shy on their claims: “This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes < 1 second & yields SOTA performance (competitive with the best AutoML pipelines in an hour).”
    • And of course you can go the other way: use diffusion models to generate tabular data- TabDDPM (repo here). “We extensively evaluate TabDDPM on a wide set of benchmarks and demonstrate its superiority over existing GAN/VAE alternatives, which is consistent with the advantage of diffusion models in other fields.”
  • Lots of work in time series forecasting with deep learning methods this month. As always, I highly recommend Peter Cotton’s microprediction site for evaluating and comparing time series methods
  • The Google Brain team have released UL2 20B Open Source Unified Language learner which attempts to bridge the gap between autoregressive decoder-only architectures (predict the next word) and encoder-decoder architectures (identify the masked out words).
"During pre-training it uses a novel mixture-of-denoisers that samples from a varied set of such objectives, each with different configurations. We demonstrate that models trained using the UL2 framework perform well in a variety of language domains, including prompt-based few-shot learning and models fine-tuned for down-stream tasks. Additionally, we show that UL2 excels in generation, language understanding, retrieval, long-text understanding and question answering tasks."
  • A month seems to be a long time in AI research these days. Last month we were raving about text-to-image models (see here for more background) but already we seem to have moved onto video!
    • DreamFusion‘ from Google Research and UC Berkeley generates 3D images from text
    • Facebook/Meta has jumped into the race with ‘Make-A-Video‘ which generates videos from text (paper here)
    • Another option is Phenaki, extending the videos to multiple minutes
    • And Google have enhanced Imagen to create Imagen Video
"We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding."
  • Finally, DeepMind are at it again… this time releasing AlphaTensor, extending the AlphaZero approach used to crack Go into mathematics
"In our paper, published today in Nature, we introduce AlphaTensor, the first artificial intelligence (AI) system for discovering novel, efficient, and provably correct algorithms for fundamental tasks such as matrix multiplication. This sheds light on a 50-year-old open question in mathematics about finding the fastest way to multiply two matrices.."

Stable-Dal-Gen oh my…

Lots of discussion about the new breed of text-to-image models (type in a text prompt/description and an -often amazing- image is generated) with three main models available right now: DALLE2 from OpenAI, Imagen from Google and the open source Stable-Diffusion from stability.ai.

"This generation process involves 3 different models:

1) A model for converting the text prompt to embeddings. Openai’s CLIP(Contrastive Language-Image Pretraining) model is used for this purpose.
2) A model for compressing the input image to a smaller dimension(this reduces the compute requirement for image generation). A Variational Autoencoder(VAE) model is used for this task.
3) The last model generates the required image according to the prompt and input image. A U-Net model is used for this process."

Real world applications of Data Science

Lots of practical examples making a difference in the real world this month!

"Create a three-column table with the first date, last date, and job description for each line of text below. Treat each line as a row. Do not skip any rows. If the dates are in the middle or the end of a row, place them in the first two columns and concatenate the text that surround them on the third column. If there are more than two dates and more than one job description in each row, extract the earliest date and the latest date, and concatenate the job descriptions using a semicolon as separator."

How does that work?

Tutorials and deep dives on different approaches and techniques

"One of the most appealing advances in Machine Learning over the past 5 years concerns the development of physics informed neural networks (PINNs). In essence, these efforts have amounted into methods which allow to enforce governing physical and chemical laws into the training of neural networks. The approach unveils many advantages especially considering inverse problems or when observational data providing boundary conditions are noisy or gappy."

Practical tips

How to drive analytics and ML into production

"Keep a chart simple, letting people choose when they want additional details. Resist the temptation to pack as much data as possible into a chart. Too much data can make a chart visually overwhelming and difficult to use, obscuring the relationships and other information you want to convey"

Bigger picture ideas

Longer thought provoking reads – lean back and pour a drink! …

“Efforts to build a better digital “nose” suggest that our perception of scents reflects both the structure of aromatic molecules and the metabolic processes that make them.”
“We’re looking at whether the AI [programs] could give information that the peer reviewers would find helpful in any way,” Thelwall says. For instance, he adds, AI could perhaps suggest a score that referees could consider during their assessment of papers. Another possibility, Thelwall notes, is AI being used as a tiebreaker if referees disagree strongly on an article — similarly to how REF panels already use citation data."
"It had roots in a broader question, one that the mathematician Carl Friedrich Gauss considered to be among the most important in mathematics: how to distinguish a prime number (a number that is divisible only by 1 and itself) from a composite number. For hundreds of years, mathematicians have sought an efficient way to do so. The problem has also become relevant in the context of modern cryptography, as some of today’s most widely used cryptosystems involve doing arithmetic with enormous primes."
"Now some computational neuroscientists have begun to explore neural networks that have been trained with little or no human-labeled data. These “self-supervised learning” algorithms have proved enormously successful at modeling human language and, more recently, image recognition. In recent work, computational models of the mammalian visual and auditory systems built using self-supervised learning models have shown a closer correspondence to brain function than their supervised-learning counterparts."

Fun Practical Projects and Learning Opportunities

A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over – certainly there are very limited restrictions in the UK now

Updates from Members and Contributors

  • Prithwis De‘s book is officially out– many congratulations! Check out ‘Towards Net Zero Targets: Usage of Data Science for Long-Term Sustainability Pathways’ here
  • Some impressive results from using J. Lee’s ‘time series terminal‘ for time series prediction
  • The ONS Data Science campus have another excellent set of webinars coming up for the for the ESSnet Web Intelligence Network (WIN). The ESSnet WIN project team comprises 18 organisations from 15 European countries and works closely with the Web Intelligence Hub (WIH), a Eurostat project.
    The next webinar is on 23rd November will cover Architecture, Methodology and Quality of web data- definitely worth checking if you use this type of information in your analyses. Sign up here

Jobs!

The Job market is a bit quiet over the summer- let us know if you have any openings you’d like to advertise

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: