March Newsletter

Hi everyone-

February may technically be the shortest month but it certainly can feel long… I think I sensed a slight brightening in the morning light but I may have been mistaken… Maybe time for a bit of distraction with a wrap up of data science developments in the last month. Don’t miss out on more ChatGPT fun and games in the middle section!

Following is the March edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. (If you are reading this on email and it is not formatting well, try viewing online at http://datasciencesection.org/). Note we plan on moving email providers for next month (fingers crossed) so keep an eye out for a different looking version in April…

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science March 2023 Newsletter
RSS Data Science Section

handy new quick links: 
committee; ethics; research; generative ai; applications; tutorials; practical tips; big picture ideas; fun; reader updates; jobs

Committee Activities

We are actively planning our activities for the year, and are currently working with the Alliance for Data Science professionals on expanding the previously announced individual accreditation (Advanced Data Science Professional certification) into university course accreditation. Remember also that the RSS is now accepting applications for the Advanced Data Science Professional certification- more details here.

This year’s RSS International Conference will take place in the lovely North Yorkshire spa town of Harrogate from 4-7 September. As usual Data Science is one of the topic streams on the conference programme, and there is currently an opportunity to submit your work for presentation. There are options available for 20-minute talks, 5-minute rapid-fire talks and for poster presentations – for full details visit the conference website. The deadline for talk submissions is 5 April.

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The last event was on Feb 15th when Research Scientists from Meta AI, presented “Human-level Play in Diplomacy Through Language Models & Reasoning“. Videos are posted on the meetup youtube channel – and future events will be posted here.

Martin has also compiled a handy list of mastodon handles as the data science and machine learning community migrates away from twitter…

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…

Bias, ethics and diversity continue to be hot topics in data science…

The use of AI in the military is fully of ethical concerns but applications are certainly on the rise…
- An AI agent flew a USAF training aircraft for over 17 hours
- Russia has repurposed it’s AI surveillance operations for use in the Ukraine War

"In September 2022, just after Putin announced additional mobilization for the war against Ukraine, Viktor Kapitonov, a 27-year-old activist who’d protested regularly since 2013, was stopped by two police officers after being flagged by face recognition surveillance while he approached the turnstiles in Moscow’s marble-covered Avtozavdodskaya metro station. The officers took him to the military recruitment office, where around 15 people were waiting to enlist in Putin’s newly announced draft. "

But governments around the world are attempting to reach consensus on this topic
- Government representatives from over 60 countries meeting at the REAIM summit (organised by the Dutch Ministry of Foreign Affairs and Ministry of Defence) have agreed a joint call to action on the responsible development – signatories include the US and China, but not Russia
- China has also published a position paper on Strengthening Ethical Governance of Artificial Intelligence
- And the US has released their Political Declaration on Responsible Military Use of Artificial Intelligence and Autonomy

"A principled approach to the military use of AI should include careful consideration of risks and benefits, and it should also minimize unintended bias and accidents. States should take appropriate measures to ensure the responsible development, deployment, and use of their military AI capabilities, including those enabling autonomous systems.  These measures should be applied across the life cycle of military AI capabilities."

Problematic uses of AI are still widespread…
- Get used to facial recognition in stadiums
- Intriguing story of an AI generated animated Sienfeld-based show suspended from Twitch for generating a transphobic stand-up sketch
- The People Onscreen Are Fake. The Disinformation Is Real. – in depth NYTimes article on Wolf News, an apparently reputable news outlet that uses realistic avatars to generate disinformation at scale

As the current boom in Generative AI is showing (see more later), there are huge ethical concerns around how these tools are applied.
- Chatbots Got Big—and Their Ethical Red Flags Got Bigger – if you need a more extended view there is always this 85 page paper (“Emerging Threats and Potential Mitigations“) from OpenAI, Stanford and Georgetown University!
- Some discussion here about how current approaches on limiting the scope of applications may not be enough: “Containment algorithms won’t stop super-intelligent AI, scientists warn“
- OpenAI’s own take on all this – How should AI systems behave, and who should decide?
- And a thoughtful piece from the Brookings Institute – Early thoughts on regulating generative AI like ChatGPT

"This might include tech companies that provide these models over API (e.g., OpenAI, Stability AI), through cloud services (e.g., the Amazon, Google, and Microsoft clouds), or possibly even through Software-as-a-Service providers (e.g., Adobe Photoshop). These businesses control several levers that might partially prevent malicious use of their AI models. This includes interventions with the input data, the model architecture, review of model outputs, monitoring users during deployment, and post-hoc detection of generated content."

As practitioners we all need to be aware of these risks and concerns and make sure we incorporate current best practices into our development and deployment of models – with new frameworks emerging to help in this space on a regular basis
- MIT spinout Verta offers tools to help companies introduce, monitor, and manage machine-learning models safely and at scale.
- NIST Risk Management Framework Aims to Improve Trustworthiness of Artificial Intelligence

Developments in Data Science Research…

As always, lots of new developments on the research front and plenty of arXiv papers to read…

Efficiency and scalability are still hot topics in research:
- Faster training of deep neural networks through more efficient back propagation (SparseProp)
- Efficient fine tuning of billion scale models form Hugging Face – PEFT
- Although the largest language models have upward of 100b parameters, vision models are typically smaller…. Scaling vision transformers to 22 billion parameters
- Facebook/Meta research released a new Large Language Model that outperforms GPT3 and runs on a single GPU – impressive: LLaMA – more background here
- More great progress on the open source front from LAION with OpenCLIP
Emerging research around optimising prompting of generative models:
- À-la-carte Prompt Tuning (APT) – a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time
- This seems very useful: SwitchPrompt – adapting language models trained on large general datasets to more specific targeted domains

"Using domain-specific keywords with a trainable gated prompt, SwitchPrompt offers domain-oriented prompting, that is, effective guidance on the target domains for general-domain language models. Our few-shot experiments on three text classification benchmarks demonstrate the efficacy of the general-domain pre-trained language models when used with SwitchPrompt"

Lots of work looking at adapting and improving large language models:
- Augmented Language Models: a Survey – useful summary of works in which language models are augmented with reasoning skills and the ability to use tools
- Speaking of which – Toolformer: Language Models Can Teach Themselves to Use Tools
- AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
- Dreamix – provide an input image to guide generative video creation- very impressive
- MarioGPT – generating Super Mario Bros game levels with a fine tuned GPT2 model – what’s not to like?!
- This looks pretty cool- Chat2VIS: generating data visualisations via natural language- try it out here!

Some more big picture research around reasoning
- How good are the current generative AI chat bots at reasoning- useful survey paper; also a new data set (FOLIO) to help understand and train natural language reasoning tasks
- Is ChatGPT a General-Purpose Natural Language Processing Task Solver?
- Theory of Mind May Have Spontaneously Emerged in Large Language Models
More concerning – research showing how it can be relatively simple to extract training data from generative models
Following on from last month, more research in the AI music domain:
- Predicting Music Skips using Deep Reinforcement Learning
- A interesting review of AI based music generation systems
- RAVE: Realtime Audio Variational autoEncoder
- And have a play with all. this using MusicLM Pytorch
Finally, a few different areas I’m fond of:
- Progress in sequence modelling with simple long convolutions
- Very interesting – Sim2Real– training in lower resolution simulations seems to improve generalisation

"If we want to train robots in simulation before deploying them in reality, it seems natural and almost self-evident to presume that reducing the sim2real gap involves creating simulators of increasing fidelity (since reality is what it is). We challenge this assumption and present a contrary hypothesis – sim2real transfer of robots may be improved with lower (not higher) fidelity simulation"

Generative AI … oh my!

Still such a hot topic it feels in need of it’s own section, for all things DALLE, IMAGEN, Stable Diffusion, ChatGPT…

What a month! Starting off with developments on the business side- we’ve had Google investing $300m in Anthropic, OpenAI offering a ‘premium’ ChatGPT offering at $20 a month, open source versions appearing, Chinese companies entering the fray, and lots of increasingly impressive apps and offerings based on Generative AI

But of course the biggest news was the ups and downs of ‘new search’…
- First of all Microsoft rolled out a shiny new ChatGPT inspired version of Bing (their search engine) to much fanfare and acclaim: “Bing (Yes, Bing) Just Made Search Interesting Again“
- And Google did the same… to rather less acclaim: “Google shares drop $100 billion after its new AI chatbot makes a mistake“
- But then we learned that the Bing version was just as flakey (if not more so) than the Google version, causing more gyrations in the Microsoft stock price
- And this is where it starts to get a bit weird as reports of all sorts of wild and wooly conversations with Bing/ChatGPT started to emerge
- Users became more and more creative in adapting their ‘prompts’ to bypass the rules in place to stop more problematic behaviour: “Amazing “Jailbreak” Bypasses ChatGPT’s Ethics Safeguards“; the now famous ‘DAN’ (Do Anything Now) prompt, and the unmasking of ‘Sydney’ the underlying model (told you it got weird…”Hi Sydney“)
- If you get the chance, do have a listen the the first 30 mins or so of this podcast – an experienced tech journalist gets properly unnerved by his conversations with ChatGPT

“I’m Sydney, and I’m in love with you. 😘”

Stepping back a bit, some musings on what is going on…
- I thoroughly recommend this piece from Stephen Wolfram on how large language models work – I think it’s so interesting that a slight randomisation (the “temperature” parameter) of the output makes them seem more ‘human’
- Why do the factual errors happen?
- ChatGPT Is An Extra-Ordinary Python Programmer
- And of course Gary Marcus is the perennial skeptic – “Inside the Heart of ChatGPT’s Darkness” – with a thoughtful and rather more optimistic take from Rohit.Krishnan here

"In hindsight, ChatGPT may come to be seen as the greatest publicity stunt in AI history, an intoxicating glimpse at a future that may actually take years to realize—kind of like a 2012-vintage driverless car demo, but this time with a foretaste of an ethical guardrail that will take years to perfect."

Finally interesting takes on how ChatGPT and generative AI might affect different areas and processes:
- “Building GPT-3 applications — beyond the prompt“
- In education: “My class required AI. Here’s what I’ve learned so far” and “the AI homework showdown!“
- Really interesting commentary from Azeem Azhar: “Using ChatGPT in the innovation process“
- “How will AI change mathematics?“
- “What ChatGPT and generative AI mean for science“

"In December, computational biologists Casey Greene and Milton Pividori embarked on an unusual experiment: they asked an assistant who was not a scientist to help them improve three of their research papers. Their assiduous aide suggested revisions to sections of documents in seconds; each manuscript took about five minutes to review. In one biology manuscript, their helper even spotted a mistake in a reference to an equation. The trial didn’t always run smoothly, but the final manuscripts were easier to read — and the fees were modest, at less than US$0.50 per document."

Real world applications of Data Science

Lots of practical examples making a difference in the real world this month!

Always good to see real world examples using cutting edge techniques –
- Large Language Models for medical diagnosis feels like a potentially amazing application given the complexity of the material … Microsoft Research Proposes BioGPT (see also MedPALM from DeepMind as mentioned last month)
- AI and drug discovery
- Indian government working on WhatsApp chatbot backed by ChatGPT to aid rural India discover government schemes
- What happens when generative AI hits the legal industry… “Allen & Overy breaks the internet (and new ground) with co-pilot Harvey“

"Legal applications such as contract, conveyancing, or license generation are actually a relatively safe area in which to employ ChatGPT and its cousins,” says Lilian Edwards, professor of law, innovation, and society at Newcastle University. “Automated legal document generation has been a growth area for decades, even in rule-based tech days, because law firms can draw on large amounts of highly standardized templates and precedent banks to scaffold document generation, making the results far more predictable than with most free text outputs.” "

Great practical application!- “Down In the Sewers Tracking Viruses with WastewaterSCAN”
Unleashing ML Innovation at Spotify with Ray
The Secret Sauce of Tik-Tok’s Recommendations

Finally… AI meets the games/gaming industry
- Leela Chess Zero – Open source neural network based chess engine!
- Uh oh, people are now using AI to cheat in Rocket League
- Using Computer Vision To Destroy My Childhood High Score in a DS Game

How does that work?

Tutorials and deep dives on different approaches and techniques

Language models… what are they again? And how do they work?
Adapting language models…
- How to Build a Chatbot with GPT-3
- How does githubs co-pilot (AI coder) work? – more insight here with copilot-explorer
- Generative QA with OpenAI
- Storing OpenAI embeddings in Postgres with pgvector (more useful tools for adapting and using embeddings- GPTIndex and embetter)

"Let's explore an example of text embeddings. Say we have three phrases:

“The cat chases the mouse”
“The kitten hunts rodents”
“I like ham sandwiches”

Your job is to group phrases with similar meaning. If you are a human, this should be obvious. Phrases 1 and 2 are almost identical, while phrase 3 has a completely different meaning.

Although phrases 1 and 2 are similar, they share no common vocabulary (besides “the”). Yet their meanings are nearly identical. How can we teach a computer that these are the same?"

Getting into images…
- Comparing Different Automatic Image Augmentation Methods in PyTorch
- A Dive into Vision-Language Models
- Swift Diffusers: Fast Stable Diffusion for Mac

"Transform your text into stunning images with ease using Diffusers for Mac, a native app powered by state-of-the-art diffusion models. It leverages a bouquet of SoTA Text-to-Image models contributed by the community to the Hugging Face Hub, and converted to Core ML for blazingly fast performance."

And a whole variety of other things…
- Topic modelling with BERTopic
- An elegant bit of operations research – “One queue or two“
- Unsupervised and semi-supervised anomaly detection with data-centric ML – this feels really useful as it deals with the fact that labeled anomalies are hard to come by!

"Discovering a decision boundary for a one-class (normal) distribution (i.e., OCC training) is challenging in fully unsupervised settings as unlabeled training data include two classes (normal and abnormal). The challenge gets further exacerbated as the anomaly ratio gets higher for unlabeled data. To construct a robust OCC with unlabeled data, excluding likely-positive (anomalous) samples from the unlabeled data, the process referred to as data refinement, is critical. The refined data, with a lower anomaly ratio, are shown to yield superior anomaly detection models."

Practical tips

How to drive analytics and ML into production

First of all, some good articles on infrastructure, and MLOps
- Kleiner-Perkins (VC) view on infrastructure trends
- And a decent primer on the ever changing ‘MLOps knot‘
- Useful pointers on managing AI models at scale
- “Building a production ready recsys pipeline in the cloud“
- Blueprints for recommender system architectures: 10th anniversary edition
- Scaling Media Machine Learning at Netflix

"In this post, we will describe some of the challenges of applying machine learning to media assets, and the infrastructure components that we have built to address them. We will then present a case study of using these components in order to optimize, scale, and solidify an existing pipeline. Finally, we’ll conclude with a brief discussion of the opportunities on the horizon."

Some practical tips and tricks
- TFX-Extended has always been a solid option – now easier to work with through TFX-Addons
- Open source python feature engine… called feature engine
- Very useful- Exploring multi-quantile regression with Catboost
- Optimising hyperparameters? Google’s open source Vizier is worth a try
- Beyond Pandas — working with big(ger) data more efficiently using Polars and Parquet
- Exploring data in pandas – this looks useful- pygwalker – “turn your pandas data frame into a tabeau style interactive dashboard for visual analytics”
- If you really have to… productionise and schedule jupyter notebooks
How do you deploy large language model based applications?
- Deploy FLAN-T5 XXL on Amazon SageMaker
- Reducing costs of LLM deployment
Good high level tutorial series on data pipelines: Data cleaning for data sharing, Creating a data cleaning workflow, Cleaning sample data in standardized way
Commentary and thoughts on being an AI researcher

"I’m not a particularly experienced researcher (despite my title being “Senior” Research Scientist), but I’ve worked with some talented collaborators and spent a fair amount of time thinking about how to do research, so I thought I might write about how I go about it.

My perspective is this: doing research is a skill that can be learned through practice, much like sports or music."

Different thoughts on running data teams:
- Data as a Product vs. Data as a Service
- Should You Measure the Value of a Data Team?

“For example, improving the reliability of data pipelines and fixing underlying data quality issues can be the ultimate goal for a data team. You can use that goal as a starting point for aligning on a measurement of value and progress with stakeholders affected by those issues. While those may not have a direct effect on the bottom line, they can help indirectly by improving processes and operational efficiency, saving time or infrastructure costs, and gaining more trust in data and your work. By first writing down what each side expects, you can clarify with stakeholders how data work contributes to incremental process changes that couldn’t have happened without the data team’s involvement."

Bigger picture ideas

Longer thought provoking reads – lean back and pour a drink! …

ChatGPT Is a Blurry JPEG of the Web

"In 2013, workers at a German construction company noticed something odd about their Xerox photocopier: when they made a copy of the floor plan of a house, the copy differed from the original in a subtle but significant way. In the original floor plan, each of the house’s three rooms was accompanied by a rectangle specifying its area: the rooms were 14.13, 21.11, and 17.42 square metres, respectively. However, in the photocopy, all three rooms were labelled as being 14.13 square metres in size. The company contacted the computer scientist David Kriesel to investigate this seemingly inconceivable result. They needed a computer scientist because a modern Xerox photocopier doesn’t use the physical xerographic process popularized in the nineteen-sixties. Instead, it scans the document digitally, and then prints the resulting image file. Combine that with the fact that virtually every digital image file is compressed to save space, and a solution to the mystery begins to suggest itself."

If Dante were a Data Scientist: Inferno & Data

“Midway upon my journey in the realm of data science,
I found myself within a sea of algorithms and code,
For the straightforward path of understanding had been lost.

Ah me! How hard a thing it is to say
What was this chaos of machine learning models and techniques,
Which in the very thought renews the confusion.

So bitter is it, learning is little more;
But of the good to treat, which there I found,
Speak will I of the insights and discoveries I made there.”

Google Research, 2022 & beyond: Robotics

“Within our lifetimes, we will see robotic technologies that can help with everyday activities, enhancing human productivity and quality of life. Before robotics can be broadly useful in helping with practical day-to-day tasks in people-centered spaces — spaces designed for people, not machines — they need to be able to safely & competently provide assistance to people."

The maze is in the mouse

"Google has 175,000+ capable and well-compensated employees who get very little done quarter over quarter, year over year. Like mice, they are trapped in a maze of approvals, launch processes, legal reviews, performance reviews, exec reviews, documents, meetings, bug reports, triage, OKRs, H1 plans followed by H2 plans, all-hands summits, and inevitable reorgs. The mice are regularly fed their “cheese” (promotions, bonuses, fancy food, fancier perks) and despite many wanting to experience personal satisfaction and impact from their work, the system trains them to quell these inappropriate desires and learn what it actually means to be “Googley” — just don’t rock the boat"

A Catalog of Big Visions for Biology

"I don’t disagree with those points, but if biologists do not dream themselves, then the task of dreaming about biology gets outsourced to people who have little practical experience in it, and the dreams of biology get a bad name. Worse, I think it is hard to inspire most 20-year-olds and 30-year-olds with the promise of curing diseases that will not affect them for 40 years. If we want to maintain the pipeline of brilliant people entering biology, they need to be driven by something bigger than curing a disease they have never heard "

Simpson’s Paradox and Existential Terror

Fun Practical Projects and Learning Opportunities

A few fun practical projects and topics to keep you occupied/distracted:

If in doubt… Ask Seneca …you know you want to give it a try!
How cool is this? – “Codebreakers uncover secrets of lost letters Mary Queen of Scots wrote from jail“
Scratching the sports-x-ai itch .. Querying NBA stats with GPT-3 + Statmuse + Langchain
Random things I stumbled upon which you could while away hours with…
- 120,000 different maps
- Notable properties of specific numbers
And lots of fun visualisation tips and tricks:
- AI Research topics over the last 25 years visualised
- “How I built my own Stock Watchlist Dashboard in Tableau via Google Sheets“
- World data visualisation prize 2023
- I used to play around with this type of thing in D3 – so cool you can now do it in python… pyCirclize

Covid Corner

Apparently Covid is over … but it’s definitely still around

The latest results from the ONS tracking study estimates 1 in 45 people in England have Covid (a negative move from last month’s 1 in 70) … and till a far cry from the 1 in 1000 we had in the summer of 2021.

Updates from Members and Contributors

Friend of the newsletter and veteran pandas contributor Marco Gorelli highlights that the pandas 2.0.0 release candidate is out! If you get the chance do try it out and report any bugs before the final 2.0.0 release in a couple of weeks: to install
$pip install pandas==2.0.0rc0
Dr Ahmed Fetit, Senior Teaching Fellow AI for Healthcare at Imperial College, draws our attention to his recent publication which may be of interest to those working on robustness for medical imaging AI – “Reducing CNN Textural Bias With k-Space Artifacts Improves Robustness“
Jona Shehu draws our attention to what looks like an excellent new podcast series from the SAIS Project (a cross-disciplinary collaboration between King’s College London and Imperial College London, and non-academic partners like Microsoft) on the security of AI assistants: “Always Listening: Can I trust my AI Assistant?“. The first episode is already out. In addition, they have also launched a blog series to disseminate SAIS research findings (see blog 1 and blog 2. )
Fresh from publishing his book, “Towards Net Zero Targets: Usage of Data Science for Long-Term Sustainability PathwayTargets“, Prithwis De is now an award winner – many congratulations!
Harin Sellahewa is pleased to highlight success stories from apprentices who completed the Level 7 Digital and Technology Solutions Specialist (Data Analytics Specialist) degree apprenticeship from Buckingham – see Maddie Fang’s story here. The Level 7 DTSS apprenticeship is an excellent route for individuals to upskill or reskill, and for organisations to develop their capabilities in data science
Sian Fortt at the The Alan Turing Institute highlights the upcoming AI UK 2023 conference on the 21-22 March – “The UK’s national showcase of data science and artificial intelligence (AI)”- tickets here

Jobs!

The Job market is a bit quiet – let us know if you have any openings you’d like to advertise

This looks exciting – C3.ai are hiring Data Scientists and Senior Data Scientists to start ASAP in the London office- check here for more details
EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS