Happy New Year! I hope you all had as relaxing a holiday period as possible and enjoyed the fireworks from around the world… London trumps them all as far as I’m concerned although I’m clearly biased. As we all gear up for 2022, perhaps time for some thought provoking data science reading materials to help guide plans for the year ahead.
Following is the January edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.
Industrial Strength Data Science January 2022 NewsletterRSS Data Science Section
We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.
2021 has been a busy and productive year for the RSS Data Science and AI section, focusing on our goals of:
- Supporting the career development of data scientists and AI specialists
- Fostering good practice for professional data scientists
- Providing the voice of the practitioner to policy-makers
A few edited highlights:
- We kicked off our “Fireside Chat” series back in February with an amazing discussion with Andrew Ng attended by over 500 people, followed up with a similarly thought provoking conversation with Anthony Goldblum, founder of Kaggle, in May.
- In March we hosted our inaugural Data Science Ethics Happy Hour, discussing a wide range of topics focused on ethical challenges with an experienced panel. We also hosted “Confessions of a Data Scientist” at the annual RSS conference based on contributions from you, our experienced data science practitioner readership.
- Throughout the year we have engaged with various initiatives focused on the accreditation of data science. More recently we have been actively engaged in the UK Government’s AI Roadmap and strategy, first conducting a survey and publishing our findings and critiques (which were publicly acknowledged). We then hosted a well attended event focused on the implications of the strategy and will be collaborating with the UK Government’s Office for AI to host a roundtable event on AI Governance and Regulation, on of the 3 main pillars of the UK AI Strategy.
- … And we’ve managed to produce 12 monthly newsletters, expanding our readership
Our very own Jim Weatherall has co-authored a paper, “Really Doing Great at Estimating CATE?” which has been accepted to NeurIPS- many congrats Jim!
Meanwhile, Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The next talk will be on January 12th where Alexey Bochkovskiy, research engineer at Intel, will discuss “YOLOv4 and Dense Prediction Transformers“. Videos are posted on the meetup youtube channel – and future events will be posted here.
This Month in Data Science
Lots of exciting data science going on, as always!
Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…
- It’s not exactly breaking news that the ImageNet data set is very influential in driving image recognition AI research, but new research from the University of California and Google Research highlights the overall importance of these ‘benchmark’ datasets, largely from influential western institutions, and frequently from government organisations.
"[We] find that there is increasing inequality in dataset usage globally, and that more than 50% of all dataset usages in our sample of 43,140 corresponded to datasets introduced by twelve elite, primarily Western, institutions."
- And some of those same organisations are very influential in driving US AI security policy
- All the more reason why AI needs more input from around the world, particularly Africa, as well articulated in this Quartz piece.
- We know that Governments are far from perfect and evidence from around the world continues to come in, this time from South Korea where 170m facial images obtained in the immigration process were passed on to private AI developers.
- Meanwhile, in a worrying development, Chinese scientists have apparently developed an ‘AI prosecutor’ that can identify eight common crimes such as fraud, gambling and dangerous driving. We know, from excellent research such as this from The Markup into PredPol, how prone to biases these types of approaches are…
- TikTok is considered by many to have one of the best recommendation systems, driving phenomenal usage figures amongst its users. The NYTimes obtained an internal company document that offers a new level of detail about how the algorithm works. It’s clear that the algorithm optimises for retention and time-spent much like many other similar systems.
"The company’s edge comes from combining machine learning with fantastic volumes of data, highly engaged users, and a setting where users are amenable to consuming algorithmically recommended content (think how few other settings have all of these characteristics!). Not some algorithmic magic.”
- So TikTok is not doing anything inherently different to facebook, twitter and any other site that recommends content. And in this excellent in-depth article, MIT Technology Review walks through how ‘clickbait farms‘ use these sites to spread misinformation.
On an average day, a financially motivated clickbait site might be populated with celebrity news, cute animals, or highly emotional stories—all reliable drivers of traffic. Then, when political turmoil strikes, they drift toward hyperpartisan news, misinformation, and outrage bait because it gets more engagement”
- To try and combat this, HalloApp (founded by Neeraj Arora and Michael Donohue, who helped build WhatsApp) is building an algorithmic feed engine that is less prone to these influences- useful commentary here
"It’s not the most “interesting” stories that make their way to the top of your News Feed (the word “interesting” implying “valuable”), but the most emotional. The most divisive. The ones with the most Likes, Comments, and Shares, and most likely to spark debate, conflict, anger. Either that, or the content a brand was willing to spend the most money sponsoring—all of which reveals a disconcerting conclusion: as a user of these platforms, being forced to see what the algorithm and brands want you to see, you have no rights"
- Not all doom and gloom though…
- US Congress is starting to draft proposals that adapt the infamous Section 230 to focus on amplification not content moderation
- And the UK Central Digital and Data Office has published a new Algorithmic Transparency Standard designed to help public sector organisations provide clear information about the algorithmic tools they use, and why they’re using them.
- The Stanford University Human Centred AI group has published “A New Direction for Machine Learning in Criminal Law” proposing the use of ML to analyse decision making in the criminal legal system not to predict human behaviour but to better understand the factors that led to past decisions.
- The engineering group at Twitch have published an article highlighting how they use ML to combat hate and harassment on their platform
- And Timnit Gebru has founded Distributed Aritifical Intelligence Research (DAIR)
"Instead of fighting from the inside, I want to show a model for an independent institution with a different set of incentive structures.”
- Finally, some entertaining real world examples from Emily Riederer, of how ‘algorithms’ can go wrong- this time in the “fitness” space
Developments in Data Science…
As always, lots of new developments…
- Not surprisingly given the date, there have been a number of reviews of the year in data science, AI, and ML research in 2021- here are some of the best:
- Louis Bouchard provides a curated list of the latest breakthroughs in AI research with excellent video explanations of each- well worth a read
- Six “outstanding papers” from NeurIPS 2021
- Top trending papers, libraries and datasets of 2021 from Papers With Code
- A more focused dive into the state of Graph ML in 2021 (with a recent addition here)
- A comparison of PyTorch and TensorFlow usage and development in 2021, with PyTorch gaining increasing dominance
- Finally an interesting audio review of the state of AI with Murray Shanahan joining Azeem Azhar
- Some interesting new research looking to improve Deep Learning’s ability to generalise:
- First, a new data set and approach to learn across conversations: “Beyond Goldfish Memory”
- And then applying approaches from collective intelligence (self-organisation, emergent behaviour and swarm optimisation) to Deep Learning
- An intriguing new approach to Reinforcement Learning – abstracting as a sequence modelling problem
- As always, Deep Mind continues to push the boundaries- this time exploring the potential of Machine Learning to recognise mathematical structures and patterns
“It feels like Galileo picking up a telescope and being able to gaze deep into the universe of data and see things never detected before.”
- In addition, DeepMind released Gopher, a new 280 billion parameter model, together with insight into the areas where parameter scaling helps, and where it is less important
"Our research investigated the strengths and weaknesses of those different-sized models, highlighting areas where increasing the scale of a model continues to boost performance – for example, in areas like reading comprehension, fact-checking, and the identification of toxic language. We also surface results where model scale does not significantly improve results — for instance, in logical reasoning and common-sense task"
- Meanwhile Google Research explored how data distillation (as opposed to model distillation) can improve ML efficiency.
- Not to be outdone, OpenAI has also been very busy
- They released WebGPT which improves the factual accuracy of language models through web browsing
- And also published GLIDE, a scaled-down text-to-image model that rivals the ground breaking DALL-E’s performance
Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!
- More scarily efficient robots, this time from ETH Zürich’s Robotic Systems Lab, exploring what’s possible with wheeled-legged approaches.
- Speaking of robots, Nissan is carrying out Japan’s largest demonstration to date of autonomous vehicles in an area to the south of Tokyo
- Lots of promising research in Medical applications of ML and AI
“The results are compelling. It's certainly opening a new class of antimicrobial peptides, and finding them in an unexpected place.”
- Using Gaussian Processes as a first step towards Active Learning in Physics – interpolating between experimental data points and actively guiding the most productive areas for future research.
- Saving seaweed with Machine Learning!
- Automatically detecting weaknesses in sewer pipes from inspection videos
- Facebook has created an elegant approach that brings children’s drawings to life
- Finally Google has released an excellent accessibility tool that helps people with speech impediments communicate
How does that work?
A new section on understanding different approaches and techniques
- MCMC (Markov chain Monte Carlo sampling) is notoriously computationally intensive – a well written guide on how to apply it on large data sets using JAX and the GPU.
- Comprehensive tutorial on Semi-Supervised learning
"The performance of supervised learning tasks improves with more high-quality labels available. However, it is expensive to collect a large number of labeled samples. There are several paradigms in machine learning to deal with the scenario when the labels are scarce. Semi-supervised learning is one candidate, utilizing a large amount of unlabeled data conjunction with a small amount of labeled data"
- Given the increasing prevalence of PyTorch this looks very useful – miniTorch
MiniTorch is a diy teaching library for machine learning engineers who wish to learn about the internal concepts underlying deep learning systems. It is a pure Python re-implementation of the Torch API designed to be simple, easy-to-read, tested, and incremental. The final library can run Torch code. The project was developed for the course 'Machine Learning Engineering' at Cornell Tech.
- Well written step-by-step tutorial for named entity recognition in text samples
- Finally, a new library that looks interesting – Skippa for pre-processing pipelines in pandas
How to drive analytics and ML into production
- Some excellent commentary from Rachel Thomas at Fast.AI on how to avoid “Data Disasters”, highlighting the importance of “data work” – including a case study on the UK Covid Tracking App!
- In a similar vein, new research accentuates how critical relevant data sets are to ML model success, “99% of computer vision (CV) teams have had a machine learning (ML) project canceled due to insufficient training data”
- With the criticality of relevant clean data sets clear, previous tenets on the “death of feature engineering” are being questioned
- Testing in Data Science and ML is always tricky- some useful examples here from Peter Baumgartner
- Finally an interesting conversation between Ben Lorica (data exchange podcast) and Azeem Ahmed on the Shopify data platform and the different components they use.
"We think about three large primitives: the ingest primitive in this chat interface, the transform interface, and the publisher interface. All of these apply to “data sets” – which could be tables, they could be models, they could be reports, dashboards, and all the other things that you mentioned. When you think of ingest, transform, publish, these are all operating on instead of storage. We are building the lakehouse architecture: our storage is GCS, Iceberg table format, plus Parquet. … Trino is our query engine.”
Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink!
- Good article in the MIT Technology Review discussing how artificial intelligence is changing what it means to compute.
"Well, computers haven’t changed much in 40 or 50 years. They’re smaller and faster, but they’re still boxes with processors that run instructions from humans. AI changes that on at least three fronts: how computers are made, how they’re programmed, and how they’re used. Ultimately, it will change what they are for. The core of computing is changing from number-crunching to decision-making."
"This post argues that we should develop tools that will allow us to build pre-trained models in the same way that we build open-source software. Specifically, models should be developed by a large community of stakeholders who continually update and improve them. Realizing this goal will require porting many ideas from open-source software development to building and training models, which motivates many threads of interesting research."
"In this series, I focus on the third trend [novel computing infrastructure capable of processing large amounts of data at massive scales and/or with fast turnaround times], and specifically, I will give a high-level overview of accelerators for artificial intelligence applications — what they are, and how they became so popular."
Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:
- Anyone for Spreadsheet games?
- NVIDIA Canvas looks fun to play around with (“See how AI can help you paint”)
- Some amazing geospatial visualisations in 30 days and 30 maps – ‘dotted oceans‘ was my favourite
- Entertaining approach to using ML in the creative process
- Perhaps a couple of New Year’s resolutions:
As we head into a new year, there are some depressing similarities with last year. The new Omicron cases to skyrocket world wide, with the UK being at the forefront…Thank goodness for vaccinations
- The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be an astonishing 1 in 25 people, by far the largest prevalence we have seen (over 2m people currently with coronavirus)… Back in May the prevalence was less than 1 in 1000..
- As yet the hospitalisation figures have not shown similar dramatic increases, although there are some worrying very recent trends.
- There is increasing frustration and bewilderment in the scientific community at the lack of UK Government action to stem the growth in cases.
Updates from Members and Contributors
- Mani Sarkar has conducted a two part interview with Ian Ozsvald (pydata London founder) on Kaggling (see twitter posts here and here, as well as a summary in Ian’s newsletter here)
- David Higgins has been very productive on topics in medical AI, digital health and data driven business, posting an article a week from September through Christmas – lots of excellent material here
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.
The views expressed are our own and do not necessarily represent those of the RSS