January Newsletter

Hi everyone-

Happy New Year! I hope you all had as relaxing a holiday period as possible and enjoyed the fireworks from around the world… London trumps them all as far as I’m concerned although I’m clearly biased. As we all gear up for 2022, perhaps time for some thought provoking data science reading materials to help guide plans for the year ahead.

Following is the January edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science January 2022 Newsletter

RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

2021 has been a busy and productive year for the RSS Data Science and AI section, focusing on our goals of:

  • Supporting the career development of data scientists and AI specialists
  • Fostering good practice for professional data scientists
  • Providing the voice of the practitioner to policy-makers

A few edited highlights:

  • We kicked off our “Fireside Chat” series back in February with an amazing discussion with Andrew Ng attended by over 500 people, followed up with a similarly thought provoking conversation with Anthony Goldblum, founder of Kaggle, in May.
  • In March we hosted our inaugural Data Science Ethics Happy Hour, discussing a wide range of topics focused on ethical challenges with an experienced panel. We also hosted “Confessions of a Data Scientist” at the annual RSS conference based on contributions from you, our experienced data science practitioner readership.
  • Throughout the year we have engaged with various initiatives focused on the accreditation of data science. More recently we have been actively engaged in the UK Government’s AI Roadmap and strategy, first conducting a survey and publishing our findings and critiques (which were publicly acknowledged). We then hosted a well attended event focused on the implications of the strategy and will be collaborating with the UK Government’s Office for AI to host a roundtable event on AI Governance and Regulation, on of the 3 main pillars of the UK AI Strategy.
  • … And we’ve managed to produce 12 monthly newsletters, expanding our readership

Our very own Jim Weatherall has co-authored a paper, “Really Doing Great at Estimating CATE?” which has been accepted to NeurIPS- many congrats Jim!

Meanwhile, Martin Goodson continues to run the excellent London Machine Learning meetup and is very active in with events. The next talk will be on January 12th where Alexey Bochkovskiy, research engineer at Intel, will discuss “YOLOv4 and Dense Prediction Transformers“. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

  • It’s not exactly breaking news that the ImageNet data set is very influential in driving image recognition AI research, but new research from the University of California and Google Research highlights the overall importance of these ‘benchmark’ datasets, largely from influential western institutions, and frequently from government organisations.
"[We] find that there is increasing inequality in dataset usage globally, and that more than 50% of all dataset usages in our sample of 43,140 corresponded to datasets introduced by twelve elite, primarily Western, institutions."
  • TikTok is considered by many to have one of the best recommendation systems, driving phenomenal usage figures amongst its users. The NYTimes obtained an internal company document that offers a new level of detail about how the algorithm works. It’s clear that the algorithm optimises for retention and time-spent much like many other similar systems.
"The company’s edge comes from combining machine learning with fantastic volumes of data, highly engaged users, and a setting where users are amenable to consuming algorithmically recommended content (think how few other settings have all of these characteristics!). Not some algorithmic magic.”
  • So TikTok is not doing anything inherently different to facebook, twitter and any other site that recommends content. And in this excellent in-depth article, MIT Technology Review walks through how ‘clickbait farms‘ use these sites to spread misinformation.
On an average day, a financially motivated clickbait site might be populated with celebrity news, cute animals, or highly emotional stories—all reliable drivers of traffic. Then, when political turmoil strikes, they drift toward hyperpartisan news, misinformation, and outrage bait because it gets more engagement”
"It’s not the most “interesting” stories that make their way to the top of your News Feed (the word “interesting” implying “valuable”), but the most emotional. The most divisive. The ones with the most Likes, Comments, and Shares, and most likely to spark debate, conflict, anger. Either that, or the content a brand was willing to spend the most money sponsoring—all of which reveals a disconcerting conclusion: as a user of these platforms, being forced to see what the algorithm and brands want you to see, you have no rights"
"Instead of fighting from the inside, I want to show a model for an independent institution with a different set of incentive structures.”

Developments in Data Science…
As always, lots of new developments…

“It feels like Galileo picking up a telescope and being able to gaze deep into the universe of data and see things never detected before.”
  • In addition, DeepMind released Gopher, a new 280 billion parameter model, together with insight into the areas where parameter scaling helps, and where it is less important
"Our research investigated the strengths and weaknesses of those different-sized models, highlighting areas where increasing the scale of a model continues to boost performance – for example, in areas like reading comprehension, fact-checking, and the identification of toxic language. We also surface results where model scale does not significantly improve results — for instance, in logical reasoning and common-sense task"

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

“The results are compelling. It's certainly opening a new class of antimicrobial peptides, and finding them in an unexpected place.”

How does that work?
A new section on understanding different approaches and techniques

"The performance of supervised learning tasks improves with more high-quality labels available. However, it is expensive to collect a large number of labeled samples. There are several paradigms in machine learning to deal with the scenario when the labels are scarce. Semi-supervised learning is one candidate, utilizing a large amount of unlabeled data conjunction with a small amount of labeled data"
  • Given the increasing prevalence of PyTorch this looks very useful – miniTorch
MiniTorch is a diy teaching library for machine learning engineers who wish to learn about the internal concepts underlying deep learning systems. It is a pure Python re-implementation of the Torch API designed to be simple, easy-to-read, tested, and incremental. The final library can run Torch code. The project was developed for the course 'Machine Learning Engineering' at Cornell Tech.

Practical tips
How to drive analytics and ML into production

"We think about three large primitives: the ingest primitive in this chat interface, the transform interface, and the publisher interface. All of these apply to “data sets” – which could be tables, they could be models, they could be reports, dashboards, and all the other things that you mentioned. When you think of ingest, transform, publish, these are all operating on instead of storage.  We are building the lakehouse architecture: our storage is GCS, Iceberg table format, plus Parquet. … Trino is our query engine.”

Bigger picture ideas
Longer thought provoking reads – lean back and pour a drink!

"Well, computers haven’t changed much in 40 or 50 years. They’re smaller and faster, but they’re still boxes with processors that run instructions from humans. AI changes that on at least three fronts: how computers are made, how they’re programmed, and how they’re used. Ultimately, it will change what they are for. 
The core of computing is changing from number-crunching to decision-­making."
"This post argues that we should develop tools that will allow us to build pre-trained models in the same way that we build open-source software. Specifically, models should be developed by a large community of stakeholders who continually update and improve them. Realizing this goal will require porting many ideas from open-source software development to building and training models, which motivates many threads of interesting research."
"In this series, I focus on the third trend [novel computing infrastructure capable of processing large amounts of data at massive scales and/or with fast turnaround times], and specifically, I will give a high-level overview of accelerators for artificial intelligence applications — what they are, and how they became so popular."

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

Covid Corner

As we head into a new year, there are some depressing similarities with last year. The new Omicron cases to skyrocket world wide, with the UK being at the forefront…Thank goodness for vaccinations

  • The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be an astonishing 1 in 25 people, by far the largest prevalence we have seen (over 2m people currently with coronavirus)… Back in May the prevalence was less than 1 in 1000..
  • As yet the hospitalisation figures have not shown similar dramatic increases, although there are some worrying very recent trends.

Updates from Members and Contributors

  • Mani Sarkar has conducted a two part interview with Ian Ozsvald (pydata London founder) on Kaggling (see twitter posts here and here, as well as a summary in Ian’s newsletter here)
  • David Higgins has been very productive on topics in medical AI, digital health and data driven business, posting an article a week from September through Christmas – lots of excellent material here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Leave a comment