June Newsletter

Hi everyone-

It’s a bank holiday weekend – again – so that means it’s June and hopefully some warmer weather as May has definitely not delivered on that front … perhaps a few curated data science reading materials might prove useful for sunshine in the garden?

Following is the June edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity … We are continuing with our move of Covid Corner to the end to change the focus a little.

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science June 2021 Newsletter
RSS Data Science Section

Committee Activities

We are all conscious that times are incredibly hard for many people and are keen to help however we can- if there is anything we can do to help those who have been laid-off (networking and introductions help, advice on development etc.) don’t hesitate to drop us a line.

We are now ‘two for two’ on our ‘Fireside chat’ series! Following on from our fantastic discussion with Andrew Ng, Giles Pavey hosted an engaging and enlightening conversation with with Anthony Goldbloom on May 20th. Anthony is founder and CEO of Kaggle (now a Google company), the world’s largest data science and machine learning community. There was a great deal of insight into the evolution of data science over the 10 years Kaggle has been running as well as lots of audience questions. We will distill the session down and publish a summary shortly.

We will soon be releasing a survey to our readers and members focused on the UK Government’s proposed AI Strategy. We are passionate about making sure the government focuses on the right things in this area, and feel like true Data Science and AI practitioners need to feed into this process. So when you see the survey, do please take the time to fill it out if you can!

The full programme for this year’s RSS Conference, which takes place in Manchester from 6-9 September, has been confirmed. The programme includes keynote talks from the likes of Hadley Wickham, Bin Yu and Tom Chivers. Registration is open with early-bird discounts available until Friday 4 June.
In addition, the RSS now has a new accreditation – Data Analyst.

Data Analyst is a registered form of professional membership status that provides formal recognition of a member’s statistical training and work-based experience at entry level

Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and is very active in with virtual events. The last event was on 24th May where Christian Szegedy, machine learning and AI researcher at Google Research, gave a talk titled ‘The Inverse Mindset of Machine Learning‘. Videos are posted on the meetup youtube channel – and future events will be posted here.

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…
Bias, ethics and diversity continue to be hot topics in data science…

Sadly, the stream of AI miss-use, bias and ethically questionable use-cases continues to hit the news:
- Police stations in China have been testing AI emotion-detection software tested on Uyghurs.
- Google released a new dermatology focused app, but apparently it struggles to work with darker skin tones.
Of course part of the driver of the miss-use is the increasingly widespread availability of the underlying functionality:
- The facial recognition site PimEyes allows anyone to search for anyone across the web for free…
- Strikingly realistic ‘AI dubbing technology’ from Flawless: anyone fancy De Nero’s “you talkin’ to me” in flawless German?
- This piece for O’Reilly by Mike Loukides highlights the underlying issue:

The real danger wasn’t “Deep Fakes.” The real danger is cheap fakes, fakes that can be produced quickly, easily, in bulk, and at virtually no cost

Regulators are rightly becoming increasingly active in an attempt to combat these issues. This HBR article helps map out what organisations need to know to be prepared.
We all know how complex ML models are becoming and the scale at which some of them now operate, and so we have to be open to the fact that mistakes will happen. The critical question becomes: what do you do about it when the issue surfaces? Twitter has taken a positive and transparent approach to dealing with some of their previous bias related issues in automated cropping, releasing a detailed and technical analysis about why it was happening and the steps they are taking to remove the bias:

We want to thank you for sharing your open feedback and criticism of this algorithm with us. As we discussed in our recent blog post about our Responsible ML initiatives, Twitter is committed to providing more transparency around the ways we’re investigating and investing in understanding the potential harms that result from the use of algorithmic decision systems like ML.

Really interesting discussion on the Kara Swisher’s Sway podcast with Daniel Kahneman (renowned behavioural economist – “Thinking Fast and Slow”) delving into why we require much higher accuracy from computers and technology than from humans before we are willing to trust them.
And in a similar vein, this is thought provoking– does more data necessarily mean better decision making?
Less specifically focused on bias and ethics, but really interesting commentary from Benedict Evans on Amazon and how much it really knows about what it sells, touching on how much of a responsibility a platform has for moderation of its own recommendation content.

Of Amazon’s top 50 best-sellers in “Children's Vaccination & Immunisation”, close to 20 are by anti-vaccine polemicists, and 5 are novels about fictional pandemics

Developments in Data Science…
As always, lots of new developments…

The combined Google/DeepMind research teams have had a busy month!
- A google research team found that replacing transformers with Fourier Transforms could help improve model efficiency and train times (“92% of BERT accuracy with training times 7 times faster”)
- Intriguing research trying to understand whether Deep Networks learn the same things as Wide Networks ….sometimes!
- Edging Reinforcement Learning closer and closer to ‘real world’ applications – “Eigen Game” from Deep Mind – this looks really interesting.
To all those researchers pushing the boundaries out there (who I know read this newsletter!)- $100m up for grabs from OpenAI!
With Hinton at Google Research, LeCun at Facebook Research, and now Bengio at Apple … good to know the Deep Learning ‘OGs’ are all gainfully employed!

Real world applications of Data Science
Lots of practical examples making a difference in the real world this month!

Interesting announcements in the world of health care:
- I’m in awe of these types of real-world applications of AI, that require delicate physical manipulation as well as advanced detection and response… ‘robo-surgeons’.
- Facebook’s AI Research team have announced a method to help accelerate discovery of effective new drug combinations
- Although some useful commentary from the Yale School of Medicine about the morality of some AI use cases in health care.
Helping save the whales…. using drones and ML to track whale populations at scale.
Sorting archaeological pottery fragments with Convolutional Neural Networks.
This would have been a fun project- developing Major League Baseball’s automatic strike detection system!
On the sporting front, DeepMind has been busy as well
Another example of automated processing of satellite imagery- this time for early detection of wildfires.
This is visually amazing- creating endless streaming 3d footage from a single photograph – definitely worth checking out the youtube footage.
Increasing real world applications built on-top of large NLP models like BERT and GPT2/3:
- Using AI to catch defamation
- Writing computer code
- Predictive shell commands anyone?
- But some useful commentary from Stanford’s Human Centered AI group about where NLP processes still stumble and the implications for use cases in the legal world
Pong, Space Invaders, Chess, Go…. and now crosswords – Fun article digging into ‘Dr. Fill’ the recently victorious crossword puzzle automaton at the US national championship – watch those letters fly!

How does that work?
A new section on understanding different approaches and techniques

Useful step by step blog post from dropbox on how they do image search – interesting to see the crossover between NLP and Image Processing with the use of Word Vectors.
Useful tutorial on monte carlo simulation from Gabriel Carvalho
Causal inference is hard… at least I find it difficult, and struggle to find useful worked through real world examples. This is one of the better ones I have found from Vahe Hakobyan – I hope he carries on with the series!
Use Bayesian Inference to move away from point estimates
How prevalent is Simpson’s Paradox in real life?

Getting it live
How to drive ML into production

We’ve talked before about the concept of MLOps – streamlining the process for getting ML models into production. Doing this well is critical to efficiently driving valuable outcomes from machine learning. Andrew Ng is a big proponent of this, and has just released what looks to be a fantastic MLOps specialisation on Coursera

"For me, teaching this course was an unusual experience. MLOps standards and tools are still evolving, so it was exciting to survey the field and try to convey to you the cutting edge. I hope you will find it equally exciting to learn about this frontier of ML development, and that the skills you gain from this will help you build and deploy valuable ML systems." Andrew Ng

As Andrew says, the tools available are still evolving, although there are some leading contenders often developed in-house by early pioneers like Google, Facebook, Netflix etc:
- The CNN data science team talk through their implementation of Metaflow – Netflix’s open source MLOps tooling set
- The Spotify team gives an in-depth review of their ‘winding’ path to MLOps and their implementation of TFX (Tensor Flow Extended) on Kubeflow (full disclosure: this is the direction we are currently pursuing at my day job). Tensor Flow Extended is an open source project originating in Google – useful background here
Keeping track of data within an organisation is notoriously hard. Twitter’s engineering blog has a nice piece on how they use elastic search and neural nets to help find and categorise what data is actually contained in their different environments.
This looks like an interesting project – ‘Flat Data‘ – making it easy to work with data in git and github.
I think this is good acknowledgement of a new role emerging in the data space – that of an Analytics Engineer

The Art of Visualisation
Making data science look right..

Excellent post from the economist about when it is appropriate to ‘break the rules’ with visualisations, complete with examples.
If you need to create interactive web based visualisations, knowing some javascript is increasingly useful:
- This post from Mike Bostock, the originator of the amazing D3 javascript library, highlights the key benefits and increasing functionality
- Observable looks look a strong new addition to the world of js visualisation libraries
For the purists out there, what font should you use? Michael Li analysed 1000 of the top websites to find out…’sans-serif’ is apparently the way to go.
Less about visualisation, more about how to best get your message across – great article from Tim Harford on why conspiracy theorists are so hard to reason with, with some useful practical tips.
If you like your info in audio form, some interesting commentary on the David Spiegelhalter’s Risky Talk podcast on communicating evidence to policymakers

Practical Projects and Learning Opportunities
As always here are a few potential practical projects to keep you busy:

Covid Corner

Again, more positive progress in the UK on the Covid front with over 40m people now having received their first vaccine dose and over 25m fully vaccinated. However, the new variant originating in India is cause for concern.

The latest ONS Coronavirus infection survey estimates the current prevalence of Covid in the community in England to be roughly 1 in 1100. This is still an in improvement from last month (1 in 1000) but it is clear that the relaxing of lockdown restrictions combined with the new ‘Indian strain’ are reducing the rate of improvement. Indeed, the government dashboard shows cases, hospitalisations and deaths are in fact slightly up in the last week.
The B.1.617.2 Variant originating in India is spreading rapidly in some areas of the UK, and is proving to be 50% or more transmissible than the previously dominant ‘Kent strain’. However, the great news is that two doses of the AstraZeneca or Pfizer vaccine do seem to be effective against the new strain.
Recent updates of various epidemiological models, including current vaccine levels as well as transmission rates of the new strain highlight the tenuous balance we tread between opening up and risking another devastating wave. Clearly keeping moving quickly with vaccinations is critical, and the announcement of the approval of the single dose Janssen vaccine could prove useful.
How good have these predictive models proved to be? Not brilliant, that’s for sure, but quite a bit better than ‘experts’:

 Experts gave a median estimate of 30,000 Covid deaths by the end of the year, whereas the non-experts said 20,000. The truth was around 75,000

One of the biggest drivers to getting a handle on the pandemic has been understanding how it actually transmits from person to person. It is now clear that the virus is able to transmit through aerosols – smaller respiratory particles that can float – in addition to droplets – which are expelled from the mouth and quickly fall to the ground. However, this was not thought to be the case originally, and there have been some interesting investigative pieces digging into why it took so long to come to this conclusion. In fact Wired traces it back to a 60 year old miss-understanding between biologists and physicist!
In case we needed reminding of the devastating cost of Covid, the economist has attempted to estimate the total global excess deaths so far… 7-13m.
Finally, interesting insight into what it takes to manufacture the Pfizer vaccine.

Updates from Members and Contributors

Harald Carlens has put together a very useful comparison of cloud GPU services and pricing – definitely check it out if you are using deep learning in the cloud.
Lucie Burgess would like to announce an interesting set of discussions around the provenance and legality of automated decisions taking place on June 15th and June 22nd. Helix Data Innovation are running the sessions on behalf of the PLEAD project (King’s College London, University of Southampton, with partners Experian, Roke and Southampton Connect) – sign up here for what should be a good discussion on a very relevant topic
Kevin O’Brien highlights the upcoming UseR! 2021 conference on 5-9th of July – a must see for those R users out there

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

June Newsletter

Committee Activities

This Month in Data Science

Covid Corner

Updates from Members and Contributors

One thought on “June Newsletter”

Leave a comment Cancel reply

Committee Activities

This Month in Data Science

Covid Corner

Updates from Members and Contributors

Share this:

Related

One thought on “June Newsletter”

Leave a comment Cancel reply