Well, April is now behind us, so it is high time to pull together the May edition of our Royal Statistical Society Data Science Section newsletter. Rather than dwell solely on COVID related data science topics (of which we know there are plenty) we thought we’d try and bring a more balanced assortment of updates and reading materials this month. we hope these prove entertaining or at least help pass the lockdown hours away.
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here
Industrial Strength Data Science May 2020 NewsletterRSS Data Science Section
As we mentioned last time, we have unfortunately had to postpone our meetup/discussion meeting plans for the time being, and so have been focused on ‘spreading the word’ of industrial data science in other ways.
- Fresh from a widely circulated post about statistical models in which he succinctly articulated key rules for communicating about uncertainty, our irrepressible chair Martin Goodson, has called out the #masks4all campaign for their questionable use of evidence.
- Martin is also hosting an upcoming London Machine Learning meetup (online)- it’s on Wednesday May 13th and includes some of the leading reinforcement learning researchers at Deep Mind and UCL .
- Janet Bastiman has released a pair of excellent videos on agile data science best practices- well worth a look (part1 and part2).
- The RSS Data Science Section ran an extensive survey of data science practitioners, and Adam Davison has a great post digging into the results – an interesting key finding is the importance of senior managers with technical skills to the success of data science in an organisation, something we are huge advocates of on the DSS committee.
- Danielle Belgrave is organising two sizeable upcoming conferences (ICML and NEURIPS) and is keen to encourage paper submissions from the DSS community.
We can’t not talk about COVID-19 and there is certainly plenty to discuss from a data science perspective.
- First of all on the research front, there have been a number of good papers digging into some of the specifics of how the virus appears to spread.
- ScienceMag gives a nice overview of the prevalent features in infectious disease dynamics.
- “Temperature, Humidity and Latitude Analysis…” attempts to tease out relationships between virus spread and various climate and weather factors and proposes a seasonal model.
- While “Social Distancing Strategies…” attempts to unpick the different components of social distancing based on the various approaches taken around the world.
- Statnews gives some useful guidance on how to interpret the studies now emerging on community spread (serological surveys). It highlights how important it is to understand the underlying sampling and testing methods and how they differ across studies.
- The focus on modelling has led to a certain amount of tension between epidemiological domain experts and more generalist machine learning practitioners. This useful piece in Nature highlights the areas where data scientists can provide the most useful input.
- One area where Data Science has found success is in identifying existing treatments that might have potential for combating the virus. The NYTimes provides a good summary of BenevolentAI’s work on baricitinib.
- And this short piece from the Smith Institute highlights how fragile some ML based time series models are to behavioural shocks (such as the one we are experiencing now).
It’s great to see some of the principles Martin highlighted in his post on communicating model uncertainty gaining more widespread airtime:
- The Washington Post has focused commentary on the importance of not relying on a single model and why model diversity is positive.
- We are beginning to see more of the source code of prevalent models being publicly shared, a key component for gaining adequate peer review and trust.
- We are also seeing more resources for combating disinformation and media manipulation, such as this excellent (and freely downloadable) “Verification Handbook“.
- Also great to see the RSS (Royal Statistical Society) providing support and guidance in this area with the launch of the RSS COVID-19 task force.
Elsewhere in Data Science
For those looking for a break from all the COVID talk, there is certainly plenty to keep you occupied:
- We have seen a fair amount of press over the last couple of years about the opportunity to leverage machine learning (and Deep Learning in particular) for disease diagnosis. This paper in the BMJ takes a systematic review of these studies and finds that while there is promise, they have yet to fully prove themselves in an operational setting, highlighting the disconnect that can often occur between prototypes and fully operational systems.
- Multi-class categorisation problems can be fun in simplified Kaggle style environments but become increasingly painful at scale.
- Shopify do a nice job of talking through their approach to categorising products at scale which leverages a little discussed technique called Kesler’s Construction.
- Meanwhile Shoprunner unveil their open sourced Deep Learning framework (Tonks) for tackling the same problem.
- For some lovely perspective on the scientific process, have a read of this piece from a lecture by Richard Hamming at Bell Labs in 1986.
- On a similar historic theme, how many of these female data science pioneers have you heard of? Improving the gender imbalance and general diversity of data science practitioners is something we are passionate about on the committee.
- Speaking of reducing bias, good to see that Google have been attempting to reduce gender-bias in their translation services by focusing on the scalability and generalisability of their solutions.
- Meanwhile Deep Mind has been digging into how to combat the issues of ‘gaming the system’– where machine learning approaches find unintended and unwanted ways of optimising an objective function.
For those wanting a bit more of a hands-on project…
- Check out this newly released trove of free downloadable data science resources, including “the bible” : Elements of Statistical Learning.
- How about building your own speech recognition system based on pytorch?
- Then maybe plug in your own chatbot based on the newly open-sourced Blender framework form facebook.
- And after all that hard work, sit back and listen to some AI generated hip-hop from OpenAI.
Updates from Members and Contributors
- Mani Sarkar has been compiling what looks to be an excellent set of data science resources on github.
- Charles Radclyffe thoroughly recommends the MLOps Community which has been hosting a number of online meetups. They meet every Wednesday on Zoom at 5pm.
- Harald Carlens has pulled together a useful dashboard of ML competitions across the web.
- Finally, Glen Wright Colopy is keen to promote the American Statistical Society’s sponsorship of a new healthcare technology podcast, “The Pod of Asclepius”, where data scientists, statisticians, engineers, and regulatory experts discuss the technical challenges in their healthcare domain.
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here