Well, we have made it to summer (at least the British variety) – and it’s time for the July edition of our Royal Statistical Society Data Science Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity while enjoying a socially distanced day on the beach with everyone else…
As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here:
Industrial Strength Data Science July 2020 NewsletterRSS Data Science Section
COVID-19 is not going anywhere soon and while talk turns to relaxing of regulations and local lockdown measures there are some interesting discussions and evaluations of the initial modelling approaches.
- Firstly, we came across this interesting github repository, which attempts to objectively evaluate the various publicly available COVID-19 forecast models out there. It’s great to properly ‘mark the homework’ of some of the experts, although of course it can be hard to realistically evaluate forecasts when actions are taken on the back of them.
- In addition there was this interesting discussion from Andrew Gellman about a Nassim Taleb paper. The paper in question (“on single point forecasts for fat tailed distributions” – exciting, I know) is well worth a quick read (it’s short…) and highlights the almost impossible task of forecasting a mean when the underlying distribution is “fat-tailed”.
- The RSS has a “COVID-19 Task Force” which posts updates and announcements here
It continues to be relatively quiet for committee members although we are still playing an active role in joint RSS/British Computing Society/Operations Research Society discussions on data science accreditation and attempting to help define and drive the RSS’s overall Data Science strategy.
Martin Goodson, our chair, continues to run the excellent London Machine Learning meetup and has been active in lockdown with virtual events. Last week there was a great discussion of “Cross-Lingual Transfer Learning” by Sebastian Ruder (DeepMind) and coming up next on July 9th is the catchily titled “Make VAEs Great Again” from Max Welling. All the talks (past and future) are available here – a very useful resource.
Elsewhere in Data Science
Lots of non-Covid data science going on, as always!
As AI and Machine Learning approaches gain traction in more and more facets of society, algorithmic decisions are more directly affecting people’s lives and livelihoods. This makes algorithmic bias is an increasingly important topic with far reaching implications to individuals and society.
Facial recognition is a case in point. Increasing research has shown that the accuracy of facial recognition systems is far from uniform across different racial and gender groups This, combined with its increasing use and miss-use, has led to recent announcements from leading providers. IBM announced they will no longer offer, develop or research facial recognition technology Similarly Amazon has put a curbs on who is allowed to use its Rekognition service.
Much of this recent activity has been driven by the publication of objective analysis of the relative performance of the different systems which can be viewed at gendershades.org. It has sparked ongoing discussion and controversy amongst key luminaries of the data science community regarding the underlying source of the bias with some interesting coverage in venture beat and the verge.
As far back as 2018 (a long time in “AI” years), the Council of Europe was highlighting potential areas at risk for discrimination from biases in algorithmic decision making. Specific at risk decision making systems called out included predictive policing, benefits eligibility, job applicant screening, loan and credit eligibility. It is well documented how discriminatory biases exist in many walks of life (Invisible Women, by Perez, is an excellent example), and if models are blindly trained on existing historical data then we should not be surprised if this discrimination perpetuates.
Academia is increasingly addressing the issue, with initiatives in many well known research institutes including Cornell and Turing. There is also interesting discussion around what responsibilities lie with individual data scientists as they develop these systems. Wired magazine recently advocated a ‘council of citizens’ approach.
In many ways, although some of these machine learning techniques have been around for some time, the packaging up of them into automated services is still relatively new. Putting together the right approaches and frameworks for how we decide what is acceptable and how we assess bias and impact is crucial and something the DSS section is passionate about.
(Many thanks to my colleague Weiting Xu for pulling together this selection of posts on bias.)
We mentioned the OpenAI announcement of the 175 billion parameter GPT-3 model last time. Now the model is available via an API, and there are various commentary pieces exploring the its performance. Chuan Li discusses where the model excels but highlights how the training of these types of models is increasingly out of reach of all but the large organisations (“Training GPT-3 would cost over $4.6M“).
Intriguingly, OpenAI have found they can use the same model design to create coherent image completions.
Getting stuff into production
We all know that there is a significant difference between getting encouraging results in a prototype, and encapsulating the model in a live production environment. This is a great example talking through some of the trials and tribulations involved.
MLOps is a term that has sprung up, to try and talk to these challenges and the leading cloud providers are attempting to help with automated services that remove some of the painful steps. There are all sorts of tools and frameworks now available attempting to help- this post looked at over 200 tools. Github have recently shown how their “github actions” can be used in a similar way which could be an option for engineering teams already leveraging the github eco-system.
As always here are a few potential practical projects to while away the lockdown hours:
- Lots of home videos to enjoy? Why not build your own searchable video archive.
- How about a hands on approach to cartography: build your own lego relief map.
- Feeling a lack of inspiration in the kitchen? Perhaps a tensor flow driven recipe generator will spark your interest?
Updates from Members and Contributors
- Kevin O’Brien would like to make readers aware of JuliaCon 2020:
“JuliaCon 2020 will be held as a virtual conference in late July. The full schedule has been announced and can be found here.
We are particularly keen to get new users to participate, and are running a series of workshops each day from Friday 24th to Tuesday 28th July. The talks will take place from Wednesday 29th to Friday 31st July.
The event is 100% free, but you must register in advance with Eventbrite“
- Glen Wright Colopy would like to announce:
“The American Statistical Association (along with KISS, IISA, IET, SSA, and ICSA) is sponsoring a fall video series on “The Philosophy of Data Science”. The series is aimed at incoming statistics and data science students (but will be of significant interest to the general statistics / data science community). The topics will focus on how scientific reasoning is essential to the practice of data science.
We’ve already confirmed a lineup of top speakers for the first three sessions, and will be adding more shortly.
The series website is here and the mailing list is here“
Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here: