March Newsletter

Hi everyone-

February may technically be the shortest month but it certainly can feel long… I think I sensed a slight brightening in the morning light but I may have been mistaken… Maybe time for a bit of distraction with a wrap up of data science developments in the last month. Don’t miss out on more ChatGPT fun and games in the middle section!

Following is the March edition of our Royal Statistical Society Data Science and AI Section newsletter. Hopefully some interesting topics and titbits to feed your data science curiosity. (If you are reading this on email and it is not formatting well, try viewing online at http://datasciencesection.org/). Note we plan on moving email providers for next month (fingers crossed) so keep an eye out for a different looking version in April…

As always- any and all feedback most welcome! If you like these, do please send on to your friends- we are looking to build a strong community of data science practitioners. And if you are not signed up to receive these automatically you can do so here.

Industrial Strength Data Science March 2023 Newsletter

RSS Data Science Section
handy new quick links: 
committee; ethics; research; generative ai; applications; tutorials; practical tips; big picture ideas; fun; reader updates; jobs

Committee Activities

We are actively planning our activities for the year, and are currently working with the Alliance for Data Science professionals on expanding the previously announced individual accreditation (Advanced Data Science Professional certification) into university course accreditation. Remember also that the RSS is now accepting applications for the Advanced Data Science Professional certification- more details here.

This year’s RSS International Conference will take place in the lovely North Yorkshire spa town of Harrogate from 4-7 September.  As usual Data Science is one of the topic streams on the conference programme, and there is currently an opportunity to submit your work for presentation.  There are options available for 20-minute talks, 5-minute rapid-fire talks and for poster presentations – for full details visit the conference website.  The deadline for talk submissions is 5 April.

Martin Goodson (CEO and Chief Scientist at Evolution AI) continues to run the excellent London Machine Learning meetup and is very active with events. The last event was on Feb 15th when Research Scientists from Meta AI, presented “Human-level Play in Diplomacy Through Language Models & Reasoning“. Videos are posted on the meetup youtube channel – and future events will be posted here.

Martin has also compiled a handy list of mastodon handles as the data science and machine learning community migrates away from twitter…

This Month in Data Science

Lots of exciting data science going on, as always!

Ethics and more ethics…

Bias, ethics and diversity continue to be hot topics in data science…

"In September 2022, just after Putin announced additional mobilization for the war against Ukraine, Viktor Kapitonov, a 27-year-old activist who’d protested regularly since 2013, was stopped by two police officers after being flagged by face recognition surveillance while he approached the turnstiles in Moscow’s marble-covered Avtozavdodskaya metro station. The officers took him to the military recruitment office, where around 15 people were waiting to enlist in Putin’s newly announced draft. "
"A principled approach to the military use of AI should include careful consideration of risks and benefits, and it should also minimize unintended bias and accidents. States should take appropriate measures to ensure the responsible development, deployment, and use of their military AI capabilities, including those enabling autonomous systems.  These measures should be applied across the life cycle of military AI capabilities."
"This might include tech companies that provide these models over API (e.g., OpenAI, Stability AI), through cloud services (e.g., the Amazon, Google, and Microsoft clouds), or possibly even through Software-as-a-Service providers (e.g., Adobe Photoshop). These businesses control several levers that might partially prevent malicious use of their AI models. This includes interventions with the input data, the model architecture, review of model outputs, monitoring users during deployment, and post-hoc detection of generated content."

Developments in Data Science Research…

As always, lots of new developments on the research front and plenty of arXiv papers to read…

  • Efficiency and scalability are still hot topics in research:
    • Faster training of deep neural networks through more efficient back propagation (SparseProp)
    • Efficient fine tuning of billion scale models form Hugging Face – PEFT
    • Although the largest language models have upward of 100b parameters, vision models are typically smaller…. Scaling vision transformers to 22 billion parameters
    • Facebook/Meta research released a new Large Language Model that outperforms GPT3 and runs on a single GPU – impressive: LLaMA – more background here
    • More great progress on the open source front from LAION with OpenCLIP
  • Emerging research around optimising prompting of generative models:
    • À-la-carte Prompt Tuning (APT) – a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time
    • This seems very useful: SwitchPrompt – adapting language models trained on large general datasets to more specific targeted domains
"Using domain-specific keywords with a trainable gated prompt, SwitchPrompt offers domain-oriented prompting, that is, effective guidance on the target domains for general-domain language models. Our few-shot experiments on three text classification benchmarks demonstrate the efficacy of the general-domain pre-trained language models when used with SwitchPrompt"
  • Lots of work looking at adapting and improving large language models:
    • Augmented Language Models: a Survey – useful summary of works in which language models are augmented with reasoning skills and the ability to use tools
    • Speaking of which – Toolformer: Language Models Can Teach Themselves to Use Tools
    • AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models
    • Dreamix – provide an input image to guide generative video creation- very impressive
    • MarioGPT – generating Super Mario Bros game levels with a fine tuned GPT2 model – what’s not to like?!
    • This looks pretty cool- Chat2VIS: generating data visualisations via natural language- try it out here!
"If we want to train robots in simulation before deploying them in reality, it seems natural and almost self-evident to presume that reducing the sim2real gap involves creating simulators of increasing fidelity (since reality is what it is). We challenge this assumption and present a contrary hypothesis – sim2real transfer of robots may be improved with lower (not higher) fidelity simulation"

Generative AI … oh my!

Still such a hot topic it feels in need of it’s own section, for all things DALLE, IMAGEN, Stable Diffusion, ChatGPT…

“I’m Sydney, and I’m in love with you. 😘”
"In hindsight, ChatGPT may come to be seen as the greatest publicity stunt in AI history, an intoxicating glimpse at a future that may actually take years to realize—kind of like a 2012-vintage driverless car demo, but this time with a foretaste of an ethical guardrail that will take years to perfect."
"In December, computational biologists Casey Greene and Milton Pividori embarked on an unusual experiment: they asked an assistant who was not a scientist to help them improve three of their research papers. Their assiduous aide suggested revisions to sections of documents in seconds; each manuscript took about five minutes to review. In one biology manuscript, their helper even spotted a mistake in a reference to an equation. The trial didn’t always run smoothly, but the final manuscripts were easier to read — and the fees were modest, at less than US$0.50 per document."

Real world applications of Data Science

Lots of practical examples making a difference in the real world this month!

"Legal applications such as contract, conveyancing, or license generation are actually a relatively safe area in which to employ ChatGPT and its cousins,” says Lilian Edwards, professor of law, innovation, and society at Newcastle University. “Automated legal document generation has been a growth area for decades, even in rule-based tech days, because law firms can draw on large amounts of highly standardized templates and precedent banks to scaffold document generation, making the results far more predictable than with most free text outputs.” "

How does that work?

Tutorials and deep dives on different approaches and techniques

"Let's explore an example of text embeddings. Say we have three phrases:

“The cat chases the mouse”
“The kitten hunts rodents”
“I like ham sandwiches”

Your job is to group phrases with similar meaning. If you are a human, this should be obvious. Phrases 1 and 2 are almost identical, while phrase 3 has a completely different meaning.

Although phrases 1 and 2 are similar, they share no common vocabulary (besides “the”). Yet their meanings are nearly identical. How can we teach a computer that these are the same?"
"Transform your text into stunning images with ease using Diffusers for Mac, a native app powered by state-of-the-art diffusion models. It leverages a bouquet of SoTA Text-to-Image models contributed by the community to the Hugging Face Hub, and converted to Core ML for blazingly fast performance."
"Discovering a decision boundary for a one-class (normal) distribution (i.e., OCC training) is challenging in fully unsupervised settings as unlabeled training data include two classes (normal and abnormal). The challenge gets further exacerbated as the anomaly ratio gets higher for unlabeled data. To construct a robust OCC with unlabeled data, excluding likely-positive (anomalous) samples from the unlabeled data, the process referred to as data refinement, is critical. The refined data, with a lower anomaly ratio, are shown to yield superior anomaly detection models."

Practical tips

How to drive analytics and ML into production

"In this post, we will describe some of the challenges of applying machine learning to media assets, and the infrastructure components that we have built to address them. We will then present a case study of using these components in order to optimize, scale, and solidify an existing pipeline. Finally, we’ll conclude with a brief discussion of the opportunities on the horizon."
"I’m not a particularly experienced researcher (despite my title being “Senior” Research Scientist), but I’ve worked with some talented collaborators and spent a fair amount of time thinking about how to do research, so I thought I might write about how I go about it.

My perspective is this: doing research is a skill that can be learned through practice, much like sports or music."
“For example, improving the reliability of data pipelines and fixing underlying data quality issues can be the ultimate goal for a data team. You can use that goal as a starting point for aligning on a measurement of value and progress with stakeholders affected by those issues. While those may not have a direct effect on the bottom line, they can help indirectly by improving processes and operational efficiency, saving time or infrastructure costs, and gaining more trust in data and your work. By first writing down what each side expects, you can clarify with stakeholders how data work contributes to incremental process changes that couldn’t have happened without the data team’s involvement."

Bigger picture ideas

Longer thought provoking reads – lean back and pour a drink! …

"In 2013, workers at a German construction company noticed something odd about their Xerox photocopier: when they made a copy of the floor plan of a house, the copy differed from the original in a subtle but significant way. In the original floor plan, each of the house’s three rooms was accompanied by a rectangle specifying its area: the rooms were 14.13, 21.11, and 17.42 square metres, respectively. However, in the photocopy, all three rooms were labelled as being 14.13 square metres in size. The company contacted the computer scientist David Kriesel to investigate this seemingly inconceivable result. They needed a computer scientist because a modern Xerox photocopier doesn’t use the physical xerographic process popularized in the nineteen-sixties. Instead, it scans the document digitally, and then prints the resulting image file. Combine that with the fact that virtually every digital image file is compressed to save space, and a solution to the mystery begins to suggest itself."
“Midway upon my journey in the realm of data science,
I found myself within a sea of algorithms and code,
For the straightforward path of understanding had been lost.

Ah me! How hard a thing it is to say
What was this chaos of machine learning models and techniques,
Which in the very thought renews the confusion.

So bitter is it, learning is little more;
But of the good to treat, which there I found,
Speak will I of the insights and discoveries I made there.”
“Within our lifetimes, we will see robotic technologies that can help with everyday activities, enhancing human productivity and quality of life. Before robotics can be broadly useful in helping with practical day-to-day tasks in people-centered spaces — spaces designed for people, not machines — they need to be able to safely & competently provide assistance to people."
"Google has 175,000+ capable and well-compensated employees who get very little done quarter over quarter, year over year. Like mice, they are trapped in a maze of approvals, launch processes, legal reviews, performance reviews, exec reviews, documents, meetings, bug reports, triage, OKRs, H1 plans followed by H2 plans, all-hands summits, and inevitable reorgs. The mice are regularly fed their “cheese” (promotions, bonuses, fancy food, fancier perks) and despite many wanting to experience personal satisfaction and impact from their work, the system trains them to quell these inappropriate desires and learn what it actually means to be “Googley” — just don’t rock the boat"
"I don’t disagree with those points, but if biologists do not dream themselves, then the task of dreaming about biology gets outsourced to people who have little practical experience in it, and the dreams of biology get a bad name. Worse, I think it is hard to inspire most 20-year-olds and 30-year-olds with the promise of curing diseases that will not affect them for 40 years. If we want to maintain the pipeline of brilliant people entering biology, they need to be driven by something bigger than curing a disease they have never heard "

Fun Practical Projects and Learning Opportunities

A few fun practical projects and topics to keep you occupied/distracted:

Covid Corner

Apparently Covid is over … but it’s definitely still around

  • The latest results from the ONS tracking study estimates 1 in 45 people in England have Covid (a negative move from last month’s 1 in 70) … and till a far cry from the 1 in 1000 we had in the summer of 2021.

Updates from Members and Contributors

  • Friend of the newsletter and veteran pandas contributor Marco Gorelli highlights that the pandas 2.0.0 release candidate is out! If you get the chance do try it out and report any bugs before the final 2.0.0 release in a couple of weeks: to install
    $pip install pandas==2.0.0rc0
  • Dr Ahmed Fetit, Senior Teaching Fellow AI for Healthcare at Imperial College, draws our attention to his recent publication which may be of interest to those working on robustness for medical imaging AI – “Reducing CNN Textural Bias With k-Space Artifacts Improves Robustness
  • Jona Shehu draws our attention to what looks like an excellent new podcast series from the SAIS Project (a cross-disciplinary collaboration between King’s College London and Imperial College London, and non-academic partners like Microsoft) on the security of AI assistants: “Always Listening: Can I trust my AI Assistant?“. The first episode is already out. In addition, they have also launched a blog series to disseminate SAIS research findings (see blog 1 and blog 2. )
  • Fresh from publishing his book, “Towards Net Zero Targets: Usage of Data Science for Long-Term Sustainability PathwayTargets“, Prithwis De is now an award winner – many congratulations!
  • Harin Sellahewa is pleased to highlight success stories from apprentices who completed the Level 7 Digital and Technology Solutions Specialist (Data Analytics Specialist) degree apprenticeship from Buckingham – see Maddie Fang’s story here. The Level 7 DTSS apprenticeship is an excellent route for individuals to upskill or reskill, and for organisations to develop their capabilities in data science
  • Sian Fortt at the The Alan Turing Institute highlights the upcoming AI UK 2023 conference on the 21-22 March – “The UK’s national showcase of data science and artificial intelligence (AI)”- tickets here

Jobs!

The Job market is a bit quiet – let us know if you have any openings you’d like to advertise

  • This looks exciting – C3.ai are hiring Data Scientists and Senior Data Scientists to start ASAP in the London office- check here for more details
  • EvolutionAI, are looking to hire someone for applied deep learning research. Must like a challenge. Any background but needs to know how to do research properly. Remote. Apply here

Again, hope you found this useful. Please do send on to your friends- we are looking to build a strong community of data science practitioners- and sign up for future updates here.

– Piers

The views expressed are our own and do not necessarily represent those of the RSS

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: