Project 22HLT05 MAIBAI

SPECIAL SESSION

Project 22HLT05 MAIBAI - Developing a metrological framework for assessment of image-based Artificial Intelligence systems for disease detection

ABSTRACT

The exponential increase in healthcare data, as well as the fast-paced technology developments, have resulted in promising novel image-based AI systems for disease detection and risk prediction. However, the adoption of AI in clinical settings remains limited, mostly due to the limited data quality and interoperability across heterogeneous clinical centers and electronic health records, the absence of robust validation procedures, and the distrust of predictions and decisions generated by AI systems.

Focusing on breast cancer screening and starting from the experience of the European Metrology Partnership Project 22HLT05 MAIBAI (https://www.maibaiproject.eu/), this special session aims at discussing the main strategies that can be adopted to face the above issues, with the design of a standardised and impartial framework for performance, generalisability and suitability assessment of AI systems in the clinical domain. Special attention will be devoted to the development of large-scale and high-quality medical imaging databases, the categorization of data based on clinically relevant subgroups and image acquisition key factors, their integration with synthetic data generated with data-driven information approaches, the AI benchmarking in terms of robustness, accuracy and fairness, and the provision of visual explanation techniques, to make the decision process more transparent and interpretable.

LIST OF CONTRIBUTIONS

AI in breast cancer screening, where is it coming in?

Speaker/Speakers: Ruben Van Engen and/or Carlijn Roozemond (Dutch Expert Centre for Screening - LRCB, Nijmegen, Netherlands), r.vanengen@lrcb.nl, c.roozemond@lrcb.nl

In this presentation an overview will be given of the different parts of the imaging chain in mammography and breast tomosynthesis in which AI is or will be employed and an overview will be given of the different ways AI software could be used in the breast cancer screening pathway. This includes the use of AI in (1) the acquisition of mammography and breast tomosynthesis images, in (2) image processing and/or image reconstruction techniques in 2D mammography and breast tomosynthesis, in (3) generation of synthetic 2D images in breast tomosynthesis. Besides the imaging AI might have a role in (4) the use of AI cancer detection software as an aid to the radiologists or (5) in patient management procedures, (6) to estimate breast density of individual women, to (7) estimate the risk of breast cancer for populations or individuals and as a positioning tool for radiographers (8). The risk on the use of this software varies, and depending on this risk and the implementation of the AI tools, quality assessment for AI software should be performed using (extended) existing quality assurance schemes or by the setup of new quality assurance methods. This method should incorporate procedures to assess the impact of updates/changes in the AI software, but also changes in the input images/data which could have an effect on the outcome of the AI software. There is an urgent need for evaluation/validation methods and tools as AI software is currently being introduced in clinical practice and knowledge on the correct and safe use is largely lacking in the medical community. In the medical community many perceive AI software as ‘plug-and-play’ and AI software is being installed without performing any proper validation and/or monitoring locally.

Trials and tribulations of collecting sufficient mammography images to cover the various sub-groups within a screening population

Speaker: Mark Halling-Brown (Royal Surrey NHS Foundation Trust, Guildford, UK), mhalling-brown@nhs.net

The aim of the talk is to describe the collection of mammographic images and data for the purpose of evaluating AI software. The collection needs to cover the full range of breast appearances, thicknesses, glandularities and cancer types. There are also technical features that can affect the appearance of a mammogram such manufacturer, dose, compression paddle, processing factors, and radiographic factors. To achieve the required distribution of cases, then large scale image collection framework is needed, which requires the collection to be automated and ongoing to ensure that the data is up-to-date and representative of current practice. There are many challenges in the collection, such as technical, linking image and clinical data, ensuring compliance with ethics and information governance, and data curation. We have developed a framework for the collection of clinical images and data for use when training and validating artificial intelligence (AI) tools that successfully overcomes these challenges to setting up and running the collection.

Role of synthetic data in the development of AI models for breast cancer screening

Speaker: Alessandra Manzin (Istituto Nazionale di Ricerca Metrologica - INRiM, Torino, Italy), a.manzin@inrim.it

Synthetic data are currently playing a key role in the application of Artificial Intelligence (AI) systems to medical imaging, enabling the generation of virtual patients’ databases, for the robust training and accurate validation of AI algorithms. The availability of well representative medical data is generally cumbersome, due to high acquisition cost, ethics restrictions, large heterogeneity, and insufficient sampling of certain subsets. In parallel, domain adaptation issues have to be faced, due to the widespread diffusion of different scanners and acquisition protocols, leading to strong variability in image features. Steps forward have been recently made in generative models to expand existing datasets of real medical images via data-driven approaches. In breast imaging, neural style transfer (NST) methods have been used to solve domain gap issues, like the ones associated with the large diversity of mammogram style due to vendor factor. Despite the utility of synthetic data in ensuring an enhancement of robustness and adaptability of AI models, accurate tests need to be performed to assess their realism and quality, by means of proper quantitative metrics. In this talk, we will present a thorough assessment of an NST method, based on a cycle generative adversarial network (CycleGAN) architecture, in transferring the vendor-style from mammograms with features strongly dependent on the equipment post-processing software. The high level of realism of the generated images is also proven by their successful usage in benchmarking a convolutional neural network (CNN) model, specifically designed for the detection of cancerous lesions in mammograms.

Practical considerations and quality assessment of breast cancer detection models using mammography

Speaker: Danny Panknin (Physikalisch-Technische Bundesanstalt - PTB, Berlin, Germany), danny.panknin@ptb.de

High-quality breast cancer detection models and the ability to assess their performance rely on the access to rich enough mammography data bases. Yet, to date, available mammography data bases lack sufficient population heterogeneity and completeness and correctness of data and metadata. This negatively impacts the training of sophisticated and most promising deep learning architectures as well as the ability to assess a model’s quality dimensions like fairness. In addition, the common quality measures from ML and statistics used in academia cannot be used to attest a model being fit-for-purpose, if such assessment is detached from clinical boundary conditions. In this talk, we will discuss these pitfalls and offer guidance for a more successful training and evaluation of breast cancer detection models.

Beyond regions of interest: reconciling radiologist expectations with diverse XAI approaches in mammography

Speaker: Aleksander Sadikov (University of Ljubljana, Faculty of Computer and Information Science, Ljubljana, Slovenia), aleksander.sadikov@fri.uni-lj.si

Explainability is an important characteristic to consider when deploying machine learning models, especially in domains where critical decisions are made. The problem with explainability, however, is that users have very different expectations of it. Moreover, it depends on many different aspects, such as: (1) how general or specific is the explanation; (2) who is the intended audience; (3) what is the purpose of the explanation; (4) what is the interplay with the model’s performance? All these aspects further depend on the specific domain and use case for the models. In order to get an overview how these aspects are perceived in the domain of mammography screening, we conducted an extensive survey with around 50 breast cancer radiologists. The specific use case presented to radiologists was an artificial intelligence (AI) reader that prioritises breast scans based on predicted malignancy and AI’s confidence (certainty) in its prediction. The radiologists were further presented with various types of automated explanations (textual, graphical, certainty information, etc.) and were asked to rank them according to their perceived usefulness. The localised anomaly (tumour) and certainty information were consistently ranked among the top three options. Furthermore, the survey responses showed that AI is viewed favourably (though not without concerns), but that surprisingly its perceived benefit lies less in saving time, but rather in increasing the certainty of diagnosis.