1 Introduction

Understanding the healthy development of structures within the paediatric brain is vital for the identification of neuro-developmental disorders. So far, application of image analysis tools on large scale MR imaging studies (e.g., the UK Biobank [17]) for automated analysis of healthy structural development of the adult brain has been well described. However, the majority of existing image analysis tools struggle with paediatric brains due to the poor gray matter/white matter differentiation and rapid growth and change observed in brain anatomical structures. This, combined with the challenges of imaging children, such as the need to minimise the movement of infants, critically limits our ability to understand the structural development of the paediatric brain.

This knowledge gap is further increased when considering populations from Low and Middle Income Countries (LMICs) where high field MRI systems (1.5T / 3T) are rare due to the costs involved. Therefore, few MR imaging studies have considered LMIC populations. To fill this gap, Hyperfine SWOOP scanners are being tested - although at 0.064T the scanners offer much lower image quality, low-field MRI offers portability, cost-effectiveness and removes the need to sedate children during the scan. This improves the potential to increase the availability of MR imaging in underrepresented research communities [13].

Existing image analysis tools (e.g., FSL, Freesurfer) have been primarily developed for 1.5 and 3T MR images of adults: the large domain shift between these images and the low-field paediatric images we are considering mean that the existing tools are unlikely to give high quality results [6]. Therefore, there is a need to develop domain specific tools, designed specifically for the low-field images and paediatric populations. Deep learning (DL) based methods are generally good candidates for producing analysis tools, such as segmentation methods, for low-field paediatric images, due to their ability to identify underlying discriminative patterns in images and create both local- and global-level associations between voxel values [12]. Accurate segmentations of subcortical brain structures are essential for volumetric and morphological assessment and the monitoring of healthy brain development, but manual segmentation is time-consuming and error-prone [3]. The hippocampus is a grey matter structure located within the medial temporal lobe memory circuit, and is associated with numerous conditions including depression, Alzheimer’s disease, psychosis and epilepsy [11]. Therefore, accurate delineation of the hippocampus is essential to enable volumetric and morphological assessment for understanding healthy development and disease. Therefore, to assess the feasibility of low-field paediatric MR brain imaging, two initial tasks are considered:

  • Task 1: Automated quality assurance (QA) to rate the overall quality of low-field MRI images, to ensure that the acquired MR images meet specific standards.

  • Task 2: Automatic segmentation of the bilateral hippocampi, due to their importance as a subcortical structure linked to cognitive and memory functions.

Here we explore preliminary results from the above two tasks based on T2 weighted MR images by comparing multiple approaches. Our results show that these images with single modality can provide understanding of overall anatomies and large scale pathologies. However, our results suggest that multiple scan types and quality enhancement approaches might be required for more granular analyses and in-depth exploration of structural development in paediatric populations.

2 Methods

2.1 Classification of Artefacts for Quality Assessment

We aim to perform quality assessment of paediatric brain MR images by assigning scores of 0, 1 and 2 (0 being good quality and 2 meaning there is a very high level of artefacts) across seven artefact domains such as noise, zipper, positioning, banding, motion, contrast and distortion. One of the major challenges is the imbalance in data samples across the different scores: the majority of cases were of good quality and the proportion of data with artefacts was extremely small, especially the ones with a score of 2. Hence, as a first step, we performed appearance-informed transformations to simulate artefacts to augment the data (for classes 1 and 2) for training.

Table 1. Appearance-based transformation used for data augmentation (the parameters were selected empirically from held-out data).
Fig. 1.
The image is a grid comparing MRI scans with different artifact levels across three classes: Class 0 (Good), Class 1 (Moderate), and Class 2 (Bad). Each class has two columns for real and simulated samples. Rows display various artifact types: Noise, Zipper, Positioning, Banding, Motion, Contrast, and Distortion. Class 0 shows minimal artifacts, Class 1 shows moderate artifacts, and Class 2 shows severe artifacts. The image highlights differences in artifact severity and simulation accuracy.

Simulation of artefacts for various artefact domains for training the quality assessment models.

Appearance-Informed Transformations for Simulating Artefacts. We used various appearance-based transformation as specified in Table 1 and a few examples are shown in Fig. 1 for various artefact domains. We used Torch IO library [1] for simulating the artefacts using parameters mentioned in the table. Note that these transformations (specified in Table 1) were used for quality assessment (task 1) alone, since they were specific to the artefacts observed in the data.

Automated Quality Assessment Using Deep Learning. For classification of images based on their quality, we compared the following architectures:

  • Multi-headed decoder with single shared encoder [8]

  • DenseNet architecture [9]

Of the two architectures, this study is the first ever to experiment with multi-headed decoder framework, to the best of our knowledge, while DenseNet has been already used for similar tasks [2]. The multi-headed decoder was chosen due to its capability to learn common attributes of data due to a shared encoder, which helps in learning perturbations better.

Multi-head Decoder Model for Quality Assessment. Figure 2 shows the network architecture of the multi-head decoder model. The architecture consists of an encoder that is shared across multiple heads of the multiple decoder, that essentially extracts features that are different from high quality images without any artefacts. We used the encoder of a 4-layer deep UNet [16] model, and connected the decoders to the bottleneck of the UNet. Each decoder consisted of 3 fully connected layers (with 4096, 512 and 32 nodes respectively), followed by the output Sigmoid layer with 3 nodes (for classes 0, 1 and 2). The model was trained using an AdamW optimiser with a learning rate of \(1\times 10^{-5}\) and a patience value of 5. We used a batch size of 8, and train:validation split of 80:20. For standard data augmentations (note that these are separate from the transformations in Sect. 2.1), we used and MONAI transforms for inflating the training data. We used focal loss (Eq. 1) [14] for training. The focal loss is given by:

$$\begin{aligned} FL(p_t) = -\alpha _t(1-p_t)^\gamma log(p_t) \end{aligned}$$
(1)

where \(\alpha \) and \(\gamma \) are weighing and focusing parameters with values of 0.25 and 2 respectively (chosen empirically), and \(p_t \in \{0,1\}\) is the predicted probability. During inference, we applied the trained encoder and individual decoders on each test instances and obtained the predictions for 7 artefact domains and used max-voting for obtaining final artefact prediction.

Fig. 2.
Diagram illustrating a neural network model for image classification. On the left, MRI images are input into an encoder with layers depicted as blue and red arrows, representing convolutional and max pooling operations. The encoder outputs to seven classification heads, each predicting different image quality issues: noise, zipper, positioning, banding, motion, contrast, and distortion. A detailed view of a classification head shows a structure with layers of sizes 4096, 512, and 32, using ReLU and sigmoid activation functions. Classes are labeled as good, moderate, and bad.

Architecture of the multi-head decoder model for quality assessment.

DenseNet Model Architecture for Quality Assessment. We also trained a DenseNet [9] model using cross-entropy loss for training, individually for each artefact domain. We used the same parameters as used for training the multi-head decoder model.

2.2 Segmentation of Hippocampus

Preliminary results from a 3D UNet indicated that the low contrast images were causing challenges for the segmentation of the hippocampus, leading to a large degree of under-segmentation. Therefore, we considered a range of different methods based on different paradigms:

  • Out-of-the-box Segmentation Approach (FSL FIRST [15])

  • Linear Registration of an atlas

  • 3d UNet

  • 3d UNet + Prior

Out-of-the-Box (OOB) Segmentation Approach (FSL FIRST). The majority of OOB segmentation tools for the hippocampus have been developed for adults and 1.5/3T MR images. Therefore, their expected performance on our images is unknown. We compared results on a subset of the training dataset for the following methods: FSL FIRST [15], Freesurfer [7], SynthSeg [5] and HippoDeep [18]. Only FSL FIRST was able to reliably locate the hippocampus in the images and so is used as the OOB approach.

Linear Registration. For the linear registration approach we aimed to create a study specific hippocampus atlas, that could be then propagated to the individual subjects. We used FSL FLIRT [10] to register all subjects to the first training subject (id: 0001), and then propagated all labels to this subject space and averaged them to produce a study specific hippocampus atlas. Non-linear registration was not used as due to the lack of contrast the algorithm struggled to converge and led to poor registration results. The atlas was then propagated back to the individual subject spaces for the validation samples, and thresholded at 0.1 (chosen empirically from held-out samples) to produce the binary segmentation mask. The average mask can be seen in Fig. 5C.

3D UNet. We considered a Vanilla 3D UNet [16] (64 features at the first level and 4 pooling layers), and trained with a Dice Loss function. Given the images were rigidly aligned, we selected an ROI from the training images centred around each hippocampus separately, as shown in Fig. 3. The images were split at the subject level into training and validation sets, and the model was trained using an AdamW optimiser with learning rate \(1\times 10^{-4}\) and a patience value of 15. The problem was treated as a binary segmentation task with both hippocampi being represented by the same label value, and standard augmentation was applied throughout training.

Fig. 3.
Three MRI brain scans showing different cross-sectional views. Each scan highlights specific areas with colored markers: yellow and orange. The left image shows a side view with a prominent yellow marker. The middle and right images display front views with both yellow and orange markers indicating areas of interest. The scans focus on brain regions for analysis or diagnosis.

Selected ROI from T2

3D UNet + Prior. On the held-out validation set, the 3D UNet was seen to have undersegemented the hippocampi, and so a prior was added to encourage the network to create larger segmentations. Specifically, an estimate of the amount of the ROI which should be hippocampus was calculated from the prior produced during the linear registration approach. This was binarised and then the ratio calculated. The model was then trained, inspired by [4], using a loss function that directly aims to match the ratio of background to hippocampus:

$$\begin{aligned} L_{total} = L_{dice}(\boldsymbol{Y}, \hat{\boldsymbol{Y}}) + \lambda KL(\tau , \hat{\tau }) \end{aligned}$$
(2)

where \(\boldsymbol{Y}\) is the true segmentation mask, \(\boldsymbol{\hat{Y}}\) is the predicted segmentation mask, KL is the KL divergence between the true tissue ratio \(\tau \), and the predicted tissue ratio \(\hat{\tau }\). \(\lambda \) is the weighting factor between the two loss functions, set to 5 empirically.

Datasets Used. The data used were from the LISA Challenge [13] consisting of low-field MRI dataset of over 300 paediatric T2 scans, acquired using a 0.064T MR Scanner with a sequence type of spin echo, TR 1.5 s, TE 5ms and TI 400ms. The images were split (at the subject level) into training and validation sets with a split of 80:20%. Quality assessment scores (values of 0, 1 and 2) were available across seven artefact domains (noise, zipper, positioning, banding, motion, contrast, distortion) in the CSV file format. Results are reported (Sect. 3.1) on the 14 validation samples available as part of the Challenge. For segmentation task (task 2), the models were trained using the 79 manually labelled T2 MR images with manual labels. No preprocessing was performed other than those detailed in methods. Results are reported on the 12 validation samples, preprocessed identically to the training data where appropriate.

3 Results and Discussion

3.1 Quality Assessment Results

The results of comparison of multi-head decoder with DenseNet model are shown in Table 2. As mentioned earlier, the main challenge in this task is the heavy class imbalance between class 0 and others (especially class 2, which are very low), hence leading the model to be biased towards class 0 and the anatomical structures are not being clearly distinguishable leading to poor predictive quality. However, it can be seen that the performance improved with the addition of simulated data using transformations described Table 1. Between the architectures, DenseNet provided much better performance (\(Accuracy_{weighted}: 0.823\)), when compared to the multi-head decoder model (\(Accuracy_{weighted}: 0.741\)), likely due to the ability of the former to extract complex spatial and contextual features that explain the subtle changes in the brain structure.

Table 2. Quality assessment results averaged across the dataset for various evaluation metrics. MHD: Multi-head decoder. WOA: without using augmentations, WA: With augmentations.

3.2 Segmentation Results

The results comparing the different segmentation approaches are reported in Table 3. It can be seen that the OOB produced poor quality segmentations for the low-field data, as expected due to the large domain shift between this data and the data FSL FIRST was trained on. The vanilla UNet produced the better results, although far below the performance that would be expected for hippocampal segmentation on adult high field data. The addition of the prior reduced the segmentation quality slightly but the performance was very asymmetric (left 0.52±0.28, right 0.59±0.19), likely due the significantly different sizes of the manual masks for each hemisphere (left volume: 1160 ± 308, right volume: 1225 ± 338, paired ttest p=0.0008)). Registration of the average hippocampal volume provided the best results across the metrics apart from relative volume error, probably indicating that the atlas over estimates the size of the hippocampus. This demonstrates the lack of signal available in the images, as even very simple DL-based approaches would be expected to out-perform registration based approaches.

Fig. 4.
The image shows a comparison of brain MRI segmentation results using different methods. The top row displays results from "FSL FIRST" and "Vanilla UNet," while the bottom row shows "Linear Registration" and "UNet + Prior." Each method is represented by four MRI slices with green highlighted regions indicating segmented areas. The comparison illustrates variations in segmentation accuracy and coverage among the methods.

Validation example segmentation results comparing the result from the four methods.

Table 3. Segmentation results averaged across left and right hippocampi for the five different metrics considered.

Figure 4 shows an example segmentation from the validation data (no manual label available), comparing the four approaches. It can clearly be seen that FIRST over segments the hippocampus whereas the 3D UNet produces the lowest volume segmentations. The addition of the prior clearly increases the size of the segmented region. The localisation of the hippocampus is consistent across the linear registration, Vanilla 3D UNet and 3D UNet + prior approaches.

Fig. 5.
MRI brain scans showing different sections labeled A, B, and C. Each section highlights specific areas with colored markers: yellow, orange, and red, indicating areas of interest or abnormalities. The scans are arranged in a grid format, with each row representing a different section and each column showing a different slice of the brain. The top row includes a reference image labeled "2/3."

Example hippocampal segmentations. Row 1: MNI T1 2mm template with the Harvard-Oxford probabilistic atlas, thresholded at 0.5. A): Example A showing a sample the 3D UNet repeatedly performed well (DSC > 0.6) on with the tail of the atlas touching the ventricle. B) Example B showing a sample where the 3D UNet performed poorly (DSC > 0.1), where the localisation of the hippocampus in the target appears different to the MNI template and sample A, with the label not reaching the ventricles. C) Example B with the average hippocampus mask registered to it, showing that the average hippocampus mask is located disjoint to the manual segmentation mask.

Training and evaluation of the DL-based models were affected by the quality of the manual masks: unsurprisingly given the low contrast and relatively poor image quality, the quality of the training masks varied, biasing results. For instance, some samples appear to have labels which mislocalise the hippocampus (Fig. 5) where it can be seen that the example B’s mask appears to be shifted lower than the expected location when considering the other examples and the average hippocampus mask registered to that subject.

4 Conclusions

In this work, we provided preliminary analysis on low-field paediatric brain for two tasks: quality assessment and hippocampal segmentation, by comparing multiple approaches. Our results for quality assessment show that simulated artefacts helped to counteract the class imbalance while a densely connected architecture provided 10% increase in accuracy. For hippocampal segmentation, we obtained the best results with registration of the study specific average, surpassing out-of-box methods which originally developed mainly for adult brains and deep learning approaches. As future work, more complicated DL models (e.g. transformer-based networks) would be implemented using self-supervised learning approach for increasing the performance. However, a vast amount of improvement would be required before the segmentation outputs can be used for automated granular analyses of paediatric brains. The Python implementation of our code is publicly available at https://github.com/v-sundaresan/LISA2024_QA.