1 Introduction

The confluence of a global biodiversity crisis and a data revolution in natural history collections has created both an urgent challenge and a significant opportunity for botanical science. On one hand, accelerating biodiversity loss threatens c. 400 000 plant species, many of which may disappear before formal description (Antonelli et al. 2020; Royal Botanic Gardens Kew 2023; Christenhusz and Byng 2016; Royal Botanic Gardens, Kew 2025). On the other hand, the world’s herbaria (collectively holding more than 400 million specimens) are undergoing large-scale digitisation, and over 100 million high-resolution plant specimen images are already accessible online (Thiers 2025; Soltis 2017; Lang et al. 2019). Herbaria are curated repositories of preserved plant specimens, each consisting of a dried, pressed plant mounted on archival paper and accompanied by detailed label metadata (as shown in Fig. 1), and serve as permanent reference libraries for taxonomy, ecology and conservation planning (Bridson and Forman 1998). Each image captures overlapping organs, handwritten labels, and calibration artefacts. Across collections, the data exhibit an extreme long-tail distribution. For example, within the Herbarium 2020 benchmark (the 2020 FGVC7 dataset containing approximately 1.17 million images across 32,000 species), 60% of species have five or fewer training images, whereas a small subset have thousands (FGVC7 Kaggle Team 2020).

This extreme imbalance, where the rarest taxa are under-represented and thus particularly difficult to study and classify, underscores the need for scalable AI methods capable of fine-grained recognition and robust performance on rare species.

Traditional taxonomic workflows rely on expert examination of morphology, dichotomous keys and authoritative literature (Simpson 2019; Stuessy 2009; Winston 1999). Although scientifically rigorous, these methods are time-consuming and subject to inter-expert variation, leading to documented misidentification rates in major collections (Goodwin et al. 2015; Govaerts 2001). It is estimated that many undescribed species already reside in herbarium cabinets, historically requiring an average of 35 years to be detected and formally named (Bebber et al. 2010).

Fig. 1
Fig. 1
Full size image

Annotated herbarium sheet (Royal Botanic Gardens, Edinburgh; CC BY 4.0) highlighting typical visual elements parsed in taxonomic workflows: pressed specimen, collection label with locality and collector data, institutional barcode, and ancillary fragments such as fruits or seeds

To fully appreciate the application of AI to herbarium specimens, it is essential to understand both the nature of the data source and the fundamental concepts from traditional botany and modern computer science that are being applied. In practice, an expert follows a multi-step reasoning process that has changed little since the nineteenth century:

  • The role of type specimens. A “type” is the single specimen to which a scientific name is formally attached. According to the International Code of Nomenclature for algae, fungi and plants, every new species description must designate a type specimen, which serves as the definitive reference for that name (Turland et al. 2018).

  • Determination and revision. Naming a specimen (determination) involves comparing it with verified material and diagnostic characters in the literature. Dichotomous keys guide this process where available; in groups lacking keys, taxonomists construct them during revision. Modern interactive online keys permit unordered character selection and return real-time candidate lists (e.g., ActKey (Brach and Song 2005), KeyBase (Royal Botanic Gardens Victoria 2025), Xper3 (Kerner et al. 2025)). New determinations may supersede earlier ones, and all revisions are affixed to the specimen sheet.

  • Floras and monographs. Floras summarise all species in a defined region, whereas monographs treat a single lineage throughout its distribution (Winston 1999; Judd et al. 2007). Producing such works often requires decades of coordinated field and herbarium study.

These steps form the foundation onto which AI-driven pipelines now build: image segmentation, text recognition, and automated classification aim to replicate and ultimately accelerate the expert workflow, while retaining an essential stage of human validation.

To accelerate discovery and mitigate these limitations, AI methods offer a data-driven alternative. For computer vision, herbarium images pose a fine-grained recognition problem: inter-species differences are subtle, and background noise is considerable. Recent studies show that deep models such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) can learn discriminative features directly from millions of specimen images (Krizhevsky et al. 2012; Dosovitskiy et al. 2021). Carranza-Rojas et al. (2017) demonstrated that an early CNN reached approximately 70% accuracy across 1,000 species, highlighting the feasibility of semi-automated workflows that underpin the concept of the “extended specimen” (Heberling et al. 2021; Wen et al. 2015). A practical workflow now realises this concept: each newly collected sheet is paired with its iNaturalist field photograph, embedding habit, colour, and geo-metadata directly into the specimen lifecycle (Heberling and Isaac 2018).

Surveys on automated plant recognition largely address living plants. Early work reviewed whole-plant images and freshly detached leaves captured in controlled settings (Wäldchen and Mäder 2018; Saranya et al. 2021), and more recent updates extend this line to broad-spectrum plant recognition and crop-disease diagnosis (Barhate et al. 2024; Upadhyay et al. 2025; Yilmaz et al. 2025; Tiwari and Dev 2025). Two prior reviews are closely related to this topic. Hussein et al. (2022) present a systematic survey of tasks, datasets, publication venues, and challenges in digitised herbarium research. Their focus lies in usage trends and broad methodological categories, without detailed analysis of model architectures and training regimes; moreover, the review predates foundation models and open-vocabulary vision methods that now dominate the field. Pearson et al. (2020b) concentrate on phenological annotation, describing a modular processing workflow and infrastructure for extracting seasonal signals from specimen images. Their scope excludes classification models, segmentation and detection benchmarks, or multimodal image-text pipelines.

The present review synthesises recent advances in model architectures and systems for herbarium image analysis, covering convolutional and transformer-based classifiers, training strategies for long-tailed or self-supervised learning, open-set recognition, and multimodal pipelines that align image and label content for end-to-end prediction. To date, no survey has examined herbarium-specimen classification from an AI perspective in comparable depth. Previous overviews omit large parts of the technical landscape, ranging from convolutional and transformer architectures to specialised loss functions, self-supervision strategies, and multimodal pipelines. Against this backdrop, the present work offers four concrete contributions:

  • Catalogue publicly available herbarium specimen image datasets and portals, detailing the complete digitisation workflow from high-resolution scanning to metadata capture.

  • Review and compare the evolution of taxon-recognition models, from traditional machine learning models to CNN, transformer, and multimodal Vision Language architectures. Focusing on their performance with fine-grained, long-tailed botanical data.

  • Survey and analyse supporting modules for segmentation, artefact removal, optical/handwriting character recognition, metadata alignment. And examine their integration within operational collection-management systems.

  • Identify and highlight open challenges, delineating research gaps in few / zero shots and Open-Set Recognition (OSR), model data stewardship. Outline a future research agenda to address them.

Figure 2 summarises the organisation of this paper. Section 2 examines the nature of digitised specimens and describes key datasets and portals. Section 3 details the four paradigms of species-classification research, while Sect. 4 surveys auxiliary techniques that complete the pipeline. Finally, Sect. 5 discusses open challenges, and Sect. 6 presents our roadmap and concludes the paper.

Fig. 2
Fig. 2
Full size image

Overview of the review paper structure and the main topics covered in this survey. Each major section and its subtopics are shown, reflecting the organisation and logical flow of the manuscript

This article is a narrative review that maps recent and influential work on herbarium specimen image analysis. A structured search (2010–2025) covered ACM Digital Library, Scopus, IEEE Xplore, SpringerLink, Google Scholar, and arXiv; benchmark/dataset sources (Fine-Grained Visual Categorization (FGVC)/PlantCLEF and LifeCLEF papers/pages, Kaggle dataset cards) were also consulted. Queries combined domain and method terms (herbarium/plant specimen/botanical image \(\times \) classification/segmentation/detection/layout/image analysis \(\times \) deep learning/transformer/computer vision), and backward/forward snowballing retrieved additional items.

Inclusion required peer-reviewed articles or official benchmark descriptions with a concrete task on herbarium images or labels and reported metrics on public data or reproducible settings; short abstracts, non-archival/grey literature, and non-herbarium targets were excluded. For each study, task, dataset source (portal/competition), code/data availability, metrics, and key findings were recorded; when mirrors existed (e.g., portal vs. Kaggle export), the canonical source was cited to avoid duplication.

2 Data and datasets

Building on the challenges outlined in Sect. 1, this section characterises the raw material that underpins herbarium AI, namely digitised specimen sheets and the large-scale repositories that curate them.

The foundation of modern AI research in botany is the availability of large, well-curated digital datasets. The scale of herbarium collections is immense: the Index Herbariorum lists approximately 400 million curated plant specimens worldwide (Thiers 2025). Global digitisation efforts have made significant progress in bringing these collections online. As of recent estimates, more than 100 million specimens have been imaged at high resolution, driven by national initiatives such as the US National Herbarium, which alone has scanned over 4.9 million sheets (Soltis 2017; Smithsonian Institution 2022). Taxonomic names from these images can be programmatically validated against the continuously updated World Flora Online (WFO) backbone, curated by a global consortium of more than 40 botanical institutions (Borsch et al. 2020; World Flora Online Consortium 2025).

This section surveys public datasets and portals used for herbarium specimen analysis (Tables 2, 3), with an emphasis on widely used and well-documented resources. Sources included Papers with Code, GitHub, Zenodo, Kaggle, and official challenge pages (FGVC/PlantCLEF, LifeCLEF), using the same domain–method keywords as in the literature search. Major portals (Global Biodiversity Information Facility (GBIF), Integrated Digitized Biocollections (iDigBio)) were queried, and collaborating botanists suggested additions and audited the list. Each entry records the name, year, task, size (images/labels), licence, and a canonical link.

Inclusion favours datasets that provide large-scale sheet imagery under clear licences and stable hosting and that are used by multiple studies or serve as official benchmarks. One well-documented but unreleased dataset is listed for context and clearly flagged. Private or unstable collections are excluded, as are sources consisting only of model-generated outputs (e.g., masks or auto-classified images) that are unsuitable for training. Counts reflect the latest snapshot and are harmonised to consistent units (thousands or millions); where mirrors exist (e.g., portal vs. Kaggle export), the canonical host is cited to avoid duplication.

2.1 Herbarium specimen data

From a data science perspective, a digitised herbarium sheet is a multimodal object that combines high-resolution imagery with structured text metadata. This dual modality underpins the multimodal AI methods reviewed later in Sect. 3.3 and the downstream pipelines in Sect. 4.

  • Image modality. Each sheet is scanned at \({300}\,{\textrm{dpi}}\) to \({600}\,{\textrm{dpi}}\), capturing morphology such as leaf venation, floral parts and fruit details (Simpson 2019; Stuessy 2009).

  • Text modality. One or more labels record the scientific name, collection locality and date, and collector’s name. In addition, collectors often include field notes documenting traits not preserved in a dried specimen, such as plant size, flower colour, and scent. These textual data map to the Darwin Core vocabulary, enabling standardized sharing across institutions (Wieczorek et al. 2012).

Modern digitisation workflows are complex, multi-stage processes designed to convert physical specimens into Findable, Accessible, Interoperable, Reusable (FAIR) digital assets (Davies et al. 2023). The dissemination of these assets is now largely standardised through the International Image Interoperability Framework (IIIF), which exposes a uniform API for serving high-resolution sheet images (IIIF Community 2025).

2.1.1 Digitisation milestones

Digitisation began in the late 1990 s when the New York Botanical Garden scanned its type collection at \({600}\,{\hbox {dpi}}\). The African Plants Initiative, launched in 2004 to digitise type specimens across tropical herbaria, later evolved into the Global Plants Initiative. Today, nearly three million high-resolution type images are available via JSTOR Plants (Royal Botanic Gardens, Kew 2015; JSTOR 2015). Major initiatives, such as the launch of the GBIF (2001) (GBIF Secretariat 2025), the establishment of iDigBio as the U.S. hub (2010) (iDigBio Consortium 2025), widespread IIIF adoption after 2015, and the European DiSSCo “Digital Specimen” programme (Hardisty et al. 2020), have collectively made tens of millions of sheets programmatically accessible, setting the stage for AI-driven analysis.

2.1.2 Specimen preparation and scanning protocols

The entire process follows a standardised workflow that encompasses pre-digitisation curation (e.g., specimen mounting and repair), barcoding to create a unique digital identifier for each sheet, and high-quality image capture under controlled lighting conditions (Davies et al. 2023). These standard practices ensure that sheets are mounted on archival card and scanned under fixed illumination to produce consistent digital surrogates (Soltis 2017).

Standard practice includes: (i) ISO Q-14 colour calibration strip; (ii) metric scale bar; (iii) \({300}\,{\textrm{dpi}}\) to \({600}\,{\textrm{dpi}}\) optical resolution; and (iv) lossless TIFF masters with embedded Extensible Metadata Platform (XMP) metadata. Artefacts (overlapping fragments, folded organs, faded pigments) pose challenges later addressed by restoration and segmentation networks (Sect. 4).

2.1.3 Metadata granularity and standards

Table 1 maps common label abbreviations to their Darwin-Core terms, guiding the Natural Language Processing (NLP) pipelines in Sect. 4. Institutions commonly publish records via an Integrated Publishing Toolkit (IPT) server, packaging images and CSV metadata into a DwC-Archive that can be harvested by aggregators such as GBIF.

Table 1 Typical label phrases and corresponding darwin core terms

2.1.4 Empirical limits of 2D imagery

While high-resolution scans capture many macromorphological traits, recent work has measured what is lost when physical examination is replaced by images. In a virtual taxonomic account of Madhuca based solely on online specimen images, Phang et al. (2022) reported that fewer than half of the required diagnostic characters could be measured or described from images alone; micromorphological and tactile characters remained inaccessible even at the highest available resolutions, and fewer than half of the images could be confidently assigned to species, necessitating verification on physical specimens with a microscope. A continental-scale comparison between iNaturalist photographs and digitised specimens further demonstrated that herbaria provide disproportionately higher taxonomic and functional diversity, especially for rare taxa (Eckert et al. 2024). Similarly, an automated leaf-mass-per-area pipeline found error rates to rise sharply below three-megapixel resolution or in the absence of physical scale bars (Vasconcelos et al. 2025). These empirical findings motivate the high-dpi and multimodal approaches reviewed in Sects. 4.1 and 3.3.

2.2 Herbarium images datasets

For machine learning practitioners, these resources can be broadly categorized into two types: curated, purpose-built datasets designed for model training and evaluation, and large-scale aggregators exposed via portals and APIs (covered in Sect. 2.3).This section focuses on curated datasets assembled with clean labels and predefined splits, enabling rigorous and reproducible benchmarking across studies. Representative benchmarks and thematic sets are summarised in Table 2, including their task scope, size, label structure, licence, and canonical links.

Table 2 Datasets for herbarium specimen image classification

2.3 Major online data portals

Table 3 Representative online herbarium portals for specimen data access

Portals are live indices that expose millions of imaged specimen records via APIs. They excel at breadth, often listing orders of magnitude more sheets than any single benchmark, but typically require additional scripting, quality filtering, and licence checking before images can be used for machine learning. Global aggregators such as GBIF and iDigBio index well more than one hundred million imaged records, while institutional portals including the New York Botanical Garden Virtual Herbarium (New York Botanical Garden 2025), the Muséum national d’Histoire naturelle (Paris) virtual herbarium (Le Bras et al. 2017), and the Chinese Virtual Herbarium (CVH) (Chinese Virtual Herbarium 2024) provide regionally focused access with specialised metadata. Table 3 summarizes the largest portals, their licence models, and current image counts.

Public herbarium datasets remain geographically biased toward temperate zones and can contain historical misidentifications (Hussein et al. 2022). Extreme class imbalance, heterogeneous scanning protocols, and geo-temporal gaps present ongoing challenges. However, these same limitations offer rich opportunities to explore hierarchical classification, few-shot learning, multimodal fusion, and robust domain transfer, which represent critical frontiers in herbarium AI research.

2.4 Exploratory data analysis

This section analyses the official training splits of five public datasets widely used for herbarium specimen classification: Herbarium 2019 (Kaggle FGVC6 Team 2019), Herbarium 2020 (FGVC7 Kaggle Team 2020), Herbarium 2021 (Half-Earth) (de Lutio et al. 2021), Herbarium 2022 (NAFlora-1 M) (Park et al. 2024), and PlantCLEF 2020 (herbarium subset only) (LifeCLEF–INRIA 2020). These datasets are among the largest public dataset of herbarium specimen images and provide official metadata and training/test splits, enabling reproducible analysis. They differ in curation policy and taxonomic scope, leading to complementary patterns in class imbalance and resolution. All five have served as the basis for recent benchmarks and shared tasks. PlantCLEF 2022 (Goëau et al. 2022) is excluded because the release includes a substantial number of field photographs; this subsection focuses exclusively on herbarium specimen images.

The analysis highlights two aspects: class imbalance and image resolution, both within and across datasets. These results provide an important context for the evaluation metrics and model design choices discussed in Sect. 3.

2.4.1 Class imbalance

A “class” refers to a species or taxon label in the official metadata. Duplicate entries are removed using a strict (image_id, taxon) key. The frequency of the class is computed as the number of unique images per class.

Fig. 3
Fig. 3
Full size image

Per-dataset histograms of images per class (y-axis = %classes; x capped at the 99th percentile for each dataset)

Figure 3 shows per-dataset histograms of image counts per class, where the y-axis indicates the percentage of classes and the x-axis is truncated at the 99th percentile to avoid distortion. Within each dataset, the severity of imbalance varies. Herbarium 2019 (Kaggle FGVC6 Team 2019), Herbarium 2020 (FGVC7 Kaggle Team 2020), Herbarium 2021 (de Lutio et al. 2021), and PlantCLEF 2020 (LifeCLEF–INRIA 2020) exhibit broad distributions with long right tails: most taxa are rare, while some dominant species appear in large numbers. By contrast, Herbarium 2022 (Park et al. 2024) concentrates near its training cap: the official split is approximately 80%/20% (train/test), the training set caps examples at 80 per species, and per-taxon counts range roughly from 7 to 100. This yields a compact, right-capped distribution that suppresses extreme heads and shortens the tail.

A cross-dataset summary is provided in Fig. 4. Here, head 10% refers to the top 10% of classes by frequency after sorting, and tail 50% refers to the bottom half. Coverage denotes the proportion of all images that fall into these subsets. The Gini coefficient measures inequality in class distribution, ranging from 0 (perfectly uniform) to 1 (maximally skewed).

The number of images and classes varies by orders of magnitude across datasets, and a logarithmic scale is used to visualise these differences without letting the largest corpus dominate. In terms of head/tail coverage, PlantCLEF 2020 and Herbarium 2021 allocate most images to a small number of high-frequency classes, leading to high head coverage and low tail coverage. In contrast, Herbarium 2022 distributes more images to mid-ranked and rare classes, consistent with its per-class cap.

This pattern is also reflected in the Gini coefficient: datasets with strong head dominance (e.g., Herbarium 2020 and Herbarium 2021) show Gini values approaching 0.9, indicating high imbalance. Herbarium 2022, by comparison, exhibits a much lower Gini due to its enforced per-class limits. These metrics together provide a comprehensive view of class imbalance and scale across benchmark dataset.

Fig. 4
Fig. 4
Full size image

Inter-dataset comparison: (left) numbers of images and classes (log scale); (middle) percentage of images covered by the head 10% and tail 50% of classes; (right) Gini coefficient of class counts

These observations motivate our evaluation strategy in Sect. 3: where available, class-balanced splits are used; macro-averaged metrics (e.g., macro-\(\textrm{F}_{1}\)) are prioritized over micro-averaged metrics; and performance is stratified by class frequency to assess robustness across the long tail.

2.4.2 Image resolution

Long-side resolution varies across datasets (Fig. 5). The plot includes a violin density, a P10-P90 box, and 1% jittered points. Herbarium 2019 (Kaggle FGVC6 Team 2019) shows wide variation due to scanner differences, while PlantCLEF 2020 (LifeCLEF–INRIA 2020) centres around \({1,000}\,{\textrm{px}}\) owing to normalization. Herbarium 2020 (FGVC7 Kaggle Team 2020), 2021 (de Lutio et al. 2021), and 2022 (Park et al. 2024) similarly apply a cap of \(\le {1,000}\,{\textrm{px}}\), but Herbarium 2022 per-image pixel sizes are not released.

Fig. 5
Fig. 5
Full size image

Long-side pixel resolution. Violin shows density; box spans P10-P90; jittered dots (1%) indicate concentration. Herbarium 2019 is variable; others are normalised near 1,000 px

Resolution normalization offers consistent preprocessing, improved efficiency, and cross-study comparability. However, it can remove fine-grained traits (e.g., venation, trichomes). Native-resolution images retain morphological detail but introduce scale variance and memory overhead, often requiring multi-scale or tiling strategies. When transferring across datasets with different resolution policies, mismatches may constitute domain shifts; mitigation techniques include scale-aware augmentation and hierarchical feature architectures.

3 Herbarium image classification

The global digitisation effort has made more than 100 million specimen images available online, yet automatically identifying species from this material confronts three significant challenges. First is the ultra-fine granularity required; sister taxa may differ only by a subtle feature such as leaf-venation patterns or the indumentum on the corolla, demanding models with high discriminative power, a challenge central to the field of fine-grained image categorization (Zheng et al. 2019). Second is the extreme long-tail distribution of species; in recent benchmarks, it is common for more than 60% of species to be represented by five or fewer training images, posing a significant challenge for data-driven models (de Lutio et al. 2021). Third is the presence of heterogeneous artefacts on the sheets, such as mounting labels, colour bars, and faded pigments, which can act as noise and mislead classifiers.

Consequently, modern AI pipelines for herbarium analysis extend far beyond standard image classifiers. They incorporate modules that segment plant tissue, calibrate colours, read label text, fuse geo-metadata, adapt across different domains (e.g., from scans to field photos), and actively query experts for difficult cases. This section traces the evolution of these classification techniques, from early handcrafted features to the sophisticated, multimodal Transformer architectures in use today.

Evaluation metrics for classification

Macro-\({\textrm{F}_{1}}\). The macro-\(\textrm{F}_{1}\) score computes the \(\textrm{F}_{1}\) score (harmonic mean of precision and recall) for each class independently and averages them without weighting. This metric penalises models that perform well on dominant classes but poorly on rare ones, making it well-suited for biodiversity datasets.

Its exact computation for a set of C classes is given by first calculating \(\textrm{Precision}_c\) and \(\textrm{Recall}_c\) for each class c, then \(\hbox {F1}_c\), and finally averaging:

$$\begin{aligned} & \textrm{F}_{1,c} = 2 \cdot \frac{\textrm{Precision}_c \cdot \textrm{Recall}_c}{\textrm{Precision}_c + \textrm{Recall}_c}, \end{aligned}$$
(1)
$$\begin{aligned} & \textrm{F}_{1}^{\text {macro}}= \frac{1}{C} \sum _{c=1}^{C} \textrm{F}_{1,c}. \end{aligned}$$
(2)

This approach gives equal weight to each class, regardless of its size (van Rijsbergen 1979).

Top-1 accuracy. Top-1 accuracy measures the proportion of samples where the predicted label matches the ground truth:

$$\begin{aligned} \mathrm {Top\text {-}1} = \frac{1}{N} \sum _{i=1}^{N} \textbf{1}\!\left[ \arg \max _{y}\, \hat{p}(y \mid x_i) = y_i \right] , \end{aligned}$$
(3)

where \(\hat{p}(y \mid x_i)\) is the model’s predicted probability distribution for sample \(x_i\), \(y_i\) is the true label, and \(\textbf{1}[\cdot ]\) is the indicator function.

As shown in Fig. 6, the model accuracy has continued to rise.

Fig. 6
Fig. 6
Full size image

Performance comparison across major herbarium classification models

3.1 Classical and CNN approaches

The journey from manual feature extraction to end-to-end deep learning mirrors the broader history of computer vision, but with unique adaptations driven by the specific challenges of herbarium data.

3.1.1 Classical models with handcrafted features

Early attempts at automated specimen identification relied on hand-engineered descriptors. Researchers would extract pre-defined shape, venation, texture, or colour cues from scans and feed them into conventional classifiers like Support Vector Machines (SVMs). Although predating deep neural networks, such pipelines are AI systems: the representation is handcrafted while the decision function is learned from data.

Studies using shape cues such as SIFT (Scale-Invariant Feature Transform) or HOG (Histogram of Oriented Gradients) seldom generalised beyond small taxonomic scopes: Clark et al. (2012) reported only 43.7% Top-1 accuracy when separating four Tilia species with SVMs. On tropical wood cross-sections, Mata-Montero and Carranza-Rojas (2016) achieved 60.2% genus-level accuracy across 24 species using handcrafted texture filters, far below modern CNN baselines. Recent evaluations on the challenging Piperaceae family confirm the gap: deep features extracted by ViT or VGG16 exceed 80% macro-\(\textrm{F}_{1}\), whereas traditional descriptors such as LBP and SURF remain below 30% (Kajihara et al. 2025). These results underline the brittleness of rule-based features in the face of extreme morphological diversity and varied specimen preparation styles.

Using 54 herbarium leaf images from three morphologically similar Ficus species, a pipeline combining shape (morphological), Hu moment invariants, texture, and HOG descriptors with ANN and SVM classifiers achieved 83.3% accuracy; the ANN yielded a slightly higher Area Under the Curve (AUC) than the SVM (Kho et al. 2017).

3.1.2 CNN-based models

CNNs constitute the classical deep-learning backbone for image tasks (LeCun et al. 1998). A cascade of convolution, pooling and fully connected layers learns increasingly abstract visual features with inductive biases such as locality and translation equivariance. Most herbarium studies therefore adopt a transfer-learning strategy: a backbone pre-trained on a large natural-image corpus such as ImageNet (Deng et al. 2009) is fine-tuned on the target herbarium dataset, which shortens training time and mitigates label scarcity (Yosinski et al. 2014). Milestone architectures, such as AlexNet (Krizhevsky et al. 2012), VGG (Simonyan and Zisserman 2014), Inception (Szegedy et al. 2015) and ResNet (He et al. 2016), remain strong baselines. Dense connectivity patterns such as DenseNet propagate features more efficiently; DenseNet-BC surpasses ResNet on ImageNet with only about 7 million parameters (Huang et al. 2017).

Early botanical experiments focused on cropped leaves or venation textures. Grinblat et al. (2016) reported that a five-layer CNN achieved 97% Top-1 accuracy for three legume crops using only venation images. Lee et al. (2015) enlarged the scope to 44 species, reaching roughly 94% accuracy and showing that CNNs capture fine-grained venation patterns better than hand-engineered features. Parallel studies began constructing benchmark collections for deep learning on herbarium material (Unger et al. 2016; Grimm et al. 2016). In the 2018 ExpertLifeCLEF, Haupt et al. (2018) track a ResNet and DenseNet ensemble delivered 77% Top-1 accuracy, rivalling human experts.

Subsequent work scaled to uncropped, full-resolution sheets. Schuettpelz et al. (2017) first demonstrated feasibility on digitised ferns. Carranza-Rojas et al. (2017) trained Inception-v1 on the Herbarium1K corpus (about 253 k images across 1 204 species) and achieved 70.3% Top-1 accuracy on the 1 000 most frequent taxa, doubling the performance of handcrafted descriptors. Training typically minimised the cross-entropy loss

$$\begin{aligned} \mathcal {L}_{\text {CE}} = - \sum _{i=1}^{C} q_i \log (p_i), \end{aligned}$$
(4)

where \(C\) is the number of classes, \(q_i\) the one-hot label and \(p_i\) the predicted probability.

The FGVC6 Herbarium 2019 challenge further pushed accuracy to 89.8% on 683 species with an ensemble of SE-ResNeXt and ResNet variants (Little et al. 2020). Researchers then combined Mask Region-based Convolutional Neural Network (Mask R-CNN) for organ localisation (Wei et al. 2018) and bilinear pooling for second-order texture statistics (Lin et al. 2015), achieving additional gains when augmented by attention modules (Yang et al. 2023). Recent evaluations confirm that deep features markedly outperform traditional descriptors, even in difficult groups such as Piperaceae (Kajihara et al. 2025).

3.1.3 Attention-enhanced CNN models

The release of massive, long-tailed benchmarks like Herbarium 2020 (1.17 M images) (FGVC7 Kaggle Team 2020) and Half-Earth 2021 (2.5 M images) (de Lutio et al. 2021) shifted the research focus. To handle the extreme class imbalance, development pivoted towards attention-augmented backbones (e.g., ResNeSt) and specialized metric-learning loss functions like ArcFace. Other strategies focused on re-weighting the loss function itself, such as the class-balanced loss proposed by Cui et al. (2019), which adjusts the contribution of each class based on its effective number of samples. These innovations yielded significant gains; on the Half-Earth dataset of 64,000 species, a TResNet backbone with ArcFace loss achieved 75.7% macro-\(\textrm{F}_{1}\), demonstrating the effectiveness of metric learning for classification of long-tailed herbariums (de Lutio et al. 2021). Unlike Cross-Entropy, ArcFace Loss directly optimizes the feature embedding space by introducing an additive angular margin m to enforce higher intra-class compactness and inter-class discrepancy. Its formulation is:

$$\begin{aligned} \mathcal {L}_{\text {ArcFace}} = -\frac{1}{N} \sum _{i=1}^{N} \log \frac{e^{s \cdot \cos (\theta _{y_i} + m)}}{e^{s \cdot \cos (\theta _{y_i} + m)} + \sum \limits _{\begin{array}{c} j=1 \\ j \ne y_i \end{array}}^{C} e^{s \cdot \cos (\theta _j)}}, \end{aligned}$$
(5)

where \(\theta _{y_i}\) is the angle between the deep feature and the target weight for class \(y_i\), s is the feature scale, and m is the additive angular margin penalty (Deng et al. 2019; de Lutio et al. 2021).

To mitigate the dominance of head classes, Cui et al. (2019) proposed the class-balanced loss, which rescales the standard cross-entropy by the effective number of samples:

$$\begin{aligned} \mathcal {L}_{\textrm{CB}} = -\frac{1 - \beta }{1 - \beta ^{n_y}} \log p_y, \end{aligned}$$
(6)

where \(n_y\) is the number of training images for class y, \(p_y\) is the predicted probability, and \(\beta \in (0,1)\) controls the re-weighting strength.

This era also saw the rise of self-supervised methods, where representation learning on millions of unlabeled sheets was shown to improve rare-class recall (Walker et al. 2022).

In parallel, data augmentation remains a cornerstone strategy. Classic augmentations like rotation and flipping are standard, but studies have shown that more advanced techniques can yield further gains. For example, Ott et al. (2020) developed the GinJinn pipeline, an open-source tool that uses object detection to automate the extraction of features, such as counting reproductive organs, from herbarium specimens.

Overall, CNN evolution in herbarium classification mirrors general computer vision, but its extreme taxonomic granularity and long-tailed class distributions have prompted early adoption of attention mechanisms, metric losses, and high-resolution inputs, which paving the way for Transformers, domain transfer, and multimodal systems discussed in the following sections.

3.2 ViTs approaches

The latest generation of vision models, particularly ViTs, is rapidly gaining momentum in herbarium image classification.

ViTs break the convolution paradigm by treating an image as a sequence of patch tokens processed with self-attention (Vaswani et al. 2017; Dosovitskiy et al. 2021). This global context modelling is advantageous for whole-sheet herbarium images containing dispersed structures and labels.

Since Dosovitskiy et al. (Touvron et al. 2021) first adapted the Transformer architecture for vision, variants such as DeiT have reduced training cost via knowledge distillation. By replacing the local receptive fields of CNNs with a global self-attention mechanism, ViTs can model long-range spatial relationships across an entire herbarium sheet. The core of this mechanism is Scaled Dot-Product Attention, which allows every image patch to weigh its interaction with every other patch. The output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product similarity of the query with the corresponding key:

$$\begin{aligned} \text {Attention}(\textbf{Q},\textbf{K},\textbf{V}) = \text {softmax}\!\left( \frac{\textbf{Q}\textbf{K}^\top }{\sqrt{d_k}}\right) \textbf{V}, \end{aligned}$$
(7)

where \(\textbf{Q}\), \(\textbf{K}\) and \(\textbf{V}\) are the query, key and value matrices, and \(d_k\) is the key dimension.

However, the standard ViT architecture, which downsamples inputs to a low resolution (e.g., 224\(\times \)224 px), often discards the fine venation and floral details crucial for taxonomic discrimination. Consequently, much of the recent research has focused on hybrid architectures and techniques that preserve high-resolution information.

3.2.1 CNN-transformer hybrid

To overcome the resolution limitations of early ViTs, several hybrid models that combine convolutional principles with Transformer backbones have been proposed. For instance, the Conviformer introduced a convolutional stem to process high-resolution inputs more efficiently before feeding features into a hierarchical self-attention body (Vaishnav et al. 2022). On the Herbarium 2021 public test split, a single Conviformer-B (448 px) scored 72.9% macro-\(\textrm{F}_{1}\), narrowly exceeding the SE-ResNeXt-101 reference (72.6%). For the subsequent Herbarium 2022 challenge, the same architecture reached 82.9% macro-\(\textrm{F}_{1}\) as a single model and 86.8% after a five-model ensemble, clarifying that the previously quoted “78%” had conflated different runs and metrics (Vaishnav et al. 2022). Other approaches focus on hierarchical partitioning to handle larger input sizes. Swin Transformers (Liu et al. 2021) and CSWin Transformers (Dong et al. 2022) use shifted or cross-shaped windows to compute self-attention locally, reducing complexity from quadratic to linear with respect to image size. This allows models to process inputs of 1024\(\times \)1024 px or larger on consumer-grade GPUs, retaining much more detail. Wang et al. (2024) introduced a method called “Cluster-Learngene,” which condenses ancestry-ViT attention heads into adaptive “learngenes” and transfers them to descendant models, trimming 24% training time on the Herbarium 2019 benchmark without sacrificing macro-\(\textrm{F}_{1}\).

Beyond backbone architectures, attention mechanisms are also being integrated into specific modules to enhance performance. For instance, Ariouat et al. (2024) demonstrated that incorporating an attention-gate mechanism into the YOLOv7 detector could significantly improve the model’s precision in the fine-grained task of plant-organ detection on herbarium sheets. The YOLO family (“You Only Look Once”) refers to one-stage object detectors that divide the image into a grid and predict bounding boxes and class probabilities directly from each grid cell, without using a separate region proposal stage.

The impact of these advanced architectures is now evident in major competitions. In the 2024 NAFlora-1 M competition, the top-performing teams used large ensembles combining Swin V2-L, CSWin-B and DeiT-III, achieving a state-of-the-art 87.7% macro-\(\textrm{F}_{1}\) score across 15 500 species (Park et al. 2024). This result marked a significant milestone, as it was the first time Transformer-based ensembles decisively surpassed the best CNNs on a million-scale herbarium classification benchmark.

3.2.2 High-resolution transformers

The practical application of these large models has been accelerated by hardware-aware engineering. Memory-efficient attention mechanisms such as FlashAttention-2 (Dao et al. 2023) reduce the footprint of self-attention, allowing higher-resolution herbarium images—represented as longer patch sequences—to fit on a single 80 GB GPU (e.g., NVIDIA H100). In parallel, approximately 4,000 tokens contexts have become routine for large language models, illustrating the scalability of this approach across modalities. When combined with efficient backbones like EfficientViT-M2 (Liu et al. 2023a), this allows real-time processing of high-resolution (\({600}\,{\textrm{dpi}}\)) scans. Beyond classification, Transformer backbones are elevating auxiliary pipeline tasks. In the Hespi pipeline, a Swin V2 detector localises label zones with an impressive 97.9% mean Average Precision (mAP) (Turnbull et al. 2024), while in PENet, CSWin blocks support fine-grained trait segmentation (Zhao et al. 2023).

In summary, while CNNs dominated early leaderboards due to their computational efficiency, the development of hierarchical, hybrid, and memory-efficient ViTs has firmly established them as the new state of the art in herbarium image analysis, particularly as datasets and hardware capabilities continue to scale (Table 4).

Table 4 Model performance in herbarium specimen classification (2017–2024)

3.3 Multimodal models

Following early cross-attention models such as LXMERT (Tan and Bansal 2019), expert taxonomists rarely identify a specimen from visual morphology alone; metadata such as collection locality, collector, and date provide strong contextual priors that can resolve ambiguity between visually similar species. Recent progress in reliable text-metadata transcription (Sect. 4) and in multimodal Transformer learning has enabled models to incorporate this auxiliary information seamlessly. Building on the two-tower contrastive paradigm popularised by Contrastive Language Image Pre-training (CLIP) and its bioscience variants (e.g., BioCLIP), a herbarium-specific framework encodes the sheet image with a ViT/ResNet tower and the concatenated label text plus a [META] token with a BERT tower (Bidirectional Encoder Representations from Transformers, a widely used text encoder pretrained on large dataset); the two embedding spaces are aligned with an InfoNCE loss (Fig. 7). The resulting image embeddings can then be used for zero-shot or few-shot species retrieval, effectively mimicking the holistic reasoning process of human experts while remaining fully differentiable.

Fig. 7
Fig. 7
Full size image

Two-tower contrastive framework. A scaled-down herbarium sheet serves as visual input (left), while the right tower consumes N label texts augmented with a [META] token encoding collection year, latitude, and collector ID

3.3.1 Integrating metadata with visual features

Early and effective approaches involve simply concatenating structured metadata with visual features learned by a CNN. For example, Guralnick et al. (2024) demonstrated that appending the collection year and herbarium code as simple features to a ResNet’s image embedding resulted in a six-percentage-point recall gain on rare taxa. This highlights that even minimal, easily parsable metadata can provide a significant performance boost. The increasing availability of datasets with validated, structured metadata, such as the NA Phenology dataset, is lowering the barrier for this type of multimodal research (Park et al. 2023).

More sophisticated methods leverage vision-language pre-training. A recent workshop paper fine-tuned CLIP on a multi-million-image herbarium corpus and reported substantial gains in zero-shot retrieval on PlantNet-300K (Sahraoui et al. 2023), although exact numbers have not yet been released. The result nevertheless highlights the potential of pairing herbarium sheets with their label text to “teach” a model botanical language.

3.3.2 Phenology as a multimodal task

Label dates are a particularly powerful form of metadata that unlocks the study of phenology (the timing of seasonal biological events). Lorieul et al. (2019) constructed a benchmark with phenological stage labels and showed that models using both image and date information outperformed image-only models by four percentage points in \(\textrm{F}_{1}\) score. Pearson et al. (2020a) took this a step further by integrating gridded climate data corresponding to the collection date and location, achieving an impressive 0.82 AUC in predicting flowering time across 25,000 species.

More recently, Ahlstrand et al. (2025) provide a comprehensive phenology-focused survey, highlighting two main themes: (1) leveraging digital specimens to study temporal trends across centuries and (2) testing long-standing ecological hypotheses at continental scales. They review sampling biases, metadata reliability, and emerging ethical considerations, and they forecast that ongoing herbarium digitisation, coupled with AI-driven trait extraction, will unlock unprecedented insights into plant responses to climate change.

3.3.3 Quality control via multimodal models

Multimodal models serve a critical role in data quality control. By flagging records where the species identification from the image conflicts with the geographic or temporal context derived from the label, these systems can help address the surprisingly high rates of taxonomic misidentification in collections, which can exceed 10% in certain lineages (Goodwin et al. 2015).

Multimodal pipelines excel when visual morphology alone is insufficient for a confident identification. While challenges in handwriting recognition and missing GPS data remain, the fusion of image and metadata within Transformer frameworks is poised to become standard practice, moving models beyond simple pattern recognition towards a more contextual, expert-like understanding.

3.4 Transfer learning in cross-domain

A growing body of research addresses the challenge of cross-domain learning, particularly between herbarium images and field photographs of living plants. These two sources of data present a substantial domain gap: herbarium sheets offer a standardized, flat, and often taxonomically verified view, whereas field photos capture plants in situ with cluttered backgrounds, variable illumination, and natural perspective distortion. Leveraging both domains is highly attractive, as herbaria provide expert-verified labels at a massive scale, while field images from platforms like iNaturalist (Van Horn et al. 2018) or Pl@ntNet (Garcin et al. 2021) are crucial for building real-world plant identification applications. The research in this area explores various methods to transfer knowledge from the well-labeled herbarium domain to the less controlled “in-the-wild” domain, often evaluating performance using not only classification accuracy but also retrieval metrics such as Mean Reciprocal Rank (MRR):

$$\begin{aligned} \textrm{MRR} = \frac{1}{N}\sum _{i=1}^{N}\frac{1}{r_i} \text {,} \end{aligned}$$
(8)

where \(r_i\) is the rank position of the first relevant item for query i.

3.4.1 Unsupervised domain adaptation (UDA)

Early approaches often relied on UDA techniques that attempt to align the feature distributions of the two domains without requiring paired images. For example, Wu et al. (2023a) employed a cycle-consistent adversarial network for unsupervised domain adaptation, adapting a ResNet-50 model trained on laboratory images to field conditions for plant disease recognition. Their method achieved a final classification accuracy of 94.5% on the target domain, a significant improvement over the baseline. Other work has focused on more sophisticated loss functions. Chulif and Chang (2021) proposed a two-stream Herbarium-Field Triplet-Loss network that jointly minimises embedding distances for same-species herbarium-field pairs while maximising those for different species. Their NEUON submission to PlantCLEF 2021 achieved a MRR of 0.181 on the full test set and 0.158 on the long-tail subset. Building on this, Chulif et al. (2023) integrated cross-attention ViTs into the HFTL pipeline, raising the MRR to 0.158 on a difficult subset of the PlantCLEF 2021 challenge-a notable achievement in a cross-domain retrieval scenario.

3.4.2 Self-supervised cross-domain transfer

More recent and powerful methods have shifted towards learning universal feature representations from large, unlabeled datasets. The temperature-scaled InfoNCE loss, which forms the foundation of contrastive representation learning, is formulated as:

$$\begin{aligned} \mathcal {L}_i = - \log \frac{\exp (\textbf{z}_i \cdot \textbf{z}_{i,+} / \tau )}{\sum \limits _{j=0}^{K} \exp (\textbf{z}_i \cdot \textbf{z}_{i,j} / \tau )}, \end{aligned}$$
(9)

where \(\textbf{z}_i\) denotes the embedding of the anchor image, \(\textbf{z}_{i,+}\) is its positive counterpart (e.g., another augmented view of the same specimen), and \(\textbf{z}_{i,j}\) represent K negative samples drawn from the batch. The temperature parameter \(\tau \) controls the sharpness of the softmax distribution. This objective encourages the model to maximise similarity with the positive pair while minimizing similarity with the negatives (van den Oord et al. 2018).

Walker and Smith (Walker et al. 2022) demonstrated that contrastive representation learning on approximately 4.3 million unlabeled herbarium images yields features that transfer effectively to field photos without any explicit supervision; a lightweight classifier trained on these features increased Top-1 accuracy on iNaturalist leaf taxa by 6 percentage points.

Large, self-supervised ViTs serve as even more powerful bridges between domains. In a state-of-the-art example, Gustineli et al. (2024) fine-tuned a DINOv2-B model on the PlantCLEF 2024 herbarium dataset. Without using any labeled field photos for training, this model achieved a 23.0% macro-\(\textrm{F}_{1}\) score on the cross-domain, multi-label classification task, more than doubling the performance of the best ResNet-based baseline and showcasing the remarkable transferability of self-supervised features.

Table 5 Representative cross-domain and transfer-learning studies (2020–2024)

3.4.3 Generative domain translation

Another approach involves using generative models to bridge the visual gap. Generative Adversarial Network (GAN)s, particularly those focused on style transfer, can be used to “translate” images from a source domain (e.g., lab) into a target domain (e.g., field) to serve as data augmentation. For instance, Xu et al. (2021) demonstrated this by using StarGAN v2 to translate clean, lab-based images of plant leaves into more realistic field-style images. They reported that using these synthetically generated photos for augmentation improved the final classification accuracy by 4.64 percentage points.

In summary, cross-domain research is rapidly moving from earlier adversarial alignment techniques towards leveraging large-scale, self-supervised foundation models. As these models become more powerful and are pre-trained on ever-larger and more diverse multimodal datasets, the distinction between “herbarium” and “field” features may become less distinct, leading to more robust and universal models for plant identification. Representative studies are summarised in Table 5.

4 Complementary vision tasks

Beyond species classification, AI offers a range of methods that accelerate the entire herbarium digitisation workflow, from isolating the specimen on a sheet to extracting structured data from its label. This section reviews the AI techniques applied to the auxiliary yet critical tasks of segmentation, transcription, and semantic data extraction. These tasks transform a simple image into a rich, machine-readable data record. We examine the evolution of models for each task and discuss how integrated systems and HITL workflows combine these components into a cohesive, high-throughput pipeline.

4.1 Specimen image segmentation

The quality of image segmentation imposes an upper bound on the performance of the entire digitisation pipeline. By accurately separating plant material, labels, and other components from the sheet background, all subsequent analyses can operate on clean, targeted inputs. This front-loading of quality control at the pixel level is crucial for reducing the propagation of errors. Classical definitions and taxonomy follow standard texts in digital image processing (Gonzalez and Woods 2018).

Segmentation evaluation metrics

Intersection over union (IoU). IoU measures the overlap between the predicted segmentation mask (\(M_{\text {pred}}\)) and the ground-truth mask (\(M_{\text {gt}}\)):

$$\begin{aligned} \text {IoU} = \frac{|M_{\text {pred}} \cap M_{\text {gt}}|}{|M_{\text {pred}} \cup M_{\text {gt}}|}. \end{aligned}$$
(10)

It ranges from 0 (no overlap) to 1 (perfect overlap) and is the primary segmentation metric, widely used in challenges such as PASCAL VOC (Everingham et al. 2010).

Dice score. The Dice score emphasizes small objects and is monotonically related to IoU:

$$\begin{aligned} \textrm{Dice} = \frac{2\,|A\cap B|}{|A| + |B|} = \frac{2\,\textrm{IoU}}{1 + \textrm{IoU}}, \end{aligned}$$
(11)

with A and B denoting the predicted and ground-truth pixel sets, respectively (Dice 1945).

Average precision (AP) and mAP. For class c, the average precision is the area under its precision-recall curve:

$$\begin{aligned} \textrm{AP}_c = \int _{0}^{1} P_c(R)\, \textrm{d}R \text {,} \end{aligned}$$
(12)

and the mean Average Precision averages over classes:

$$\begin{aligned} \textrm{mAP} = \frac{1}{C}\sum _{c=1}^{C} \textrm{AP}_c \text {.} \end{aligned}$$
(13)

In instance segmentation and object detection, AP is typically computed at a given IoU threshold (e.g., \(\textrm{AP}@0.50\)). The COCO-style mAP further averages AP across multiple IoU thresholds, typically from 0.50 to 0.95 in 0.05 increments.

4.1.1 Specimen-background segmentation

Modern herbarium segmentation is dominated by encoder-decoder deep networks. For the foundational task of separating the plant from its background, architectures from the U-Net family have become the de facto standard, leveraging an encoder-decoder structure with skip connections to combine high-level semantic context with fine-grained spatial information (Ronneberger et al. 2015). For example, White et al. (2020) first demonstrated that a U-Net could achieve a 0.95 mean IoU on the large Herbarium-120k dataset, and its effectiveness as a preprocessing step has been validated in numerous subsequent pipelines (Kajihara et al. 2025). A multi-collection study by Milleville et al. (2023) confirmed this, reporting a similar IoU of 0.951 with a UNet++ model and further demonstrating that a hybrid cascade-UNet++ for the plant mask followed by YOLOv8 for accessory objects-maintained plant IoU while boosting non-plant artifact precision to 98.5%. Other deep learning approaches have also been proposed, such as the VGG-inspired network by Triki et al. (2022a) for segmenting both leaves and other artifacts on the sheet. These results underscore that precise whole-sheet segmentation of plant pixels can be achieved with high accuracy under controlled conditions.

To cope with the extreme foreground-background imbalance that typically occurs in dense object detection on herbarium sheets, Lin et al. (2017) introduced the focal loss:

$$\begin{aligned} \mathcal {L}_{\text {Focal}} = -(1 - p_t)^{\gamma }\,\log p_t, \end{aligned}$$
(14)

where \(p_t\) denotes the predicted probability of the ground-truth class and \(\gamma >0\) is a focusing parameter; the modulating factor \((1 - p_t)^{\gamma }\) down-weights well-classified examples so that the model focuses on hard, informative instances.

4.1.2 Component level segmentation

Beyond binary masks, advanced pipelines perform component-aware segmentation to delineate not just the plant, but also labels, barcodes, scale bars, and other sheet elements. Object detectors trained on multi-institution datasets now routinely reach production-level accuracy. An improved YOLO v3 with a fourth detection scale lifts mAP at IoU threshold 0.50 from 90.1% to 93.2% on 4,000 herbarium specimens from the Haussknecht of Jena (HHJ) in Germany (Triki et al. 2020). For instance, Thompson et al. (2023) report that a YOLOv5-based model can localise eleven distinct sheet components at 0.983 precision and 0.969 recall. The HESPI pipeline (Turnbull et al. 2024) pushes performance even further: its custom detector achieves near-perfect label localisation (about 99% IoU) and an \(\hbox {F}_{1}\) score of approximately 98% across multiple benchmarks. The qualitative difference in output between these methods can be significant, as shown in the instance-segmentation comparison in Fig. 8.

Fig. 8
Fig. 8
Full size image

Instance-segmentation comparison on herbarium specimens. Left to right: Detectron2, Mask R-CNN, YOLOv8, and Mask2Former. Reproduced from the “herbarium-segmentation” repository [GitHub] https://github.com/kymillev/herbarium-segmentation under the MIT Licences; methodology follows Milleville (2025); Milleville et al. (2023)

To illustrate the qualitative output of a recent Mask2Former-based pipeline, Fig. 9 presents an original sheet, the pixel-wise class map, and the isolated plant mask. These results were obtained by re-running the open-source implementation of Milleville (2025); Milleville et al. (2023) on a representative specimen, reproducing the authors’ settings. As in their report, the component labels are accurate even though the outer silhouette is not yet at truly fine-grained resolution.

Fig. 9
Fig. 9
Full size image

Reproduction of Milleville et al.’s (2025; 2023) Mask2Former workflow on a test specimen. Left: original herbarium image. Centre: predicted instance labels (colour-coded by organ). Right: composite plant mask

4.1.3 Plant organ level segmentation

For fine-grained phenotyping, instance segmentation of individual organs (leaves, flowers, etc.) is required. Younis et al. (2020a) pioneered this with Mask R-CNN, training a model to detect six organ categories and publicly releasing the associated dataset of annotations on the PANGAEA repository (Younis et al. 2020b). Their results highlighted the complexity of the task, achieving a mAP of approximately 22%, with performance varying significantly across organs, from 37.9% for leaves down to 0% for seeds. Goëau et al. (2020) trained a Mask-R-CNN on only 21 Streptanthus sheets correctly enumerated buds, flowers and fruits with 77.9% accuracy, enabling fine-grained phenology at scale. Deep Leaf segments individual leaves with an average relative error of 4.6% in length and 5.7% in width across 800 test sheets (Triki et al. 2021). Then Triki et al. (2022b) refined YOLO-v3 variant lifts overall organ-level precision/recall to 94.2% and 95.5% on 3,400 annotated organs. Hussein et al. (2021a) used a DeepLabv3+ pipeline that first isolates intact leaves and then measures their traits, recording 96% \(\textrm{F}_{1}\) score on an in-house set and 93% on a public benchmark. Using these organ masks as support, they trained a GAN-based restoration model that reconstructed over 90% of missing leaf area and resulted in a four-percentage-point gain in downstream species classification recall (Hussein et al. 2021b).

More recent work has explored attention-gated versions of YOLO and Transformer-based models like Mask2Former to improve accuracy on this complex task (Ariouat et al. 2024; Milleville et al. 2023). This organ-level segmentation is a critical input for integrated systems like LeafMachine2, which automates the measurement of key morphological traits from the resulting masks (Weaver and Smith 2023). Plug-in MLaaS services inside DiSSCo already compute organ area and ruler scale for 203 sheets; a YOLO-11 detector identifies scale bars in 98% of cases and yields centimetre-accurate area estimates (Rajendran et al. 2025).

4.2 Automated label transcription

Accurate transcription of herbarium labels supplies critical metadata to biodiversity databases. Modern pipelines implement a multi-step process: locating and classifying labels, applying Optical Character Recognition (OCR) or Handwritten Text Recognition (HTR), and performing post-processing to correct and format the output.

4.2.1 Label type classification

Given the localised label crops from Sect. 4.1.2, pipelines then classify each crop by type (e.g., printed vs. handwritten) and route it to a specialised transcription model. In HESPI, a lightweight secondary classifier reaches over 98% accuracy (Turnbull et al. 2024). This allows the system to route each label crop to a specialized transcription model, a critical step for maximizing accuracy.

4.2.2 OCR and HTR

For machine-printed text, off-the-shelf OCR engines like ABBYY FineReader can exceed 99% character-level accuracy on clear labels (Drinkwater et al. 2014). Throughout this section, we measure transcription quality with the Character Error Rate (CER), a standard metric in automatic speech and text recognition, defined as:

$$\begin{aligned} \text {CER} = \frac{S + D + I}{N}, \end{aligned}$$
(15)

where S is the number of substitutions, D is the number of deletions, I is the number of insertions required to change the predicted text to the ground-truth text, and N is the total number of characters in the ground-truth text (Morris et al. 2004).

However, to handle domain-specific vocabulary, ensemble approaches that combine multiple OCR engines can reduce remaining field error rates by a factor of four (Guralnick et al. 2025). For challenging cursive script, Transformer-based HTR models are rapidly closing the gap. Sadek et al. (2024) demonstrated that fine-tuning a state-of-the-art Transkribus model on historical botanist handwriting can lower the CER to just 3.1%, compared to 8.3% from a generic service like AWS Textract.

4.2.3 Post-processing

After initial OCR/HTR, a post-processing stage cleans and validates the outputs, often using LLMs. Guralnick et al. (2025) used an LLM-based correction step combined with controlled-vocabulary checks to raise the overall Darwin Core field \(\hbox {F}_1\) score to 0.90\(-\)0.95 on unseen collections. Weaver et al. (2023) demonstrated that routing multiple OCR engine outputs through a GPT-4-style LLM to consolidate best readings, normalise terms, and emit structured JSON records reduces character-error rates by approximately 45% compared to single-engine OCR baselines.

4.3 NLP for structured data extraction

Once label images have been transcribed, the resulting free text must be converted into semantically structured data that can interoperate with biodiversity portals. This conversion usually follows a four-step cascade.

  1. (1) Token normalisation and abbreviation expansion.

    Herbarium labels often contain standard botanical abbreviations (“Coll.”, “Det.”, “Alt.”). Rule-based lexicons remain effective baseline tools, but to handle the immense variety of terms, automated parsers are crucial. For example, the Salix method, an early semi-automated workflow, successfully parsed and expanded abbreviations for key fields like collector and date with over 95% accuracy on a test set of Salix specimens (Barber et al. 2013).

  2. (2) Scientific-name recognition and parsing.

    Extracting Latin binomials is essential for linking specimens to taxonomic backbones. The open-source package gnfinder combines heuristics with machine learning to reach an \(\hbox {F}_1\) score of 0.94 on the Herbarium-NER benchmark (Mozzherin et al. 2024), while its companion gnparser splits raw strings into structured components with 96% accuracy against IPNI (Mozzherin et al. 2017; Mozzherin 2023).

  3. (3) Locality geoparsing and coordinate cleaning.

    Translating prose locality descriptions into geographic coordinates couples NLP with GIS. Gazetteer-based matchers such as BioGeomancer correctly position about 81% of historical localities within 10 km of expert references (Guralnick et al. 2006). Post-processing tools like CoordinateCleaner then flag implausible points (centroids, oceans, capitals) and standardise uncertainty estimates (Zizka et al. 2019).

  4. (4) Entity linking and consistency checks.

    Recognised names are reconciled with authoritative taxonomies (e.g., WCVP, Plants of the World Online (POWO)) (Hassler et al. 2021; Royal Botanic Gardens, Kew 2025). Automated tools designed for this purpose, such as the Taxonomic Name Resolution Service, have been shown to resolve synonymy with very high precision, often in the 95-98.5% range (Boyle et al. 2013). A final ensemble validation that cross-checks all linked fields, including collector, date, locality, and taxon, can increase end-to-end metadata accuracy above 92% (Guralnick et al. 2025). Unmatched or conflicting records are routed to experts for manual review, closing the loop between automated extraction and expert oversight.

4.4 Human-assisted workflows

Real-world digitisation pipelines now weave together automated vision modules and targeted human expertise. Most production stacks follow a modular recipe: component detection, text transcription, and NLP structuring, but differ in how they invite expert or volunteer input. The entire end-to-end process, from specimen handling to final data publication, has been mapped in detail to identify bottlenecks and opportunities for automation (Thompson and Birch 2023).

4.4.1 Component-level modular pipelines

Component-level herbarium pipelines interleave automated vision modules with targeted human input at key checkpoints. Low-confidence detections may be routed to volunteers for annotation; OCR or HTR outputs are verified and normalised by experts; and final taxonomic resolution or database triage is typically handled by curators. This hybrid design balances scalability with data quality and is common across production systems. Figure 10 summarises the current herbarium digitisation pipeline, showing how automated vision modules and LLM-driven metadata structuring interact with volunteer input and expert triage before final database integration.

Fig. 10
Fig. 10
Full size image

End-to-end herbarium digitisation workflow integrating computer-vision modules, LLM-based metadata structuring, and human contributions prior to database write-back and GBIF synchronisation

A leading example is LeafMachine2, a modular, open-source pipeline developed through a collaboration of nearly 300 institutions (Weaver and Smith 2023). Rather than a single model, it orchestrates YOLOv5 for object detection and Detectron2 for fine-grained segmentation, processing sheet components, plant parts and labels in parallel (Fig. 11). VoucherVision extends this architecture with automated transcription and parsing (Weaver et al. 2023), while many research prototypes follow the same blueprint. For example, a pipeline comprising a U-Net, ViT and an MLP classifier was evaluated on Piperaceae sheets (Kajihara et al. 2025). Convergent ideas appear in the HESPI pipeline from the University of Melbourne (Turnbull et al. 2024).

Fig. 11
Fig. 11
Full size image

Modular workflow of the LeafMachine2 pipeline. The system uses distinct models for sheet-component detection, leaf segmentation, pseudo-landmark placement, and archival cropping. Figure reproduced from Weaver and Smith (2023), licenced under CC BY 4.0

4.4.2 Integrated human-AI loops and expert tools

Strategic human intervention raises both accuracy and trust. Bounding boxes drawn by a handful of volunteers can guide a detector-OCR ensemble, achieving 93% success in locating primary labels (Guralnick et al. 2024). The same study trained a ResNet-50, not to identify species but to flag likely errors in the Herbarium 2020 dataset. By reviewing only the top 3% most suspicious sheets, experts captured 87% of genuine misidentifications, cutting inspection time from 42 s to 11 s per sheet.

Active learning creates a feedback loop where the AI model actively requests human help to improve itself. The model identifies candidates for human review by ranking their predictive uncertainty. A common method is to use the entropy of the model’s predicted probability distribution \(\textbf{p}\) for a given sample:

$$\begin{aligned} H(\textbf{p}) = - \sum _{c=1}^{C} p_c \log _2(p_c), \end{aligned}$$
(16)

where \(\textbf{p}\) is the predicted probability for class c. Samples with higher entropy are considered more uncertain and are prioritized for expert annotation (Settles 2009).

Systems such as VoucherVision display the image, OCR text and predicted taxonomy side-by-side, letting specialists overwrite any step (Weaver et al. 2023). Uncertainty heat-maps and versioned audit trails cut curation time by over 40% while keeping data accuracy above 95%.

Herbaria historically harbour baseline taxonomic error rates (Goodwin et al. 2015), and many new species are discovered by re-examining existing collections (Bebber et al. 2010). Screening networks that flag outliers therefore act as computational microscopes, allowing experts to focus on rare taxa and potential novelties. Walker et al. (2022) used self-supervised learning to surface morphological clusters that botanists later formalised as new species.

The relationship between humans and AI is no longer confined to post-hoc error correction; it is evolving into a collaborative engine for scientific discovery. Comparative studies show that state-of-the-art classifiers routinely outperform human generalists on common, well-represented species, whereas domain experts remain superior at recognising rare taxa and detecting novelties (Bonnet et al. 2018). An optimal division of labour therefore lets models handle bulk identification, while specialists scrutinise outliers and potential new species, turning AI into a computational microscope that accelerates taxonomic insight.

Table 6 summarises milestone HITL initiatives spanning the past decade.

Table 6 HITL initiatives for digitised herbaria

4.4.3 Crowdsourcing at collection scale

Volunteer platforms such as Les Herbonautes (France), Notes from Nature (global), and DigiVol (Australia) routinely deliver high-quality label transcriptions (https://research.mnhn.fr/projects/les-herbonautes, https://www.notesfromnature.org/, https://volunteer.ala.org.au/). An Annonaceae campaign reached 97% field-level accuracy (Streiff et al. 2024), while volunteer-drawn label boxes lowered OCR word-error rates four-fold (Guralnick et al. 2024). Hybrid workflows that interleave crowdsourcing with automated checks (e.g., the Guralnick filter Guralnick et al. 2025) illustrate how mass participation and machine intelligence can scale far beyond what either could achieve alone.

Taken together, the shift from component-level detectors to fully integrated, human-AI Transformer pipelines has lifted classification accuracy to the brink of routine deployment. But progress is still capped by input fidelity: blurred scans, imprecise segmentations and noisy transcriptions propagate errors that no downstream model can repair. The next subsection therefore turns to vision language models that fuse image features with textual context, seeking to relieve these input bottlenecks.

4.5 Vision language models

LLMs and Vision Language Model (VLM)s increasingly replace pre-trained NLP modules and even perform zero-shot classification. A single LLM model now ingests pixels plus raw text and outputs taxon names, collection descriptions, structured information or natural-language explanations.

4.5.1 Vision language model pre-training

Large ViTs pre-trained on massive unlabeled datasets using self-supervised objectives have proven to be exceptionally effective feature extractors. Models like DINOv2 and VLMs like SigLIP provide powerful, ready-to-use backbones (Zhai et al. 2023). Fast Language-Image Pre-training (FLIP) masks about 70% of patches so the same GPU hours cover 3\(\times \) more image-text pairs, yielding increase 1.4 percentage points (pp) zero-shot ImageNet accuracy versus vanilla CLIP at equal compute (Li et al. 2023a). In a particularly notable study, Gustineli et al. (2024) achieved a 23.0% macro-\(\textrm{F}_{1}\) score on the challenging, multi-label PlantCLEF 2024 cross-domain task by tiling a self-supervised DINOv2-B model over 4 k-pixel herbarium sheets, thereby doubling the performance of the best ResNet-based baseline. Combining contrastive self-supervision with language supervision, SLIP (Self-supervision Meets Language Image Pre-training) lifts ImageNet zero-shot Top-1 by 5.2 pp over CLIP and by 8.1 pp over pure SSL on identical data (Mu et al. 2022). Developing knowledge of those giant VL models, TinyCLIP inherits weights and ’affinity mimicking’, compressing the ViT-B/32 model by reducing its parameters by 50%, yet holding ImageNet zero shot within 0.3 pp (Wu et al. 2023b).

4.5.2 Text normalisation and conversational assistance

LLMs fine-tuned for OCR clean-up, abbreviation expansion and date reformatting have begun to replace rule-based post-processing. BLIP-2, for instance, combines a frozen ViT-G/14 image encoder with a 13B language decoder and has demonstrated competitive OCR correction on several public benchmarks (Li et al. 2023b). Early experiments with GPT-4V(ision) likewise report improved Latin-name canonicalisation, although results remain unpublished (OpenAI et al. 2024). Beyond text normalisation, multimodal backbones can answer free-form expert queries: LLaVA\(-\)1.5 and Kosmos-2 both support prompts such as “Which floral parts are present?” and deliver state-of-the-art accuracy on the ScienceQA and VQA benchmarks (Liu et al. 2023b; Peng et al. 2023), suggesting immediate applicability to herbarium curation tasks.

Second-generation systems embed VLMs directly into the vision stack. VoucherVision v2 replaces rule-based string parsing with a GPT-4V module that resolves collector names and locality phrases, reporting higher parsing accuracy in internal tests (Weaver et al. 2023).

HESPI combines BLIP-2 with rule-based checks so that detected geocoordinates must be consistent with the locality text, substantially lowering false positives in its public demo (Turnbull et al. 2024). Early user studies indicate that experts gain confidence when the model explains edits in plain language, although hallucination remains a concern mitigated by retrieval-augmented prompts.

4.5.3 Open-set recognition

OSR assumes that unknown taxa may appear at test time. Zero-shot models, which do not require explicit class labels during training, offer a natural solution to this challenge. Image-text encoders pre-trained on web-scale captions (e.g., SigLIP, CLIP,and ImageBind) already display useful zero-shot accuracy on long-tailed vision tasks, and small herbarium trials show the same qualitative trend (Zhai et al. 2023; Girdhar et al. 2023). On PlantCLEF 2024, Gustineli et al. (2024) fine-tuned a self-supervised DINOv2 backbone and achieved the competition’s highest macro-\(\textrm{F}_{1}\), confirming that foundation models can mitigate class imbalance without exhaustive labels.

Generalised-category-discovery (GCD) methods push further. NeighbourGCN refines pseudo-labels on k-Nearest Neighbors (k-NN) sub-graphs and sets a new state-of-the-art 38.6% macro-\(\textrm{F}_{1}\) on Herbarium 2019, 3.2 pp above the previous GCD baseline (Yang et al. 2025). GET-CLIP unlocks CLIP’s multimodal capacity for fine-grained domains: prompt tuning plus geometric alignment improve zero-shot macro-\(\textrm{F}_{1}\) on Herbarium 2019 by up 5.3 pp (Wang et al. 2025). Hallucinated localities or fabricated author names nevertheless pose risks. Retrieval-augmented generation lowers factual error in dialogue by more than 60% (Shuster et al. 2021); adapting such prompts with Darwin-Core fields and GBIF look-ups to herbarium metadata is an open research avenue, and production systems still route high-impact predictions-such as putative new species-back to the expert dashboards in Sect. 4.4 for manual approval.

Foundation VLMs already match or exceed bespoke pipelines on small testbeds while offering richer, explainable outputs. Coupling them with FAIR digital-specimen APIs and energy-efficient adapter tuning is likely to define the next wave of herbarium AI services.

5 Challenges

Despite the progress reviewed above, several fundamental obstacles still limit the routine deployment of herbarium-AI systems. Fewer in number yet broader in scope, the following four challenges subsume earlier concerns and point to concrete research directions.

5.1 Data imbalance

Herbarium datasets follow a long-tail distribution, 60% of species are represented by fewer than five sheets, which biases empirical risk minimisation toward common taxa and degrades recall on threatened or newly described lineages (de Lutio et al. 2021). Domain gap further compounds the issue: networks fine-tuned on pressed specimens generalise poorly to field photographs or plant sections. Recent solutions include class-balanced focal loss, meta-batch sampling, and few-shot adapters that leverage large self-supervised encoders (Wang et al. 2020; Chen et al. 2020; He et al. 2020). Vision-language pre-training (e.g., BioCLIP) now delivers 18–25% macro-\(\textrm{F}_{1}\) gains in zero-shot transfer between Herbarium 2020 and iNaturalist subsets (Stevens et al. 2024), yet robust calibration under extreme shift remains unsolved.

5.2 Information loss

Flattening, drying and long-term storage inevitably strip herbarium sheets of characters that are conspicuous in living plants. Quantitative metrics such as leaf-mass-per-area exhibit resolution-dependent error once scan size drops below three megapixels (Vasconcelos et al. 2025). Internal anatomy is likewise hidden: sclereid density, vascular bundle shape and seed endosperm remain inaccessible without destructive sectioning or \(\upmu \)CT. Even \({600}\,{\textrm{dpi}}\) scans omit micromorphological features smaller than \({50}\,{\upmu {\textrm{m}}}\), and virtual revisions recover only approximately 80% of diagnostic traits (Phang et al. 2022).

Pressing also distorts shape: a survey of 794 leaf pairs from 22 taxa showed lamina area shrinking by 5-18%, with changes in length-to-width ratio that can mislead morphometric analyses (Tomaszewski and Górzkowska 2016). Reflectance spectra of pressed leaves differ systematically from fresh tissue: drying removes dominant water absorption features (around \({1,450}\,{\textrm{nm}}\) and \({1,930}\,{\textrm{nm}}\)) and shifts the red edge to longer wavelengths, altering which traits can be reliably inferred (Kothari et al. 2023). It is a well-documented fact in herbarium practice that floral colours, particularly reds and blues, are poorly preserved during the pressing and drying process and fade over time (Bridson and Forman 1998).

Future pipelines must integrate focal stacking, photogrammetric texture maps and low-dose \(\upmu \)CT. Cross-modal contrastive objectives could align these richer 3D/IR modalities with legacy 2D scans, enabling backward compatibility while progressively upgrading trait coverage. Linking pressed sheets to field photographs or iNaturalist observations via shared embeddings would further restore colour and phenological context lost at collection time.

5.3 Model interpretability and explainability

Without explanatory heat-maps, deep classifiers remain black boxes to taxonomists. CAM and Grad-CAM reveal which pixels drive decisions but ignore part structure (Selvaraju et al. 2017; Zhou et al. 2016). Combining AI with organ-level masks (PENet) supports high-throughput measurement of lamina area, petiole angle and vein density (Zhao et al. 2023). The next step is to link attention hotspots directly to formal descriptors (e.g., “pinnate venation present”), closing the loop between image evidence and descriptive terminology.

5.4 Scalable open-set recognition

Large-scale vision-language encoders such as SigLIP, CLIP and ImageBind already achieve useful zero-shot results on long-tailed benchmarks, and prompt-guided adapters push further: PromptCAL and Targeted Representation Alignment improve discovery accuracy on ImageNet-O and Herbarium 2019, while GET-CLIP raises zero-shot macro-\(\textrm{F}_{1}\) on Herbarium 2019 by 5.3 pp through geometric alignment and prompt tuning (Zhai et al. 2023; Zhang et al. 2023; Liu et al. 2024; Wang et al. 2025). Graph-based methods follow the same agenda: NeighbourGCN refines pseudo-labels on \(k\)-NN sub-graphs and reaches 38.6% macro-\(\textrm{F}_{1}\) on Herbarium 2019, 3.2 pp above the previous state of the art (Yang et al. 2025). Nevertheless, classical open-set filters (OpenMax, POEM) still lose recall when the novel-class rate exceeds 10% (Scheirer et al. 2013), and foundation VLMs process at most \(4\text {k}\times 4\text {k}\) inputs on 80 GB GPUs, which is far below the \({600}\,{\textrm{dpi}}\) masters held by major institutions.

Future work must attack four fronts simultaneously: (i) memory-efficient adapters and sparse attention to handle full-resolution TIFFs, (ii) retrieval-augmented prompts that inject Darwin-Core triples and GBIF look-ups to reduce hallucination, (iii) fine-grained open-set detectors able to flag cryptic sister taxa distinguished by tiny morphological differences, and (iv) community benchmarks that evaluate imbalance, novelty and multimodal reasoning in one place. Progress on these fronts is essential before foundation models can be trusted across the world’s 400 million digitised herbarium sheets.

6 Conclusions and future directions

Digitised herbarium sheets now constitute one of the largest fine-grained image repositories in biology, and the past decade has seen a decisive transition from rule-based descriptors to end-to-end learning pipelines. Section 2 detailed how IIIF imaging and Darwin-Core metadata have turned formerly local cabinets into machine-actionable resources comprising more than 100 million high-resolution scans. Section 3 traced the evolution of recognition models, from handcrafted features through convolutional baselines to self-supervised ViTs, while Sect. 4 showed that accurate upstream modules (segmentation, OCR/HTR, entity resolution) are indispensable for maximising downstream taxon accuracy and enabling multimodal fusion.

Three overarching insights recur throughout this survey. First, the fidelity of the input data (including scan resolution, colour calibration, and precise label alignment) continues to dictate the ceiling on model performance, regardless of architectural sophistication. Second, transfer learning from large natural-image dataset remains a remarkably effective catalyst: fine-tuning ImageNet-pre-trained backbones routinely yields double-digit gains, especially on the extreme long-tailed distributions that typify floristic datasets. Finally, workflows that embed human expertise at strategic junctures (such as active learning loops, expert dashboards, and volunteer verification) not only correct residual errors but also preserve the domain-expert trust essential for responsible deployment.

Looking ahead, four research thrusts appear most likely to unlock the full potential of the world’s 400 million preserved specimens:

  • Generative and self-supervised augmentation. Self-distilled ViT encoders trained with large-scale contrastive objectives now match or exceed fully supervised CNNs on multi-label plant benchmarks, yet performance still degrades on the long botanical tail. Class-conditional diffusion models and text-guided image editing can synthesise realistic variants (additional organs, colour channels, or phenophases) for taxa represented by only one or two sheets, broadening intra-class diversity without changing labels. Combined with synthetic-to-real domain adaptation and active-learning loops, these generators can focus expert effort on the most informative gaps and reduce annotation cost. To avoid overfitting or drift, synthetic images should be explicitly flagged and evaluation kept on held-out, real-only test sets.

  • Truly multimodal foundation models. Current CLIP-style resources are image-caption only; species-level identification benefits from a joint embedding of high-resolution scans, transcribed label text, GPS provenance, climate layers and, where available, DNA barcodes. Rank-aware contrastive objectives (“hierarchical contrastive losses”, i.e., losses aligned with botanical classification (family-genus-species) that pull together positives at multiple ranks and separate negatives across ranks), together with graph neural networks linking sheets across space and time, could enable zero-shot recognition of previously unseen or undescribed taxa and provide quantitative novelty scores for taxonomists.

  • Machine-actionable digital specimens. Persistent identifiers, versioned annotations, and open APIs (as exemplified by the DiSSCo blueprint) enable recognition-as-a-service pipelines that write new determinations, organ counts, and trait values directly back to collection databases. Streaming inference on the ingest server, combined with real-time validation dashboards, would shorten the feedback loop between experts, AI models, and biodiversity aggregators such as GBIF and iDigBio.

  • Responsible and sustainable deployment. Herbarium AI must embrace green-AI accounting (tracking GPU hours and carbon cost), federated or privacy-preserving learning to respect sensitive locality data, and the CARE (collective benefit, authority to control, responsibility, and ethics) as well as FAIR principles to ensure equitable benefit-sharing with source countries and Indigenous knowledge holders. Benchmark reports should include energy use and bias audits alongside accuracy metrics.

Taken together, these threads point toward an AI “digital taxonomist”: a modular system that (i) assigns names at family, genus and species rank, (ii) segments reproductive and vegetative organs, (iii) extracts quantitative morphological and phenological traits, (iv) flags outliers for expert review and (v) synchronises structured data with downstream ecological workflows. Such capabilities would vastly accelerate species discovery, reveal shifts in flowering and fruiting phenology under climate change, and inform timely conservation actions that directly tackle the biodiversity crisis outlined in Sect. 1.

Realising this vision will require sustained collaboration among computer scientists, botanists, collection managers and citizen scientists; cross-institutional benchmarks that include under-sampled tropical floras; and rigorous ethical safeguards. Building on the foundations synthesised in this survey, the community can convert centuries-old specimens into a living, quantitative observatory of global plant diversity.