A review of artificial intelligence in herbarium specimen image analysis

Guo, Yu-Yue; Cai, Haibin; Bramley, Gemma L. C.; Atkins, Hannah J.; Li, Baihua; Theodossiades, Stephanos

doi:10.1007/s10462-025-11408-2

A review of artificial intelligence in herbarium specimen image analysis

Open access
Published: 31 October 2025

Volume 58, article number 402, (2025)
Cite this article

You have full access to this open access article

Download PDF

Save article

View saved research

Artificial Intelligence Review Aims and scope Submit manuscript

A review of artificial intelligence in herbarium specimen image analysis

Download PDF

Yu-Yue Guo^1,2,
Haibin Cai¹,
Gemma L. C. Bramley²,
Hannah J. Atkins³,
Baihua Li¹ &
…
Stephanos Theodossiades¹

3399 Accesses
Explore all metrics

Abstract

The digitisation of hundreds of millions of herbarium specimen images and their labels has created an unprecedented resource for taxonomy, ecology, and conservation, motivating the development of artificial intelligence (AI) solutions. Automated analysis of these high-resolution scans faces significant challenges, including data imbalance, information loss, model interpretability and explainability, and scalable Open-Set Recognition (OSR). This paper provides an in-depth algorithm-level review of AI methodologies for herbarium image classification, tracing the development from classical classification models like Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to cutting-edge multimodal frameworks. In addition to classification, the review further investigates vision-based analytical tasks critical to herbarium image analysis, including specimen image segmentation, label text identification using Large Language Models (LLMs), and Human-in-the-Loop (HITL) quality assurance strategies. Furthermore, this review reveals practical challenges in specimen image analysis along with their promising solutions and potential future directions.

Computer vision applied to herbarium specimens of German trees: testing the future utility of the millions of herbarium specimen images for automated identification

Article Open access 16 November 2016

Development of a system for the automated identification of herbarium specimens with high accuracy

Article Open access 16 May 2022

AI Application in Plant Identification and Classification: Innovation and Impact

1 Introduction

The confluence of a global biodiversity crisis and a data revolution in natural history collections has created both an urgent challenge and a significant opportunity for botanical science. On one hand, accelerating biodiversity loss threatens c. 400 000 plant species, many of which may disappear before formal description (Antonelli et al. 2020; Royal Botanic Gardens Kew 2023; Christenhusz and Byng 2016; Royal Botanic Gardens, Kew 2025). On the other hand, the world’s herbaria (collectively holding more than 400 million specimens) are undergoing large-scale digitisation, and over 100 million high-resolution plant specimen images are already accessible online (Thiers 2025; Soltis 2017; Lang et al. 2019). Herbaria are curated repositories of preserved plant specimens, each consisting of a dried, pressed plant mounted on archival paper and accompanied by detailed label metadata (as shown in Fig. 1), and serve as permanent reference libraries for taxonomy, ecology and conservation planning (Bridson and Forman 1998). Each image captures overlapping organs, handwritten labels, and calibration artefacts. Across collections, the data exhibit an extreme long-tail distribution. For example, within the Herbarium 2020 benchmark (the 2020 FGVC7 dataset containing approximately 1.17 million images across 32,000 species), 60% of species have five or fewer training images, whereas a small subset have thousands (FGVC7 Kaggle Team 2020).

This extreme imbalance, where the rarest taxa are under-represented and thus particularly difficult to study and classify, underscores the need for scalable AI methods capable of fine-grained recognition and robust performance on rare species.

Traditional taxonomic workflows rely on expert examination of morphology, dichotomous keys and authoritative literature (Simpson 2019; Stuessy 2009; Winston 1999). Although scientifically rigorous, these methods are time-consuming and subject to inter-expert variation, leading to documented misidentification rates in major collections (Goodwin et al. 2015; Govaerts 2001). It is estimated that many undescribed species already reside in herbarium cabinets, historically requiring an average of 35 years to be detected and formally named (Bebber et al. 2010).

To fully appreciate the application of AI to herbarium specimens, it is essential to understand both the nature of the data source and the fundamental concepts from traditional botany and modern computer science that are being applied. In practice, an expert follows a multi-step reasoning process that has changed little since the nineteenth century:

The role of type specimens. A “type” is the single specimen to which a scientific name is formally attached. According to the International Code of Nomenclature for algae, fungi and plants, every new species description must designate a type specimen, which serves as the definitive reference for that name (Turland et al. 2018).
Determination and revision. Naming a specimen (determination) involves comparing it with verified material and diagnostic characters in the literature. Dichotomous keys guide this process where available; in groups lacking keys, taxonomists construct them during revision. Modern interactive online keys permit unordered character selection and return real-time candidate lists (e.g., ActKey (Brach and Song 2005), KeyBase (Royal Botanic Gardens Victoria 2025), Xper3 (Kerner et al. 2025)). New determinations may supersede earlier ones, and all revisions are affixed to the specimen sheet.
Floras and monographs. Floras summarise all species in a defined region, whereas monographs treat a single lineage throughout its distribution (Winston 1999; Judd et al. 2007). Producing such works often requires decades of coordinated field and herbarium study.

These steps form the foundation onto which AI-driven pipelines now build: image segmentation, text recognition, and automated classification aim to replicate and ultimately accelerate the expert workflow, while retaining an essential stage of human validation.

To accelerate discovery and mitigate these limitations, AI methods offer a data-driven alternative. For computer vision, herbarium images pose a fine-grained recognition problem: inter-species differences are subtle, and background noise is considerable. Recent studies show that deep models such as Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) can learn discriminative features directly from millions of specimen images (Krizhevsky et al. 2012; Dosovitskiy et al. 2021). Carranza-Rojas et al. (2017) demonstrated that an early CNN reached approximately 70% accuracy across 1,000 species, highlighting the feasibility of semi-automated workflows that underpin the concept of the “extended specimen” (Heberling et al. 2021; Wen et al. 2015). A practical workflow now realises this concept: each newly collected sheet is paired with its iNaturalist field photograph, embedding habit, colour, and geo-metadata directly into the specimen lifecycle (Heberling and Isaac 2018).

Surveys on automated plant recognition largely address living plants. Early work reviewed whole-plant images and freshly detached leaves captured in controlled settings (Wäldchen and Mäder 2018; Saranya et al. 2021), and more recent updates extend this line to broad-spectrum plant recognition and crop-disease diagnosis (Barhate et al. 2024; Upadhyay et al. 2025; Yilmaz et al. 2025; Tiwari and Dev 2025). Two prior reviews are closely related to this topic. Hussein et al. (2022) present a systematic survey of tasks, datasets, publication venues, and challenges in digitised herbarium research. Their focus lies in usage trends and broad methodological categories, without detailed analysis of model architectures and training regimes; moreover, the review predates foundation models and open-vocabulary vision methods that now dominate the field. Pearson et al. (2020b) concentrate on phenological annotation, describing a modular processing workflow and infrastructure for extracting seasonal signals from specimen images. Their scope excludes classification models, segmentation and detection benchmarks, or multimodal image-text pipelines.

The present review synthesises recent advances in model architectures and systems for herbarium image analysis, covering convolutional and transformer-based classifiers, training strategies for long-tailed or self-supervised learning, open-set recognition, and multimodal pipelines that align image and label content for end-to-end prediction. To date, no survey has examined herbarium-specimen classification from an AI perspective in comparable depth. Previous overviews omit large parts of the technical landscape, ranging from convolutional and transformer architectures to specialised loss functions, self-supervision strategies, and multimodal pipelines. Against this backdrop, the present work offers four concrete contributions:

Catalogue publicly available herbarium specimen image datasets and portals, detailing the complete digitisation workflow from high-resolution scanning to metadata capture.
Review and compare the evolution of taxon-recognition models, from traditional machine learning models to CNN, transformer, and multimodal Vision Language architectures. Focusing on their performance with fine-grained, long-tailed botanical data.
Survey and analyse supporting modules for segmentation, artefact removal, optical/handwriting character recognition, metadata alignment. And examine their integration within operational collection-management systems.
Identify and highlight open challenges, delineating research gaps in few / zero shots and Open-Set Recognition (OSR), model data stewardship. Outline a future research agenda to address them.

Figure 2 summarises the organisation of this paper. Section 2 examines the nature of digitised specimens and describes key datasets and portals. Section 3 details the four paradigms of species-classification research, while Sect. 4 surveys auxiliary techniques that complete the pipeline. Finally, Sect. 5 discusses open challenges, and Sect. 6 presents our roadmap and concludes the paper.

This article is a narrative review that maps recent and influential work on herbarium specimen image analysis. A structured search (2010–2025) covered ACM Digital Library, Scopus, IEEE Xplore, SpringerLink, Google Scholar, and arXiv; benchmark/dataset sources (Fine-Grained Visual Categorization (FGVC)/PlantCLEF and LifeCLEF papers/pages, Kaggle dataset cards) were also consulted. Queries combined domain and method terms (herbarium/plant specimen/botanical image $\times $ classification/segmentation/detection/layout/image analysis $\times $ deep learning/transformer/computer vision), and backward/forward snowballing retrieved additional items.

Inclusion required peer-reviewed articles or official benchmark descriptions with a concrete task on herbarium images or labels and reported metrics on public data or reproducible settings; short abstracts, non-archival/grey literature, and non-herbarium targets were excluded. For each study, task, dataset source (portal/competition), code/data availability, metrics, and key findings were recorded; when mirrors existed (e.g., portal vs. Kaggle export), the canonical source was cited to avoid duplication.

2 Data and datasets

Building on the challenges outlined in Sect. 1, this section characterises the raw material that underpins herbarium AI, namely digitised specimen sheets and the large-scale repositories that curate them.

The foundation of modern AI research in botany is the availability of large, well-curated digital datasets. The scale of herbarium collections is immense: the Index Herbariorum lists approximately 400 million curated plant specimens worldwide (Thiers 2025). Global digitisation efforts have made significant progress in bringing these collections online. As of recent estimates, more than 100 million specimens have been imaged at high resolution, driven by national initiatives such as the US National Herbarium, which alone has scanned over 4.9 million sheets (Soltis 2017; Smithsonian Institution 2022). Taxonomic names from these images can be programmatically validated against the continuously updated World Flora Online (WFO) backbone, curated by a global consortium of more than 40 botanical institutions (Borsch et al. 2020; World Flora Online Consortium 2025).

This section surveys public datasets and portals used for herbarium specimen analysis (Tables 2, 3), with an emphasis on widely used and well-documented resources. Sources included Papers with Code, GitHub, Zenodo, Kaggle, and official challenge pages (FGVC/PlantCLEF, LifeCLEF), using the same domain–method keywords as in the literature search. Major portals (Global Biodiversity Information Facility (GBIF), Integrated Digitized Biocollections (iDigBio)) were queried, and collaborating botanists suggested additions and audited the list. Each entry records the name, year, task, size (images/labels), licence, and a canonical link.

Inclusion favours datasets that provide large-scale sheet imagery under clear licences and stable hosting and that are used by multiple studies or serve as official benchmarks. One well-documented but unreleased dataset is listed for context and clearly flagged. Private or unstable collections are excluded, as are sources consisting only of model-generated outputs (e.g., masks or auto-classified images) that are unsuitable for training. Counts reflect the latest snapshot and are harmonised to consistent units (thousands or millions); where mirrors exist (e.g., portal vs. Kaggle export), the canonical host is cited to avoid duplication.

2.1 Herbarium specimen data

From a data science perspective, a digitised herbarium sheet is a multimodal object that combines high-resolution imagery with structured text metadata. This dual modality underpins the multimodal AI methods reviewed later in Sect. 3.3 and the downstream pipelines in Sect. 4.

Image modality. Each sheet is scanned at ${300}\,{\textrm{dpi}}$ to ${600}\,{\textrm{dpi}}$, capturing morphology such as leaf venation, floral parts and fruit details (Simpson 2019; Stuessy 2009).
Text modality. One or more labels record the scientific name, collection locality and date, and collector’s name. In addition, collectors often include field notes documenting traits not preserved in a dried specimen, such as plant size, flower colour, and scent. These textual data map to the Darwin Core vocabulary, enabling standardized sharing across institutions (Wieczorek et al. 2012).

Modern digitisation workflows are complex, multi-stage processes designed to convert physical specimens into Findable, Accessible, Interoperable, Reusable (FAIR) digital assets (Davies et al. 2023). The dissemination of these assets is now largely standardised through the International Image Interoperability Framework (IIIF), which exposes a uniform API for serving high-resolution sheet images (IIIF Community 2025).

2.1.1 Digitisation milestones

Digitisation began in the late 1990 s when the New York Botanical Garden scanned its type collection at ${600}\,{\hbox {dpi}}$. The African Plants Initiative, launched in 2004 to digitise type specimens across tropical herbaria, later evolved into the Global Plants Initiative. Today, nearly three million high-resolution type images are available via JSTOR Plants (Royal Botanic Gardens, Kew 2015; JSTOR 2015). Major initiatives, such as the launch of the GBIF (2001) (GBIF Secretariat 2025), the establishment of iDigBio as the U.S. hub (2010) (iDigBio Consortium 2025), widespread IIIF adoption after 2015, and the European DiSSCo “Digital Specimen” programme (Hardisty et al. 2020), have collectively made tens of millions of sheets programmatically accessible, setting the stage for AI-driven analysis.

2.1.2 Specimen preparation and scanning protocols

The entire process follows a standardised workflow that encompasses pre-digitisation curation (e.g., specimen mounting and repair), barcoding to create a unique digital identifier for each sheet, and high-quality image capture under controlled lighting conditions (Davies et al. 2023). These standard practices ensure that sheets are mounted on archival card and scanned under fixed illumination to produce consistent digital surrogates (Soltis 2017).

Standard practice includes: (i) ISO Q-14 colour calibration strip; (ii) metric scale bar; (iii) ${300}\,{\textrm{dpi}}$ to ${600}\,{\textrm{dpi}}$ optical resolution; and (iv) lossless TIFF masters with embedded Extensible Metadata Platform (XMP) metadata. Artefacts (overlapping fragments, folded organs, faded pigments) pose challenges later addressed by restoration and segmentation networks (Sect. 4).

2.1.3 Metadata granularity and standards

Table 1 maps common label abbreviations to their Darwin-Core terms, guiding the Natural Language Processing (NLP) pipelines in Sect. 4. Institutions commonly publish records via an Integrated Publishing Toolkit (IPT) server, packaging images and CSV metadata into a DwC-Archive that can be harvested by aggregators such as GBIF.

Table 1 Typical label phrases and corresponding darwin core terms

Full size table

2.1.4 Empirical limits of 2D imagery

While high-resolution scans capture many macromorphological traits, recent work has measured what is lost when physical examination is replaced by images. In a virtual taxonomic account of Madhuca based solely on online specimen images, Phang et al. (2022) reported that fewer than half of the required diagnostic characters could be measured or described from images alone; micromorphological and tactile characters remained inaccessible even at the highest available resolutions, and fewer than half of the images could be confidently assigned to species, necessitating verification on physical specimens with a microscope. A continental-scale comparison between iNaturalist photographs and digitised specimens further demonstrated that herbaria provide disproportionately higher taxonomic and functional diversity, especially for rare taxa (Eckert et al. 2024). Similarly, an automated leaf-mass-per-area pipeline found error rates to rise sharply below three-megapixel resolution or in the absence of physical scale bars (Vasconcelos et al. 2025). These empirical findings motivate the high-dpi and multimodal approaches reviewed in Sects. 4.1 and 3.3.

2.2 Herbarium images datasets

For machine learning practitioners, these resources can be broadly categorized into two types: curated, purpose-built datasets designed for model training and evaluation, and large-scale aggregators exposed via portals and APIs (covered in Sect. 2.3).This section focuses on curated datasets assembled with clean labels and predefined splits, enabling rigorous and reproducible benchmarking across studies. Representative benchmarks and thematic sets are summarised in Table 2, including their task scope, size, label structure, licence, and canonical links.

Table 2 Datasets for herbarium specimen image classification

Full size table

2.3 Major online data portals

Table 3 Representative online herbarium portals for specimen data access

Full size table

Portals are live indices that expose millions of imaged specimen records via APIs. They excel at breadth, often listing orders of magnitude more sheets than any single benchmark, but typically require additional scripting, quality filtering, and licence checking before images can be used for machine learning. Global aggregators such as GBIF and iDigBio index well more than one hundred million imaged records, while institutional portals including the New York Botanical Garden Virtual Herbarium (New York Botanical Garden 2025), the Muséum national d’Histoire naturelle (Paris) virtual herbarium (Le Bras et al. 2017), and the Chinese Virtual Herbarium (CVH) (Chinese Virtual Herbarium 2024) provide regionally focused access with specialised metadata. Table 3 summarizes the largest portals, their licence models, and current image counts.

Public herbarium datasets remain geographically biased toward temperate zones and can contain historical misidentifications (Hussein et al. 2022). Extreme class imbalance, heterogeneous scanning protocols, and geo-temporal gaps present ongoing challenges. However, these same limitations offer rich opportunities to explore hierarchical classification, few-shot learning, multimodal fusion, and robust domain transfer, which represent critical frontiers in herbarium AI research.

2.4 Exploratory data analysis

This section analyses the official training splits of five public datasets widely used for herbarium specimen classification: Herbarium 2019 (Kaggle FGVC6 Team 2019), Herbarium 2020 (FGVC7 Kaggle Team 2020), Herbarium 2021 (Half-Earth) (de Lutio et al. 2021), Herbarium 2022 (NAFlora-1 M) (Park et al. 2024), and PlantCLEF 2020 (herbarium subset only) (LifeCLEF–INRIA 2020). These datasets are among the largest public dataset of herbarium specimen images and provide official metadata and training/test splits, enabling reproducible analysis. They differ in curation policy and taxonomic scope, leading to complementary patterns in class imbalance and resolution. All five have served as the basis for recent benchmarks and shared tasks. PlantCLEF 2022 (Goëau et al. 2022) is excluded because the release includes a substantial number of field photographs; this subsection focuses exclusively on herbarium specimen images.

The analysis highlights two aspects: class imbalance and image resolution, both within and across datasets. These results provide an important context for the evaluation metrics and model design choices discussed in Sect. 3.

2.4.1 Class imbalance

A “class” refers to a species or taxon label in the official metadata. Duplicate entries are removed using a strict (image_id, taxon) key. The frequency of the class is computed as the number of unique images per class.

Figure 3 shows per-dataset histograms of image counts per class, where the y-axis indicates the percentage of classes and the x-axis is truncated at the 99th percentile to avoid distortion. Within each dataset, the severity of imbalance varies. Herbarium 2019 (Kaggle FGVC6 Team 2019), Herbarium 2020 (FGVC7 Kaggle Team 2020), Herbarium 2021 (de Lutio et al. 2021), and PlantCLEF 2020 (LifeCLEF–INRIA 2020) exhibit broad distributions with long right tails: most taxa are rare, while some dominant species appear in large numbers. By contrast, Herbarium 2022 (Park et al. 2024) concentrates near its training cap: the official split is approximately 80%/20% (train/test), the training set caps examples at 80 per species, and per-taxon counts range roughly from 7 to 100. This yields a compact, right-capped distribution that suppresses extreme heads and shortens the tail.

A cross-dataset summary is provided in Fig. 4. Here, head 10% refers to the top 10% of classes by frequency after sorting, and tail 50% refers to the bottom half. Coverage denotes the proportion of all images that fall into these subsets. The Gini coefficient measures inequality in class distribution, ranging from 0 (perfectly uniform) to 1 (maximally skewed).

The number of images and classes varies by orders of magnitude across datasets, and a logarithmic scale is used to visualise these differences without letting the largest corpus dominate. In terms of head/tail coverage, PlantCLEF 2020 and Herbarium 2021 allocate most images to a small number of high-frequency classes, leading to high head coverage and low tail coverage. In contrast, Herbarium 2022 distributes more images to mid-ranked and rare classes, consistent with its per-class cap.

This pattern is also reflected in the Gini coefficient: datasets with strong head dominance (e.g., Herbarium 2020 and Herbarium 2021) show Gini values approaching 0.9, indicating high imbalance. Herbarium 2022, by comparison, exhibits a much lower Gini due to its enforced per-class limits. These metrics together provide a comprehensive view of class imbalance and scale across benchmark dataset.

These observations motivate our evaluation strategy in Sect. 3: where available, class-balanced splits are used; macro-averaged metrics (e.g., macro-$\textrm{F}_{1}$) are prioritized over micro-averaged metrics; and performance is stratified by class frequency to assess robustness across the long tail.

2.4.2 Image resolution

Long-side resolution varies across datasets (Fig. 5). The plot includes a violin density, a P10-P90 box, and 1% jittered points. Herbarium 2019 (Kaggle FGVC6 Team 2019) shows wide variation due to scanner differences, while PlantCLEF 2020 (LifeCLEF–INRIA 2020) centres around ${1,000}\,{\textrm{px}}$ owing to normalization. Herbarium 2020 (FGVC7 Kaggle Team 2020), 2021 (de Lutio et al. 2021), and 2022 (Park et al. 2024) similarly apply a cap of $\le {1,000}\,{\textrm{px}}$, but Herbarium 2022 per-image pixel sizes are not released.

Resolution normalization offers consistent preprocessing, improved efficiency, and cross-study comparability. However, it can remove fine-grained traits (e.g., venation, trichomes). Native-resolution images retain morphological detail but introduce scale variance and memory overhead, often requiring multi-scale or tiling strategies. When transferring across datasets with different resolution policies, mismatches may constitute domain shifts; mitigation techniques include scale-aware augmentation and hierarchical feature architectures.

3 Herbarium image classification

The global digitisation effort has made more than 100 million specimen images available online, yet automatically identifying species from this material confronts three significant challenges. First is the ultra-fine granularity required; sister taxa may differ only by a subtle feature such as leaf-venation patterns or the indumentum on the corolla, demanding models with high discriminative power, a challenge central to the field of fine-grained image categorization (Zheng et al. 2019). Second is the extreme long-tail distribution of species; in recent benchmarks, it is common for more than 60% of species to be represented by five or fewer training images, posing a significant challenge for data-driven models (de Lutio et al. 2021). Third is the presence of heterogeneous artefacts on the sheets, such as mounting labels, colour bars, and faded pigments, which can act as noise and mislead classifiers.

Consequently, modern AI pipelines for herbarium analysis extend far beyond standard image classifiers. They incorporate modules that segment plant tissue, calibrate colours, read label text, fuse geo-metadata, adapt across different domains (e.g., from scans to field photos), and actively query experts for difficult cases. This section traces the evolution of these classification techniques, from early handcrafted features to the sophisticated, multimodal Transformer architectures in use today.

Evaluation metrics for classification

Macro-${\textrm{F}_{1}}$. The macro-$\textrm{F}_{1}$ score computes the $\textrm{F}_{1}$ score (harmonic mean of precision and recall) for each class independently and averages them without weighting. This metric penalises models that perform well on dominant classes but poorly on rare ones, making it well-suited for biodiversity datasets.

Its exact computation for a set of C classes is given by first calculating $\textrm{Precision}_c$ and $\textrm{Recall}_c$ for each class c, then $\hbox {F1}_c$, and finally averaging:

$$\begin{aligned} & \textrm{F}_{1,c} = 2 \cdot \frac{\textrm{Precision}_c \cdot \textrm{Recall}_c}{\textrm{Precision}_c + \textrm{Recall}_c}, \end{aligned}$$

(1)

$$\begin{aligned} & \textrm{F}_{1}^{\text {macro}}= \frac{1}{C} \sum _{c=1}^{C} \textrm{F}_{1,c}. \end{aligned}$$

(2)

This approach gives equal weight to each class, regardless of its size (van Rijsbergen 1979).

Top-1 accuracy. Top-1 accuracy measures the proportion of samples where the predicted label matches the ground truth:

$$\begin{aligned} \mathrm {Top\text {-}1} = \frac{1}{N} \sum _{i=1}^{N} \textbf{1}\!\left[ \arg \max _{y}\, \hat{p}(y \mid x_i) = y_i \right] , \end{aligned}$$

(3)

where $\hat{p}(y \mid x_i)$ is the model’s predicted probability distribution for sample $x_i$, $y_i$ is the true label, and $\textbf{1}[\cdot ]$ is the indicator function.

As shown in Fig. 6, the model accuracy has continued to rise.

3.1 Classical and CNN approaches

The journey from manual feature extraction to end-to-end deep learning mirrors the broader history of computer vision, but with unique adaptations driven by the specific challenges of herbarium data.

3.1.1 Classical models with handcrafted features

Early attempts at automated specimen identification relied on hand-engineered descriptors. Researchers would extract pre-defined shape, venation, texture, or colour cues from scans and feed them into conventional classifiers like Support Vector Machines (SVMs). Although predating deep neural networks, such pipelines are AI systems: the representation is handcrafted while the decision function is learned from data.

Studies using shape cues such as SIFT (Scale-Invariant Feature Transform) or HOG (Histogram of Oriented Gradients) seldom generalised beyond small taxonomic scopes: Clark et al. (2012) reported only 43.7% Top-1 accuracy when separating four Tilia species with SVMs. On tropical wood cross-sections, Mata-Montero and Carranza-Rojas (2016) achieved 60.2% genus-level accuracy across 24 species using handcrafted texture filters, far below modern CNN baselines. Recent evaluations on the challenging Piperaceae family confirm the gap: deep features extracted by ViT or VGG16 exceed 80% macro-$\textrm{F}_{1}$, whereas traditional descriptors such as LBP and SURF remain below 30% (Kajihara et al. 2025). These results underline the brittleness of rule-based features in the face of extreme morphological diversity and varied specimen preparation styles.

Using 54 herbarium leaf images from three morphologically similar Ficus species, a pipeline combining shape (morphological), Hu moment invariants, texture, and HOG descriptors with ANN and SVM classifiers achieved 83.3% accuracy; the ANN yielded a slightly higher Area Under the Curve (AUC) than the SVM (Kho et al. 2017).

3.1.2 CNN-based models

CNNs constitute the classical deep-learning backbone for image tasks (LeCun et al. 1998). A cascade of convolution, pooling and fully connected layers learns increasingly abstract visual features with inductive biases such as locality and translation equivariance. Most herbarium studies therefore adopt a transfer-learning strategy: a backbone pre-trained on a large natural-image corpus such as ImageNet (Deng et al. 2009) is fine-tuned on the target herbarium dataset, which shortens training time and mitigates label scarcity (Yosinski et al. 2014). Milestone architectures, such as AlexNet (Krizhevsky et al. 2012), VGG (Simonyan and Zisserman 2014), Inception (Szegedy et al. 2015) and ResNet (He et al. 2016), remain strong baselines. Dense connectivity patterns such as DenseNet propagate features more efficiently; DenseNet-BC surpasses ResNet on ImageNet with only about 7 million parameters (Huang et al. 2017).

Early botanical experiments focused on cropped leaves or venation textures. Grinblat et al. (2016) reported that a five-layer CNN achieved 97% Top-1 accuracy for three legume crops using only venation images. Lee et al. (2015) enlarged the scope to 44 species, reaching roughly 94% accuracy and showing that CNNs capture fine-grained venation patterns better than hand-engineered features. Parallel studies began constructing benchmark collections for deep learning on herbarium material (Unger et al. 2016; Grimm et al. 2016). In the 2018 ExpertLifeCLEF, Haupt et al. (2018) track a ResNet and DenseNet ensemble delivered 77% Top-1 accuracy, rivalling human experts.

Subsequent work scaled to uncropped, full-resolution sheets. Schuettpelz et al. (2017) first demonstrated feasibility on digitised ferns. Carranza-Rojas et al. (2017) trained Inception-v1 on the Herbarium1K corpus (about 253 k images across 1 204 species) and achieved 70.3% Top-1 accuracy on the 1 000 most frequent taxa, doubling the performance of handcrafted descriptors. Training typically minimised the cross-entropy loss

$$\begin{aligned} \mathcal {L}_{\text {CE}} = - \sum _{i=1}^{C} q_i \log (p_i), \end{aligned}$$

(4)

where $C$ is the number of classes, $q_i$ the one-hot label and $p_i$ the predicted probability.

The FGVC6 Herbarium 2019 challenge further pushed accuracy to 89.8% on 683 species with an ensemble of SE-ResNeXt and ResNet variants (Little et al. 2020). Researchers then combined Mask Region-based Convolutional Neural Network (Mask R-CNN) for organ localisation (Wei et al. 2018) and bilinear pooling for second-order texture statistics (Lin et al. 2015), achieving additional gains when augmented by attention modules (Yang et al. 2023). Recent evaluations confirm that deep features markedly outperform traditional descriptors, even in difficult groups such as Piperaceae (Kajihara et al. 2025).

3.1.3 Attention-enhanced CNN models

The release of massive, long-tailed benchmarks like Herbarium 2020 (1.17 M images) (FGVC7 Kaggle Team 2020) and Half-Earth 2021 (2.5 M images) (de Lutio et al. 2021) shifted the research focus. To handle the extreme class imbalance, development pivoted towards attention-augmented backbones (e.g., ResNeSt) and specialized metric-learning loss functions like ArcFace. Other strategies focused on re-weighting the loss function itself, such as the class-balanced loss proposed by Cui et al. (2019), which adjusts the contribution of each class based on its effective number of samples. These innovations yielded significant gains; on the Half-Earth dataset of 64,000 species, a TResNet backbone with ArcFace loss achieved 75.7% macro-$\textrm{F}_{1}$, demonstrating the effectiveness of metric learning for classification of long-tailed herbariums (de Lutio et al. 2021). Unlike Cross-Entropy, ArcFace Loss directly optimizes the feature embedding space by introducing an additive angular margin m to enforce higher intra-class compactness and inter-class discrepancy. Its formulation is:

$$\begin{aligned} \mathcal {L}_{\text {ArcFace}} = -\frac{1}{N} \sum _{i=1}^{N} \log \frac{e^{s \cdot \cos (\theta _{y_i} + m)}}{e^{s \cdot \cos (\theta _{y_i} + m)} + \sum \limits _{\begin{array}{c} j=1 \\ j \ne y_i \end{array}}^{C} e^{s \cdot \cos (\theta _j)}}, \end{aligned}$$

(5)

where $\theta _{y_i}$ is the angle between the deep feature and the target weight for class $y_i$, s is the feature scale, and m is the additive angular margin penalty (Deng et al. 2019; de Lutio et al. 2021).

To mitigate the dominance of head classes, Cui et al. (2019) proposed the class-balanced loss, which rescales the standard cross-entropy by the effective number of samples:

$$\begin{aligned} \mathcal {L}_{\textrm{CB}} = -\frac{1 - \beta }{1 - \beta ^{n_y}} \log p_y, \end{aligned}$$

(6)

where $n_y$ is the number of training images for class y, $p_y$ is the predicted probability, and $\beta \in (0,1)$ controls the re-weighting strength.

This era also saw the rise of self-supervised methods, where representation learning on millions of unlabeled sheets was shown to improve rare-class recall (Walker et al. 2022).

In parallel, data augmentation remains a cornerstone strategy. Classic augmentations like rotation and flipping are standard, but studies have shown that more advanced techniques can yield further gains. For example, Ott et al. (2020) developed the GinJinn pipeline, an open-source tool that uses object detection to automate the extraction of features, such as counting reproductive organs, from herbarium specimens.

Overall, CNN evolution in herbarium classification mirrors general computer vision, but its extreme taxonomic granularity and long-tailed class distributions have prompted early adoption of attention mechanisms, metric losses, and high-resolution inputs, which paving the way for Transformers, domain transfer, and multimodal systems discussed in the following sections.

3.2 ViTs approaches

The latest generation of vision models, particularly ViTs, is rapidly gaining momentum in herbarium image classification.

ViTs break the convolution paradigm by treating an image as a sequence of patch tokens processed with self-attention (Vaswani et al. 2017; Dosovitskiy et al. 2021). This global context modelling is advantageous for whole-sheet herbarium images containing dispersed structures and labels.

Since Dosovitskiy et al. (Touvron et al. 2021) first adapted the Transformer architecture for vision, variants such as DeiT have reduced training cost via knowledge distillation. By replacing the local receptive fields of CNNs with a global self-attention mechanism, ViTs can model long-range spatial relationships across an entire herbarium sheet. The core of this mechanism is Scaled Dot-Product Attention, which allows every image patch to weigh its interaction with every other patch. The output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product similarity of the query with the corresponding key:

$$\begin{aligned} \text {Attention}(\textbf{Q},\textbf{K},\textbf{V}) = \text {softmax}\!\left( \frac{\textbf{Q}\textbf{K}^\top }{\sqrt{d_k}}\right) \textbf{V}, \end{aligned}$$

(7)

where $\textbf{Q}$, $\textbf{K}$ and $\textbf{V}$ are the query, key and value matrices, and $d_k$ is the key dimension.

However, the standard ViT architecture, which downsamples inputs to a low resolution (e.g., 224$\times $224 px), often discards the fine venation and floral details crucial for taxonomic discrimination. Consequently, much of the recent research has focused on hybrid architectures and techniques that preserve high-resolution information.

3.2.1 CNN-transformer hybrid

To overcome the resolution limitations of early ViTs, several hybrid models that combine convolutional principles with Transformer backbones have been proposed. For instance, the Conviformer introduced a convolutional stem to process high-resolution inputs more efficiently before feeding features into a hierarchical self-attention body (Vaishnav et al. 2022). On the Herbarium 2021 public test split, a single Conviformer-B (448 px) scored 72.9% macro-$\textrm{F}_{1}$, narrowly exceeding the SE-ResNeXt-101 reference (72.6%). For the subsequent Herbarium 2022 challenge, the same architecture reached 82.9% macro-$\textrm{F}_{1}$ as a single model and 86.8% after a five-model ensemble, clarifying that the previously quoted “78%” had conflated different runs and metrics (Vaishnav et al. 2022). Other approaches focus on hierarchical partitioning to handle larger input sizes. Swin Transformers (Liu et al. 2021) and CSWin Transformers (Dong et al. 2022) use shifted or cross-shaped windows to compute self-attention locally, reducing complexity from quadratic to linear with respect to image size. This allows models to process inputs of 1024$\times $1024 px or larger on consumer-grade GPUs, retaining much more detail. Wang et al. (2024) introduced a method called “Cluster-Learngene,” which condenses ancestry-ViT attention heads into adaptive “learngenes” and transfers them to descendant models, trimming 24% training time on the Herbarium 2019 benchmark without sacrificing macro-$\textrm{F}_{1}$.

Beyond backbone architectures, attention mechanisms are also being integrated into specific modules to enhance performance. For instance, Ariouat et al. (2024) demonstrated that incorporating an attention-gate mechanism into the YOLOv7 detector could significantly improve the model’s precision in the fine-grained task of plant-organ detection on herbarium sheets. The YOLO family (“You Only Look Once”) refers to one-stage object detectors that divide the image into a grid and predict bounding boxes and class probabilities directly from each grid cell, without using a separate region proposal stage.

The impact of these advanced architectures is now evident in major competitions. In the 2024 NAFlora-1 M competition, the top-performing teams used large ensembles combining Swin V2-L, CSWin-B and DeiT-III, achieving a state-of-the-art 87.7% macro-$\textrm{F}_{1}$ score across 15 500 species (Park et al. 2024). This result marked a significant milestone, as it was the first time Transformer-based ensembles decisively surpassed the best CNNs on a million-scale herbarium classification benchmark.

3.2.2 High-resolution transformers

The practical application of these large models has been accelerated by hardware-aware engineering. Memory-efficient attention mechanisms such as FlashAttention-2 (Dao et al. 2023) reduce the footprint of self-attention, allowing higher-resolution herbarium images—represented as longer patch sequences—to fit on a single 80 GB GPU (e.g., NVIDIA H100). In parallel, approximately 4,000 tokens contexts have become routine for large language models, illustrating the scalability of this approach across modalities. When combined with efficient backbones like EfficientViT-M2 (Liu et al. 2023a), this allows real-time processing of high-resolution (${600}\,{\textrm{dpi}}$) scans. Beyond classification, Transformer backbones are elevating auxiliary pipeline tasks. In the Hespi pipeline, a Swin V2 detector localises label zones with an impressive 97.9% mean Average Precision (mAP) (Turnbull et al. 2024), while in PENet, CSWin blocks support fine-grained trait segmentation (Zhao et al. 2023).

In summary, while CNNs dominated early leaderboards due to their computational efficiency, the development of hierarchical, hybrid, and memory-efficient ViTs has firmly established them as the new state of the art in herbarium image analysis, particularly as datasets and hardware capabilities continue to scale (Table 4).

Table 4 Model performance in herbarium specimen classification (2017–2024)

Full size table

3.3 Multimodal models

Following early cross-attention models such as LXMERT (Tan and Bansal 2019), expert taxonomists rarely identify a specimen from visual morphology alone; metadata such as collection locality, collector, and date provide strong contextual priors that can resolve ambiguity between visually similar species. Recent progress in reliable text-metadata transcription (Sect. 4) and in multimodal Transformer learning has enabled models to incorporate this auxiliary information seamlessly. Building on the two-tower contrastive paradigm popularised by Contrastive Language Image Pre-training (CLIP) and its bioscience variants (e.g., BioCLIP), a herbarium-specific framework encodes the sheet image with a ViT/ResNet tower and the concatenated label text plus a [META] token with a BERT tower (Bidirectional Encoder Representations from Transformers, a widely used text encoder pretrained on large dataset); the two embedding spaces are aligned with an InfoNCE loss (Fig. 7). The resulting image embeddings can then be used for zero-shot or few-shot species retrieval, effectively mimicking the holistic reasoning process of human experts while remaining fully differentiable.

3.3.1 Integrating metadata with visual features

Early and effective approaches involve simply concatenating structured metadata with visual features learned by a CNN. For example, Guralnick et al. (2024) demonstrated that appending the collection year and herbarium code as simple features to a ResNet’s image embedding resulted in a six-percentage-point recall gain on rare taxa. This highlights that even minimal, easily parsable metadata can provide a significant performance boost. The increasing availability of datasets with validated, structured metadata, such as the NA Phenology dataset, is lowering the barrier for this type of multimodal research (Park et al. 2023).

More sophisticated methods leverage vision-language pre-training. A recent workshop paper fine-tuned CLIP on a multi-million-image herbarium corpus and reported substantial gains in zero-shot retrieval on PlantNet-300K (Sahraoui et al. 2023), although exact numbers have not yet been released. The result nevertheless highlights the potential of pairing herbarium sheets with their label text to “teach” a model botanical language.

3.3.2 Phenology as a multimodal task

Label dates are a particularly powerful form of metadata that unlocks the study of phenology (the timing of seasonal biological events). Lorieul et al. (2019) constructed a benchmark with phenological stage labels and showed that models using both image and date information outperformed image-only models by four percentage points in $\textrm{F}_{1}$ score. Pearson et al. (2020a) took this a step further by integrating gridded climate data corresponding to the collection date and location, achieving an impressive 0.82 AUC in predicting flowering time across 25,000 species.

More recently, Ahlstrand et al. (2025) provide a comprehensive phenology-focused survey, highlighting two main themes: (1) leveraging digital specimens to study temporal trends across centuries and (2) testing long-standing ecological hypotheses at continental scales. They review sampling biases, metadata reliability, and emerging ethical considerations, and they forecast that ongoing herbarium digitisation, coupled with AI-driven trait extraction, will unlock unprecedented insights into plant responses to climate change.

3.3.3 Quality control via multimodal models

Multimodal models serve a critical role in data quality control. By flagging records where the species identification from the image conflicts with the geographic or temporal context derived from the label, these systems can help address the surprisingly high rates of taxonomic misidentification in collections, which can exceed 10% in certain lineages (Goodwin et al. 2015).

Multimodal pipelines excel when visual morphology alone is insufficient for a confident identification. While challenges in handwriting recognition and missing GPS data remain, the fusion of image and metadata within Transformer frameworks is poised to become standard practice, moving models beyond simple pattern recognition towards a more contextual, expert-like understanding.

3.4 Transfer learning in cross-domain

A growing body of research addresses the challenge of cross-domain learning, particularly between herbarium images and field photographs of living plants. These two sources of data present a substantial domain gap: herbarium sheets offer a standardized, flat, and often taxonomically verified view, whereas field photos capture plants in situ with cluttered backgrounds, variable illumination, and natural perspective distortion. Leveraging both domains is highly attractive, as herbaria provide expert-verified labels at a massive scale, while field images from platforms like iNaturalist (Van Horn et al. 2018) or Pl@ntNet (Garcin et al. 2021) are crucial for building real-world plant identification applications. The research in this area explores various methods to transfer knowledge from the well-labeled herbarium domain to the less controlled “in-the-wild” domain, often evaluating performance using not only classification accuracy but also retrieval metrics such as Mean Reciprocal Rank (MRR):

$$\begin{aligned} \textrm{MRR} = \frac{1}{N}\sum _{i=1}^{N}\frac{1}{r_i} \text {,} \end{aligned}$$

(8)

where $r_i$ is the rank position of the first relevant item for query i.

3.4.1 Unsupervised domain adaptation (UDA)

Early approaches often relied on UDA techniques that attempt to align the feature distributions of the two domains without requiring paired images. For example, Wu et al. (2023a) employed a cycle-consistent adversarial network for unsupervised domain adaptation, adapting a ResNet-50 model trained on laboratory images to field conditions for plant disease recognition. Their method achieved a final classification accuracy of 94.5% on the target domain, a significant improvement over the baseline. Other work has focused on more sophisticated loss functions. Chulif and Chang (2021) proposed a two-stream Herbarium-Field Triplet-Loss network that jointly minimises embedding distances for same-species herbarium-field pairs while maximising those for different species. Their NEUON submission to PlantCLEF 2021 achieved a MRR of 0.181 on the full test set and 0.158 on the long-tail subset. Building on this, Chulif et al. (2023) integrated cross-attention ViTs into the HFTL pipeline, raising the MRR to 0.158 on a difficult subset of the PlantCLEF 2021 challenge-a notable achievement in a cross-domain retrieval scenario.

3.4.2 Self-supervised cross-domain transfer

More recent and powerful methods have shifted towards learning universal feature representations from large, unlabeled datasets. The temperature-scaled InfoNCE loss, which forms the foundation of contrastive representation learning, is formulated as:

$$\begin{aligned} \mathcal {L}_i = - \log \frac{\exp (\textbf{z}_i \cdot \textbf{z}_{i,+} / \tau )}{\sum \limits _{j=0}^{K} \exp (\textbf{z}_i \cdot \textbf{z}_{i,j} / \tau )}, \end{aligned}$$

(9)

where $\textbf{z}_i$ denotes the embedding of the anchor image, $\textbf{z}_{i,+}$ is its positive counterpart (e.g., another augmented view of the same specimen), and $\textbf{z}_{i,j}$ represent K negative samples drawn from the batch. The temperature parameter $\tau $ controls the sharpness of the softmax distribution. This objective encourages the model to maximise similarity with the positive pair while minimizing similarity with the negatives (van den Oord et al. 2018).

Walker and Smith (Walker et al. 2022) demonstrated that contrastive representation learning on approximately 4.3 million unlabeled herbarium images yields features that transfer effectively to field photos without any explicit supervision; a lightweight classifier trained on these features increased Top-1 accuracy on iNaturalist leaf taxa by 6 percentage points.

Large, self-supervised ViTs serve as even more powerful bridges between domains. In a state-of-the-art example, Gustineli et al. (2024) fine-tuned a DINOv2-B model on the PlantCLEF 2024 herbarium dataset. Without using any labeled field photos for training, this model achieved a 23.0% macro-$\textrm{F}_{1}$ score on the cross-domain, multi-label classification task, more than doubling the performance of the best ResNet-based baseline and showcasing the remarkable transferability of self-supervised features.

Table 5 Representative cross-domain and transfer-learning studies (2020–2024)

Full size table

3.4.3 Generative domain translation

Another approach involves using generative models to bridge the visual gap. Generative Adversarial Network (GAN)s, particularly those focused on style transfer, can be used to “translate” images from a source domain (e.g., lab) into a target domain (e.g., field) to serve as data augmentation. For instance, Xu et al. (2021) demonstrated this by using StarGAN v2 to translate clean, lab-based images of plant leaves into more realistic field-style images. They reported that using these synthetically generated photos for augmentation improved the final classification accuracy by 4.64 percentage points.

In summary, cross-domain research is rapidly moving from earlier adversarial alignment techniques towards leveraging large-scale, self-supervised foundation models. As these models become more powerful and are pre-trained on ever-larger and more diverse multimodal datasets, the distinction between “herbarium” and “field” features may become less distinct, leading to more robust and universal models for plant identification. Representative studies are summarised in Table 5.

4 Complementary vision tasks

Beyond species classification, AI offers a range of methods that accelerate the entire herbarium digitisation workflow, from isolating the specimen on a sheet to extracting structured data from its label. This section reviews the AI techniques applied to the auxiliary yet critical tasks of segmentation, transcription, and semantic data extraction. These tasks transform a simple image into a rich, machine-readable data record. We examine the evolution of models for each task and discuss how integrated systems and HITL workflows combine these components into a cohesive, high-throughput pipeline.

4.1 Specimen image segmentation

The quality of image segmentation imposes an upper bound on the performance of the entire digitisation pipeline. By accurately separating plant material, labels, and other components from the sheet background, all subsequent analyses can operate on clean, targeted inputs. This front-loading of quality control at the pixel level is crucial for reducing the propagation of errors. Classical definitions and taxonomy follow standard texts in digital image processing (Gonzalez and Woods 2018).

Segmentation evaluation metrics

Intersection over union (IoU). IoU measures the overlap between the predicted segmentation mask ($M_{\text {pred}}$) and the ground-truth mask ($M_{\text {gt}}$):

$$\begin{aligned} \text {IoU} = \frac{|M_{\text {pred}} \cap M_{\text {gt}}|}{|M_{\text {pred}} \cup M_{\text {gt}}|}. \end{aligned}$$

(10)

It ranges from 0 (no overlap) to 1 (perfect overlap) and is the primary segmentation metric, widely used in challenges such as PASCAL VOC (Everingham et al. 2010).

Dice score. The Dice score emphasizes small objects and is monotonically related to IoU:

$$\begin{aligned} \textrm{Dice} = \frac{2\,|A\cap B|}{|A| + |B|} = \frac{2\,\textrm{IoU}}{1 + \textrm{IoU}}, \end{aligned}$$

(11)

with A and B denoting the predicted and ground-truth pixel sets, respectively (Dice 1945).

Average precision (AP) and mAP. For class c, the average precision is the area under its precision-recall curve:

$$\begin{aligned} \textrm{AP}_c = \int _{0}^{1} P_c(R)\, \textrm{d}R \text {,} \end{aligned}$$

(12)

and the mean Average Precision averages over classes:

$$\begin{aligned} \textrm{mAP} = \frac{1}{C}\sum _{c=1}^{C} \textrm{AP}_c \text {.} \end{aligned}$$

(13)

In instance segmentation and object detection, AP is typically computed at a given IoU threshold (e.g., $\textrm{AP}@0.50$). The COCO-style mAP further averages AP across multiple IoU thresholds, typically from 0.50 to 0.95 in 0.05 increments.

4.1.1 Specimen-background segmentation

Modern herbarium segmentation is dominated by encoder-decoder deep networks. For the foundational task of separating the plant from its background, architectures from the U-Net family have become the de facto standard, leveraging an encoder-decoder structure with skip connections to combine high-level semantic context with fine-grained spatial information (Ronneberger et al. 2015). For example, White et al. (2020) first demonstrated that a U-Net could achieve a 0.95 mean IoU on the large Herbarium-120k dataset, and its effectiveness as a preprocessing step has been validated in numerous subsequent pipelines (Kajihara et al. 2025). A multi-collection study by Milleville et al. (2023) confirmed this, reporting a similar IoU of 0.951 with a UNet++ model and further demonstrating that a hybrid cascade-UNet++ for the plant mask followed by YOLOv8 for accessory objects-maintained plant IoU while boosting non-plant artifact precision to 98.5%. Other deep learning approaches have also been proposed, such as the VGG-inspired network by Triki et al. (2022a) for segmenting both leaves and other artifacts on the sheet. These results underscore that precise whole-sheet segmentation of plant pixels can be achieved with high accuracy under controlled conditions.

To cope with the extreme foreground-background imbalance that typically occurs in dense object detection on herbarium sheets, Lin et al. (2017) introduced the focal loss:

$$\begin{aligned} \mathcal {L}_{\text {Focal}} = -(1 - p_t)^{\gamma }\,\log p_t, \end{aligned}$$

(14)

where $p_t$ denotes the predicted probability of the ground-truth class and $\gamma >0$ is a focusing parameter; the modulating factor $(1 - p_t)^{\gamma }$ down-weights well-classified examples so that the model focuses on hard, informative instances.

4.1.2 Component level segmentation

Beyond binary masks, advanced pipelines perform component-aware segmentation to delineate not just the plant, but also labels, barcodes, scale bars, and other sheet elements. Object detectors trained on multi-institution datasets now routinely reach production-level accuracy. An improved YOLO v3 with a fourth detection scale lifts mAP at IoU threshold 0.50 from 90.1% to 93.2% on 4,000 herbarium specimens from the Haussknecht of Jena (HHJ) in Germany (Triki et al. 2020). For instance, Thompson et al. (2023) report that a YOLOv5-based model can localise eleven distinct sheet components at 0.983 precision and 0.969 recall. The HESPI pipeline (Turnbull et al. 2024) pushes performance even further: its custom detector achieves near-perfect label localisation (about 99% IoU) and an $\hbox {F}_{1}$ score of approximately 98% across multiple benchmarks. The qualitative difference in output between these methods can be significant, as shown in the instance-segmentation comparison in Fig. 8.

To illustrate the qualitative output of a recent Mask2Former-based pipeline, Fig. 9 presents an original sheet, the pixel-wise class map, and the isolated plant mask. These results were obtained by re-running the open-source implementation of Milleville (2025); Milleville et al. (2023) on a representative specimen, reproducing the authors’ settings. As in their report, the component labels are accurate even though the outer silhouette is not yet at truly fine-grained resolution.

4.1.3 Plant organ level segmentation

For fine-grained phenotyping, instance segmentation of individual organs (leaves, flowers, etc.) is required. Younis et al. (2020a) pioneered this with Mask R-CNN, training a model to detect six organ categories and publicly releasing the associated dataset of annotations on the PANGAEA repository (Younis et al. 2020b). Their results highlighted the complexity of the task, achieving a mAP of approximately 22%, with performance varying significantly across organs, from 37.9% for leaves down to 0% for seeds. Goëau et al. (2020) trained a Mask-R-CNN on only 21 Streptanthus sheets correctly enumerated buds, flowers and fruits with 77.9% accuracy, enabling fine-grained phenology at scale. Deep Leaf segments individual leaves with an average relative error of 4.6% in length and 5.7% in width across 800 test sheets (Triki et al. 2021). Then Triki et al. (2022b) refined YOLO-v3 variant lifts overall organ-level precision/recall to 94.2% and 95.5% on 3,400 annotated organs. Hussein et al. (2021a) used a DeepLabv3+ pipeline that first isolates intact leaves and then measures their traits, recording 96% $\textrm{F}_{1}$ score on an in-house set and 93% on a public benchmark. Using these organ masks as support, they trained a GAN-based restoration model that reconstructed over 90% of missing leaf area and resulted in a four-percentage-point gain in downstream species classification recall (Hussein et al. 2021b).

More recent work has explored attention-gated versions of YOLO and Transformer-based models like Mask2Former to improve accuracy on this complex task (Ariouat et al. 2024; Milleville et al. 2023). This organ-level segmentation is a critical input for integrated systems like LeafMachine2, which automates the measurement of key morphological traits from the resulting masks (Weaver and Smith 2023). Plug-in MLaaS services inside DiSSCo already compute organ area and ruler scale for 203 sheets; a YOLO-11 detector identifies scale bars in 98% of cases and yields centimetre-accurate area estimates (Rajendran et al. 2025).

4.2 Automated label transcription

Accurate transcription of herbarium labels supplies critical metadata to biodiversity databases. Modern pipelines implement a multi-step process: locating and classifying labels, applying Optical Character Recognition (OCR) or Handwritten Text Recognition (HTR), and performing post-processing to correct and format the output.

4.2.1 Label type classification

Given the localised label crops from Sect. 4.1.2, pipelines then classify each crop by type (e.g., printed vs. handwritten) and route it to a specialised transcription model. In HESPI, a lightweight secondary classifier reaches over 98% accuracy (Turnbull et al. 2024). This allows the system to route each label crop to a specialized transcription model, a critical step for maximizing accuracy.

4.2.2 OCR and HTR

For machine-printed text, off-the-shelf OCR engines like ABBYY FineReader can exceed 99% character-level accuracy on clear labels (Drinkwater et al. 2014). Throughout this section, we measure transcription quality with the Character Error Rate (CER), a standard metric in automatic speech and text recognition, defined as:

$$\begin{aligned} \text {CER} = \frac{S + D + I}{N}, \end{aligned}$$

(15)

where S is the number of substitutions, D is the number of deletions, I is the number of insertions required to change the predicted text to the ground-truth text, and N is the total number of characters in the ground-truth text (Morris et al. 2004).

However, to handle domain-specific vocabulary, ensemble approaches that combine multiple OCR engines can reduce remaining field error rates by a factor of four (Guralnick et al. 2025). For challenging cursive script, Transformer-based HTR models are rapidly closing the gap. Sadek et al. (2024) demonstrated that fine-tuning a state-of-the-art Transkribus model on historical botanist handwriting can lower the CER to just 3.1%, compared to 8.3% from a generic service like AWS Textract.

4.2.3 Post-processing

After initial OCR/HTR, a post-processing stage cleans and validates the outputs, often using LLMs. Guralnick et al. (2025) used an LLM-based correction step combined with controlled-vocabulary checks to raise the overall Darwin Core field $\hbox {F}_1$ score to 0.90$-$0.95 on unseen collections. Weaver et al. (2023) demonstrated that routing multiple OCR engine outputs through a GPT-4-style LLM to consolidate best readings, normalise terms, and emit structured JSON records reduces character-error rates by approximately 45% compared to single-engine OCR baselines.

4.3 NLP for structured data extraction

Once label images have been transcribed, the resulting free text must be converted into semantically structured data that can interoperate with biodiversity portals. This conversion usually follows a four-step cascade.

(1) Token normalisation and abbreviation expansion.
Herbarium labels often contain standard botanical abbreviations (“Coll.”, “Det.”, “Alt.”). Rule-based lexicons remain effective baseline tools, but to handle the immense variety of terms, automated parsers are crucial. For example, the Salix method, an early semi-automated workflow, successfully parsed and expanded abbreviations for key fields like collector and date with over 95% accuracy on a test set of Salix specimens (Barber et al. 2013).
(2) Scientific-name recognition and parsing.
Extracting Latin binomials is essential for linking specimens to taxonomic backbones. The open-source package gnfinder combines heuristics with machine learning to reach an $\hbox {F}_1$ score of 0.94 on the Herbarium-NER benchmark (Mozzherin et al. 2024), while its companion gnparser splits raw strings into structured components with 96% accuracy against IPNI (Mozzherin et al. 2017; Mozzherin 2023).
(3) Locality geoparsing and coordinate cleaning.
Translating prose locality descriptions into geographic coordinates couples NLP with GIS. Gazetteer-based matchers such as BioGeomancer correctly position about 81% of historical localities within 10 km of expert references (Guralnick et al. 2006). Post-processing tools like CoordinateCleaner then flag implausible points (centroids, oceans, capitals) and standardise uncertainty estimates (Zizka et al. 2019).
(4) Entity linking and consistency checks.
Recognised names are reconciled with authoritative taxonomies (e.g., WCVP, Plants of the World Online (POWO)) (Hassler et al. 2021; Royal Botanic Gardens, Kew 2025). Automated tools designed for this purpose, such as the Taxonomic Name Resolution Service, have been shown to resolve synonymy with very high precision, often in the 95-98.5% range (Boyle et al. 2013). A final ensemble validation that cross-checks all linked fields, including collector, date, locality, and taxon, can increase end-to-end metadata accuracy above 92% (Guralnick et al. 2025). Unmatched or conflicting records are routed to experts for manual review, closing the loop between automated extraction and expert oversight.

4.4 Human-assisted workflows

Real-world digitisation pipelines now weave together automated vision modules and targeted human expertise. Most production stacks follow a modular recipe: component detection, text transcription, and NLP structuring, but differ in how they invite expert or volunteer input. The entire end-to-end process, from specimen handling to final data publication, has been mapped in detail to identify bottlenecks and opportunities for automation (Thompson and Birch 2023).

4.4.1 Component-level modular pipelines

Component-level herbarium pipelines interleave automated vision modules with targeted human input at key checkpoints. Low-confidence detections may be routed to volunteers for annotation; OCR or HTR outputs are verified and normalised by experts; and final taxonomic resolution or database triage is typically handled by curators. This hybrid design balances scalability with data quality and is common across production systems. Figure 10 summarises the current herbarium digitisation pipeline, showing how automated vision modules and LLM-driven metadata structuring interact with volunteer input and expert triage before final database integration.

A leading example is LeafMachine2, a modular, open-source pipeline developed through a collaboration of nearly 300 institutions (Weaver and Smith 2023). Rather than a single model, it orchestrates YOLOv5 for object detection and Detectron2 for fine-grained segmentation, processing sheet components, plant parts and labels in parallel (Fig. 11). VoucherVision extends this architecture with automated transcription and parsing (Weaver et al. 2023), while many research prototypes follow the same blueprint. For example, a pipeline comprising a U-Net, ViT and an MLP classifier was evaluated on Piperaceae sheets (Kajihara et al. 2025). Convergent ideas appear in the HESPI pipeline from the University of Melbourne (Turnbull et al. 2024).

4.4.2 Integrated human-AI loops and expert tools

Strategic human intervention raises both accuracy and trust. Bounding boxes drawn by a handful of volunteers can guide a detector-OCR ensemble, achieving 93% success in locating primary labels (Guralnick et al. 2024). The same study trained a ResNet-50, not to identify species but to flag likely errors in the Herbarium 2020 dataset. By reviewing only the top 3% most suspicious sheets, experts captured 87% of genuine misidentifications, cutting inspection time from 42 s to 11 s per sheet.

Active learning creates a feedback loop where the AI model actively requests human help to improve itself. The model identifies candidates for human review by ranking their predictive uncertainty. A common method is to use the entropy of the model’s predicted probability distribution $\textbf{p}$ for a given sample:

$$\begin{aligned} H(\textbf{p}) = - \sum _{c=1}^{C} p_c \log _2(p_c), \end{aligned}$$

(16)

where $\textbf{p}$ is the predicted probability for class c. Samples with higher entropy are considered more uncertain and are prioritized for expert annotation (Settles 2009).

Systems such as VoucherVision display the image, OCR text and predicted taxonomy side-by-side, letting specialists overwrite any step (Weaver et al. 2023). Uncertainty heat-maps and versioned audit trails cut curation time by over 40% while keeping data accuracy above 95%.

Herbaria historically harbour baseline taxonomic error rates (Goodwin et al. 2015), and many new species are discovered by re-examining existing collections (Bebber et al. 2010). Screening networks that flag outliers therefore act as computational microscopes, allowing experts to focus on rare taxa and potential novelties. Walker et al. (2022) used self-supervised learning to surface morphological clusters that botanists later formalised as new species.

The relationship between humans and AI is no longer confined to post-hoc error correction; it is evolving into a collaborative engine for scientific discovery. Comparative studies show that state-of-the-art classifiers routinely outperform human generalists on common, well-represented species, whereas domain experts remain superior at recognising rare taxa and detecting novelties (Bonnet et al. 2018). An optimal division of labour therefore lets models handle bulk identification, while specialists scrutinise outliers and potential new species, turning AI into a computational microscope that accelerates taxonomic insight.

Table 6 summarises milestone HITL initiatives spanning the past decade.

Table 6 HITL initiatives for digitised herbaria

Full size table

4.4.3 Crowdsourcing at collection scale

Volunteer platforms such as Les Herbonautes (France), Notes from Nature (global), and DigiVol (Australia) routinely deliver high-quality label transcriptions (https://research.mnhn.fr/projects/les-herbonautes, https://www.notesfromnature.org/, https://volunteer.ala.org.au/). An Annonaceae campaign reached 97% field-level accuracy (Streiff et al. 2024), while volunteer-drawn label boxes lowered OCR word-error rates four-fold (Guralnick et al. 2024). Hybrid workflows that interleave crowdsourcing with automated checks (e.g., the Guralnick filter Guralnick et al. 2025) illustrate how mass participation and machine intelligence can scale far beyond what either could achieve alone.

Taken together, the shift from component-level detectors to fully integrated, human-AI Transformer pipelines has lifted classification accuracy to the brink of routine deployment. But progress is still capped by input fidelity: blurred scans, imprecise segmentations and noisy transcriptions propagate errors that no downstream model can repair. The next subsection therefore turns to vision language models that fuse image features with textual context, seeking to relieve these input bottlenecks.

4.5 Vision language models

LLMs and Vision Language Model (VLM)s increasingly replace pre-trained NLP modules and even perform zero-shot classification. A single LLM model now ingests pixels plus raw text and outputs taxon names, collection descriptions, structured information or natural-language explanations.

4.5.1 Vision language model pre-training

Large ViTs pre-trained on massive unlabeled datasets using self-supervised objectives have proven to be exceptionally effective feature extractors. Models like DINOv2 and VLMs like SigLIP provide powerful, ready-to-use backbones (Zhai et al. 2023). Fast Language-Image Pre-training (FLIP) masks about 70% of patches so the same GPU hours cover 3$\times $ more image-text pairs, yielding increase 1.4 percentage points (pp) zero-shot ImageNet accuracy versus vanilla CLIP at equal compute (Li et al. 2023a). In a particularly notable study, Gustineli et al. (2024) achieved a 23.0% macro-$\textrm{F}_{1}$ score on the challenging, multi-label PlantCLEF 2024 cross-domain task by tiling a self-supervised DINOv2-B model over 4 k-pixel herbarium sheets, thereby doubling the performance of the best ResNet-based baseline. Combining contrastive self-supervision with language supervision, SLIP (Self-supervision Meets Language Image Pre-training) lifts ImageNet zero-shot Top-1 by 5.2 pp over CLIP and by 8.1 pp over pure SSL on identical data (Mu et al. 2022). Developing knowledge of those giant VL models, TinyCLIP inherits weights and ’affinity mimicking’, compressing the ViT-B/32 model by reducing its parameters by 50%, yet holding ImageNet zero shot within 0.3 pp (Wu et al. 2023b).

4.5.2 Text normalisation and conversational assistance

LLMs fine-tuned for OCR clean-up, abbreviation expansion and date reformatting have begun to replace rule-based post-processing. BLIP-2, for instance, combines a frozen ViT-G/14 image encoder with a 13B language decoder and has demonstrated competitive OCR correction on several public benchmarks (Li et al. 2023b). Early experiments with GPT-4V(ision) likewise report improved Latin-name canonicalisation, although results remain unpublished (OpenAI et al. 2024). Beyond text normalisation, multimodal backbones can answer free-form expert queries: LLaVA$-$1.5 and Kosmos-2 both support prompts such as “Which floral parts are present?” and deliver state-of-the-art accuracy on the ScienceQA and VQA benchmarks (Liu et al. 2023b; Peng et al. 2023), suggesting immediate applicability to herbarium curation tasks.

Second-generation systems embed VLMs directly into the vision stack. VoucherVision v2 replaces rule-based string parsing with a GPT-4V module that resolves collector names and locality phrases, reporting higher parsing accuracy in internal tests (Weaver et al. 2023).

HESPI combines BLIP-2 with rule-based checks so that detected geocoordinates must be consistent with the locality text, substantially lowering false positives in its public demo (Turnbull et al. 2024). Early user studies indicate that experts gain confidence when the model explains edits in plain language, although hallucination remains a concern mitigated by retrieval-augmented prompts.

4.5.3 Open-set recognition

OSR assumes that unknown taxa may appear at test time. Zero-shot models, which do not require explicit class labels during training, offer a natural solution to this challenge. Image-text encoders pre-trained on web-scale captions (e.g., SigLIP, CLIP,and ImageBind) already display useful zero-shot accuracy on long-tailed vision tasks, and small herbarium trials show the same qualitative trend (Zhai et al. 2023; Girdhar et al. 2023). On PlantCLEF 2024, Gustineli et al. (2024) fine-tuned a self-supervised DINOv2 backbone and achieved the competition’s highest macro-$\textrm{F}_{1}$, confirming that foundation models can mitigate class imbalance without exhaustive labels.

Generalised-category-discovery (GCD) methods push further. NeighbourGCN refines pseudo-labels on k-Nearest Neighbors (k-NN) sub-graphs and sets a new state-of-the-art 38.6% macro-$\textrm{F}_{1}$ on Herbarium 2019, 3.2 pp above the previous GCD baseline (Yang et al. 2025). GET-CLIP unlocks CLIP’s multimodal capacity for fine-grained domains: prompt tuning plus geometric alignment improve zero-shot macro-$\textrm{F}_{1}$ on Herbarium 2019 by up 5.3 pp (Wang et al. 2025). Hallucinated localities or fabricated author names nevertheless pose risks. Retrieval-augmented generation lowers factual error in dialogue by more than 60% (Shuster et al. 2021); adapting such prompts with Darwin-Core fields and GBIF look-ups to herbarium metadata is an open research avenue, and production systems still route high-impact predictions-such as putative new species-back to the expert dashboards in Sect. 4.4 for manual approval.

Foundation VLMs already match or exceed bespoke pipelines on small testbeds while offering richer, explainable outputs. Coupling them with FAIR digital-specimen APIs and energy-efficient adapter tuning is likely to define the next wave of herbarium AI services.

5 Challenges

Despite the progress reviewed above, several fundamental obstacles still limit the routine deployment of herbarium-AI systems. Fewer in number yet broader in scope, the following four challenges subsume earlier concerns and point to concrete research directions.

5.1 Data imbalance

Herbarium datasets follow a long-tail distribution, 60% of species are represented by fewer than five sheets, which biases empirical risk minimisation toward common taxa and degrades recall on threatened or newly described lineages (de Lutio et al. 2021). Domain gap further compounds the issue: networks fine-tuned on pressed specimens generalise poorly to field photographs or plant sections. Recent solutions include class-balanced focal loss, meta-batch sampling, and few-shot adapters that leverage large self-supervised encoders (Wang et al. 2020; Chen et al. 2020; He et al. 2020). Vision-language pre-training (e.g., BioCLIP) now delivers 18–25% macro-$\textrm{F}_{1}$ gains in zero-shot transfer between Herbarium 2020 and iNaturalist subsets (Stevens et al. 2024), yet robust calibration under extreme shift remains unsolved.

5.2 Information loss

Flattening, drying and long-term storage inevitably strip herbarium sheets of characters that are conspicuous in living plants. Quantitative metrics such as leaf-mass-per-area exhibit resolution-dependent error once scan size drops below three megapixels (Vasconcelos et al. 2025). Internal anatomy is likewise hidden: sclereid density, vascular bundle shape and seed endosperm remain inaccessible without destructive sectioning or $\upmu $CT. Even ${600}\,{\textrm{dpi}}$ scans omit micromorphological features smaller than ${50}\,{\upmu {\textrm{m}}}$, and virtual revisions recover only approximately 80% of diagnostic traits (Phang et al. 2022).

Pressing also distorts shape: a survey of 794 leaf pairs from 22 taxa showed lamina area shrinking by 5-18%, with changes in length-to-width ratio that can mislead morphometric analyses (Tomaszewski and Górzkowska 2016). Reflectance spectra of pressed leaves differ systematically from fresh tissue: drying removes dominant water absorption features (around ${1,450}\,{\textrm{nm}}$ and ${1,930}\,{\textrm{nm}}$) and shifts the red edge to longer wavelengths, altering which traits can be reliably inferred (Kothari et al. 2023). It is a well-documented fact in herbarium practice that floral colours, particularly reds and blues, are poorly preserved during the pressing and drying process and fade over time (Bridson and Forman 1998).

Future pipelines must integrate focal stacking, photogrammetric texture maps and low-dose $\upmu $CT. Cross-modal contrastive objectives could align these richer 3D/IR modalities with legacy 2D scans, enabling backward compatibility while progressively upgrading trait coverage. Linking pressed sheets to field photographs or iNaturalist observations via shared embeddings would further restore colour and phenological context lost at collection time.

5.3 Model interpretability and explainability

Without explanatory heat-maps, deep classifiers remain black boxes to taxonomists. CAM and Grad-CAM reveal which pixels drive decisions but ignore part structure (Selvaraju et al. 2017; Zhou et al. 2016). Combining AI with organ-level masks (PENet) supports high-throughput measurement of lamina area, petiole angle and vein density (Zhao et al. 2023). The next step is to link attention hotspots directly to formal descriptors (e.g., “pinnate venation present”), closing the loop between image evidence and descriptive terminology.

5.4 Scalable open-set recognition

Large-scale vision-language encoders such as SigLIP, CLIP and ImageBind already achieve useful zero-shot results on long-tailed benchmarks, and prompt-guided adapters push further: PromptCAL and Targeted Representation Alignment improve discovery accuracy on ImageNet-O and Herbarium 2019, while GET-CLIP raises zero-shot macro-$\textrm{F}_{1}$ on Herbarium 2019 by 5.3 pp through geometric alignment and prompt tuning (Zhai et al. 2023; Zhang et al. 2023; Liu et al. 2024; Wang et al. 2025). Graph-based methods follow the same agenda: NeighbourGCN refines pseudo-labels on $k$-NN sub-graphs and reaches 38.6% macro-$\textrm{F}_{1}$ on Herbarium 2019, 3.2 pp above the previous state of the art (Yang et al. 2025). Nevertheless, classical open-set filters (OpenMax, POEM) still lose recall when the novel-class rate exceeds 10% (Scheirer et al. 2013), and foundation VLMs process at most $4\text {k}\times 4\text {k}$ inputs on 80 GB GPUs, which is far below the ${600}\,{\textrm{dpi}}$ masters held by major institutions.

Future work must attack four fronts simultaneously: (i) memory-efficient adapters and sparse attention to handle full-resolution TIFFs, (ii) retrieval-augmented prompts that inject Darwin-Core triples and GBIF look-ups to reduce hallucination, (iii) fine-grained open-set detectors able to flag cryptic sister taxa distinguished by tiny morphological differences, and (iv) community benchmarks that evaluate imbalance, novelty and multimodal reasoning in one place. Progress on these fronts is essential before foundation models can be trusted across the world’s 400 million digitised herbarium sheets.

6 Conclusions and future directions

Digitised herbarium sheets now constitute one of the largest fine-grained image repositories in biology, and the past decade has seen a decisive transition from rule-based descriptors to end-to-end learning pipelines. Section 2 detailed how IIIF imaging and Darwin-Core metadata have turned formerly local cabinets into machine-actionable resources comprising more than 100 million high-resolution scans. Section 3 traced the evolution of recognition models, from handcrafted features through convolutional baselines to self-supervised ViTs, while Sect. 4 showed that accurate upstream modules (segmentation, OCR/HTR, entity resolution) are indispensable for maximising downstream taxon accuracy and enabling multimodal fusion.

Three overarching insights recur throughout this survey. First, the fidelity of the input data (including scan resolution, colour calibration, and precise label alignment) continues to dictate the ceiling on model performance, regardless of architectural sophistication. Second, transfer learning from large natural-image dataset remains a remarkably effective catalyst: fine-tuning ImageNet-pre-trained backbones routinely yields double-digit gains, especially on the extreme long-tailed distributions that typify floristic datasets. Finally, workflows that embed human expertise at strategic junctures (such as active learning loops, expert dashboards, and volunteer verification) not only correct residual errors but also preserve the domain-expert trust essential for responsible deployment.

Looking ahead, four research thrusts appear most likely to unlock the full potential of the world’s 400 million preserved specimens:

Generative and self-supervised augmentation. Self-distilled ViT encoders trained with large-scale contrastive objectives now match or exceed fully supervised CNNs on multi-label plant benchmarks, yet performance still degrades on the long botanical tail. Class-conditional diffusion models and text-guided image editing can synthesise realistic variants (additional organs, colour channels, or phenophases) for taxa represented by only one or two sheets, broadening intra-class diversity without changing labels. Combined with synthetic-to-real domain adaptation and active-learning loops, these generators can focus expert effort on the most informative gaps and reduce annotation cost. To avoid overfitting or drift, synthetic images should be explicitly flagged and evaluation kept on held-out, real-only test sets.
Truly multimodal foundation models. Current CLIP-style resources are image-caption only; species-level identification benefits from a joint embedding of high-resolution scans, transcribed label text, GPS provenance, climate layers and, where available, DNA barcodes. Rank-aware contrastive objectives (“hierarchical contrastive losses”, i.e., losses aligned with botanical classification (family-genus-species) that pull together positives at multiple ranks and separate negatives across ranks), together with graph neural networks linking sheets across space and time, could enable zero-shot recognition of previously unseen or undescribed taxa and provide quantitative novelty scores for taxonomists.
Machine-actionable digital specimens. Persistent identifiers, versioned annotations, and open APIs (as exemplified by the DiSSCo blueprint) enable recognition-as-a-service pipelines that write new determinations, organ counts, and trait values directly back to collection databases. Streaming inference on the ingest server, combined with real-time validation dashboards, would shorten the feedback loop between experts, AI models, and biodiversity aggregators such as GBIF and iDigBio.
Responsible and sustainable deployment. Herbarium AI must embrace green-AI accounting (tracking GPU hours and carbon cost), federated or privacy-preserving learning to respect sensitive locality data, and the CARE (collective benefit, authority to control, responsibility, and ethics) as well as FAIR principles to ensure equitable benefit-sharing with source countries and Indigenous knowledge holders. Benchmark reports should include energy use and bias audits alongside accuracy metrics.

Taken together, these threads point toward an AI “digital taxonomist”: a modular system that (i) assigns names at family, genus and species rank, (ii) segments reproductive and vegetative organs, (iii) extracts quantitative morphological and phenological traits, (iv) flags outliers for expert review and (v) synchronises structured data with downstream ecological workflows. Such capabilities would vastly accelerate species discovery, reveal shifts in flowering and fruiting phenology under climate change, and inform timely conservation actions that directly tackle the biodiversity crisis outlined in Sect. 1.

Realising this vision will require sustained collaboration among computer scientists, botanists, collection managers and citizen scientists; cross-institutional benchmarks that include under-sampled tropical floras; and rigorous ethical safeguards. Building on the foundations synthesised in this survey, the community can convert centuries-old specimens into a living, quantitative observatory of global plant diversity.

Data availability

No datasets were generated or analysed during the current study.

Materials availability

Not applicable.

Code availability

Not applicable.

References

Ahlstrand NI, Primack RB, Austin MW, Panchen ZA, Römermann C, Miller-Rushing AJ (2025) The promise of digital herbarium specimens in large-scale phenology research. New Phytol. https://doi.org/10.1111/nph.70178
Article Google Scholar
Antonelli A, Fry C, Smith RJ, Simmonds MSJ, Kersey PJ, Pritchard HW et al (2020) State of the World’s Plants and Fungi 2020. Royal Botanic Gardens Kew, Richmond, UK
Ariouat H, Sklab Y, Pignal M, Jabbour F, Vignes Lebbe R, Prifti E, Zucker J, Chenin E (2024) Enhancing yolov7 for plant organ detection using an attention-gate mechanism. In: Advances in knowledge discovery and data mining (PAKDD 2024). Lecture Notes in Computer Science, vol. 14646, pp 223–234. Springer, Taipei, Taiwan. https://doi.org/10.1007/978-981-97-2253-2_18
Atlas of Living Australia: DigiVol: Biodiversity Volunteer Portal. https://volunteer.ala.org.au/. Accessed 2025-07-01
Barber A, Lafferty D, Landrum LR (2013) The Salix method: a semi-automated workflow for herbarium specimen digitization. Taxon 62(3):581–590. https://doi.org/10.12705/623.16
Article Google Scholar
Barhate D, Pathak S, Kumar Singh B, Jain A, Dubey AK (2024) A systematic review of machine learning and deep learning approaches in plant species detection. Smart Agric Technol 9:100605. https://doi.org/10.1016/j.atech.2024.100605
Article Google Scholar
Bebber DP, Carine MA, Wood JRI, Wortley AH, Harris DJ, Prance GT, Davidse G, Paige JG, Pennington TD, Saslis-Lagoudakis CH, Klitgård BB, Wilkie P, Scotland RW (2010) Herbaria are a major frontier for species discovery. Proc Natl Acad Sci USA 107(51):22169–22171. https://doi.org/10.1073/pnas.1011841108
Article Google Scholar
Bonnet P, Goëau H, Hang ST, Lasseck M, Šulc M, Malécot V, Jauzein P, Melet J, You C, Joly A (2018) Plant identification: experts vs. machines in the era of deep learning. In: Joly A, Vrochidis S, Karatzas K, Karppinen A, Bonnet P (eds) Multimedia tools and applications for environmental & biodiversity informatics. Springer, Cham, pp 131–149
Chapter Google Scholar
Borsch T, Sosef MSM, Wieringa JJK et al (2020) World flora online - placing taxonomists at the heart of a definitive and comprehensive global resource on the world’s plants. Taxon 69(6):1311–1341. https://doi.org/10.1002/tax.12373
Article Google Scholar
Boyle B, Hopkins N, Lu Z, Raygoza Garay JA, Mozzherin D, Rees T, Matasci N, Narro ML, Piel WH, Mckay SJ et al (2013) The taxonomic name resolution service: an online tool for automated standardization of plant names. BMC Bioinform 14:1–15. https://doi.org/10.1186/1471-2105-14-16
Article Google Scholar
Brach AR, Song H (2005) Actkey: a web-based interactive identification key program. Taxon 54(4):1041–1046. https://doi.org/10.2307/25065490
Article Google Scholar
Bras G, Pignal M, Jeanson ML, Müller JC, Aupic C, Carré B, Flament G, Gaudeul M, Gonçalves C, Invernón VR, Lerat E, Lowry PP, Offroy O, Pimparé EP, Poncet B, Rouhan G, Haevermans T (2017) The french muséum national d’histoire naturelle vascular plant herbarium collection dataset. Sci Data 4:170016. https://doi.org/10.1038/sdata.2017.16
Article Google Scholar
Bridson DM, Forman L (1998) The Herbarium handbook, 3rd edn. Royal Botanic Gardens, Kew, Richmond
Google Scholar
California Academy of Sciences Smithsonian Institution: Notes from Nature: Transcribing Biodiversity Collections. https://www.notesfromnature.org/. Accessed 2025-07-01
Carranza-Rojas J, Goëau H, Bonnet P, Mata-Montero E, Joly A (2017) Going deeper in the automated identification of herbarium specimens. BMC Evol Biol 17:181. https://doi.org/10.1186/s12862-017-1014-z
Article Google Scholar
Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th international conference on machine learning. Proceedings of Machine Learning Research. PMLR, Virtual Conference, vol. 119, pp 1597–1607
Chinese Virtual Herbarium: Chinese Virtual Herbarium (CVH) Portal. http://www.cvh.ac.cn/. Accessed 27 Jun 2025 (2024)
Christenhusz MJM, Byng JW (2016) The number of known plant species in the world and its annual increase. Phytotaxa 261(3):201–217. https://doi.org/10.11646/phytotaxa.261.3.1
Article Google Scholar
Chulif S, Chang YL (2021) Herbarium–field triplet network for cross-domain plant identification. In: Experimental IR meets multilinguality, multimodality, and interaction—-CLEF 2021. Lecture Notes in Computer Science. Springer, Bucharest, Romania, vol. 12880, pp 173–188. https://doi.org/10.1007/978-3-030-85251-1_14
Chulif S, Lee SH, Chang YL, Chai KC (2023) A machine-learning approach for cross-domain plant identification using herbarium specimens. Neural Comput Appl 35(8):5963–5985. https://doi.org/10.1007/s00521-022-07951-6
Article Google Scholar
Clark J, Corney D, Tang H (2012) Automated plant identification using artificial neural networks. In: 2012 IEEE symposium on computational intelligence in bioinformatics and computational biology (CIBCB). IEEE, San Diego, CA, USA, pp 343–348. https://doi.org/10.1109/CIBCB.2012.6217250
Council of Heads of Australasian Herbaria: Australasian Virtual Herbarium (AVH). https://avh.chah.org.au. Accessed 18 Jun 2025 (2025)
Cui Y, Jia M, Lin T, Song Y, Belongie S (2019) Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Long Beach, CA, USA, pp 9268–9277. https://doi.org/10.1109/CVPR.2019.00949
Dao T, Fu DY, Ermon S, Rudra A, Ré C (2023) Flashattention-2: Faster attention with better parallelism and work partitioning. https://doi.org/10.48550/arXiv.2307.08691 arXiv:2307.08691 [cs.LG]
Davies N, Drinkell C, Utteridge T (2023) The Herbarium handbook: sharing best practice from across the glob. Kew Publishing, Richmond, UK
Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: Proceedings of the 2009 IEEE conference on computer vision and pattern recognition, Miami, FL, USA, pp 248–255. https://doi.org/10.1109/CVPR.2009.5206848
Deng J, Guo J, Xue N, Zafeiriou S (2019) Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Long Beach, CA, USA, pp 4690–4699. https://doi.org/10.1109/CVPR.2019.00482
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302. https://doi.org/10.2307/1932409
Article Google Scholar
Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B (2022) Cswin transformer: a general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (CVPR). IEEE, New Orleans, LA, USA, pp 12124–12134. https://doi.org/10.1109/CVPR52688.2022.01181
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16$\times $16 words: transformers for image recognition at scale. In: Proceedings of the 9th international conference on learning representations (ICLR). https://openreview.net/forum?id=YicbFdNTTy, Virtual Conference (Vienna, Austria)
Drinkwater RE, Cubey RW, Haston EM (2014) The use of optical character recognition in the digitisation of herbarium specimen labels. PhytoKeys 38:15–30. https://doi.org/10.3897/phytokeys.38.7346
Article Google Scholar
Eckert I, Bruneau A, Metsger D (2024) Herbarium collections remain essential in the age of community science. Nat Commun 15:7586. https://doi.org/10.1038/s41467-024-51899-1
Article Google Scholar
e-ReColNat Consortium: e-ReColNat Virtual Herbarium Portal. https://www.recolnat.org/. Accessed: 2025-06-27 (2025)
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. Int J Comput Vision 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
FGVC7 Kaggle Team: Herbarium 2020 Image Dataset. Kaggle. Dataset, accessed 27 Jun 2025 (2020)
Garcin C, Joly A, Bonnet P, Affouard A, Lombardo J-C, Chouet M, Servajean M, Lorieul T, Salmon J (2021) Pl@ntnet-300k: A plant image dataset with high label ambiguity and a long-tailed distribution. In: Vanschoren J, Yeung S (eds) Proceedings of the neural information processing systems track on datasets and benchmarks, vol. 1. https://openreview.net/forum?id=eLYinD0TtIt, Online Event
GBIF Secretariat: Global Biodiversity Information Facility (GBIF): Occurrence Statistics. https://www.gbif.org/occurrence/statistics. Accessed: 2025-06-27 (2025)
German Centre for Integrative Biodiversity Research: VH/de — German Virtual Herbarium. https://herbarium.gbif.de/. Accessed 18 Jun 2025 (2025)
Girdhar R, El-Nouby A, Liu Z, Singh M, Alwala KV, Joulin A, Misra I (2023) Imagebind: One embedding space to bind them all. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Vancouver, Canada, pp 15180–15190
Goëau H, Mora-Fallas A, Champ J, Love NLR, Mazer SJ, Mata-Montero E, Joly A, Bonnet P (2020) A new fine-grained method for automated visual analysis of herbarium specimens: a case study for phenological data extraction. Appl Plant Sci 8(6):11368. https://doi.org/10.1002/aps3.11368
Article Google Scholar
Goëau H, Bonnet P, Joly A (2021) Overview of plantclef 2021: Cross-domain plant identification. In: CLEF (working Notes). https://ceur-ws.org/Vol-2936/paper-122.pdf, Bucharest, Romania, pp 1422–1436
Goëau H, Bonnet P, Joly A, Kahl S (2022) Overview of plantclef 2022: Image-based plant identification at global scale. In: CLEF 2022 working notes. CEUR Workshop Proceedings, vol. 3180, pp 285–300. https://ceur-ws.org/Vol-3180/paper-153.pdf
Goëau H, Bonnet P, Joly A (2023) Overview of plantclef 2023: Image-based plant identification at global scale. In: CEUR workshop proceedings, vol. 3497. https://hal.science/hal-04345310, Thessalonique, Greece, pp 1972–1981. Mohammad Aliannejadi and Guglielmo Faggioli and Nicola Ferro and Michalis Vlachos
Gonzalez RC, Woods RE (2018) Digital image processing. Pearson. https://books.google.co.uk/books?id=0F05vgAACAAJ
Goodwin ZA, Harris DJ, Filer D, Wood JRI, Scotland RW (2015) Widespread mistaken identity in tropical plant collections. Curr Biol 25(22):1066–1067. https://doi.org/10.1016/j.cub.2015.10.002
Article Google Scholar
Govaerts R (2001) How many species of seed plants are there? Taxon 50(4):1085–1090. https://doi.org/10.2307/1224723
Article Google Scholar
Grimm J, Hoffmann M, Stöver BC, Müller KF, Steinhage V (2016) Image-based identification of plant species using a model-free approach and active learning. In: KI 2016: advances in artificial intelligence. Lecture Notes in Computer Science. Springer, Cham, vol. 9904, pp 169–176. https://doi.org/10.1007/978-3-319-46073-4_16
Grinblat GL, Uzal LC, Larese MG, Granitto PM (2016) Deep learning for plant identification using vein morphological patterns. Comput Electron Agric 127:418–424. https://doi.org/10.1016/j.compag.2016.07.003
Article Google Scholar
Guralnick RP, Wieczorek J, Beaman R, Hijmans RJ, Group BW (2006) Biogeomancer: automated georeferencing to map the world’s biodiversity data. PLoS Biol 4(11):381. https://doi.org/10.1371/journal.pbio.0040381
Article Google Scholar
Guralnick RP, LaFrance R, Denslow MW, Blickhan S, Bouslog M, Miller S, Yost J, Best J, Paul DL, Ellwood E, Gilbert E, Allen J (2024) Humans in the loop: community science and machine-learning synergies for overcoming herbarium digitization bottlenecks. Appl Plant Sci 12(1):11560. https://doi.org/10.1002/aps3.11560
Article Google Scholar
Guralnick RP, LaFrance R, Denslow MW (2025) Ensemble automated approaches for producing high-quality herbarium digital records. Appl Plant Sci 13:11623. https://doi.org/10.1002/aps3.11623
Article Google Scholar
Gustineli M, Miyaguchi A, Stalter I (2024) Multi-label plant species classification with self-supervised vision transformers. https://doi.org/10.48550/arXiv.2407.06298 arXiv:2407.06298 [cs.CV]
Hardisty A, Saarenmaa H, Casino A, Dillen M, Gödderz K, Groom Q, Hardy H, Koureas D, Hidalga A, Paul DL, Runnel V, Vermeersch X, Walsum M, Willemse L (2020) Conceptual design blueprint for the dissco digitization infrastructure – deliverable d8.1. Res Ideas Outcomes 6:54280. https://doi.org/10.3897/rio.6.e54280
Article Google Scholar
Hassler M, Damboldt T, Jiménez J, Leclercq S, Leblond E, Lorieul T, Meseguer AS, Moat J, Pearse I, Rarity JP et al (2021) The world checklist of vascular plants (wcvp). Sci Data 8:380. https://doi.org/10.1038/s41597-021-00997-6
Article Google Scholar
Haupt J, Kahl S, Kowerko D, Eibl M (2018) Large-scale plant classification using deep convolutional neural networks. In: CLEF 2018 working notes. CEUR Workshop Proceedings, vol. 2125 https://ceur-ws.org/Vol-2125/paper_92.pdf, Avignon, France. paper_92, 7 pp
Heberling JM, Isaac BL (2018) Inaturalist as a tool to expand the research value of museum specimens. Appl Plant Sci 6(11):01193. https://doi.org/10.1002/aps3.1193
Article Google Scholar
Heberling MJ, Miller JT, Noesgaard D, Weingart SB, Schigel D (2021) Data integration enables global biodiversity synthesis. Proc Natl Acad Sci USA 118(6):2018093118. https://doi.org/10.1073/pnas.2018093118
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Las Vegas, NV, USA, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Virtual Conference, pp 9726–9735 (Seattle, WA, USA). https://doi.org/10.1109/CVPR42600.2020.00975
Huang G, Liu Z, Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Honolulu, HI, USA, pp 4700–4708. https://doi.org/10.1109/CVPR.2017.243
Hussein BR, Malik OA, Ong W-H, Slik JWF (2021a) Automated extraction of phenotypic leaf traits of individual intact herbarium leaves from herbarium specimen images using deep-learning-based semantic segmentation. Sensors 21(13):4549. https://doi.org/10.3390/s21134549
Article Google Scholar
Hussein BR, Malik OA, Ong W, Slik FSW (2021b) Reconstruction of damaged herbarium leaves using deep learning techniques for improving classification accuracy. Eco Inform 61:101243. https://doi.org/10.1016/j.ecoinf.2021.101243
Article Google Scholar
Hussein BR, Malik OA, Ong W-H, Slik JWF (2022) Applications of computer vision and machine learning techniques for digitized herbarium specimens: a systematic literature review. Ecol Inform 69:101641. https://doi.org/10.1016/j.ecoinf.2022.101641
Article Google Scholar
iDigBio Consortium: iDigBio: Integrated Digitized Biocollections Portal Statistics. https://www.idigbio.org/portal/stats. Accessed: 2025-06-27 (2025)
IIIF Community: International Image Interoperability Framework (IIIF). https://iiif.io/. Version 3.0 specification site, accessed 27 Jun 2025 (2025)
JACQ Consortium: JACQ — Virtual Herbaria Portal. https://www.jacq.org/. Accessed 18 Jun 2025 (2025)
Jardim Botânico do Rio de Janeiro: Reflora Virtual Herbarium. https://reflora.jbrj.gov.br/. Accessed 27 Jun 2025 (2023)
JSTOR Global Plants Initiative: JSTOR Global Plants Database. https://plants.jstor.org/. Accessed: 2025-06-27 (2025)
JSTOR: JSTOR Plants: Global Type Specimens. https://plants.jstor.org/. Accessed 2025-07-01 (2015)
Judd WS, Campbell CS, Kellogg EA, Stevens PF, Donoghue MJ (2007) Plant systematics: a phylogenetic approach, 3rd edn. Sinauer Associates, Sunderland, MA
Google Scholar
Kaggle FGVC6 Team: Herbarium 2019 Image Dataset. Kaggle. Accessed: 2025-06-27 (2019). https://www.kaggle.com/c/herbarium-2019-fgvc6
Kajihara AY, Queiroz GA, Caxambú MG, Oliveira LES, Bertolini D, Schwerz AL (2025) A database for automatic identification of herbarium specimens in the piperaceae family. Multimedia Tools Appl. https://doi.org/10.1007/s11042-025-20883-2. (Advance online publication)
Article Google Scholar
Kerner A, Saliba EM, Vignes-Lebbe R, Portier R, Vignes-Lebbe R (2025) An Xper3 reference guide for taxonomists: a collaborative system for identification keys and descriptive data. Eur J Taxon 987:281–302. https://doi.org/10.5852/ejt.2025.987.2875
Article Google Scholar
Kew RBG (2023) State of the world’s plants and fungi 2023. Kew Publishing, Richmond, UK
Google Scholar
Kho SJ, Manickam S, Malek S, Mosleh MAA, Dhillon SK (2017) Automated plant identification using artificial neural network and support vector machine. Front Life Sci 10(1):98–107. https://doi.org/10.1080/21553769.2017.1412361
Article Google Scholar
Kothari S, Beauchamp-Rioux R, Laliberté E, Cavender-Bares J (2023) Reflectance spectroscopy allows rapid, accurate and non-destructive estimates of functional traits from pressed leaves. Methods Ecol Evol 14(2):385–401. https://doi.org/10.1111/2041-210X.13958
Article Google Scholar
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol. 25. https://proceedings.neurips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf, Lake Tahoe, NV, USA, pp 1097–1105
Lang PLM, Willems FM, Scheepens JF, Burbano HA, Bossdorf O (2019) Using herbaria to study global environmental change. New Phytol 221(1):110–122. https://doi.org/10.1111/nph.15401
Article Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Article Google Scholar
Lee SH, Chan CS, Wilkin P, Remagnino P (2015) Deep-plant: Plant identification with convolutional neural networks. In: 2015 IEEE international conference on image processing (ICIP), pp 452–456. https://doi.org/10.1109/ICIP.2015.7350839
Li Y, Fan H, Hu R, Feichtenhofer C, He K (2023a) Scaling language–image pre-training via masking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Vancouver, BC, Canada, pp 23390–23400. https://doi.org/10.1109/CVPR52729.2023.02240
Li J, Li D, Savarese S, Hoi S (2023b) BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In: Krause A, Brunskill E, Cho K, Engelhardt B, Sabato S, Scarlett J (eds) Proceedings of the 40th international conference on machine learning. PMLR, Honolulu, Hawaii. vol. 202, pp 19730–19742
LifeCLEF–INRIA: PlantCLEF 2020 Training Data. Zenodo. Dataset (2020). https://doi.org/10.5281/zenodo.3658343
Lin T, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE international conference on computer vision (ICCV). IEEE, Santiago, Chile, pp 1449–1457. https://doi.org/10.1109/ICCV.2015.174
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision (ICCV). IEEE, Venice, Italy, pp 2980–2988. https://doi.org/10.1109/ICCV.2017.324
Little DP, Tulig M, Tan KC, Liu Y, Belongie S, Kaeser-Chen C, Michelangeli FA, Panesar K, Guha RV, Ambrose BA (2020) An algorithm competition for automatic species identification from herbarium specimens. Appl Plant Sci 8:11365. https://doi.org/10.1002/aps3.11365
Article Google Scholar
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Lin S, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV). IEEE, Montreal, QC, Canada, pp 10012–10022. https://doi.org/10.1109/ICCV48922.2021.00986
Liu X, Peng H, Zheng N, Yang Y, Hu H, Yuan Y (2023a) Efficientvit: Memory efficient vision transformer with cascaded group attention. https://doi.org/10.48550/ARXIV.2305.07027 arXiv:2305.07027 [cs.CV]
Liu H, Li C, Wu Q, Lee YJ (2023b) Visual instruction tuning. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S (eds) Advances in neural information processing systems, vol 36. Curran Associates Inc, New Orleans, pp 34892–34916
Google Scholar
Liu X, Patel V, Wang Q (2024) Targeted representation alignment for open-world semi-supervised learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Seattle, USA, pp 12815–12824. https://doi.org/10.1109/CVPR56347.2024.12815
Lorieul T, Pearson KD, Ellwood ER, Goëau H, Molino J, Sweeney PW, Yost JM, Sachs J, Mata-Montero E, Nelson G, Soltis PS, Bonnet P, Joly A (2019) Toward a large-scale and deep phenological stage annotation of herbarium specimens: case studies from temperate, tropical, and equatorial floras. Appl Plant Sci 7(3):01233. https://doi.org/10.1002/aps3.1233
Article Google Scholar
Lutio R, Park JY, Watson KA, D’Aronco S, Wegner JD, Wieringa JJ, Tulig M, Pyle RL, Gallaher TJ, Brown G, Guymer G, Franks A, Ranatunga D, Baba Y, Belongie SJ, Michelangeli FA, Ambrose BA, Little DP (2021) The herbarium 2021 half-earth challenge dataset and machine learning competition. Front Plant Sci 12:787127. https://doi.org/10.3389/fpls.2021.787127
Article Google Scholar
Mata-Montero E, Carranza-Rojas J (2016) Automated plant species identification: challenges and opportunities. In: ICT for promoting human development and protecting the environment (WITFOR 2016). IFIP advances in information and communication technology. Springer, Cham, vol. 481, pp 26–36. https://doi.org/10.1007/978-3-319-44447-5_3
Milleville K (2025) herbarium-segmentation. https://github.com/kymillev/herbarium-segmentation. MIT License, accessed 27 Jun 2025
Milleville K, Chandrasekar KKT, Weghe N, Verstockt S (2023) Evaluating segmentation approaches on digitized herbarium specimens. In: Advances in visual computing (ISVC 2023) Part II. Lecture Notes in Computer Science. Springer, Lake Tahoe, NV, USA, vol. 14362, pp 65–78. https://doi.org/10.1007/978-3-031-47966-3_6
Missouri Botanical Garden: Tropicos Specimen Database. https://www.tropicos.org/. Accessed 18 Jun 2025 (2025)
Morris AC, Maier V, Green P (2004) From wer and ril to mer and wil: Improved evaluation measures for connected speech recognition. In: Proceedings of Interspeech 2004 – 8th international conference on spoken language processing. ISCA, Jeju Island, South Korea, pp 2765–2768. https://doi.org/10.21437/Interspeech.2004-668
Mozzherin DY (2023) GNverifier: Global Names Verifier. Zenodo software release; accessed 27 Jun 2025. https://doi.org/10.5281/zenodo.10070488
Mozzherin DY, Myltsev AA, Patterson DJ (2017) gnparser: A powerful parser for scientific names based on parsing expression grammar. BMC Bioinform 18:279. https://doi.org/10.1186/s12859-017-1663-3
Article Google Scholar
Mozzherin DY, Myltsev A, Zalavadiya H (2024) GNfinder: Global Names Finder. Zenodo software release; accessed 27 Jun 2025. https://doi.org/10.5281/zenodo.11584025
Mu N, Kirillov A, Wagner D, Xie S (2022) Slip: Self-supervision meets language–image pre-training. In: European conference on computer vision (ECCV). Lecture Notes in Computer Science. Springer, Tel Aviv, Israel, vol. 13675, pp 529–544. https://doi.org/10.1007/978-3-031-19809-0_30
Muséum national d’Histoire naturelle: Les Herbonautes: Collaborative Transcription of Herbarium Labels. https://research.mnhn.fr/projects/les-herbonautes. Accessed 2025-07-01
Muséum national d’Histoire naturelle: Vascular Plants Collection (Herbarium P). https://www.mnhn.fr/en/vascular-plants-collection. Accessed 18 Jun 2025 (2025)
Natural History Museum, London: Natural History Museum (London) Data Portal. https://data.nhm.ac.uk/. Accessed 2025-09-01 (2025)
Naturalis Biodiversity Center: BioPortal — Botany Subcollection. https://www.naturalis.nl/en/collection/botany. Accessed 18 Jun 2025 (2025)
New York Botanical Garden: Steere Herbarium Virtual Collection. https://sweetgum.nybg.org/science/. Accessed: 2025-06-27 (2025)
Oord A, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. https://doi.org/10.48550/arXiv.1807.03748, arXiv:1807.03748
Open AI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al (2024) GPT-4 Technical Report. https://doi.org/10.48550/arXiv.2303.08774
Ott T, Palm C, Vogt R, Oberprieler C (2020) Ginjinn: an object-detection pipeline for automated feature extraction from herbarium specimens. Appl Plant Sci 8(6):11351. https://doi.org/10.1002/aps3.11351
Article Google Scholar
Park IW, Mazer SJ, Ellison AM, Davis CC, Record S, Ramirez-Parada T (2023) Herbarium-derived phenological data in North America. Dryad. Dataset. https://doi.org/10.25349/D9WP6S
Park JY, Lutio R, Rappazzo B, Ambrose BA, Michelangeli F, Watson KA, Belongie S, Little DP (2024) NAFlora-1M: Continental-scale high-resolution fine-grained plant classification dataset. J Data-Centric Mach Learn Res 1:1
Google Scholar
Pearson KD, Nelson G, Aronson MFJ, Bontrager M, Mazer SJ (2020a) Machine learning using digitized herbarium specimens to advance phenological research. Bioscience 70(7):610–620. https://doi.org/10.1093/biosci/biaa044
Article Google Scholar
Pearson KD, Nelson G, Aronson MFJ, Bonnet P, Brenskelle L, Davis CC, Denny EG, Ellwood ER, Goëau H, Heberling JM, Joly A, Lorieul T, Mazer SJ, Meineke EK, Stucky BJ, Sweeney P, White AE, Soltis PS (2020b) Machine learning using digitized herbarium specimens to advance phenological research. Bioscience 70(7):610–620. https://doi.org/10.1093/biosci/biaa044
Article Google Scholar
Peng Z, Wang W, Dong L, Hao Y, Huang S, Ma S, Wei F (2023) Kosmos-2: Grounding multimodal large language models to the world. https://doi.org/10.48550/arXiv.2306.14824 arXiv:2306.14824 [cs.CL]
Phang A, Atkins HJ, Wilkie P (2022) The effectiveness and limitations of digital images for taxonomic research. Taxon 71(5):1063–1076. https://doi.org/10.1002/tax.12767
Article Google Scholar
Rajendran R, Weiland C, Grieb J, Theocharides S, Leeflang S, Addink W, Islam S (2025) Extraction of quantitative specimen data using machine learning as a service in the dissco research infrastructure. ARPHA Preprints e160486, https://doi.org/10.3897/arphapreprints.e160486. Preprint
Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, London
Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention – MICCAI 2015. Lecture Notes in Computer Science. Springer, Munich, Germany, vol. 9351, pp 234–241. https://doi.org/10.1007/978-3-319-24574-4_28
Royal Botanic Gardens, Kew: Global Plants Initiative. https://www.kew.org/science/our-science/projects/global-plants-initiative. Accessed 2025-07-01 (2015)
Royal Botanic Gardens, Kew: Kew Data Portal. https://data.kew.org/. Accessed: 2025-06-27 (2025)
Royal Botanic Gardens, Kew: Plants of the World Online (POWO). https://powo.science.kew.org/. Accessed: 2025-06-27 (2025)
Royal Botanic Garden Edinburgh: RBGE Herbarium Data Portal. https://data.rbge.org.uk/. Accessed: 2025-06-27 (2025)
Royal Botanic Gardens Victoria: KeyBase: Online Repository for Interactive Keys. https://keybase.rbg.vic.gov.au/. Accessed 27 Jun 2025 (2025)
Royal Horticultural Society: RHS Herbarium Specimen Database. https://www.rhs.org.uk/science/conservation-biodiversity/conserving-garden-plants/rhs-herbarium/specimen-database. Accessed: 2025-06-27 (2025)
Sadek J, Vlachidis A, Pickering V, Humbel M, Metilli D, Carine MA, Nyhan J (2024) Leveraging ocr and htr cloud services towards data mobilisation of historical plant names. Int J Digit Human 6(3):237–261. https://doi.org/10.1007/s42803-024-00091-4
Article Google Scholar
Sahraoui M, Sklab Y, Pignal M, Vigues Lebbe R, Guigue V (2023) Leveraging multimodality for biodiversity data: exploring joint representations of species descriptions and specimen images using clip. Biodiversity Inform Sci Stand 7:112666. https://doi.org/10.3897/biss.7.112666
Article Google Scholar
Saranya SM, Rajalaxmi RR, Prabavathi R, Suganya T, Mohanapriya S, Tamilselvi T (2021) Deep learning techniques in tomato plant–a review. J Phys: Conf Ser 1767:012010. https://doi.org/10.1088/1742-6596/1767/1/012010
Article Google Scholar
Scheirer WJ, Rezende Rocha A, Sapkota A, Boult TE (2013) Toward open set recognition. IEEE Trans Pattern Anal Mach Intell 35(7):1757–1772. https://doi.org/10.1109/TPAMI.2012.256
Article Google Scholar
Schuettpelz E, Frandsen P, Dikow RB, Brown A, Orli S, Peters M, Metallo A, Funk V, Dorr L (2017) Applications of deep convolutional neural networks to digitized natural history collections. Biodiversity Data J 5:21139. https://doi.org/10.3897/BDJ.5.e21139
Article Google Scholar
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision (ICCV). IEEE, Venice, Italy, pp 618–626. https://doi.org/10.1109/ICCV.2017.74
Settles B (2009) Active learning literature survey. Computer Sciences Technical Report 1648, Department of Computer Sciences, University of Wisconsin–Madison. https://burrsettles.com/pub/settles.activelearning.pdf
Shirai M, Takano A, Kurosawa T, Inoue M, Tagane S, Tanimoto T, Koganeyama T, Sato H, Terasawa T, Horie T, Mandai I, Akihiro T (2022) Development of a system for the automated identification of herbarium specimens with high accuracy. Sci Rep 12:8066. https://doi.org/10.1038/s41598-022-11450-y
Article Google Scholar
Shuster K, Poff S, Chen M, Kiela D, Weston J (2021) Retrieval augmentation reduces hallucination in conversation. https://doi.org/10.48550/arXiv.2104.07567 arXiv:2104.07567 [cs.CL]
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv https://doi.org/10.48550/arXiv.1409.1556 arXiv:1409.1556 [cs.CV]
Simpson MG (2019) Plant systematics, 3rd edn. Academic Press, San Diego
Google Scholar
Singapore Botanic Gardens Herbarium: BRAHMS Portal — SING Herbarium. https://herbaria.plants.ox.ac.uk/bol/sing. Accessed 18 Jun 2025 (2025)
Smithsonian Institution: Smithsonian Virtual Specimen: Biodiversity Specimen Access. https://naturalhistory.si.edu/research/biodiversity-data/virtual-specimen. Accessed: 2025-06-27 (2022)
Soltis PS (2017) Digitization of herbaria enables novel research. Am J Bot 104(9):1281–1284. https://doi.org/10.3732/ajb.1700281
Article Google Scholar
Stevens S, Wu J, Thompson MJ, Campolongo EG, Song CH, Carlyn DE, Dong L, Dahdul WM, Stewart C, Berger-Wolf T, Chao W, Su Y (2024) Bioclip: A vision foundation model for the tree of life. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Seattle, WA, USA, pp 19412–19424. https://doi.org/10.1109/CVPR52733.2024.01836
Streiff SJR, Ravomanana EO, Rakotoarinivo M, Pignal M, Pimparé EPEREZ, Erkens RHJ, Couvreur TLP (2024) High-quality herbarium-label transcription by citizen scientists improves taxonomic and spatial representation of the tropical plant family annonaceae. Adansonia 46(18):173–186. https://doi.org/10.5252/adansonia2024v46a18
Article Google Scholar
Stuessy TF (2009) Plant taxonomy: the systematic evaluation of comparative data, 2nd edn. Columbia University Press, New York
Google Scholar
Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Boston, MA, USA, pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594
Tan H, Bansal M (2019) Lxmert: Learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP–IJCNLP). Association for Computational Linguistics, Hong Kong, China, pp 5100–5111. https://doi.org/10.18653/v1/D19-1514
Tan KC, Liu Y, Ambrose BA, Tulig M, Belongie S (2019) The herbarium challenge 2019 dataset. https://doi.org/10.48550/arXiv.1906.05372 arXiv:1906.05372 [cs.CV]
Thiers BM (2025) The world’s herbaria 2024: a summary report based on data from index Herbariorum. Technical report, William and Lynda Steere Herbarium, New York Botanical Garden. Accessed 2025-06-27. https://sweetgum.nybg.org/science/wp-content/uploads/2025/01/The_World_Herbaria_2024-.pdf
Thompson KM, Birch JL (2023) Mapping the digitisation workflow in a university herbarium. Res Ideas Outcomes 9:106883. https://doi.org/10.3897/rio.9.e106883
Article Google Scholar
Thompson KM, Turnbull R, Fitzgerald E, Birch JL (2023) Identification of herbarium specimen sheet components from high-resolution images using deep learning. Ecol Evol 13(8):10395. https://doi.org/10.1002/ece3.10395
Article Google Scholar
Tiwari M, Dev H (2025) Advances in deep learning techniques for plant disease identification: a comprehensive survey. In: Proceedings of the 4th international conference on innovation in IoT, robotics and automation (IIRA 4.0). AIP Conference Proceedings. Jaipur, India, vol. 3224, p 020019. https://doi.org/10.1063/5.0245923
Tomaszewski D, Górzkowska A (2016) Is shape of a fresh and dried leaf the same? PLoS ONE 11(4):0153071. https://doi.org/10.1371/journal.pone.0153071
Article Google Scholar
Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers and distillation through attention. In: Proceedings of the 38th international conference on machine learning. Proceedings of machine learning research. PMLR, Virtual Conference, vol. 139, pp 10347–10357. https://doi.org/10.48550/arXiv.2012.12877
Triki A, Bouaziz B, Mahdi W, Gaikwad J (2020) Objects detection from digitized herbarium specimens based on improved yolov3. In: Proceedings of the 15th international joint conference on computer vision, imaging and computer graphics theory and applications (VISAPP 2020). SCITEPRESS, Valletta, Malta, vol. 4, pp 523–529. https://doi.org/10.5220/0009170005230529
Triki A, Bouaziz B, Gaikwad J, Mahdi W (2021) Deep leaf: Mask r-cnn based leaf detection and segmentation from digitized herbarium specimen images. Pattern Recogn Lett 150:76–83. https://doi.org/10.1016/j.patrec.2021.07.003
Article Google Scholar
Triki A, Bouaziz B, Mahdi W, Hamed H, Gaikwad J (2022a) Deep learning based approach for digitized herbarium specimen segmentation. Multimedia Tools Appl 81(19):28689–28707. https://doi.org/10.1007/s11042-022-12935-8
Article Google Scholar
Triki A, Bouaziz B, Mahdi W (2022b) A deep learning-based approach for detecting plant organs from digitized herbarium specimen images. Eco Inform 69:101590. https://doi.org/10.1016/j.ecoinf.2022.101590
Article Google Scholar
Turland NJ, Wiersema JH, Barrie FR, Greuter W, Hawksworth DL, Herendeen PS, Knapp S, Kusber W, Li D, Marhold K, May TW, McNeill J, Monro AM, Prado J, Price MJ, Smith GF (2018) International code of nomenclature for Algae, Fungi, and Plants (Shenzhen Code) Adopted by the Nineteenth International Botanical Congress Shenzhen, China, July 2017. Regnum Vegetabile, vol. 159. Koeltz Botanical Books, Glashütten. https://doi.org/10.12705/Code.2018
Turnbull R, Fitzgerald E, Thompson K, Birch JL (2024) Hespi: A pipeline for automatically detecting information from herbarium specimen sheets. https://doi.org/10.48550/ARXIV.2410.08740 arXiv:2410.08740 [cs.CV]
Unger J, Merhof D, Renner S (2016) Computer vision applied to herbarium specimens of German trees: testing the future utility of the millions of herbarium specimen images for automated identification. BMC Evol Biol 16:248. https://doi.org/10.1186/s12862-016-0827-5
Article Google Scholar
Upadhyay A, Chandel NS, Singh KP, Chakraborty SK, Nandede BM, Kumar M, Subeesh A, Upendar K, Salem A, Elbeltagi A (2025) Deep learning and computer vision in plant disease detection: a comprehensive review of techniques, models, and trends in precision agriculture. Artif Intell Rev 58:92. https://doi.org/10.1007/s10462-024-11100-x
Article Google Scholar
Vaishnav M, Fel T, Rodríguez IF, Serre T (2022) Conviformers: Convolutionally guided vision transformer. https://doi.org/10.48550/arXiv.2208.08900 arXiv:2208.08900 [cs.CV]
Van Horn G, Mac Aodha O, Song Y, Cui Y, Sun C, Shepard HA, Dehghan A, Zhang Y, Hedau V, Harchaoui Z, Perona P, Belongie S (2018) The inaturalist species classification and detection dataset. In: Proceedings of the IEEE/cvf conference on computer vision and pattern recognition (CVPR), pp 8769–8778. https://doi.org/10.1109/CVPR.2018.00910
Vasconcelos T, Weaver W, Baumgartner A (2025) Automated extraction of leaf mass per area from digitized herbarium specimens. New Phytol 236(1):70292. https://doi.org/10.1111/nph.70292
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol. 30. Long Beach, CA, USA, pp 5998–6008. https://doi.org/10.5555/3295222.3295349
Wäldchen J, Mäder P (2018) Machine learning for image-based species identification. Methods Ecol Evol 9(11):2216–2225. https://doi.org/10.1111/2041-210X.13075
Article Google Scholar
Walker BE, Tucker A, Nicolson N (2022) Harnessing large-scale herbarium image datasets through representation learning. Front Plant Sci 12:806407. https://doi.org/10.3389/fpls.2021.806407
Article Google Scholar
Wang Y, Yao Q, Kwok JT, Ni LM (2020) Generalizing from a few examples: a survey on few-shot learning. ACM Comput Surv 53(3):63–16334. https://doi.org/10.1145/3386252
Article Google Scholar
Wang Q, Yang X, Feng F, Wang J, Geng X (2024) Cluster-Learngene: Inheriting adaptive clusters for vision transformers. In: Advances in neural information processing systems 37 (NeurIPS 2024). https://proceedings.neurips.cc/paper_files/paper/2024/hash/2e53c02ea028cbf603f4b6b47fef3d97-Abstract-Conference.html, New Orleans, LA, USA
Wang E, Peng Z, Xie Z, Yang F, Liu X, Cheng M-M (2025) GET: Unlocking the multi-modal potential of CLIP for generalized category discovery. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). https://openaccess.thecvf.com/content/CVPR2025/papers/Wang_GET_Unlocking_the_Multi-modal_Potential_of_CLIP_for_Generalized_Category_CVPR_2025_paper.pdf, Nashville, TN, USA, pp. 7572–7581
Weaver WN, Smith SA (2023) From leaves to labels: building modular machine-learning networks for rapid herbarium specimen analysis with leafmachine2. Appl Plant Sci 11(5):11548. https://doi.org/10.1002/aps3.11548
Article Google Scholar
Weaver WN, Ruhfel BR, Lough KJ, Smith SA (2023) Herbarium specimen label transcription reimagined with large language models: capabilities, productivity, and risks. Am J Bot 110(12):16256. https://doi.org/10.1002/ajb2.16256
Article Google Scholar
Wei X, Xie C, Wu J, Shen C (2018) Mask-cnn: Localizing parts and selecting descriptors for fine-grained bird species categorization. Pattern Recogn 76:704–714. https://doi.org/10.1016/j.patcog.2017.11.033
Article Google Scholar
Wen J, Ickert-Bond SM, Appelhans MS, Dorr LJ, Funk VA (2015) Collections-based systematics: opportunities and outlook for 2050. J Syst Evol 53(6):477–488. https://doi.org/10.1111/jse.12181
Article Google Scholar
White AE, Dikow RB, Baugh M, Jenkins A, Frandsen PB (2020) Generating segmentation masks of herbarium specimens and a data set for training segmentation models using deep learning. Appl Plant Sci 8(6):11352. https://doi.org/10.1002/aps3.11352
Article Google Scholar
Wieczorek J, Bloom D, Guralnick R, Blum S, Döring M, Giovanni R, Robertson T, Vieglais D (2012) Darwin core: an evolving community-developed biodiversity data standard. PLoS ONE 7(1):29715. https://doi.org/10.1371/journal.pone.0029715
Article Google Scholar
Wilf P, Zhang S, Chikkerur S, Little SA, Wing SL, Serre T (2016) Computer vision cracks the leaf code. Proc Natl Acad Sci USA 113(12):3305–3310. https://doi.org/10.1073/pnas.1524473113
Article Google Scholar
Winston JE (1999) Describing species: practical taxonomic procedure for biologists. Columbia University Press, New York
Google Scholar
World Flora Online Consortium: World Flora Online: Taxonomic Backbone and Species Database. https://www.worldfloraonline.org/. Accessed: 2025-06-27 (2025)
Wu X, Fan X, Luo P, Das Choudhury S, Tjahjadi T, Hu C (2023a) From laboratory to field: unsupervised domain adaptation for plant disease recognition in the wild. Plant Phenomics 5:0038. https://doi.org/10.34133/plantphenomics.0038
Article Google Scholar
Wu K, Peng H, Zhou Z, Wang X, Li S, Liang X, Luo P (2023b) Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 21970–21980.https://doi.org/10.1109/ICCV51070.2023.02165
Xu M, Yoon S, Fuentes A, Yang J, Park DS (2021) Style-consistent image translation: a novel data augmentation paradigm to improve plant disease recognition. Front Plant Sci 12:773142. https://doi.org/10.3389/fpls.2021.773142
Article Google Scholar
Yang C, Xu J, De Mello S, Crowley EJ, Wang X (2023) GPViT: A high-resolution non-hierarchical vision transformer with group propagation. In: Proceedings of the 11th international conference on learning representations (ICLR). OpenReview, Kigali, Rwanda
Yang F, Pu N, Li W, Luo Z, Li S, Sebe N, Zhong Z (2025) Learning to distinguish samples for generalized category discovery. In: European conference on computer vision (ECCV). Lecture Notes in Computer Science. Milan, Italy, vol. 15123, pp 105–122. https://doi.org/10.1007/978-3-031-73650-6_7
Yilmaz E, Ceylan Bocekci S, Safak C, Yildiz K (2025) Advancements in smart agriculture: a systematic literature review on state-of-the-art plant disease detection with computer vision. IET Comput Vision 19(1):70004. https://doi.org/10.1049/cvi2.70004
Article Google Scholar
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems, vol. 27. Montreal, QC, Canada, pp 3320–3328
Younis S, Schmidt M, Weiland C, Dressler S, Seeger B, Hickler T (2020a) Detection and annotation of plant organs from digitised herbarium scans using deep learning. Biodiversity Data J 8:57090. https://doi.org/10.3897/BDJ.8.e57090
Article Google Scholar
Younis S, Schmidt M, Dressler S (2020b) Plant organ detections and annotations on digitized herbarium scans. Pangaea. https://doi.org/10.1594/PANGAEA.920895
Article Google Scholar
Zhai X, Mustafa B, Kolesnikov A, Beyer L (2023) Sigmoid loss for language–image pre-training. In: Proceedings of the IEEE/cvf international conference on computer vision (ICCV). IEEE, Paris, France, pp 11975–11986. https://doi.org/10.1109/ICCV51070.2023.01100
Zhang S, Khan S, Shen Z, Naseer M, Chen G, Khan F (2023) Promptcal: Contrastive affinity learning via auxiliary prompts for generalized novel category discovery. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), Vancouver, Canada, pp 21646–21655. https://doi.org/10.1109/CVPR52688.2023.02162
Zhao Z, Lu Y, Tong Y, Chen X, Bai M (2023) Penet: a phenotype-encoding network for automatic extraction and representation of morphological discriminative features. Methods Ecol Evol 14(12):3035–3046. https://doi.org/10.1111/2041-210X.14235
Article Google Scholar
Zheng H, Fu J, Zha Z, Luo J (2019) Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR). IEEE, Long Beach, CA, USA, pp 5007–5016. https://doi.org/10.1109/CVPR.2019.00515
Zhou B, Khosla A, Lapedriza A, Oliva A, Torralba A (2016) Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). IEEE, Las Vegas, NV, USA, pp 2921–2929. https://doi.org/10.1109/CVPR.2016.319
Zizka A, Silvestro D, Andermann T, Azevedo J, Duarte Ritter C, Fabre A, Farooq H et al (2019) Coordinatecleaner: Standardised cleaning of occurrence records from biological collection databases. Methods Ecol Evol 10(5):744–751. https://doi.org/10.1111/2041-210X.13152
Article Google Scholar

Download references

Funding

This work was supported by NERC CENTA2 grant NE/S007350/1.

Author information

Authors and Affiliations

Department of Computer Science, Loughborough University, Loughborough, LE11 3TU, UK
Yu-Yue Guo, Haibin Cai, Baihua Li & Stephanos Theodossiades
Herbarium, Royal Botanic Gardens, Kew, Richmond, TW9 3AE, UK
Yu-Yue Guo & Gemma L. C. Bramley
Royal Botanic Garden Edinburgh, Edinburgh, EH3 5LR, UK
Hannah J. Atkins

Authors

Yu-Yue Guo
View author publications
Search author on:PubMed Google Scholar
Haibin Cai
View author publications
Search author on:PubMed Google Scholar
Gemma L. C. Bramley
View author publications
Search author on:PubMed Google Scholar
Hannah J. Atkins
View author publications
Search author on:PubMed Google Scholar
Baihua Li
View author publications
Search author on:PubMed Google Scholar
Stephanos Theodossiades
View author publications
Search author on:PubMed Google Scholar

Contributions

Yu-Yue Guo drafted the manuscript and prepared all figures and tables. Haibin Cai refined the structure, expanded the technical content, and coordinated revisions. Gemma L. C. Bramley and Hannah J. Atkins supplied botanical expertise and detailed information on herbarium practices and resources. Baihua Li and Stephanos Theodossiades provided supervision and guidance during the preparation of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Haibin Cai.

Ethics declarations

Conflict of interest

The authors declare that they have no Conflict of interest.

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guo, YY., Cai, H., Bramley, G.L.C. et al. A review of artificial intelligence in herbarium specimen image analysis. Artif Intell Rev 58, 402 (2025). https://doi.org/10.1007/s10462-025-11408-2

Download citation

Received: 11 July 2025
Accepted: 23 September 2025
Published: 31 October 2025
Version of record: 31 October 2025
DOI: https://doi.org/10.1007/s10462-025-11408-2

Keywords

Profiles

Yu-Yue Guo View author profile
Stephanos Theodossiades View author profile

A review of artificial intelligence in herbarium specimen image analysis

Abstract

Similar content being viewed by others

Computer vision applied to herbarium specimens of German trees: testing the future utility of the millions of herbarium specimen images for automated identification

Development of a system for the automated identification of herbarium specimens with high accuracy

AI Application in Plant Identification and Classification: Innovation and Impact

Explore related subjects

1 Introduction

2 Data and datasets

2.1 Herbarium specimen data

2.1.1 Digitisation milestones

2.1.2 Specimen preparation and scanning protocols

2.1.3 Metadata granularity and standards

2.1.4 Empirical limits of 2D imagery

2.2 Herbarium images datasets

2.3 Major online data portals

2.4 Exploratory data analysis

2.4.1 Class imbalance

2.4.2 Image resolution

3 Herbarium image classification

3.1 Classical and CNN approaches

3.1.1 Classical models with handcrafted features

3.1.2 CNN-based models

3.1.3 Attention-enhanced CNN models

3.2 ViTs approaches

3.2.1 CNN-transformer hybrid

3.2.2 High-resolution transformers

3.3 Multimodal models

3.3.1 Integrating metadata with visual features

3.3.2 Phenology as a multimodal task

3.3.3 Quality control via multimodal models

3.4 Transfer learning in cross-domain

3.4.1 Unsupervised domain adaptation (UDA)

3.4.2 Self-supervised cross-domain transfer

3.4.3 Generative domain translation

4 Complementary vision tasks

4.1 Specimen image segmentation

4.1.1 Specimen-background segmentation

4.1.2 Component level segmentation

4.1.3 Plant organ level segmentation

4.2 Automated label transcription

4.2.1 Label type classification

4.2.2 OCR and HTR

4.2.3 Post-processing

4.3 NLP for structured data extraction

4.4 Human-assisted workflows

4.4.1 Component-level modular pipelines

4.4.2 Integrated human-AI loops and expert tools

4.4.3 Crowdsourcing at collection scale

4.5 Vision language models

4.5.1 Vision language model pre-training

4.5.2 Text normalisation and conversational assistance

4.5.3 Open-set recognition

5 Challenges

5.1 Data imbalance

5.2 Information loss

5.3 Model interpretability and explainability

5.4 Scalable open-set recognition

6 Conclusions and future directions

Data availability

Materials availability

Code availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords