There are many scans of trees, however, they have not been assembled into a data forest.
Species classification from laser scanning data has long been seen as a frontier challenge in forest remote sensing, one with major implications for everything from carbon accounting to biodiversity assessment to sustainable forest management. Accurately identifying tree species from 3D point clouds would enable researchers to automate inventory processes, monitor habitat diversity, and improve ecosystem models. Yet progress toward this goal has been limited.
The main barrier has not been a lack of algorithms but a lack of data. Deep learning relies on large, diverse, and well-annotated datasets, and in forestry, the infrastructure to create such datasets has been lacking. Existing studies often relied on small, homogeneous samples: a few hundred trees from a single forest type or a single sensor. Without sufficient variety, models struggle to generalize beyond their training environments, confining innovation to isolated experiments.
That landscape is now changing. With the release of the FOR-species20K dataset, introduced by Puliti et al. (2025) in Methods in Ecology and Evolution, the field finally has a shared benchmark, a comprehensive, open, and high-quality dataset of more than 20,000 individual trees spanning 33 species and three major biogeographic regions of Europe. Compiled through international collaboration under the COST Action 3DForEcoTech network, it marks a milestone in the evolution of AI for forest ecosystems.
The FOR-species20K dataset represents a collective effort to solve that problem through collaboration. It integrates 25 independent datasets, contributed by research groups across Europe and beyond, into a single harmonized resource. The dataset includes over 20,000 individual trees from 33 species, covering dominant genera such as Pinus, Picea, Fagus, and Quercus. Each tree is represented by a precisely segmented 3D point cloud, captured through various laser scanning platforms, terrestrial (TLS), mobile (MLS), and unmanned aerial (ULS) systems.
Tree species identification is fundamental to forest science. Each species plays a distinct role in ecosystem processes making species-level data crucial for modeling and management. Traditionally, species identification in forest inventories has been manual and labor-intensive, relying on field experts or visual interpretation of aerial imagery. These approaches are slow, subjective, and costly to scale. As forest monitoring expands under global carbon and biodiversity frameworks, automation becomes not just desirable but necessary. Laser scanning data offer a promising solution.
Beyond its scale, FOR-species20K is notable for its cross-platform nature. By integrating data from different LiDAR systems and acquisition protocols, it captures the variability inherent in real-world forest measurements. This makes it ideal for testing whether models can generalize, a critical step toward sensor-agnostic AI that performs reliably across tools and conditions. To demonstrate the dataset’s potential, Puliti and colleagues benchmarked seven state-of-the-art deep learning architectures for tree species classification. This was done via a data science competition made possible by the COST action network. The models tested included both 3D point cloud–based models, such as PointNet++, MinkNet, and DGCNN, and 2D multi-view convolutional networks, which convert 3D point clouds into a series of rendered images from multiple angles.
The results were illuminating. 2D multi-view models consistently outperformed their 3D counterparts, achieving higher accuracy and better robustness across platforms. The best-performing architecture, DetailView, reached an average accuracy above 80%, even when trained and tested on data from different sensors. This is likely due to a disparity in technical readiness level with massive amounts of technical development having gone into image-based methods in recent years.
Moreover, the benchmark revealed a significant advance: cross-sensor generalization. Models trained on data from one type of scanner, for example, TLS, could still perform well on ULS or MLS data. This finding moves the field closer to practical deployment, where forest monitoring systems can integrate data from heterogeneous sources without retraining from scratch. Still, challenges remain. While 2D methods currently lead in performance, 3D deep learning continues to evolve rapidly, with new architectures like graph transformers and diffusion-based models on the horizon. FOR-species20K provides the benchmark needed to test these innovations under consistent, reproducible conditions.
One of the most important contributions of FOR-species20K lies not in its size but in its openness. The dataset, accompanying code, and benchmark leaderboards have all been released publicly, inviting researchers worldwide to participate, reproduce results, and build upon the foundation. This commitment to open science addresses a long-standing challenge in forest AI: the lack of shared reference datasets. Without common baselines, comparing algorithms across studies is nearly impossible. By making both data and evaluation protocols openly available, FOR-species20K fosters reproducibility, transparency, and innovation.
The benchmark also encourages cross-lab collaboration. Researchers can submit new models to the public leaderboard, compare performance, and refine approaches collectively — mirroring the collaborative progress seen in computer vision and other AI domains. Such openness not only accelerates technical development but also democratizes access to high-quality forest data, enabling participation from institutions and regions with limited resources.
Ultimately, this initiative lays the groundwork for cross-sensor and cross-region classifiers, capable of identifying tree species anywhere, from managed European forests to tropical or boreal ecosystems. As global forest monitoring becomes increasingly automated, such models could play a pivotal role in carbon verification, biodiversity tracking, and sustainable land-use planning.
The FOR-species20K initiative marks a turning point in the quest for automated, AI-driven forest monitoring. By assembling an unprecedented dataset of individual trees, standardizing data preparation, and establishing open benchmarks for model evaluation, it transforms a fragmented research landscape into a shared platform for progress.
Text is a summarization of a following paper:
Puliti, Stefano, et al. “Benchmarking tree species classification from proximally sensed laser scanning data: Introducing the FOR‐species20K dataset.” Methods in Ecology and Evolution16.4 (2025): 801-818. https://doi.org/10.1111/2041-210X.14503
Text is authored by Henry Cerbone – Department of Biology – University of Oxford