Statistics Explained

Merging statistics and geospatial information, 2020 projects - Austria


Earth observation (EO) for land cover statistics; 2020 project; final report July 2022

AT GG2023.jpg


This article forms part of Eurostat’s statistical report on the Integration of statistical and geospatial information.

Full article

Problem

Investigate the use of earth observation data within statistical production processes.

Objectives

Integrate Earth observation (EO) data from the European Space Agency (ESA) Copernicus programme into the statistical production process for use in agriculture, forestry and environmental statistics.

Develop a methodology for land cover classification to gain spatially explicit data on different categories, for example woodland, grassland and cropland.

Method

A ground truth dataset was generated by combining three datasets:

  • The digital cadastral map (DKM) from the Federal Office of Metrology and Surveying (BEV) providing polygons with information on land use. Data were for 2021.
  • Land-parcel identification system (InVeKoS/LPIS) / integrated administration and control system (IACS) from the Agrarmarkt Austria (AMA) providing polygons with information on land cover for each parcel. Data were for 2019.
  • A forest map for Austria from Geoville, providing information on forest types. Data were for 2020.
A diagram showing the workflow for the Statistics Austria earth observation for land cover statistics project.
Figure 1: Workflow

The data from all three datasets were combined.

  • Data from the Agrarmarkt Austria and from Geoville were preferred when available; when not, the digital cadastral map data were used. Information from these datasets cannot be assumed to be always correct; furthermore, the Geoville data have a coarse resolution.
  • The land cover attributes were aggregated to 10 classes (such as cropland, forest and urban). Some detailed classes with very small proportions of land were deleted because they were difficult to allocate to a broad class, such as gardens and quarrying areas.
  • Eight representative test communities were selected. They each represent one of the eight agricultural production areas of Austria.
  • Randomly generated points were selected within the municipalities, 1 400 in each municipality. In addition, at least 40 randomly distributed points were created within each municipality for each land cover class. In total, there were 1 600 to 1 800 points per municipality.

Satellite data from March to October 2019 was used.

  • Level 2 scenes from Sentinel-2 were downloaded from the Copernicus Sci Hub, with cloud cover not exceeding 20 %. Cloud-covered pixels were masked out using the cloud layer. No imputation was performed at the masked locations. Learners that use internal methods were needed to deal with missing data.
  • Monthly composites from Sentinel-1 scenes were provided by the project partner, the Earth Observation Data Centre (EODC).
  • Eight datasets were compiled: normalised difference vegetation index (NDVI); normalised difference water index (NDWI); tasseled cap greenness; tasseled cap wetness; tasseled cap brightness; fractional vegetation cover (FCOVER); leaf area index (LAI); and cover fraction (FVC).

For the randomly generated points in each municipality, the data from the ground truth datasets and the satellite data were combined. These data were used for model training.

  • A training workflow with Random Forest (RF) in R and a Histogram-based Gradient Boosting (HGB) workflow in Python were implemented.
  • A land cover map was created from the models.

The RF method resulted in slightly more stable models than the HGB procedure, whereas the HGB method resulted in a higher classification accuracy in every municipality tested. Confusion matrices (showing classification percentages) were compiled for the 10 classes in the eight municipalities. The heathland and shrubs class as well as the sparsely vegetated land class had the lowest classification accuracies.

  • The former is sometimes difficult to separate from forest.
  • The latter is difficult to separate from grassland and from cropland.

Each model consisted of several hundred highly correlated variables. An analysis of the importance of variables within each model was performed, based on eight groups of variables. The results varied between municipalities, but generally the classification accuracy was highest with four or five groups, although the overall improvement by reducing the number of groups of variables is small.

Possibilities for improvements in classification through post-processing include smoothing filters or reclassification depending on the probability with which an estimate belongs to a class. Pixels that have been classified with a high degree of uncertainty can thus be reclassified depending on their neighbourhood. A high degree of classification uncertainty at the edge of classes or at the transitions to other classes indicates that mixed pixels – in which two or more land cover types are mapped – are responsible for a high degree of uncertainty. A reclassification was carried out based on the results of the HGB classification, using a bilateral smoothing filter. While a higher accuracy can be achieved, one disadvantage is the loss of narrow structures such as roads.

Two aerial visuals for Histogram-based Gradient Boosting (H G B): the first shows the original classification and the second shows smoothed results based on a bilateral smoothing function.
Figure 2: Top: original classification with HGB. Bottom: smoothed

Results

A prototype was successfully set up for a land cover classification, integrating EO data from the ESA Copernicus programme. In the future, this can be integrated into the statistical production process for use in agriculture, forestry and environmental statistics.

  • The targets of 80 % classification accuracy could be achieved, especially for the cropland, woodland and grassland classes.
  • The selected HGB method proved suitable for land cover classifications and outperformed RF.
  • A workflow implemented in Python had less resource consumption in terms of RAM and CPU and was more efficient than a workflow implemented in R.

Direct access to

Other articles
Tables
Database
Dedicated section
Publications
Methodology
Visualisations