Merging statistics and geospatial information, 2020 projects - the Netherlands

DeepGeoStat; 2020 project; final report 9 January 2023

This article forms part of Eurostat’s statistical report on the Integration of statistical and geospatial information.

Full article

Problem

DeepGeoStat explored the potential of analysing Earth observation (EO) image data.

Objectives

To research new methodologies – using convolutional deep learning (CDL) on EO data – that can automatically generate official statistics or make the process of creating official statistics more efficient.

To make proven technology available to end users within the statistical office by way of a user-friendly tool.

Method

Analysing image data may be referred to as object classification, with two main approaches.

Computer vision – the process of carefully handcrafting features to perform the job.
Supervised machine learning – a more data-centric approach where algorithms learn features automatically based on training examples. A training example has two components: the input image and an output label (describing what is in the image).

Deep learning works best with large amounts of training examples. National statistical offices:

have huge amounts of register and administrative data that can be used to automatically create labelled data and so aerial images can be labelled based on geospatial administrative data;
employ many domain experts that can create new training data using so-called annotations.

In DeepGeoStat, state-of-the-art deep learning networks were used for image classification, using the automatic creation of labels based on administrative data and new ways to efficiently annotate images by domain experts (human-machine interaction). DeepGeoStat enforced a stringent workflow for users with four steps:

setting-up the project, such as explicitly describing the statistical concept(s) to be modelled,
collecting the data (EO data and labels),
inspecting the combined dataset to be fed into the CDL algorithm,
setting-up and running the experiments and then exporting results for further processing by researchers.

The user chooses the statistical concepts to be modelled and the methodology, either direct classification or change detection.

Direct classification uses a supervised convolutional deep learning network to model (in other words, predict) selected classes in images. The user gives the target classes (or labels), such as woodlands, other agricultural usage or building sites as land use classes.
For change detection a supervised Siamese convolutional deep learning network is used, which takes two different images of the same piece of land (for subsequent years) as input. The Siamese network is designed to detect changes. The labels assigned are changed or unchanged.

Two types of data are chosen i) aerial or satellite image data (or EO data) and ii) evidence of what label or class (part of) the image contains.

It is important to mention that small pieces of land should be analysed, as the deep learning network expects fixed sized images. For the Netherlands, grid cells of 100 m x 100 m and 500 m x 500 m exist. The tool is currently loaded with images that are 500 m x 500 m. Alternatively, images can be loaded that have been masked to show only polygons meeting certain filter criteria, for example based on land use.
The evidence for the labels can be loaded either from classifications loaded from files (such as registers or administrative data) or from annotations made by human observers. Manual annotations are time consuming and therefore the annotation tool should work efficiently, being intuitive and fast.

Having collected all relevant images and evidence, the goal of the consolidation inspection step is to transform the evidence into final labels.

By using all evidence, the aim is to minimise errors. The combined data, in other words images with labels, makes up the training data for deep learning.
The training examples may be inspected. If a structural flaw is detected, the evidence should be revised.
Statistical information is provided to users, such as the number of training examples per label. If these are judged to be unbalanced, an under-sampling may be made for some labels or more images may be annotated to correct the imbalance.

Figure 1: Visualisation of one example consolidation strategy

In the experiment step, users choose and configure the deep learning model.

Pre-trained networks that have performed well in academic experiments are provided. Some basic hyper-parameter settings (that may be modified) are provided, such as the learning rate, batch size, number of epochs and early stopping.
During training, real-time feedback on the training process is provided to users in order to determine if the model is learning without overfitting the data. This includes feedback such as loss and accuracy, indicating whether or not the network is making progress. A confusion matrix is the final test of the model on an independent dataset either sampled by the tool (and taken out of the training data) or provided by the user.

Figure 2: The experiments step in the DeepGeoStat: this shows real-time learning results

If training is considered to have been successful and completed, the model can make predictions for each grid cell in the Netherlands and the data outputted. The Python file that includes the deep learning code (a script based on the training data), the methodology and settings can be downloaded.

Results

The aerial images for the Netherlands from 2015, 2017 and 2020 in RGB format are made available in the tool in 500 m x 500 m grid cells. Satellite images, more years of aerial images, more detailed cells and other spectra are available. The regional divisions (provinces, cities, districts and neighbourhoods) and the ‘Bestand Bodem Gebruik’ (surface usage) for 2015 and 2017 were also loaded.

Three tests were performed: poverty, land use and solar panels.

The income level (poor or not poor) of a grid cell is predicted based on income data of Dutch households. The first step was to create a dataset of images for 2017 classified as residential (according to the surface usage). The poverty labels – computed based on administrative income data – were loaded. The experimental data were balanced to have an equal frequency of poor and non-poor grid cells in residential areas. The network was trained, and the outcome checked. This test showed that the tool could use administrative data.
Changes for land use in two years were detected using a Siamese convolutional neural network. Several projects were created in the tool, one for each type of surface usage studied: building site, woodland and other agricultural usage.
- Datasets for 2015 to 2017 were created for training and included images labelled as building sites in 2015. The images were masked to show only building site polygons. Data were labelled based on the existing surface usage maps of 2015 and 2017 using the register option. The experiment data were balanced by under-sampling the majority class. The best performing algorithm configuration for the building site class was evaluated.
- The network was trained, and the model applied to a dataset for the years 2017 to 2020 to predict the building site changes for 2020. This test showed that the tool could be used to detect changes (rather than to classify).
Grid cells were classified with regards to whether or not they contained solar panels. This test used 10 m x 10 m aerial images and had many manual annotations. A dataset containing RGB images for 2017 was created for the province of Limburg. The aerial images were aggregated to 500 m x 500 m images and the annotations combined into five classes (none, few, some, many and most solar panels). These data were loaded as labels. The network was trained, and the outcomes checked. The purpose of this test was to load and use the annotations to train the network.

A further result of the work was the organisation of a satellite event on Earth observation and machine learning at the 2023 edition of the conference on new techniques and technologies for statistics (NTTS). NTTS is an international biennial scientific conference series, organised by Eurostat. A keynote presentation was given on Earth observation data high-performance computing for automated continental-scale mapping. This was followed by several presentations and discussions focused either on land use and land cover or on Sustainable Development Goals.