Statistics Explained

Archive:European business statistics manual - microdata for researchers

This Statistics Explained article is outdated and has been archived - for updated information please see the dynamic version of the European Business Statistics Manual at: European Business Statistics Manual A static full version of the European Business Statistics Manual was published in February 2021: European Business Statistics Manual — 2021 edition

This article describes the legal conditions at EU level for releasing and accessing record-level data and the methods used to prepare microdata files for researchers. Its focus is European Statistical System (ESS) microdata released by Eurostat.

Since it is part of the European Business Statistics Manual, the article is concerned only with business-related data. However, the conditions of access to microdata at EU level are identical for social and business data.

Full article

European Statistical System and European statistics

This article focuses on access to European Statistical System (ESS) microdata released by Eurostat. However, the main concepts and principles dealt with apply to any data set made available for research purposes by a statistical office.

Regulation (EC) No 223/2009 on European statistics states as follows:

  • The European Statistical System is the partnership between Eurostat and the national statistical institutes (NSIs) and other national authorities (ONAs) responsible for developing, compiling and disseminating European statistics in each EU country. NSIs and ONAs are often known as national statistical authorities (NSAs), a nomenclature we shall use in this article. These authorities are listed on Eurostat’s website [1].
  • European statistics are those required for the European Union to carry out its various activities. They are determined in the European statistical programmes [2].
These statistics are sent to Eurostat in accordance with the particular regulations governing their subject matter. If national statistical authorities (NSAs) transmit data in the form of microdata, and if this is agreed, Eurostat can provide access to this data for scientific purposes.

Confidential data versus microdata

Definition: confidential data are data that reveal the contribution of individual statistical units (individual persons, households or business entities). One of the fundamental principles of the European Statistical System (ESS) is the obligation of the national statistical authorities (NSA) and Eurostat to protect confidential data. Confidential data cannot be published.

The methods used to identify and protect confidential data in the different forms of statistical output are called methods of statistical disclosure control (SDC). SDC methods establish the criteria necessary to judge whether or not figures in an output (e.g. cells in a table, records in microdata files, the regression coefficient in a model) are confidential.

If the figures are non-confidential they may be published. If they are confidential, on the other hand, they must be treated appropriately: kept secret, rounded up or down, or aggregated, for instance. Statistical disclosure control (SDC) methods offer a range of techniques designed to ensure optimum protection of confidential figures without forfeiting too much information. A separate chapter of the EBS manual outlines the SDC methods applied to the various types of statistical outputs.

Microdata are sets of records (lines in the file) containing information on statistical units: individual persons, households or business entities. Each record line represents information about respondents and/or statistical units.

Records may be readily identifiable when they contain unique direct identifiers, such as a person’s names, address, social security number or ID number. These confidential records with direct identifiers are available to statistical institutes under strict confidentiality protocols only. Microdata with direct identifiers (especially with a unique ID number) are increasingly important for producing official statistics, as they enable data collected from different sources to be linked. This encourages the use of administrative sources and the like and the derivation of further results from existing data. In case of longitudinal files direct identifiers allow monitoring individuals over time. Direct identifiers are often replaced by pseudo codes, which are less identifying but equally efficient in monitoring individuals over time and various data collections.

Even microdata without direct identifiers are confidential, as a combination of rare characteristics may enable unique statistical units to be identified (see below). Such microdata are invaluable to the research community, as only they allow deep analysis of relationships in the data, such as causalities, dependencies or convergences.

The conditions of access to microdata are normally set out in legal acts. In the ESS, access to microdata is limited to statistical analysis for scientific purposes. The precise conditions are laid down in a Commission Regulation. In parallel, there are national access systems governed by national statistical institutes. Chapter 3 describes the conditions of access to ESS microdata released by Eurostat.

Access to ESS microdata released by Eurostat

Data collections available as microdata files for scientific purposes

As stated above, access to confidential microdata for scientific purposes at European level may be considered for those data sets for which Eurostat receives data at individual or micro level. In 2018 Eurostat granted access to microdata for a wide range of data collections. Four of these are business data collections (indicated by the initials ‘BS’):

  1. Structure of Earnings Survey (SES), BS
  2. Community Innovation Survey (CIS), BS
  3. Continuing Vocational Training Survey (CVTS), BS
  4. Micro-Moments Dataset (MMD) — Linked micro-aggregated data on ICT usage, innovation and economic performance in enterprises, BS
  5. European Community Household Panel (ECHP)
  6. European Union Statistics on Income and Living Conditions (EU-SILC)
  7. Labour Force Survey (LFS)
  8. Adult Education Survey (AES)
  9. European Road Freight Transport Survey (ERFT)
  10. European Health Interview Survey (EHIS)
  11. Community Statistics on the Information Society (CSIS)
  12. Household Budget Survey (HBS).

An up-to-date list and descriptions of the microdata collections available may be found on the Eurostat website.

Criteria for eligible research entities and research proposals

The legal basis for access to ESS microdata is Commission Regulation (EU) No 557/2013 on access to confidential data for scientific purposes [3]. It defines the criteria for eligible research entities and research proposals. It also describes how microdata are to be made available to researchers (modes of access).

Researchers wishing to access microdata must follow a two-step procedure [4]:

  1. Eurostat accredits the organisation to which the researcher is affiliated as a research entity;
  2. A research proposal describing the scientific project and explaining the need for access to confidential data is submitted to Eurostat.

1. Recognition as a research entity

The recognition of research entities (step 1) involves identifying organisations (or particular departments thereof) that conduct research and can be trusted with confidential data. Applicants must meet the following criteria:

  • The statement of the body’s purpose, its mission statement or its articles of association must mention research.
  • The entity must be able to show a record of quality research, such as a list of scientific publications and research projects. Its research results must be in the public domain.
  • It must formulate its scientific conclusions as an independent body.
  • It must have adequate security safeguards.

It is Eurostat that assesses each application on its merits. If a research entity is granted accreditation, its director must sign a commitment that the microdata will be used in accordance with the terms agreed and protected by the organisation’s researchers.

Eurostat publishes the list of recognised research entities on its website [5]. In 2018, the list comprises some 750 organisations both within and outside the EU.

2. Submitting a research proposal

To have access to microdata, researchers affiliated with the recognised research entities must submit a research proposal to Eurostat.

Proposals must describe:

  • the research project for which the microdata are to be used
  • the data and variables to be used
  • the statistical methods to be applied to the data
  • why the project requires access to microdata
  • how the research results are to be published
  • how data security will be guaranteed.

To be considered eligible, the research proposal must specify in sufficient detail:

  • the scientific purpose of the research
  • why the microdata are needed
  • the expected outcomes of the research.

Research outcomes must be made public. All researchers named in the research proposal as potential microdata users must sign an individual confidentiality declaration, in which they must undertake to abide by the specific terms governing the use of confidential data.

The research proposal is examined internally, in cooperation with the Eurostat managers responsible for the data requested and the national statistical authorities that provided it. If a national statistics authority denies access, the data of the country concerned is expunged from the relevant microdata file.

If the research proposal is accepted, researchers are given access to the relevant data. They may access the data for the period specified in the research proposal. If requested, new releases of the approved microdata sets are sent to researchers in the course of the project (for up to 5 years).

At the end of the access period, the researchers send Eurostat the resultant publications and destroy any confidential data they have received [6].

Eurostat has published on its website a Self-Study material for the users of Eurostat microdata sets guiding researchers through the applicable procedures and explaining how he or she can become "safe researcher". This material aims at making researchers aware of their responsibilities when they are entrusted with confidential data.

ESS microdata are available to researchers in 2 formats:

  • Secure use files
  • Scientific use files

Secure use files: access in Eurostat’s safe centre (Luxembourg)

Secure use files contain data on individual statistical units. Usual practice is to remove only direct identifiers, and data are cleaned but not further anonymised.

It is still possible to identify statistical units (companies or persons), and secure use files are considered confidential files. Respondents can be identified by combining basic characteristics or variables. For instance, a company with more than 1000 employees in a NUTS 3-level region is easily recognisable, even if not named.

ESS secure use files

ESS secure use files can only be accessed in Eurostat’s safe centre. Researchers may analyse data there, but nothing can be taken out of the room. Researchers are isolated from ‘the rest of the world’; they cannot, for instance, use the internet or download the data they are working on. They are obliged to work in a dedicated room equipped with a standalone PC.

The following ESS data collections are available as secure use files in Eurostat’s safe centre:

  • Structure of Earnings Survey (SES)
  • Community Innovation Survey (CIS)
  • Micro-Moments Dataset (MMD) — linked micro-aggregated data on ICT usage, innovation and economic performance in enterprises.

Output checking

After working at the safe centre, researchers place the results of their research in the output folder. They must ensure that the output is safe. In addition, the results are checked for confidentiality by the Eurostat data manager. This is known as output checking. The aim is to check that the results contain no confidential data. Safe output is emailed to the researchers concerned.

The general rules on this are set out in the Guidelines on Output Checking. They differentiate between safe output (e.g. regression coefficients) and unsafe output (e.g. tables) and suggest appropriate techniques to check whether or not results are confidential.

The rules on output checking also depend on the characteristics and sensitivity of the domain. For example, there are specific safe centre rules at ESS level for the Community Innovation Survey (CIS) [7]. These rules, which were established by representatives of national statistical authorities, include requirements and criteria for safe output produced on the basis of the CIS secure use files.

Remote access

Some statistical institutes in Europe offer access to national secure use files in remote mode [8]. This enables researchers to work on such files without travelling to a safe centre. The key principle of remote access is that secure use files remain in a controlled environment in one place, while the researcher connects from elsewhere. The researcher’s identify can be checked remotely using specific (e.g. biometric) tools. The remote connection enables a researcher to run statistical packages/programs on a server at a distant location. There are two basic types of remote access:

  • ‘real’ remote access where the researcher can see the microdata and work directly on the files
  • remote execution where the researcher cannot see the data but submits codes and routines that are processed on the data by the system.

The remote execution system checks the input codes (unauthorised tasks are blocked) and the output data (confidentiality on the fly [9]).

Eurostat aims to provide remote access to ESS microdata. The first step in this direction is the DARA project (see below).

Decentralised And Remote Access (DARA) to ESS secure use files

Eurostat is in the process of developing a system that will enable researchers rated eligible through the procedure described in 3.2 to work on secure use files at accredited safe centres in national statistical institutes.

Scientific use files (SUFs)

‘Scientific use files means confidential data for scientific purposes to which methods of statistical disclosure control have been applied to reduce (to an appropriate level and in accordance with current best practice) the risk of identification of the statistical unit’ (definition from Regulation (EC) No 557/2013). Scientific use files (SUFs) are anonymised more than secure use files. This entails not only removing direct identifiers, but also grouping together, rounding, swapping or eliminating certain categories of variables. While it is still possible to identify statistical units in SUFs, it is less probable. It may nonetheless be possible to identify a statistical unit in a SUF if that unit has some unusual or specific feature (if it is a very big company, for instance [10]).

Since statistical units are identifiable, SUFs are considered confidential and can be accessed only by authorised researchers (a procedure described in 3.2). Unlike secure use files, SUFs can be used outside Eurostat’s secure environment, on the premises of the research body [11].

Table 1. Identification risk levels in different types of microdata

Data Risk levels How respondents can be identified To what level of precision can respondents be identified?
Microdata for statistical purposes Extremely high By direct identifiers (if available) or by combining indirect identifiers (characteristics such as NUTs level, size class, NACE category) ‘This is a record referring to company X.’
Secure use files High By combining indirect identifiers (characteristics such as NUTS level, size class, NACE category) The likelihood of identification is much smaller than with microdata for statistical purposes: ‘This is a record that probably refers to company X.’
Scientific use files Low (reduced) By combining indirect identifiers (characteristics such as NUTS level, size class, NACE category), but only units with rare characteristics can be identified The likelihood of identification is much smaller than with secure use files: ‘This is a record that may refer to company X.’
Public use files Eliminated NA NA

Preparing scientific use files

Scientific use files must be developed in such a way as to make it more difficult for the user to identify the statistical unit(s) concerned. However, the data must retain their research value. The following basic SDC methods are applied to make it harder to identify respondents [12]:

  • Removal of direct identifiers
  • Recoding: providing information at a more general level: e.g. at NUTS 2 instead of NUTS 3, size classes instead of precise employment figures.
  • Micro-aggregation
  • Record swapping
  • Rounding off
  • (Local) suppression

SDC methods are applied gradually to produce scientific use files. The actual risk of disclosure and the quality of the data are constantly checked. The more measures are taken to anonymise files are, the less detailed they are, and the less interesting to researchers.

The process of applying SDC methods to microdata continues until the right balance is struck between the disclosure risk (the probability that the respondent will be identified) and the quality of the data. Achieving a judicious balance depends on many conditions and involves expert judgment. It helps if the framework criteria are established at the beginning of the process.

Table 2. Examples of framework criteria for scientific use files

Criteria for:
Disclosure risk of scientific use file The SUF must contain at least X records with the same characteristics (defined by combining variables)
Quality of scientific use file Indicator X, derived from the SUF, must be close to an indicator X, derived from original data

It is important to document the various steps in microdata protection well and to describe the reasoning leading to a particular decision. This not only makes the process transparent, but also enables it to be reproduced for other releases of data or other countries’ data.

ESS scientific use files

Most ESS microdata sets are available as scientific use files. All social surveys are available in this format. In addition, the following ESS business data collections are available as scientific use files released by Eurostat:

  • Structure of Earnings Survey (SES)
  • Community Innovation Survey (CIS)
  • Continuing Vocational Training Survey (CVTS).

It is much more difficult to prepare scientific use files for business data than for social data. This is because it is easier to identify enterprises, even if their detailed characteristics (direct identifiers: name, address, business register number) are not provided. This is particularly true of big companies (see more in the article on SDC).

ESS business microdata released by Eurostat

As mentioned above, ESS business microdata are available on site in Eurostat’s safe centre (as secure use files) and as scientific use files, depending on the data collection. There are fewer users of secure use files because of the costs of travel to the safe centre in Luxembourg. All microdata files for researchers are provided free of charge [13].

Table 3. ESS business microdata available for scientific purposes and number of research proposals submitted between July 2013 and April 2018

Secure use files Scientific use files
Structure of Earnings Survey (SES) 18 133
Community Innovation Survey (CIS) 67 143
Continuing Vocational Training Survey (CVTS) 34
Micro-Moments Dataset (MMD, available since 2015) 13

Not all EU countries participate in the release of ESS business microdata [14]. In some countries, releasing data on individual businesses is banned by law.

Some countries choose not to release scientific use files, for technical reasons related to the applied SDC method, or because they consider their research value to be insufficient.

Public use files (PUFs)

In 2014 Eurostat launched a project designed to develop methodology for ESS public use files for EU-SILC and EU-LFS. Seven NSIs worked on the methodology for public use files (PUFs) and developed the files themselves. These files are now available, on the Eurostat website.) In the course of the PUF project it became clear that it would be very difficult to produce PUFs that are both safe and rich in information. These first versions of ESS PUFs are intended mainly for educational and testing uses. There are currently no plans to produce PUFs for European business data [15].

Modes of access available within the ESS

There are various other modes of access in the ESS countries. Source data also range from traditional questionnaires to administrative registers and publicly available sources. The graph below represents the current situation as regards different modes of access to the various data types available.

Figure 1. Modes of access to available microdata types (in bold: modes of access provided by Eurostat)

How Eurostat prepares ESS microdata

To release ESS microdata, national statistical authorities must agree on the mode of access (secure use files or scientific use files) and on how to protect the data adequately. This process is outlined in the ‘Guidelines for the assessment of research entities, research proposals and access facilities’. It covers both SDC methods applied to produce scientific use files and rules for output checking of secure use files. The process can be launched only for the surveys for which NSAs transmit microdata to Eurostat.

The process comprises these stages:

1. The domain-specific ESS working group (e.g. CIS WG):
a) analyses the need for and context of the release of confidential data for scientific purposes
b) identifies researchers’ needs as regards the level of detail of the datasets
c) establishes the order of priority of variables in terms of their interest or importance to researchers
d) documents the most relevant types of analysis in the context of the survey
e) proposes the mode of release (secure use files or scientific use files).
2. The ESS Working Group on Methodology (WGM) is notified of the decision by the domain-specific Working Group on the release of confidential data for scientific purposes.
3. After analysing the disclosure risk, Eurostat, assisted by the Expert Group on SDC, proposes protection methods.
4. The protection method is cross-validated by the domain-specific Working Group against the initial context and objectives, and by the WGM with regard to disclosure risks.
5. The national statistical authorities providing the confidential data notify Eurostat that they have approved the protection method and that their data are included in the release.
6. The list of research datasets and possible modes of access is published on Eurostat’s website.

Anonymising microdata

The term ‘anonymisation’ is often used as a synonym of microdata protection in general. It may refer either to the overall ‘de-confidentialisation’ process or to specific stages of that process. Microdata protection falls into the following stages:

  1. De-identification or pseudonymisation — the process of removing direct identifiers (such as name, ID and address) from confidential data and replacing them with pseudonymous codes [16].
  2. Partial anonymisation - applying the set of SDC methods to de-identified microdata so as to reduce the risk that the statistical unit can be identified. Scientific use files are the result of partial anonymisation.
  3. Complete anonymisation — the application of SDC methods that completely eliminate the risk of identification of the statistical unit (directly or indirectly). Public use files contain fully anonymised records.

The diagram below shows the processes involved in preparing different types of microdata files and other types of outputs.

Figure 2. Processes for preparing different types of statistical outputs.

Organising access to microdata within the ESS

While Eurostat provides access to ESS microdata, most national statistical authorities (NSAs) in Europe provide access to their national microdata. NSAs hold microdata at country level. They decide individually which data are available for scientific purposes and what specific conditions apply. The overview of microdata access systems in the EU drawn up in 2015 as part of the ‘Data without Boundaries’ project is available here: http://www.dwbproject.org/access/accreditation_db.html [17].

National statistical authorities do not always consider providing access to microdata to be part of their core business. In some countries, researchers may have to pay for services relating to access provision.

Many EU countries have data archives which provide researchers with additional services, including:

  • preparing metadata
  • user support
  • training sessions
  • information sessions.

In some countries, data archives release microdata — usually scientific use files — on behalf of NSAs. Data archives have become important partners within the ESS, providing added value to the research community concerned with accessing microdata.

Conclusions and future outlook

Microdata access systems within the European Statistical System are in a process of constant development. They provide access to growing numbers of datasets, which are made available in different ways.

As regards access to European business statistics microdata for research purposes, Eurostat provides a range of modes of access covering data collections for which microdata are available at European level.

At the same time, Eurostat is expanding its services by developing decentralised access to secure use files via safe centres in EU countries. Another area of work in progress is the development of a system that will enable researchers to remotely submit codes and routines to be applied to microdata. This will obviate the need for researchers to have access to such microdata themselves.

Other challenges facing Eurostat include processing and providing access to integrated data or big data. Such processing raises a number of legal issues which will need to be resolved first.

Contacts

Aleksandra Bujnowska (aleksandra.bujnowska@ec.europa.eu)

Wim Kloek (wilhelmus.kloek@ec.europa.eu)

Estat-microdata-access@ec.europa.eu

Direct access to

Other articles
Tables
Database
Dedicated section
Publications
Methodology
Visualisations





  • Overview of methodologies of European business statistics: EBS manual
  • Legal provisions related to Microdata service for researchers can be found in the following overview

Notes

  1. Path: Eurostat website/About Eurostat/Our partners/European statistical system.
  2. For information about European statistical programmes, see: http://ec.europa.eu/eurostat/web/european-statistical-system/overview.
  3. https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:32013R0557.
  4. For more details of microdata access procedures, see Eurostat’s dedicated website: http://ec.europa.eu/eurostat/web/microdata/overview.
  5. http://ec.europa.eu/eurostat/documents/203647/771732/Recognised-research-entities.pdf.
  6. Eurostat is currently developing a database of all the publications issued using ESS microdata.
  7. http://ec.europa.eu/eurostat/documents/203647/203701/Note-CIS-researcher-Eurostat-SAFE-Centre.pdf
  8. There are also various collaborative initiatives involving several countries. The Nordic countries have agreed that, in the event of a research project requiring joint Nordic microdata, their data can be pooled via the remote access system of one Nordic country (see: http://nordman.network/).
  9. See more: Methodology for the Automatic Confidentialisation of Statistical Outputs from Remote Servers at the Australian Bureau of Statistics, Gwenda Thompson, Stephen Broadfoot and Daniel Elazar, October 2013.
  10. With both scientific and secure use files, users can identify a statistical unit if they have some real-life knowledge of it. For instance, the user may know where the unit is located, how big it is or what its main activities are.
  11. Specific conditions of use apply.
  12. These SDC techniques are described in detail in the article on SDC.
  13. Access to microdata released by Eurostat used to be subject to fees. Since 2011, access has been free of charge, following a decision of the Dissemination Working Group. This reason for this decision was the inefficiency of the complex cost recovery procedures which Eurostat used to charge for access. Moreover, the charging procedures were slowing down the application process.
  14. For details of countries participating in the different microdata releases, see: http://ec.europa.eu/eurostat/documents/203647/771732/Datasets-availability-table.pdf.
  15. However, some national statistical institutes (such as Finland’s) provide access to public use files for business data.
  16. In some countries, anonymisation is limited to and synonymous with de-identification.
  17. Microdata access systems and conditions may have changed in ESS countries after the completion of this overview. Eurostat plans to update it in the course of 2017.