6 Remote access issues
6.1 Introduction
Although very sophisticated methods have been developed to make safe microdata files, the needs of serious researchers for more detailed data cannot be met by these methods. It is simply impossible to release these very detailed microdata sets to users outside the control of the NSIs without breaching the necessary confidentiality protection. Nevertheless the NSIs recognize the serious and respectable requests by the research community for access to the very rich and valuable microdata sets of the NSIs. Therefore different initiatives have been taken by the NSIs to meet these needs.
The first step was the creation of Research Data Centres (RDCs), a special room in the NSI, where researchers can analyse the data sets, without the option to export any information without the consent of the NSI. In parallel to this initiative there are options for remote execution. Remote execution facilities are various kinds of systems where researchers can submit scripts for SAS, SPSS etc to the NSI. Remote access, where users can “log in” to a RDC from a remote desktop, has become commonly used.
As all these options allow the researcher to access unprotected sensitive data in some way, all possible precautions have to be taken. These options are certainly not available to the general public, but only to selected research institutes like universities and similar research institutes. Additionally, strict contracts have to be signed between the NSI and the researcher. Preferably also the research institute itself should sign the contract. This enables the NSI to take action against the institute itself as well as against the researcher, if something might go wrong. A common repercussion for the institute could be a ban for the whole institute to access these facilities. So it will be in the interest of the institute to ensure a correct behaviour of the researcher.
6.2 Research Data Centres (RDCs)
In order to meet the needs of the researcher community to analyse the rich datasets compiled by the NSIs, while safeguarding the confidentiality constraints, the first solution was to create special rooms in the premises of the NSIs (RDCs). The NSI makes available special computers for the researchers. On this computer the necessary software for the research will be installed by the NSI together with the necessary datasets. Ideally these computers have no connection whatsoever to the internet and there is no email. Also drives for removable discs are not available and the use of memory sticks has to be blocked. The access to the internal production network of the NSI has to be blocked as well, preventing the possibilities of the researchers to access other sensitive information. Installing a printer is a risk as well as is the use of phones. Supervision of the RDC is always needed.
The datasets to be used in the RDC have to be anonymised (i.e. at least the name, address etc are removed). It is also advisable to restrict the variables available to the set that is needed for the specific research.
On these computers the researchers should nevertheless be able to fully analyse the data files and complete their analysis. When the research is finished the results have to be released to the researchers. Before this can be done, NSI staff has to check the research results. Unfortunately this is not a straightforward, easy task. This will be discussed in section 6.6.
The concept of RDCs is meeting many research needs and several NSIs have adopted this idea. RDCs have been implemented in the USA, Canada, The Netherlands, Italy, Germany, Denmark, Eurostat and several other countries.
The concept of RDCs has proved to be very successful. Many good research papers and theses have been completed for which these centres were indispensable. However there are some drawbacks. The most important one is that the researchers have to come physically to the premises of the NSIs. Even in a small country like the Netherlands, this is seen as a serious problem. Also the researcher cannot just try another option when he is back at his normal working place, because he has to travel to the NSI first. Also the fact that he cannot work in his normal working environment is considered a drawback.
6.3 Remote execution
As modern communication techniques have become available, the NSIs have investigated the possibilities to use these techniques. The first initiative is remote execution. In this concept the researchers will get a full description of all the metadata of the datasets available for research. However the dataset available will remain on the computers of the NSIs. The researchers will prepare scripts for analysing the datasets (with SPSS, SAS etc) and send them to the NSI (by email or via some internet page). The NSI will then check the script (e.g. for commands like List Cases, but also other unwanted actions like printing the residuals of a regression) before running it and after a second check send back the results to the researcher.
For the researcher this system has the advantage that he does no longer have to travel to the NSI. He can send a script whenever he wants. On the other hand he cannot directly run the script since this is done by the NSI. Hence correcting errors in a script can take much more time, depending on the turn around time of the NSI. This process could be speeded up if the NSI will make available a fake dataset which corresponds to the original file in terms of structure but not in content. The main objective of this dataset is to avoid all unsuccessful submissions due to syntax errors etc.
For the researcher remote execution has several advantages (no need for travel) but also some drawbacks (slow turn around time). For the NSIs it is very time-consuming, as they have to check so many scripts and results. It is not uncommon in statistical analysis that several scripts are submitted and executed. But then the outcome proves to be not the optimal model and a new script is submitted. However the NSI does not know in advance which script is successful and has to check everything. This is very time-consuming if the NSI takes this seriously.
Examples of this kind of systems are the Luxembourg Income Study (Lissy) and the Australian RADL.
6.4 Remote access
Systems for remote access have become common over the years. The aim is to combine the flexibility for researchers to do all their analysis in a RDC while removing the constraints of travelling to the NSIs. Modern developments in the internet make it possible to set up a safe controlled connection, a VPN (virtual private network). A VPN is a technique to setup a secure connection between the server at the NSI and a computer of the researcher. It uses firewalls and encryption techniques. Also additional procedures to control the login procedure like software tokens or biometrics can be used to secure the connection. The most well know product behind this is Citrix, but other systems exist as well. Citrix has been developed to set up safe access to business networks over the internet without giving access to unauthorised persons. This will safeguard the confidentiality of the information on this network.
Some NSIs are using Citrix to set up a safe connection between the PC of the researcher and a protected server of the NSI. This approach was followed by Denmark, Sweden and the Netherlands. Slovenia is using Windows remote desktop services (similar to Citrix). Statistics Netherlands is currently using VMware Horizon Client (also similar to Citrix).
The main idea of a remote facility is that it should resemble the ‘traditional’ OnSite RDCs as much as possible, concerning confidentiality aspects.
The following aspects have to be taken into account:
Only authorized users should be able to make use of this facility,
Microdata should remain at the NSI,
Desired output of analyses should be checked on confidentiality,
Legal measures have to be taken when allowing access.
The key issue is that the microdata set remains in the controlled environment of the NSI, while the researcher can do the analysis in his institute. In fact it is an equivalent of the RDC. The Citrix connection will enable the researcher to run SPSS, SAS etc on the server of the NSI. The researcher will only see the session on his screen. This allows him to see the results on his analysis but also the microdata itself. This is completely equivalent to what he can see, if he would be at the RDC.
Citrix will only send the pictures of the screens to the PC of the researcher, but no data is send to him. Even copying the data from the screen to the hard disk is not possible. If the researcher is satisfied with some analysis and wants to use the results in his report, he should make a request to the NSI to release these results to him. The NSI has to check the output for disclosure risks and if this is OK the NSI will send the results to the researcher.
As the researchers will work with very sensitive data, all measures should be taken to ensure the confidentiality of the data. Therefore also legal measures have to be taken, binding not only the researcher himself but also the institute.
6.5 Licensing
Another access option for microdata releases available to NSIs is to release data under licence or access agreements. A spectrum of different data access arrangements can be provided. A variety of factors should be taken into account when granting approval for access – including the purpose of the access, the status of the user, the legal framework, the status of the data, the availability of facilities and the history of access. The levels of control over use and user applied within the licence should be balanced by the level of detail and/or perturbation in the microdata.
6.6 Confidentiality protection of the analysis results
6.6.1 Output checking
Output checking is the process of checking the disclosure risk of research results based on microdata files made available in RDCs. NSIs and other institutions can establish their own rules for output checking.
In 2009, a document 'Guidelines for the checking of output based on microdata research' was prepared within the European project ESSnet SDC. In 2015, this document was a basis for a document 'Guidelines for Output Checking' prepared within the DwB (Data without Boundaries) project. Both documents provide guidelines and practical advice for output checkers. Principles-based model and rule-of-thumb model are described; the former considers the entire context of the output, while the latter is based on strict rules. The overall rule of thumb is defined and its application to different types of output is described. Organisational aspects of output checking are discussed.
6.6.2 Basic rules concerning the program code: Example from German official statistics
6.6.2.1 Introduction
As in many other countries, German official microdata are subject to strict data protection regulations. Therefore, results produced on the basis of statistical microdata are checked for confidentiality risks and critical values are suppressed or altered. This applies both to publications of the Statistical Offices and to results that are generated by researchers via the Research Data Centres (RDC) of German official statistics.
The RDC offer the scientific community a wide variety of data from different statistics. Since the RDC were established in Germany in 2001, microdata requests have considerably increased and analyses are getting more and more complex. Thus, the checks for statistical disclosure control, which are mostly done manually1, became a very time-consuming and labour-intensive part of RDC work. Experience from several hundred research projects has shown that those confidentiality checks can be accomplished easier if the program code and the resulting output structure follow a given set of rules. To make the application of these rules as user friendly as possible, the RDC of German official statics developed sample program codes for different programming languages (see https://www.forschungsdatenzentrum.de/en/confidentiality).
1 A mostly manual handling of the confidentiality check has several disadvantages (inefficient, possible mistakes, …) but, in contrast to automated systems, it does not completely block certain procedures and commands but allows individual decisions: when calling up critical procedures, and depending on the data or the specific analysis, the RDC staff can decide to what extent the analysis results can be transmitted to the user or be retained for confidentiality reasons. Necessary changes can easily be communicated to the user.
The purpose of the rules and the sample program codes is
- to make program code easily and quickly understandable,
- to include all necessary information for the output check,
- to clearly indicate the relations between different sets of output,
- to facilitate readability,
- to differentiate between output for release and output that is only created for the confidentiality check,
- to apply uniform standards to projects performed at different locations of the RDC,
- to thus reduce the time it takes to check an output,
- and, where necessary, to enable a smooth change of the project staff at the RDC without causing delays in the project progress.
6.6.2.2 Rules for program design
The rules that have to be applied by the data users are stated below. They are an outline of the RDC’s “Regulations on the analysis of microdata”.2
2 Research Data Centres of the Federal Statistical Office and the Statistical Offices of the Federal States: Regulations on the analysis of microdata in the Research Data Centres of the Federal Statistical Office and the Statistical Offices of the Federal States (RDC). Düsseldorf, 2022. See https://www.forschungsdatenzentrum.de/en/confidentiality
Clear Structure
The program code has to be written with a clear structure and all program steps have to be comprehensible. A master program code is to be used if the code is separated into several files. All specifications that have to be adjusted by the RDC to execute the program code (path specification, name of the dataset, …) are to be stated only once and in the header of the (master) program code.
Detailed commentary
All steps for preparation and analysis of the data have to be commented reasonably and in detail. Their content has to be described.
Uniqueness of variable and value labels
Variable and value labels have to be assigned uniquely and with descriptive names. If a new variable is created or if an existing variable is adjusted then all related labels have to be assigned newly and to be stated in the header of the program code.
Reproducibility of the output
All output has to be identically reproducible by the associated program code. The logging has to be switched on at the beginning of the code and may not be switched off at any time.
Specification of the output formats
All tabular and analytical results are to be saved in a processable format so the RDC are able to conduct to the confidentiality check. In contrast, all graphical results have to be saved in a non-processable format to prevent underlying values or numbers of cases to be released.
Marking of output to be released and output for the confidentiality check
The RDC distinguish between output that is to be checked by the RDC staff and released for publication and output that is generated only for the conduction of the confidentiality check. Both output categories and their relations have to be unambiguously marked.
Output of the underlying numbers of cases and marking of the relations
For all output that is to be released, the underlying unweighted number of cases is to be stated. If diagrams and graphics are to be released then additional tables with the depicted values and the underlying unweighted numbers of cases have to be stated and to be marked unambiguously.
Output of difference groups and marking of relations
If results for one or more associated and not overlapping subgroup(s) are created in addition to results for the whole population then the results for the remaining (possibly summarised) subgroup(s) always have to be stated as well. Missing values should preferably be stated separately to avoid difference problems in the following analyses. In case of overlapping sub-groups the number of cases has to be stated for every intersection.
Output of certain values for the check for dominance and marking of relations
If value tables (sums) are created then the number of cases and the highest two individual values have to be stated. [Note: In Germany, this only applies to economic or tax statistics.]
Non-redundancy of statistical results
In the course of a project, identical statistical results may only be marked for release once. If results have to be released again in duly substantiated exceptional cases, an exact reference to the according earlier analysis has to be made.
6.7 References
John Coder and Marc Cigrang (2003), LISSY: A system for providing Restricted Access to Survey Microdata from Remote Sites, Monographs in Official Statistics, Luxembourg
Anco Hundepool and Peter-Paul de Wolf(2005), OnSite@Home: Remote Access at Statistics Netherlands, Monographs of Official Statistics, Luxembourg
Lars-Johan Söderberg (2005), MONA,- Microdata On liNe Access as Statistics Sweden, Monographs of Official Statistics, Luxembourg
Lars Borchsenius (2005), New developments in the Danish system for access to micro data, Monographs of Official Statistics, Luxembourg
Brandt, M. et al. (2009). Guidelines for the checking of output based on microdata research: https://research.cbs.nl/casc/ESSnet/GuidelinesForOutputChecking_Dec2009.pdf
The DwB project, Work Package 11, extracted from the deliverable D11.8 (2015). Guidelines for Output Checking: https://cros.ec.europa.eu/system/files/2024-02/Output-checking-guidelines.pdf