5  Frequency tables

5.1 Introduction

This chapter discusses disclosure controls for frequency tables, that is tables of counts (or percentages) where each cell value represents the number of respondents in that cell.

Traditionally frequency tables have been the main method of dissemination for census and social data by NSIs. These tables contain counts of people or households with certain social characteristics. Frequency tables are also used for business data where characteristics are counted, such as the number of businesses. Because of their longer history there has been relatively more research on protecting frequency tables, as compared with newer output methods such as microdata.

Section 5.2 of the chapter outlines the common types of disclosure risk and how the consideration of these risks leads to the definition of unsafe cells in frequency tables. The process of aggregating individual records into groups to display in tables reduces the risk of disclosure compared with microdata, but usually some additional SDC protection is needed for unsafe cells in tables. Disclosure control methods are used to reduce the disclosure risk by disguising these unsafe cells.

Advantages and disadvantages of a range of different SDC methods are discussed in Section 5.3 on a general basis. Well established methods of SDC for frequency tables are the cell key method, introduced in Section 5.4, and rounding, discussed in Section 5.5 which explains alternative techniques such as conventional and random rounding, small cell adjustment, and the mathematically much more demanding controlled rounding. Section 5.5 also provides information on how the software package \(\tau\)‑ARGUS can be used to apply this method to frequency tables. Section 5.6 introduces the targeted record swapping which is a pre-tabular method, e.g. applied on the micro data before generating the table, and is intended to be used as protection method for tables instead of micro data.

Section 5.7 adresses a special kind of disclosure risk that can be connected to the release of means based on original data, when the underlying frequencies are protected by a perturbative, non-additive SDC method.

Information loss measures that can be used to evaluate the impact that different disclosure control methods have on the utility of frequency tables are described in Section 5.8. However, when evaluating results of a disclosure control method, it is of course not enough to look at information loss. When designing a disclosure control method, it is also important to condiser residual disclosure risks and to balance risk and utility. The final chapter Section 5.9 offers measures that could be useful in that respect.

5.2 Disclosure risks

Disclosure risks for frequency tables primarily relate to ‘unsafe cells’; that is cells in a table which could lead to a statistical disclosure. There are several types of disclosure risk and the associated unsafe cells can vary in terms of their impact. A risk assessment should be undertaken to evaluate the expected outcomes of a disclosure. In order to be explicit about the disclosure risks to be managed one should also consider a range of potentially disclosive situations and use these to develop appropriate confidentiality rules to protect any unsafe cells.

The disclosure risk situations described in this section primarily apply to tables produced from registration processes, administrative sources or censuses, e.g. data sources with a complete coverage of the population or sub-population. Where frequency tables are derived from sample surveys, e.g. the counts in the table are weighted, some protection is provided by the sampling process. The sample a priori introduces uncertainty into the zero counts and other counts through sample error.

It should be noted that when determining unsafe cells one should take into account the variables that define the population within the table, as well as the variables defining the table. For example, a frequency table may display income by region for males. Although sex does not define a row or column it defines the eligible population for the table and therefore must be considered as an identifying variable when thinking about these disclosive situations.

Disclosure risks are categorised based on how information is revealed. The most common types of disclosure risk in frequency tables are described below.

Identification as a disclosure risk involves finding yourself or another individual or group within a table. Many NSIs will not consider that self-identification alone poses a disclosure risk. An individual that can recall their circumstances at the time of data collection will likely be able to deduct which cell in a published table their information contributes to. In other words, they will be able to identify themselves but only because they know what attributes were provided in the data collection, along with any other information about themselves which may assist in this detection.

However, identification or self-identification can lead to the discovery of rareness, or even uniqueness, in the population of the statistic, which is something an individual might not have known about themselves before. This is most likely to occur where a cell has a small value, e.g. a 1, or where it becomes in effect a population of 1 through subtraction or deduction using other available information. For certain types of information, rareness or uniqueness may encourage others to seek out the individual. The threat or reality of such a situation could cause harm or distress to the individual, or may lead them to claim that the statistics offer inadequate disclosure protection for them, and therefore others.

Example Identification or self-identification may occur from any cells with a count of 1, i.e. representing one statistical unit. Table 5.1 presents an example of a low-dimensional table in a particular area where identification may occur.

Marital Status Male Female Total
Married 38 17 55
Divorced 7 4 11
Single 3 1 4
Total 48 22 70
Table 5.1: Marital status by sex

The existence of a 1 in the highlighted cell indicates that the female who is single is at risk of being identified from the table.

Identification itself poses a relatively low disclosure risk, but its tendency to lead to other types of disclosure, together with the perception issues it raises means several NSIs choose to protect against identification disclosure. Section 5.3 discusses protection methods which tend to focus on reducing the number of small cells in tables.

Attribute disclosure involves the uncovering of new information about a person through the use of published data. An individual attribute disclosure occurs when someone who has some information about an individual could, with the help of data from the table (or from a different table with a common attribute), discover details that were not previously known to them. This is most likely to occur where there is a cell containing a 1 in the margin of the table and the corresponding row or column is dominated by zeros. The individual is identified on the basis of some of the variables spanning the table and a new attribute is then revealed about the individual from other variables. Note that identification is a necessary precondition for individual attribute disclosure to occur, and should therefore be avoided.

This type of disclosure is a particular problem when many tables are released from one data set. If an intruder can identify an individual then additional tables provide more detail about that person. In continuation of the example shown in Table 5.1, the cell disclosing the single female as unique will ultimately turn into a marginal cell in a higher dimensional table such as Table 5.2 below and her number of hours worked is revealed.

Example

Marital status / Hours worked Male Female Total
More than 30 16-30 15 or less More than 30 16-30 15 or less
Married 30 6 2 14 3 0 55
Divorced 3 4 0 2 2 0 11
Single 2 0 1 0 0 1 4
Total 35 10 3 16 5 1 70
Table 5.2: Marital status and sex by hours worked

The table shows how attribute disclosure arises due to the zeros dominating the column of the single female, and it is learned that she is in the lowest hours-worked band.

The occurrence of a 2 in the table could also lead to identification if one individual contributed to the cell and therefore could identify the other individual in the cell.

Example An example of potential attribute disclosure from the 2001 UK Census data, involves 184 persons living in a particular area in the UK. Uniques (frequency counts of 1) were found for males aged 50-59, males aged 85+, and females aged 60-64. An additional table showed these individuals further disseminated by health variables, and it was learned that the single male aged 50-59 and the single female aged 60-64 had good or fairly good health and no limiting long-term illness, while the single male aged 85+ had poor health and a limiting long-term illness. Without disclosure control, anyone living in this particular area had the potential to learn these health attributes about the unique individuals. Full coverage sources – like the Census – are a particular concern for disclosure control, because they are compulsory, so there is an expectation to find all individuals in the output. Although there may be some missing data and coding errors etc., NSIs work to minimise these, and the data issues are unlikely to be randomly distributed in the output. Certain SDC techniques can be adjusted to target particular variables (or tables) with more or less inherent data error. For example, providing more cell suppression for variables which are known to be better quality and have fewer data issues.

Another disclosure risk involves learning a new attribute about an identifiable group, or learning a group does not have a particular attribute. This is termed group attribute disclosure, and it can occur when all respondents fall into a subset of categories for a particular variable, i.e. where a row or column contains mostly zeros and a small number of cells that are non-zero. This type of disclosure is a much neglected threat to the disclosure protection of frequency tables, and in contrast to individual attribute disclosure, it does not require individual identification. In order to protect against group attribute disclosure it is essential to introduce ambiguity in the zeros and ensure that all respondents do not fall into just one or a few categories.

Example Table 5.3 shows respondents in a particular area broken down by hours worked and marital status.

Hours worked
Marital status Full time Part time Total
Married 6 0 6
Divorced 5 1 6
Single 2 2 4
Total 13 3 16
Table 5.3: Marital status by hours worked

From the table we can see that all married individuals work full time, therefore any individual in that area who is married will have their hours worked disclosed.

The table also highlights another type of group attribute disclosure referred to as ‘within-group disclosure’. This occurs for the divorced group and results from all respondents falling into two response categories for a particular variable, where one of these response categories has a cell value of 1. In this case, the divorced person who works part time knows that all other divorced individuals work full time.

Differencing involves an intruder using multiple overlapping tables and subtraction to gather additional information about the differences between them. A disclosure by differencing occurs when this comparison of two or more tables enables a small cell (0, 1, or 2) to be calculated. Disclosures by differencing can result from three different scenarios which will be explained in turn:

Disclosure by geographical differencing may result when there are several published tables from the same dataset and they relate to similar geographical areas. If these tables are compared, they can reveal a new, previously unpublished table for the differenced area. For instance, 2001 Output Areas (OA) are similar in geographical size to 1991 Enumeration Districts (ED), and a new differenced table may be created for the remaining area.

Example A fictitious example of differencing is presented below in Table 5.4, Table 5.5 and Table 5.6.

Single Person Household Male Single Person Household Female
More than 30 50 54
16-30 128 140
15 or less 39 49
Table 5.4: Single Person households and hours worked in Area A (2001 OA definition)
Single Person Household Male Single Person Household Female
More than 30 52 55
16-30 130 140
15 or less 39 49
Table 5.5: Single Person households and hours worked in Area A (1991 ED definition)
Single Person Household Male Single Person Household Female
More than 30 2 1
16-30 2 0
15 or less 0 0
Table 5.6: New differenced table (via geographical differencing)

The above example demonstrates how simple subtraction of the geographical data in Table 5.4 from Table 5.5 can produce disclosive information for the area in Table 5.6.

For the special case of dealing with geographical differencing problems when spatial breakdowns are used, Costemalle (2019) have proposed a method by detecting individuals located in small overlapping areas and whose personal information can therefore be disclosed. On more general level, Möhler et al. (2024) offer guidance for statistical disclosure control methods applied on geo-referenced data, also looking at issues of geographical differencing.

Disclosure by linking can occur when published tables relating to the same base population are linked by common variables. These new linked tables were not published by the NSI and therefore may reveal the statistical disclosure control methods applied and/or unsafe cell counts.

Example A fictitious example of disclosure by linking is provided below in Table 5.7 to Table 5.10, which are linked by employment status and whether or not the respondents live in the area.

Number of Persons
Employed 85
Not employed 17
Total 102
Table 5.7: Employment status in Area A


Number working in area Number living in area Living and working in area
Area A 49 102 22
Table 5.8: Area of Residence or Workplace


Living and working in Area A Living in Area A and working elsewhere Working in Area A and living elsewhere
Males 21 58 23
Table 5.9: Males working and living in Area A


Table 5.10 shows the new data which can be derived by combining and differencing the totals from the existing tables. The linked table discloses the female living and working in Area A as a unique.

Living and working in Area A Living in Area A and working elsewhere Working in Area A and living elsewhere
Males 21 58 23
Females 1 5 4
Total 22 63 27
Table 5.10: New differenced table (via linking)

Importantly, when linked tables are produced from the same dataset it is not sufficient to consider the protection for each table separately. If a cell requires protection in one table then it will require protection in all tables, otherwise the protection in the first table could be undone.

The last type of disclosure by differencing involves differencing of sub-population tables. Sub-populations are specific groups which data may be subset into before a table is produced (e.g. a table of fertility may use a sub-population of females). Differencing can occur when a published table definition corresponds to a sub-population of another published table, resulting in the production of a new, previously unpublished table. If the total population is known and the subpopulation of females is gathered from another table, the number of males can be deduced.

Tables based on categorical variables which have been recoded in different ways may also result in this kind of differencing. To reduce the disclosure risk resulting from having many different versions of variables, most NSIs have a set of standard classifications which they use to release data.

Example An example using the number of hours worked is shown below in Tables Table 5.11 to Table 5.13.

<20 20 - 39 40 - 59 60 - 69 70 or more
Male 6 9 5 8 4
Female 10 38 51 42 32
Table 5.11: Hours worked by sex in Area A


<25 25 - 39 40 - 59 60 - 69 70 or more
Male 7 8 5 8 4
Female 10 38 51 42 32
Table 5.12: Hours worked by sex in Area A


<20 20 - 24 25 - 39 40 - 59 60 - 69 70 or more
Male 6 1 8 5 8 4
Female 10 0 38 51 42 32
Table 5.13: New differenced table (via sub-populations)


The example indicates how a new table can be differenced from the original tables, in particular a new hours worked group (for 20-24 hours) which reveals that the male falling into this derived hours worked group is unique.

More information on disclosure by differencing can be obtained from Brown (2003) and Duke-Williams and Rees (1998).

In addition to providing actual disclosure control protection for sensitive information, NSIs need to be seen to be providing this protection. The public may have a different understanding of disclosure control risks and their perception is likely to be influenced by what they see in tables. If many small cells appear in frequency tables users may perceive that either no SDC, or insufficient SDC methods have been applied to the data. Section 5.3 discusses SDC methods, but generally some methods are more obvious in the output tables than others. To protect against negative perceptions, NSIs should be transparent about the SDC methods applied. Managing perceptions is important to maintain credibility and responsibility towards respondents. Negative perceptions may impact response rates for censuses and surveys if respondents perceive that there is little concern about protecting their confidentiality. More emphasis has been placed on this type of disclosure risk in recent years due to declining response rates and data quality considerations. It is important to provide clear explanations to the public about the protection afforded by the SDC method, as well as guidance on the impact of the SDC methods on the quality and utility of the outputs. Explanations should provide details of the methods used but avoid stating the exact parameters as this may allow intruders to unpick the protection.

Disclosure Risk Description
Identification Identifying an individual in a table
Attribute disclosure (individual and group) Finding out previously unknown information about an individual (or group) from a table
Disclosure by differencing Uncovering new information by comparing more than one table
Perception of disclosure The public’s feeling of risk based on what is seen in released tables
Table 5.14: Summary of disclosure risks associated with frequency tables

5.3 Methods

There are a variety of disclosure control methods which can be applied to tabular data to provide confidentiality protection. The choice of which method to use needs to balance the how the data is used, the operational feasibility of the method, and the disclosure control protection it offers. SDC methods can be divided into three categories which will be discussed in turn below: those that adjust the data before tables are designed (pre-tabular), those that determine the design of the table (table redesign) and those that modify the values in the table (post-tabular). Further information on SDC methods for frequency tables can also be found in Willenborg & Ton de Waal (2001) and Doyle et al (2001).

Pre-tabular
Pre-tabular disclosure control methods are applied to microdata before it is aggregated and output in frequency tables. These methods include: record swapping, over imputation, data switching PRAM, sampling and synthetic mircrodata (see Section 3.4 or Section 5.6, for details of the methods). A key advantage of pre-tabular methods is that the output tables are consistent and additive since all outputs are created from protected microdata. Pre-tabular methods by definition only need to be applied once to the microdata and after they are implemented for a microdata set (often in conjunction with threshold or sparsity rules) they can be used to allow flexible table generation. This is because pre-tabular methods provide some protection against disclosure by differencing and any uncovered slivers will have already had SDC protection applied.

Disadvantages of pre-tabular techniques are that one must have access to the original microdata. Also, a high level of perturbation may be required in order to disguise all unsafe cells. Pre-tabular methods have the potential to distort distributions in the data, but the actual impact of this will depend on which method is used and how it is applied. It may be possible to target pre-tabular methods towards particular areas or sensitive variables. Generally pre-tabular methods are not as transparent to users of the frequency tables and there is no clear guidance that can be given in order to make adjustments in their statistical analysis for this type of perturbation.

Table redesign
Table redesign is recommended as a simple method that can minimise the number of unsafe cells in a table and preserve original counts. It can be applied alongside post-tabular or pre-tabular disclosure control methods, as well as being applied on its own. As an additional method of protection it has been used in many NSI’s including the UK and New Zealand. As table redesign alone provides relatively less disclosure control protection than other methods, it is often used to protect sample data, which already contains some protection from the sampling process.

Table redesign methods used to reduce the risk of disclosure include;

  • aggregating to a higher level geography or to a larger population subgroup
  • applying table thresholds
  • collapsing or grouping categories of variables (reducing the level of detail)
  • applying a minimum average cell size to released tables.

The advantages of table redesign methods are that original counts in the data are not damaged and the tables are additive with consistent totals. In addition, the method is simple to implement and easy to explain to users. However, the detail in the table will be greatly reduced, and if many tables do not pass the release criteria it may lead to user discontent.

Post-tabular
Statistical disclosure control methods that modify cell values within tabular outputs are referred to as post-tabular methods. Such methods are generally clear and transparent to users, and are easier to understand and account for in analyses, than pre-tabular methods. However, post-tabular methods suffer the problem that each table must be individually protected, and it is necessary to ensure that the new protected table cannot be compared against any other existing outputs in such a way which may undo the protection that has been applied. In addition post-tabular methods can be cumbersome to apply to large tables. The main post-tabular methods include cell suppression, the cell key method, and rounding.

Pre-Tabular Table Redesign Post-Tabular
Methods applied before tables are created Methods applied as tables are created Methods applied after tables are created
Tables and totals will be additive and consistent Yes Yes No
Methods are visible to users and can be accounted for in analysis No Yes Yes
Methods need to be applied to each table individually No Yes Yes
Flexible table generation is possible Yes No No (for Cell suppression)
Table 5.15: Summary of Tabular Disclosure Control Methods

The main perturbative post-tabular methods of disclosure control are discussed in the two subsequent sections.

Cell suppression Cell suppression is a non-perturbative method of disclosure control, (it is described in detail in Chapter 4), but the method essentially removes sensitive values and denotes them as missing. Protecting the unsafe cells is called primary suppression, and to ensure these cannot be derived by subtractions from published marginal totals, additional cells are selected for secondary suppression.

Cell suppression cannot be unpicked provided secondary cell suppression is adequate and the same cells in any linked tables are also suppressed. Other advantages are that the method is easy to implement on unlinked tables and it is highly visible to users. The original counts in the data that are not selected for suppression are left unadjusted.

However cell suppression has several disadvantages as a protection method for frequency tables, in particular information loss can be high if more than a few suppressions are required. Secondary suppression removes cell values which are not necessarily a disclosure risk, in order to protect other cells which are a risk. Disclosive zeros need to be suppressed and this method does not protect against disclosure by differencing. This can be a serious problem if more than one table is produced from the same data source (e.g. flexible table generation). When disseminating a large number of tables it is much harder to ensure the consistency of suppressed cells, and care must be taken to ensure that same cells in linked tables are always suppressed.

A simple instance of a Cell Perturbation method is Barnardisation. Barnardisation modifies each internal cell of every table by +1, 0 or -1, according to probabilities \(p/2\), \(1-p\) and \(p/2\) respectively for a fixed \(p\in(0,1)\). Zeros are not adjusted. The method offers some protection against disclosure by differencing, however table totals are added up from the perturbed internal cells, resulting in inconsistent totals between tables. Typically, the probability \(p\) is quite small and therefore a high proportion of risky cells are not modified. The exact proportion of cells modified is not revealed to the user. This is generally a difficult method to implement for flexible output.

The Cell Key Method (CKM, described in Section 5.4) is a much more advanced cell perturbation method which was developed by the Australian Bureau of Statistics (hence it used to be known as ABS Cell Perturbation method) to protect the outputs from their 2006 Census. The method is designed to protect tables by altering potentially all cells by small amounts. The cells are adjusted in such a way that the same cell is perturbed in the same way even when it appears across different tables. This method adds sufficient ‘noise’ to each cell so if an intruder tried to gather information by differencing, they would not be able to obtain the real data. When integrated into the table generation process, the method provides protection for flexible tables and can be used to produce perturbations for multiple large high dimensional hierarchical tables. It is one of the methods recommended by Eurostat for protection of the Census 2022 output data.

The method is less transparent than other methods, such as, for example, conventional rounding.

Rounding (discussed in Section 5.5) involves adjusting the values in all cells in a table to a specified rounding base so as to create uncertainty about the real value for any cell. There are several alternative rounding methods including: conventional rounding, random rounding and controlled rounding. Properties of the different alternative methods (as compared in the summary table Table 5.15) vary widely between those variants.

5.4 Cell Perturbation - the Cell Key Method

The Cell Key Method is a post tabular perturbative disclosure control method, that adds noise to the original table cell values. Since the individual table cells are equipped with a noise independently, please note that this implies that the resulting table is no longer additive. For example, after perturbation, a population table could show that 1000 males, 1100 females and 16 non-binary persons live in an area, while the total count is 2109 persons. This non-additivity is part of the protective mechanism of this method and at the same time offers the advantage that the deviation from the original value can be kept as small as possible. It is not recommended, generally, to form such aggregates subsequently from perturbed values, because this would also add the sum of all noise terms to the aggregate, which can make the deviation undesirably large.

The Cell Key Method is a more informed post-tabular method of disclosure control since it utilizes pre-tabular microdata information during the perturbation stage. This is to achieve that cells are adjusted in such a way that the same cell is perturbed in the same way even when it appears across different tables. The method is highly dependent on the so-caleld ‘lookup table’ that is used to determine the noise to be added to each particular cell before dissemination, but is flexible in that lookup tables can be specifically designed to meet needs, and different lookup tables could potentially be used for different tables. Furthermore, the lookup table can be designed to reflect other post-tabular methods (e.g. small cell adjustments or random rounding). The method provides protection for flexible tables and can be used to produce perturbations for large high dimensional hierarchical tables. As noted above, since perturbation is applied to each table cell independently, additivity gets lost. This is similar to the case of rounding but due to the complexity of the method, those inconsistencies in the data are harder to communicate. Theoretically one might add a post-processing stage to restore additivity, using, for example, an iterative fitting algorithm which may attempt to balance and minimise absolute distances to the stage one table (although not necessarily producing an optimal solution). However, restoring additivity tends to increase the noise, and may cause different perturbation for the same cell when it appears across different tables. It is therefore not generally recommended.

Please note that there is not one ultimate way of how to define the Record Key, the Cell Key and the lookup table. The Australian Bureau of Statistics for example relies on integer values for their Record Keys whereas the Center of Excellence (CoE) on SDC presented an approach where the Record Keys are uniformly distributed between 0 and 1, which should allow for more flexibility regarding noise design. We will focus on the latter approach here, which is also implemented in the software \(\tau\)-ARGUS and the R-package cellKey. In the variant suggested by the CoE on SDC all digits before the decimal point are removed from the Cell Key (being the sum of the Record Keys within a table cell), which makes it another random number that is uniformly distributed between 0 and 1. The lookup table now can be interpreted as the tabular representation of a piecewise constant inverse distribution function. By looking up values that are uniformly distributed, we thus obtain realizations of a random variable with the corresponding coded distribution.

It is possible to create lookup tables, which are also known as perturbation tables or p-tables, that are tailored to your needs, by using the freely accesible R-package ptable. The package allows, among other things, to specify a maximum for the noise you want to add and the probability for the noise to be zero, which is equivalent to retaining the original value. You also have the option to generate the distribution, coded inside the perturbation table, in such a way, that certain values, such as ones or twos, do not occur in the perturbed output at all. The method for creating such tables, implemented in the ptable package, is based on a maximum entropy approach as described, for example, in Giessing (2016) and yields a distribution with zero mean. Therefore, the distribution of the data will not get biased by adding the noise. For more information about the ptable package, please see the vignette or the reference manual on CRAN (https://cran.r-project.org/web/packages/ptable).

The protection effect arises from the uncertainty of a data attacker about whether and, if so, how much a value has been changed. Therefore, all published figures must be perturbed with the Cell Key Method, even those that do not pose a disclosure risk per se. Before the Cell Key Method can be applied, one has to consider which maximum deviation is still acceptable and how large the variance of the noise should be. But one should always keep in mind that a low maximum deviation also leads to less protection and hence one cannot focus on information loss alone. It is especially risky to publish the maximum deviation, since a data attacker can use this information to draw further conclusions.

Example To illustrate how the Cell Key Method is used, Table 5.16 shows a purely fictional manually created perturbation table with a maximum deviation of 1 and without ones in the results after perturbation.

original value perturbed value probability of occurrence noise lower bound upper bound
0 0 1 0 0 1
1 0 0.5 -1 0 0.5
1 2 0.5 1 0.5 1
2 2 0.8 0 0 0.8
2 3 0.2 1 0.8 1
3 2 0.3 -1 0 0.3
3 3 0.4 0 0.3 0.7
3 4 0.3 1 0.7 1
Table 5.16: Fictional example of a p-table

As you can see the values in the colum ‘original value’ range from 0 to 3. This is because a different distribution is stored in the p-table for each of these values. Otherwise, negative values could arise, for example. This means that within a p-table several probability distributions for the noise are stored, which are used depending on the original value. In the given example for an original value of 1 the noise \(\textit{v}\) is defined as a uniform distribution on the set \(\lbrace -1,1\rbrace\), whereas for an original value of 2 with a probability of 80% the noise is 0 and with a probability of 20% it is 1. For every original value which is at least 3 the lowest lines in the p table will be used to define the noise \(\textit{v}\), which encode a symmetric distribution on \(\lbrace -1,0,1\rbrace\).

ID Sex Record Key
A male 0.9
B male 0.3
C male 0.6
Table 5.17: Exemplary Microdata

Now if we have a set of microdata which contains three male respondents with Record Keys 0.5, 0.3 and 0.6 respectively, as shown in Table 5.17, then in a table cell that aggregates those three respondents the corresponding sum of Record Keys is 0.9+0.3+0.6=1.8. Since for the Cell Key the digits before the decimal point are irrelevant, we get a corresponding Cell Key of 0.8. Now to identify the noise \(\textit{v}\) that has to be added to the original count of 3, we have to concentrate on those lines of the p-table, for which the original value is 3 and identify that line for which \(\textit{'lower bound'} < 0.8 \leq \textit{'upper bound'}\). This is the last row of our exemplary table, in which the value 1 is given for the noise. Hence the perturbed count for this cell computes as \(\hat{n} = n + v = 3 + 1 = 4\). At this point, it should be noted that if, in addition to frequencies, magnitudes and mean values are also published, the mean values should rather not be shown as original values, since otherwise there is a risk that the corresponding original frequency values can be disclosed. See the discussion in Section 5.7.

5.4.1 Software implementing the Cell Key Method

For application of the Cell Key Method, so called \(p\)-tables describing the distribution of the noise are needed. They should be specified in a certain format. The R-package ptable, available on CRAN, can be used to produce such p-tables for use in the method specific R-package cellKey as well as for use in the general purpose software \(\tau\)‑ARGUS. For information how to use the software, we refer to the vignettes of the respective R-packages on CRAN (https://cran.r-project.org/) and to the manual of \(\tau\)-ARGUS and to the quick references for CKM in \(\tau\)‑ARGUS on GitHub (https://github.com/sdcTools/tauargus/releases).

5.5 Rounding

Rounding involves adjusting the values in all cells in a table to a specified base so as to create uncertainty about the real value for any cell. It adds a small, but acceptable, amount of distortion to the original data. Rounding is considered to be an effective method for protecting frequency tables, especially when there are many tables produced from one dataset. It provides protection to small frequencies and zero values (e.g. empty cells). The method is simple to implement, and for the user it is easy to understand as the data is visibly perturbed.

Care must be taken when combining rounded tables to create user-defined areas. Cells can be significantly altered by the rounding process and aggregation compounds these rounding differences. Furthermore, the level of association between variables is affected by rounding, and the variance of the cell counts is increased.

There are several alternative rounding methods including; conventional rounding, random rounding, controlled rounding, and semi-controlled rounding, which are outlined below. Each method is flexible in terms of the choice of the base for rounding, although common choices are 3 and 5. All rounded values (other than zeros) will then be integer multiples of 3 or 5, respectively.

Conventional rounding
When using conventional rounding, each cell is rounded to the nearest multiple of the base. The marginal totals and table totals are rounded independently from the internal cells.

Example Table 5.18 shows counts of males and females in different areas, while Table 5.19 shows the same information rounded to a base of 5.

Male Female Total
Area A 1 0 1
Area B 3 3 6
Area C 12 20 32
Total 16 23 39
Table 5.18: Population counts by sex


Male Female Total
Area A 0 0 0
Area B 5 5 5
Area C 10 20 35
Total 15 25 40
Table 5.19: Population counts by sex (conventional rounding)

The example shows the Males unsafe cell in Area A in Table 5.18 is protected by the rounding process in Table 5.19.

The advantages of this method are that the table totals are rounded independently from the internal cells, and therefore consistent table totals will exist within the rounding base. Cells in different tables which represent the same records will always be the same. While this method does provide some confidentiality protection, it is considered less effective than controlled or random rounding. Tables are not additive (e.g. row 3 of Table 5.19 does not sum to 35) and the level of information is poor if there are many values of 1 and 2. The method is not suitable for flexible table generation as it can be easily ‘unpicked’ when differencing and linking tables. For these reasons conventional rounding is not recommended as a disclosure control method for frequency tables. Conventional rounding is sometimes used by NSIs for quality reasons (e.g. rounding data from small sample surveys to emphasize the uncertain nature of the data). The distinction between rounding performed for disclosure control reasons and rounding performed for quality reasons should always be made clear to users.

Random rounding
Random rounding shifts each cell to one of the two nearest base values in a random manner. Each cell value is rounded independently of other cells, and has a greater probability of being rounded to the nearest multiple of the rounding base. For example, with a base of 5, cell values of 6, 7, 8, or 9 could be rounded to either 5 or 10. Marginal totals are typically rounded separately from the internal cells of the table (i.e. they are not created by adding rounding cell counts) and this means tables are not necessarily additive. Various probability schemes are possible, but an important characteristic is that they should be unbiased. This means there should be no net tendency to round up or down and the average difference from the original counts should be zero.

Example If we are rounding to base 3 the residual of the cell value after dividing by 3 can be either 0, 1 or 2.

  • If the residual is zero no change is made to the original cell value.
  • If the residual is 1, then with a probability of \(2/3\) the cell value is rounded down to the lower multiple of 3 and with a probability of \(1/3\) the cell value is rounded up to the higher multiple of 3.
  • If the residual is 2, the probabilities are \(2/3\) to round up and \(1/3\) to round down.
Original Value Rounded Value (probability)
0 0 (\(1\))
1 0 (\(2/3\)) or 3 (\(1/3\))
2 3(\(2/3\)) or 0(\(1/3\))
3 3 (\(1\))
4 3(\(2/3\)) or 6(\(1/3\))
5 6(\(2/3\)) or 3(\(1/3\))
6 6 (\(1\))

Example As another example, Table 5.20 shows a possible solution for Table 5.18 using random rounding to base 5.

Male Female Total
Area A 0 0 0
Area B 5 0 5
Area C 10 20 35
Total 15 20 40
Table 5.20: Population counts by sex (with random rounding)

The main advantages of random rounding are that it is relatively easy to implement, it is unbiased, and it is clear and transparent to users. Table totals are consistent within the rounding base because the totals are rounded independently from the internal cells. All values of 1 and 2 are removed from the table by rounding, which prevents cases of perceived disclosure as well as actual disclosure. The method may also provide some protection against disclosure by differencing as rounding should obscure most of the exact differences between tables.

However, random rounding has disadvantages including the increased information loss which results from the fact that all cells (even safe cells) are rounded. In some instances the protection can be ‘unpicked’ and in order to ensure adequate protection, the resulting rounded tables need to be audited. Although the method is unbiased, after applying random rounding there may be inconsistencies in data within tables (e.g. rows or columns which do not add up like row 3 of Table 5.20 does not sum to 35) and between tables (e.g. the same cell may be rounded to a different number in different tables).

Controlled rounding
Unlike other rounding methods, controlled rounding yields additive rounded tables. It is the statistical disclosure control method that is generally most effective for frequency tables. The method uses linear programming techniques to round cell values up or down by small amounts, and its strength over other methods is that additivity is maintained in the rounded table, (i.e. it ensures that the rounded values add up to the rounded totals and sub-totals shown in the table). This property not only permits the release of realistic tables which are as close as possible to the original table, but it also makes it impossible to reduce the protection by ‘unpicking’ the original values by exploiting the differences in the sums of the rounded values. Another useful feature is that controlled rounding can achieve specified levels of protection. In other words, the user can specify the degree of ambiguity added to the cells, for example, they may not want a rounded value within 10% of the true value. Controlled rounding can be used to protect flexible tables although the time taken to implement the method may make it unsuitable for this purpose.

Example Table 5.21 shows a possible rounding solution for Table 5.18, using controlled rounding to base 5.

Male Female Total
Area A 5 0 5
Area B 0 5 5
Area C 10 20 30
Total 15 25 40
Table 5.21: Population counts by sex (controlled rounding)

The disadvantages of controlled rounding are that it is a complicated method to implement, and it has difficulty coping with the size, scope and magnitude of the census tabular outputs. Controlled rounding is implemented in the software package \(\tau\)‑ARGUS, see Section 5.5.1 below for detailed information. Nevertheless, it is hard to find control-rounded solutions for sets of linked tables, and in order to find a solution cells may be rounded beyond the nearest rounding base. In this case users will know less about exactly how the table was rounded and it is also likely to result in differing values for the same internal cells across different tables.

Semi-controlled rounding
Semi-controlled rounding also uses linear programming to round table entries up or down but in this case it controls for the overall total in the table, or it controls for each separate output area total. Other marginal and sub totals will not necessarily be additive. This ensures that either the overall total of the table is preserved (or the output area totals are all preserved), and the utility of this method is increased compared with conventional and random rounding. Consistent totals are provided across linked tables, and therefore the method can be used to protect flexible tables, although the time it takes to implement may make it unsuitable. Disadvantages of semi-controlled rounding relate to the fact that tables are not fully additive, and finding an optimal solution can prove difficult.

Conventional Rounding Controlled (and semi-controlled) Rounding Random rounding
Internal cells add to table totals (additvity) No Yes No
Method provides enough SDC protection (and cannot be unpicked) No Yes In some situations this method can be unpicked
Method is quick and easy to implement Yes It can take time for this method to find a solution Yes
Table 5.22: Summary of SDC rounding methods

There are some more specialised rounding methods which have been used at various times by NSIs to protect census data, one of these methods is described below.

Small cell adjustment was used (in addition to random swapping (a pre-tabular method)) to protect 2001 Census tabular outputs for England, Wales and Northern Ireland. This method was also used by the ABS to protect their tabular outputs from the 2001 Census.

Applying small cell adjustments involves randomly adjusting small cells within tables upwards or downwards to a base using an unbiased prescribed probability scheme. During the process:

  • small counts appearing in a table cells are adjusted
  • totals and sub totals are calculated as the sum of the adjusted counts. This means all tables are internally additive.
  • tables are independently adjusted so counts of the same population which appear in two different tables, may not necessarily have the same value.
  • tables for higher geographical levels are independently adjusted, and therefore will not necessarily be the sum of the lower component geographical units.
  • output is produced from one database which has been adjusted for estimated undercount so the tables produced from this one database provide a consistent picture of this one population.

Advantages of this method are that tables are additive, and the elimination of small cells in the table removes cases of perceived as well as actual identity disclosure. In addition, loss of information is lower for standard tables as all other cells remain the same, however information loss will be high for sparse tables. Other disadvantages include inconsistency of margins between linked tables since margins are calculated using perturbed internal cells, and this increases the risk of tables being unpicked. Furthermore, this method provides little protection against disclosure by differencing, and is not suitable for flexible table generation.

5.5.1 Software - How to use Controlled Rounding in \(\tau\)‑ARGUS

\(\tau\)-ARGUS (Hunderpool et al, 2014) is a software package which provides tools to protect tables against the risk of statistical disclosure (\(\tau\)-ARGUS is also discussed in Chapter 4). Controlled rounding is easy to use in \(\tau\)-ARGUS and the controlled rounding procedure (CRP) was developed by JJ Salazar. This procedure is based on optimisation techniques similar to the procedure developed for cell suppression. The CRP yields additive rounded tables, where the rounded values add up to the rounded totals and sub-totals shown in the table. This means realistic tables are produced and it makes it impossible to reduce the protection by “unpicking” the original values by exploiting the differences in the sums of the rounded values. The CRP implemented in \(\tau\)-ARGUS also allows the specification of hierarchical structures within the table variables.

Controlled rounding gives sufficient protection to small frequencies and creates uncertainty about the zero values (i.e. empty cells). (This is not the case for suppression in terms of how it is now implemented in \(\tau\)-ARGUS).

In Zero-restricted Controlled Rounding cell counts are left unchanged if they are multiples of the rounding base or shifted to one of the adjacent multiples of the rounding base. The modified values are chosen so that the sum of the absolute differences between the original values and the rounded ones are minimized (under an additivity constraint). Therefore, some values will be rounded up or down to the most distant multiple of the base in order to help satisfy these constraints. In most cases a solution can be found, but in some cases it cannot and the zero-restriction constraint in CRP can be relaxed to allow the cell values to be rounded to a nonadjacent multiple of the base. This relaxation is controlled by allowing the procedure to take a maximum number of steps.

For example, consider rounding a cell value of 7 when the rounding base equals 5. In zero-restricted rounding, the solution can be either 5 or 10. If 1 step is allowed, the solution can be 0, 5, 10 or 15. In general, let \(z\) be the integer to be rounded in base \(b\), then this number can be written as

\(z = u b + r\),

where \(ub\) is the lower adjacent multiple of \(b\) (hence \(u\) is the floor value of \(z/b\)) and \(r\) is the remainder. In the zero-restricted solution the rounded value, \(a\), can take values:

\[\begin{align} a = \begin{cases} a = ub &\text{if}\quad r=0 \\ a= \begin{cases} ub \\ (u+1)b \end{cases} &\text{if}\quad r\neq0. \end{cases} \end{align}\]

If \(K\) steps are allowed, then \(a\), can take values:

\[\begin{align} a = \begin{cases} \max_{j\in\{-K,\ldots,K\}}(0,(u+j)) b &\text{if}\quad r=0\\ \max_{j\in\{-K,\ldots,K+1\}}(0,(u+j)) b &\text{if}\quad r\neq0 \end{cases} \end{align}\]

5.5.1.1 Optimal, first feasible and RAPID solutions

For a given table there can exist more than one controlled rounded solution, and any of these solutions is a feasible solution. The Controlled Rounding Program embedded in \(\tau\)‑ARGUS determines the optimal solution by minimising the sum of the absolute distances of the rounded values, from the original ones. Denoting the cell values, including the totals and sub-totals, with zi and the corresponding rounded values with ai, the function that is minimised is

\[ \sum\limits_{i = 1}^{N} \left| z_{i}-a_{i} \right| , \]

where \(N\) is the number of cells in a table (including the marginal ones). The optimisation procedure for controlled rounding is a rather complex one (NP-complete program), so finding the optimal solution may take a long time for large tables. In fact, the algorithm iteratively builds different rounded tables until it finds the optimal solution. In order to limit the time required to obtain a solution, the algorithm can be stopped when the first feasible solution is found. In many cases, this solution is quite close to the optimal one and it can be found in significantly less time.

The RAPID solution is produced by CRP as an approximated solution when a feasible one cannot be found. This solution is obtained by rounding the internal cells to the closest multiple of the base and then computing the marginal cells by addition. This means that the computed marginal values can be many jumps away from the original value. However, a RAPID solution is produced at each iteration of the search for an optimal solution, and it will improve (in terms of the loss function) over time. \(\tau\)‑ARGUS allows the user to stop CRP after the first RAPID solution is produced, but this is likely to be very far away from the optimal one.

5.5.1.2 Protection provided by controlled rounding

The protection provided by controlled rounding can be assessed by considering the uncertainty (about the true values achieved) when releasing rounded values; that is the existence interval that an intruder can compute for a rounded value. We assume that the values of the rounding base, \(b\), and the number of steps allowed, \(K\), are known by the user together with the output rounded table. Furthermore, we assume that it is known that the original values are positive frequencies (hence nonnegative integers).

Zero-restricted rounding
Given a rounded value, a, an intruder can compute the following existence intervals for the true value, z:

\[\begin{align} z \in \begin{cases} [0,b-1] &\text{if} \quad a = 0 \\ [a-b+1,a+b-1] &\text{if} \quad a \neq 0 \end{cases} \end{align}\]

For example, if the rounding base is \(b=5\) and the rounded value is \(a=0\), a user can determine that the original value is between \(0\) and \(4\). If the rounded value is not \(0\), then users can determine that the true value is between \(\pm 4\) units from the published value.

K-step rounding
As mentioned above, it is assumed that the number of steps allowed is released together with the rounded table. Let \(K\)* be the number of steps allowed, then an intruder can compute the following existence intervals for the true value \(z\):

\[\begin{align} z \in \begin{cases} [0,(K+1)b-1] &\text{if} \quad a < (K+1)b \\ [a-(K+1)b+1,a+(K+1)b-1] &\text{if} \quad a \geq (K+1)b \end{cases} \end{align}\]

For example, assume that for controlled rounding with \(b=5\) and \(K=1\), \(a=15\), then a user can determine that \(z \in [ 6,24 ]\).

Very large tables
The procedure implemented in \(\tau\)‑ARGUS is capable of rounding tables up to 150K cells on an average computer. However for larger tables a partitioning procedure is available, which allows much larger tables to be rounded. Tables with over six million cells have been successfully rounded this way.

5.6 Targeted Record Swapping

Targeted Records Swapping (TRS) is a pre-tabular perturbation method. It’s intended use is to apply a swapping procedure to the micro data before generating a table. Although it is applied solely on micro data it is generally considered a protection method used for tabular data and not recommended for protecting micro data. TRS can be used for tables with and without spatial characteristics, with the prior case containing also grid data products or tables created by cross-tabulating with grid cells.
During the TRS the spatial character of the data can be taken into account to some degree.

5.6.1 The TRS noise mechanism

Regardless of the table, be it count data or a magnitude table, the methodology of the TRS does not change. This is a direct consequence of the fact that the method is applied to the underlying micro data before generating any table.

Consider population units \(i = 1, \ldots, N\) where each unit \(i\) has \(p\) characteristics or variables \(\{\mathbf{x}_{1},\ldots,\mathbf{x}_{n}\} = \mathbf{X} \in \mathbb{R}^{n\times p}\). Furthermore there exists a geographic hierarchy \(\mathcal{G}^{1} \succ \mathcal{G}^{2} \succ \ldots \succ \mathcal{G}^{K}\) where each \(\mathcal{G}^{k}\) is the set of disjointly split areas \(g_m^{k}\), \(m=1,\ldots,M_k\) and each \(g_m^{k}\) is further disjointly subdivided into smaller areas \(g_m^{k+1}\),\(m=1,\ldots,M_{k+1}\):

\[ \mathcal{G}^{k} = \{g_m^{k} \mid g_i^{k}\cap g_j^{k} = \varnothing \text{ for }i\neq j \} \quad \forall k = 1,\ldots,K \]

where

\[ g_m^{k} = \dot{\bigcup}_{m=1}^{M_{k+1}}g_m^{k+1} \quad \forall k = 1,\ldots,K-1 \quad . \]

The notation \(a\dot{\cup}b\) refers to the disjoint union meaning that \(a\cap b = \varnothing\).

With the above definition each unit \(i\) in the population can be assigned to a single area \(g_{m_i}^{k}\) for each geographic hierarchy level \(\mathcal{G}^{k}\), \(k = 1,\ldots,K\). Consider as geographic hierarchy for example the NUTS regions, NUTS1 \(\succ\) NUTS2 \(\succ\) NUTS3, or grid cells, 1000m grid cells \(\succ\) 500m grid cells \(\succ\) 250m grid cells.

Given the geographic hierarchy levels \(\mathcal{G}^{k}\), \(k = 1,\ldots,K\) calculate for each unit \(i = 1, \ldots, N\) risk values \(r_{i,k}\). As an example one can choose \(k\)-anonymity as risk measure and a subset of \(Q\) variables \(\mathbf{x}_{q_1},\ldots,\mathbf{x}_{q_Q}\) to derive risk values \(r_{i,k}\). They can be defined by calculating the number of units \(j\) which have the same values for variables \(\mathbf{x}_{q_1},\ldots,\mathbf{x}_{q_Q}\) as unit \(i\) and taking the inverse.

\[ c_{i,k} = \sum\limits_{j=1}^N \mathbf{1}[x_{i,q_1} = x_{j,q_1}, x_{i,q_2} = x_{j,q_2}, \ldots, x_{i,q_Q} = x_{j,q_Q}] \]

\[ r_{i,k} = \frac{1}{c_{i,k}} \]

Having the risk values \(r_{i,k}\) for each unit \(i\) and each geographic hierarchy level calculated the TRS can be defined as follows:

  1. Define initial, use-case specific, parameter.
    • A global swap rate \(p\);
    • Define a risk value \(r_{high}\) beyond which all units with \(r_{i,k}\) are considered for the geographic hierarchy level \(k\);
    • A subset of \(T\) variables \(\mathbf{x}_{t_1},\ldots,\mathbf{x}_{t_T}\) which are considered while swapping units;
  2. Begin at the first hierarchy level \(\mathcal{G}^{1}\) and select all units \(j\) for which \(r_{i,1} \geq r_{high}\).
  3. For each \(j\) select all units \(l_1,\ldots,l_L\), which do not belong to the same area \(g_{m_j}^{1}\) and have the same values for variables \(\mathbf{x}_{t_1},\ldots,\mathbf{x}_{t_T}\) as unit \(j\). In addition units \(l_1,\ldots,l_L\) cannot have been swapped already. \[ g_{m_j}^{1} \neq g_{m_l}^{1} \] \[ x_{j,t_1} = x_{l,t_1}, x_{j,t_2} = x_{l,t_2}, \ldots, x_{j,t_T} = x_{l,t_T} \]
  4. Sample for each \(j\) one unit from the set \(\{l \mid g_{m_j}^{1} \neq g_{m_l}^{1} \land x_{j,t_1} = x_{l,t_1} \land, \ldots, \land x_{j,t_T} = x_{l,t_T}\}\) by normalising corresponding risk value \(r_{l,1}\) and using them as sampling probabilities.
    • Previously swapped units should be excuded from this set.
  5. Swap all variables, holding geographic information in \(\mathbf{X}\), between unit \(j\) and the sampled unit.
    • Some implementation of targeted record swapping consider only swapping specific variable values from \(\mathbf{X}\) between \(j\) and the sampled unit.
  6. Iterate through the geographic hierarchies \(k=2,\ldots,K\) and repeat in each of them steps 3. - 5.
  7. At the final geographic hierarchy \(k=K\) if the number of already swapped units is less than \(p\times N\) additional units are swapped to reach \(p\times N\) overall swaps.

If the population units refer to people living in dwellings and the aim is to swap only full dwellings with each other and not only individuals it can be useful to set

\[ r_{i,k} = \max_{j \text{ living in same dwelling as }i}r_{j,k} \]

prior to applying the swapping procedure. In addition \(\mathbf{x}_{1(q)},\ldots,\mathbf{x}_q\) should be defined such that they refer to variables holding dwelling information.

The above described procedure is implemented in the R package sdcMicro as well as in the software muArgus, alongside a multitude of parameters to fine tune the procedure.

5.6.2 Pros and cons of targeted record swapping

Indicated by the name of the method the TRS aims to swap micro data records prior to building a table and specifically targeting records during the swapping procedure which have a higher risk of disclosure with respect to the final tables. The protection of the TRS itself is considered to be the uncertainty that an identified unit \(i\) has a considerable chance of actually being a swapped unit and that the information derived from this unit does not contain the information of the original unit \(i\). In general it is recommended to apply the TRS on the micro data set only once and afterwards build various tables from the same perturbed micro data. This creates a more drastic trade off between the number of records to swap and the utility of the final tables. The swapping procedure can indirectly take into account the structure of the final tables through the risk value derived from the subest of variables \(\mathbf{x}_{q_1},\ldots,\mathbf{x}_{q_Q}\) and choice of the geographic hierarchy. However a large number of variables \(\mathbf{x}_{q_1},\ldots,\mathbf{x}_{q_Q}\) and a high resolution in the geographic hierarchy can result in a high share of units with high risk values and consequently many potential swaps. A high swap rate, for instance beyond 10%, can quickly lead to high information loss, because the noise introduced through the swapping is not controlled for while drawing the swapped units. Thus it is not feasible to both address all possible disclosure risk scenarios while maintaining high utility in the final tables.

As with any method it is advised to thoroughly tune parameters to balance information loss and disclosure risk. Possible tuning parameters are:

  • The geographic hierarchy and its depth of granularity.
  • The construction of the risk values \(r_{i,k}\) and \(r_{high}\)
  • The choice of \(\mathbf{x}_{t_1},\ldots,\mathbf{x}_{t_T}\)
  • The global swap rate \(p\)

Because the noise is applied to the microdata before building any tables the additivity between inner and marginal aggregates will always be preserved.

5.7 Publication of mean values

In this section, we will explain why original means should rather not be published when the Cell Key Method (or any other non-additive, perturbative SDC method, such as, e.g. Rounding) is applied to frequency counts. We will illustrate this with an example and show ways to publish the mean values in a safer variant. For this purpose we consider a certain population group of size \(n\) and for every person \(i\in {1,\ldots ,n}\) of this population we denote their corresponding age with \(x_i\). So the average age of this population group can be written as \((\sum_{i=1\ldots n} x_i)/n\). We now consider the following example scenario, in which both (perturbed) frequencies and (original) mean values are published. Table 5.23 shows both the perturbed and the original frequency counts, as well as information about the age.

Example

Cell 1 Cell 2 Marginal
Original Count (\(n\)) 8 12 20
Perturbed Count (\(\hat{n}\)) 9 14 19
Original Sum of Ages (\(x\)) 90 95 185
Original Mean of Ages (\(x/n\)) 11.25 7.9167 9.25
Table 5.23: An example table with (perturbed) ages and counts

Attackers now have the perturbed frequencies as well as the original mean values at their disposal. In our attack scenario, we also assume that attackers know that the maximum deviation of the frequency count is 2. Attackers can therefore conclude that the original case numbers of the inner cells must be between 7 and 11 and between 12 and 16, respectively, and that the marginal must have originally had a value between 17 and 21. For an attacker, this results in the following possible combinations:

  • The marginal is originally 19 and the inner cells are
    • 7 and 12
  • The marginal is originally 20 and the inner cells are
    • 7 and 13
    • 8 and 12
  • The marginal is originally 21 and the inner cells are
    • 7 and 14
    • 8 and 13
    • 9 and 12

The attackers can now multiply the original mean values of the table cells known to them with the thus calculated candidates for the associated frequencies. In this way, they obtain estimates for the original magnitude value for each cell. If now for each of those estimates, they sum up those values for the inner cells, they obtain another estimated value for the marginal value. This can now be used to identify the correct combination of frequency values, since for the correct ones the sum over the inner cell values is identical to the marginal value, as shown in Table 5.24.

Example

Cell 1 Cell 2 Marginal Est. Cell 1 Est. Cell 2 Sum of Estimates Est. Marginal
7 12 19 78.75 95 173.75 175.75
7 13 20 78.75 102.917 181.667 185
8 12 20 90 95 185 185
7 14 21 78.75 110.833 189.583 194.25
8 13 21 90 102.917 192.917 194.25
9 12 21 101.25 95 196.25 194.25
Table 5.24: Sample calculation for an attacker

Additionally, through this calculation, the associated magnitude values are now known as well. If these are confidential, a further problem arises. The publication of original mean values is therefore not recommended.

So, when using the Cell Key Method for frequency tables we recommend to use those perturbed counts also when generating mean values, i.e. if \(n\) is an original count, \(\hat{n}\) is the corresponding perturbed count and \(m\) is the corresponding magnitude value, as it gets published, then, in order to avoid the disclosure risk described here, it is better to calculate the mean as \(m/\hat{n}\).

5.8 Information loss measures

As described in Sections 5.3 to 5.6 there are a number of different disclosure control methods used to protect frequency tables. Each of these methods modifies the original data in the table in order to reduce the disclosure risk from small cells (0’s, 1’s and 2’s). However, the process of reducing disclosure risk results in information loss. Some quantitative information loss measures have been developed by Shlomo and Young (2005 & 2006) to determine the impact various statistical disclosure control (SDC) methods have on the original tables.

Information loss measures can be split into two classes: measures for data suppliers, used to make informed decisions about optimal SDC methods depending on the characteristics of the tables; and measures for users in order to facilitate adjustments to be made when carrying out statistical analysis on protected tables. Here we focus on measures for data suppliers. Measuring utility and quality for SDC methods is subjective. It depends on the users, the purpose of the statistical analysis, and on the type and format of the data itself. Therefore it is useful to have a range of information loss measures for assessing the impact of the SDC methods.

The focus here is information loss measures for tables containing frequency counts; however, some of these measures can easily be adapted to microdata. Magnitude or weighted sample tables will have the additional element of the number of contributors to each cell of the table.

When evaluating information loss measures for tables protected using cell suppression, one needs to decide on an imputation method for replacing the suppressed cells similar to what one would expect a user to do prior to analysing the data (i.e. we need to measure the difference between the observed and actual values, and for suppressed cells the observed values will be based on user inference about the possible cell values). A naive user might use zeros in place of the suppressed cells whereas a more sophisticated user might replace suppressed cells by some form of averaging of the total information that was suppressed, or by calculating feasibility intervals.

A number of different information loss measures are described below, and more technical details can be found in Shlomo and Young (2005 & 2006).

  • An exact Binomial Hypothesis Test can be used to check if the realization of a random stochastic perturbation scheme, such as random rounding, follows the expected probabilities (i.e. the parameters of the method). For other SDC methods, a non-parametric signed rank test can be used to check whether the location of the empirical distribution has changed after the application of the SDC method.
  • Information loss measures that measure distortion of distributions are based on distance metrics between the original and perturbed cells. Some useful metrics are also presented in Gomatam and Karr (2003). A distance metric can be calculated for internal cells of a table. When combining several tables one may want to calculate an overall average across the tables as the information loss measure. These distance metrics can also be calculated for totals or sub-totals of the tables.
  • SDC methods will have an impact on the variance of the average cell size for the rows, columns or the entire table. The variance of the average cell size is examined before and after the SDC method has been applied. Another important variance to examine is the “between”-variance when carrying out a one-way ANOVA test based on the table. In ANOVA, we examine the means of a specific target variable within groupings defined by independent categorical variables. The goodness of fit statistic R2 for testing the null hypothesis that the means are equal across the groupings is based on the variance of the means between the groupings divided by the total variance. The information loss measure therefore examines the impact of the “between”-variance and whether the means of the groupings have become more homogenized or spread apart as a result of the SDC method.
  • Another statistical analysis tool that is frequently carried out on tabular data are tests for independence between categorical variables that span the table. The test for independence for a two-way table is based on a Pearson Chi-squared statistic and the measure of association is the Cramer’s V statistic. For multi-way tables, one can examine conditional dependencies and calculate expected cell frequencies based on the theory of log-linear models. The test statistic for the fit of the model is also based on a Pearson Chi-squared statistic. SDC methods applied to tables may change the results of statistical inferences. Therefore we examine the impact to the test statistics before and after the application of the SDC method.
  • Another statistical tool for inference is the Spearman’s rank correlation. This is a technique that tests the direction and strength of the relationship between two variables. The statistic is based on ranking both sets of data from the highest to the lowest. Therefore, one important assessment of the impact of the SDC method on statistical data is whether we are distorting the rankings of the cell counts.

In order to allow data suppliers to make informed decisions about optimal disclosure control methods, ONS has developed a user-friendly software application that calculates both disclosure risk measures based on small counts in tables and a wide range of information loss measures (as described above) for disclosure controlled statistical data, Shlomo and Young (2006). The software application also outputs R-U Confidentiality maps.

5.9 Disclosure risk measures

Section 5.2 has introduced two basic concepts of disclosure risk in frequency tables, i.e. risks of identification (resulting from rareness of an attribute combination exhibited by a low frequency in a table cell) on one hand, and, on the other hand, of risks of attribute disclosure, in particular of group attribute disclosure. In 5.2, it has been observed that „…in order to protect against group attribute disclosure it is essential to introduce ambiguity in the zeros and ensure that all respondents do not fall into just one or a few categories.“

These types of risk are measurable, so we take a closer look at them. Lupp and Langsrud (2021) suggest a formal definition of group attribute disclosure. Assume that an attacker has knowledge of \(k\) (where \(k\) is the natural number) contributors to the table (called a \(k\)-coalition), for example if the attacker is one of the individuals contributing to the table and also has knowledge of the attributes of \(k-1\) other contributors. In the most basic case, an attacker has no background knowledge whatsoever, i.e., for him/her is \(k=0\). Another typical assumption is \(k=1\). In that case, attackers can then use their background knowledge to disclose information about other units by removing themselves from the data and analyzing the resulting table. In this formal definition, Lupp and Langsrud (2021) regard a cell \(c\) in a frequency table as *directly disclosive with reference to \(k^*\), if there exists a published marginal cell \(p_c\) within a sensitive variable, such that

  • \(c\) is a cell contributing to \(p_c\), and
  • \(|p_c|-k\le |c|\),

where \(|t|\) denotes the number of units belonging to the cell \(t\). In other words, if a cell is directly disclosive with reference to \(k\) then there exists an attacker with knowledge of \(k\) table contributors that can deduce that all other units contributing to pc must be in the cell \(c\). Therefore, the share of directly disclosive cells in their total number can be a good measure of primary disclosure risk. The disclosure risk of identification (connected to low frequencies) can be measured as the share of cells with small frequencies in the total number of cells in a table. Antal, Shlomo and Elliot (2014) have proposed the following measure of risk for population tables, based on the entropy. According to their approach, a high entropy indicates that the distribution across cells is uniform and a low entropy indicates mainly zeros in a row/column or table with a few non-zero cells. The fewer the number of non-zero cells, the more likely that attribute disclosure occurs: Let \(F_l\) be population frequency of cell \(C_l\), \(l=1,2,\ldots,k\) (where \(k\) is the total number of cells). Let \(N\) be the number of population, \(N=\sum_{l=1}^k{F_k}\) and \(D=\{l\in\{1,2,\ldots,n\}:F_l=0\}\). The disclosure risk for a population based frequency table is then defined as \[ r=w_1\frac{|D|}{k}+w_2\left(1-\frac{H}{\log k}\right)-w_3\frac{1}{\sqrt{N}}\log\frac{1}{e\cdot\sqrt{N}}, \] where \(|A|\) denotes the number of elements of the set \(A\), \(H\) is the value defined in (4.6.1) but adjusted to the current situation as \[ H=-\sum_{i=1}^{k}{\frac{F_i}{N}\cdot\log{\frac{F_i}{N}}}, \] and \(\mathbf{w}=(w_1,w_2,w_3)\) is the vector of weights (\(e\) is, of course, the base of the natural logarithm). This measure has all above mentioned properties.

Assume that the original table was perturbed. The frequency of cell \(C_l\) is then \(F_l^{\#}\), \(l=1,2,\ldots,k\). Because the modified table has more uncertainty, the proposed measure of disclosure risk in this case is as follows: \[ r^{\#}=w_1\left(\frac{|D|}{k}\right)^{\frac{|D\cup Q|}{|D\cap Q|}}+w_2\left(1-\frac{H}{\log k}\right)\left(1-\frac{H^{\#}}{H}\right)-w_3\frac{1}{\sqrt{N}}\log\frac{1}{e\cdot\sqrt{N}} \]

where :

  • \(Q\) is the set of zeroes in the perturbed frequency table
  • \(H^{\#}=-\sum_{i=1}^{k}{\frac{F_i^{\#}}{N}\cdot\sum_{j=1}^{k}{\frac{F_{ij}^{\#}}{N\cdot F_j^{\#}}\cdot\log{\frac{F_{ij}^{\#}}{N\cdot F_j^{\#}}}}}\)
  • \(F_{ij}^{\#}\) is the number of units which belong to \(C_i\) before and to \(C_j\) after perturbation.

The disclosure risk of a perturbed table is expected to be lower than that of the original table. \(r^{\#}\) satisfies this requirement.

5.10 References

Antal, L., Shlomo, N., & Elliot, M. (2014) Measuring disclosure risk with entropy in population based frequency tables, In Privacy in Statistical Databases: UNESCO Chair in Data Privacy, International Conference, PSD 2014, Ibiza, Spain, September 17-19, 2014. Proceedings (pp. 62-78). Springer International Publishing.

Antal, L., & Shlomo, N., & Elliot, M. (2015), Disclosure Risk Measurement with Entropy in Sample Based Frequency Tables, New Techniques and Technologies for Statistics (NTTS) Reliable Evidence for a Society in Transition, Brussels, Belgium, 9-13 March 2015, https://cros-legacy.ec.europa.eu/system/files/Antal-etal_NTTS%202015%20abstract%20unblinded%20disclosure%20risk%20measurement.pdf.

Brown, D., (2003) Different approaches to disclosure control problems associated with geography, ONS, United Nations Statistical Commission and Economic Commission for Europe Conference of European Statisticians, Working Paper No. 14.

Costemalle, V. (2019). Detecting geographical differencing problems in the context of spatial data dissemination. Statistical Journal of the IAOS, vol. 35, No. 4, pp. 559-568.

Doyle, P., Lane, J.I., Theeuwes, J.J.M. and Zayatz, (2001). Confidentiality, Disclosure and Data Access: Theory and Practical Application for Statistical Agencies. Elsevier Science BV.

Duke-Williams, O. and Rees, P., (1998) Can Census Offices publish statistics for more than one small area geography? An analysis of the differencing problem in statistical disclosure, International Journal of Geographical Information Science 12, 579-605

Enderle, T., Giessing, S., Tent, R., (2020) Calculation of Risk Probabilities for the Cell Key Method. In: Domingo-Ferrer, J., Muralidhar, K. (eds) Privacy in Statistical Databases, Lecture Notes in Computer Science, vol 12276. Springer, Cham. https://doi.org/10.1007/978-3-030-57521-2_11

Gomatam, S. and A. Karr (2003), Distortion Measures for Categorical Data Swapping, Technical Report Number 131, National Institute of Statistical Sciences.

Lupp, D. P., & Langsrud, Ø. (2021). Suppression of directly-disclosive cells in frequency tables, In Joint UNECE/Eurostat Expert Meeting on Statistical Data Confidentiality, Poznań, 1-3 December 2021 (pp. 1-3)

Möhler, M., Jamme, J., De Jonge, E., Młodak, A., Gussenbauer, J., De Wolf, P.-P. (2024), Guidelines for Statistical Disclosure Control Methods Applied on Geo-Referenced Data, EU Grant agreement 899218 – 2019-BG-Methodology 'STACE project' , Deliverable D2.9, https://github.com/sdcTools/GeoSpatialGuidelinesSources/releases .

Salazar, JJ, Staggermeier, A and Bycroft, C, (2005) Controlled rounding implementation, Proceeding of the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Geneva

Shlomo, N. (2007) Statistical Disclosure Control Methods for Census Frequency Tables. International Statistical Review, Volume 75, Number 2, August 2007 , pp. 199-217(19) Blackwell Publishing.

Shlomo, N. and Young, C. (2005) Information Loss Measures for Frequency Tables, Proceeding of the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality, Geneva.

Shlomo, N. and Young, C. (2006) Statistical Disclosure Control Methods Through a Risk-Utility Framework, PSD'2006 Privacy in Statistical Databases, Springer LNCS proceedings, to appear.

Willenborg, L., and Ton de Waal. (1996) Statistical Disclosure Control in Practice. Lecture Notes in Statistics no 111 Springer-Verlag. New York.