Function makeProblem()
is used to create sdcProblem objects.
makeProblem(
data,
dimList,
dimVarInd = NULL,
freqVarInd = NULL,
numVarInd = NULL,
weightInd = NULL,
sampWeightInd = NULL
)
a data frame featuring at least one column for each desired dimensional variable. Optionally the input data can feature variables that contain information on cell counts, weights that should be used during the cut and branch algorithm, additional numeric variables or variables that hold information on sampling weights.
a (named) list where the names refer to variable names in
input data
. If the list is not named, it is required to specify argument
dimVarInd
. Each list element can be one of:
tree
: generated with hier_*()
functions from package sdcHierarchies
data.frame
: a two column data.frame
containing the full hierarchy of
a dimensional variable using a top-to-bottom approach. The format of this
data.frame
is as follows:
first column: a character vector specifying levels with each vector
element being a string only containing of @
s from length 1 to n.
If a vector element consists of i
-chars, the corresponding code
is of level i
. The code @
(one character) equals the grand
total (level=1), the code @@
(two characters) is of level 2 (directly
below the overall total).
second column: a character vector specifying level codes
path
: absolute or relative path to a .csv
file that
contains two columns seperated by semicolons (;
) having the same structure
as the "@;levelname"
-structure described above
if dimList
is a named list, this argument is
ignored (NULL
). Else either a numeric or character vector
defining the column indices or names of dimensional variables
(specifying the table) within argument data
are expected.
if not NULL
, a scalar numeric or character vector
defining the column index or variable name of a variable holding counts
in data
if not NULL
, a numeric or character vector
defining the column indices or variable names of additional numeric
variables with respect to data
if not NULL
, a scalar numeric or character vector
defining the column index or variable name holding costs within data
that should be used as objective coefficients when solving secondary
cell suppression problems.
if not NULL
, a scalar numeric or character vector
defining the column index or variable name of a variable holding sampling
weights within data
. In case a complete table is provided, this parameter is
ignored.
a sdcProblem object
# loading micro data
utils::data("microdata1", package = "sdcTable")
# we can observe that we have a micro data set consisting
# of two spanning variables ('region' and 'gender') and one
# numeric variable ('val')
# specify structure of hierarchical variable 'region'
# levels 'A' to 'D' sum up to a Total
dim.region <- data.frame(
levels=c('@','@@','@@','@@','@@'),
codes=c('Total', 'A','B','C','D'),
stringsAsFactors=FALSE)
# specify structure of hierarchical variable 'gender'
# using create_node() and add_nodes() (see ?manage_hierarchies)
dim.gender <- hier_create(root = "Total", nodes = c("male", "female"))
hier_display(dim.gender)
#> Total
#> ├─male
#> └─female
# create a named list with each element being a data-frame
# containing information on one dimensional variable and
# the names referring to variables in the input data
dimList <- list(region = dim.region, gender = dim.gender)
# third column containts a numeric variable
numVarInd <- 3
# no variables holding counts, numeric values, weights or sampling
# weights are available in the input data
# creating an problem instance using numeric indices
p1 <- makeProblem(
data = microdata1,
dimList = dimList,
numVarInd = 3 # third variable in `data`
)
# using variable names is also possible
p2 <- makeProblem(
data = microdata1,
dimList = dimList,
numVarInd = "val"
)
# what do we have?
print(class(p1))
#> [1] "sdcProblem"
#> attr(,"package")
#> [1] "sdcTable"
# have a look at the data
df1 <- sdcProb2df(p1, addDups = TRUE,
addNumVars = TRUE, dimCodes = "original")
df2 <- sdcProb2df(p2, addDups=TRUE,
addNumVars = TRUE, dimCodes = "original")
print(df1)
#> strID freq sdcStatus val region gender
#> 1: 0000 100 s 1284 Total Total
#> 2: 0001 55 s 802 Total male
#> 3: 0002 45 s 482 Total female
#> 4: 0100 20 s 198 A Total
#> 5: 0101 18 s 178 A male
#> 6: 0102 2 s 20 A female
#> 7: 0200 33 s 344 B Total
#> 8: 0201 14 s 140 B male
#> 9: 0202 19 s 204 B female
#> 10: 0300 22 s 224 C Total
#> 11: 0301 12 s 118 C male
#> 12: 0302 10 s 106 C female
#> 13: 0400 25 s 518 D Total
#> 14: 0401 11 s 366 D male
#> 15: 0402 14 s 152 D female
identical(df1, df2)
#> [1] TRUE