mlrCPO Core

CPO Vignette Navigation

1. First Steps (compact version)
mlrCPO Core (compact version)
CPOs Built Into mlrCPO (compact version)
Building Custom CPOs (compact version)

Introduction

This vignette is supposed to be a short reference of the primitives and tools supplied by the mlrCPO package.

Lifecycle of a CPO

CPOs are first-class objects in R that represent data manipulation. They can be combined to form networks of operation, they can be attached to mlr Learners, and they have tunable Hyperparameters that influence their behaviour. CPOs go through a lifecycle from construction to CPO to a CPOTrained “retrafo” or “inverter” object. The different stages of a CPO related object can be distinguished using getCPOClass(), which takes one of five values:

getCPOClass(cpoPca)
#> [1] "CPOConstructor"
getCPOClass(cpoPca())
#> [1] "CPO"
getCPOClass(pid.task %>|% cpoPca())
#> [1] "CPORetrafo"
getCPOClass(inverter(bh.task %>>% cpoLogTrafoRegr()))
#> [1] "CPOInverter"
getCPOClass(NULLCPO)
#> [1] "NULLCPO"

CPOConstructor

CPOs are created using CPOConstructors. These are R functions with a print function and many parameters in common.

print(cpoAsNumeric)  # example CPOConstructor
#> <<CPO as.numeric()>>
print(cpoAsNumeric, verbose = TRUE)  # alternative: !cpoAsNumeric
#> <<CPO as.numeric()>>
#> 
#> cpo.retrafo:
#> function (data) 
#> {
#>     as.data.frame(lapply(data, as.numeric), row.names = rownames(data))
#> }
#> <environment: namespace:mlrCPO>
class(cpoAsNumeric)
#> [1] "CPOConstructor" "function"
getCPOName(cpoPca)  # same as getCPOName() of the *constructed* CPO
#> [1] "pca"
getCPOClass(cpoPca)
#> [1] "CPOConstructor"

The function parameters of a CPOConstructor

set the CPO Hyperparameters
set the CPO id (default to the CPO’s name)
resetrict the data columns a CPO operates on (affect.* parameters)
control which of the CPO’s hyperparameters are “exported”, i.e. can late be manipulated using setHyperPars().

names(formals(cpoPca))
#>  [1] "center"                     "scale"                     
#>  [3] "tol"                        "rank"                      
#>  [5] "id"                         "export"                    
#>  [7] "affect.type"                "affect.index"              
#>  [9] "affect.names"               "affect.pattern"            
#> [11] "affect.invert"              "affect.pattern.ignore.case"
#> [13] "affect.pattern.perl"        "affect.pattern.fixed"

CPO

(cpo = cpoScale()) # construct CPO with default Hyperparameter values
#> scale(center = TRUE, scale = TRUE)
print(cpo, verbose = TRUE)  # detailed printing. Alternative: !cpo
#> Trafo chain of 1 cpos:
#> scale(center = TRUE, scale = TRUE)
#> Operating: feature
#> ParamSet:
#>                 Type len  Def Constr Req Tunable Trafo
#> scale.center logical   - TRUE      -   -    TRUE     -
#> scale.scale  logical   - TRUE      -   -    TRUE     -
class(cpo)  # CPOs that are not compound are "CPOPrimitive"
#> [1] "CPOPrimitive" "CPO"
getCPOClass(cpo)
#> [1] "CPO"

Functions that work on CPOs

The inner “state” of a CPO can be inspected and manipulated using various getters and setters.

getParamSet(cpo)
#>                 Type len  Def Constr Req Tunable Trafo
#> scale.center logical   - TRUE      -   -    TRUE     -
#> scale.scale  logical   - TRUE      -   -    TRUE     -
getHyperPars(cpo)
#> $scale.center
#> [1] TRUE
#> 
#> $scale.scale
#> [1] TRUE
setHyperPars(cpo, scale.center = FALSE)
#> scale(center = FALSE, scale = TRUE)
getCPOId(cpo)
#> [1] "scale"
setCPOId(cpo, "MYID")
#> MYID<scale>(center = TRUE, scale = TRUE)
getCPOName(cpo)
#> [1] "scale"
getCPOAffect(cpo)  # empty, since no affect set
#> named list()
getCPOAffect(cpoPca(affect.pattern = "Width$"))
#> $pattern
#> [1] "Width$"
getCPOConstructor(cpo)  # the constructor used to create the CPO
#> <<CPO scale(center = TRUE, scale = TRUE)>>
getCPOProperties(cpo)  # see properties explanation below
#> $handling
#>  [1] "numerics"   "factors"    "ordered"    "missings"   "cluster"   
#>  [6] "classif"    "multilabel" "regr"       "surv"       "oneclass"  
#> [11] "twoclass"   "multiclass" "prob"       "se"        
#> 
#> $adding
#> character(0)
#> 
#> $needed
#> character(0)
getCPOPredictType(cpo)
#>   response       prob         se 
#> "response"     "prob"       "se"
getCPOClass(cpo)
#> [1] "CPO"
getCPOOperatingType(cpo)  # Operating on feature, target, retrafoless?
#> [1] "feature"

Compare the predict type and operating type of a TOCPO or ROCPO:

getCPOPredictType(cpoResponseFromSE())
#> response       se 
#>     "se"     "se"
getCPOOperatingType(cpoResponseFromSE())
#> [1] "target"
getCPOOperatingType(cpoSample())
#> [1] "retrafoless"

The identicalCPO() function is used to check whether the underlying operation of two CPOs is identical. For this understanding, CPOs with different hyperparameters can still be “identical”.

identicalCPO(cpoScale(scale = TRUE), cpoScale(scale = FALSE))
#> [1] TRUE
identicalCPO(cpoScale(), cpoPca())
#> [1] FALSE

CPO Application

CPOs can be applied to data.frame and Task objects using %>>% or applyCPO.

head(iris) %>>% cpoPca()
#>   Species        PC1          PC2         PC3           PC4
#> 1  setosa -0.1634147  0.017230444 -0.11038321 -0.0231625616
#> 2  setosa  0.3324970 -0.189351624 -0.08152883  0.0005612917
#> 3  setosa  0.3268659  0.101103375 -0.02238439  0.0464537730
#> 4  setosa  0.4202367  0.005523981  0.17106514 -0.0222757931
#> 5  setosa -0.1768684  0.140149101 -0.04185224 -0.0194870755
#> 6  setosa -0.7393165 -0.074655279  0.08508352  0.0179103657
task = applyCPO(cpoPca(), iris.task)
head(getTaskData(task))
#>   Species       PC1        PC2         PC3          PC4
#> 1  setosa -2.684126 -0.3193972  0.02791483  0.002262437
#> 2  setosa -2.714142  0.1770012  0.21046427  0.099026550
#> 3  setosa -2.888991  0.1449494 -0.01790026  0.019968390
#> 4  setosa -2.745343  0.3182990 -0.03155937 -0.075575817
#> 5  setosa -2.728717 -0.3267545 -0.09007924 -0.061258593
#> 6  setosa -2.280860 -0.7413304 -0.16867766 -0.024200858

CPO Composition

CPO composition can be done using %>>% or composeCPO. It results in a new CPO which mostly behaves like a primitive CPO. Exceptions are:

Compound CPOs have no id
Affect of compound CPOs cannot be retrieved

scale = cpoScale()
pca = cpoPca()

compound = scale %>>% pca
composeCPO(scale, pca)  # same
#> (scale >> pca)(scale.center = TRUE, scale.scale = TRUE, pca.center = TRUE, pca.scale = FALSE)
class(compound)
#> [1] "CPOPipeline" "CPO"
!compound
#> Trafo chain of 2 cpos:
#> scale(center = TRUE, scale = TRUE)
#> Operating: feature
#> ParamSet:
#>                 Type len  Def Constr Req Tunable Trafo
#> scale.center logical   - TRUE      -   -    TRUE     -
#> scale.scale  logical   - TRUE      -   -    TRUE     -
#>   ====>
#> pca(center = TRUE, scale = FALSE)[not exp'd: tol = <NULL>, rank = <NULL>]
#> Operating: feature
#> ParamSet:
#>               Type len   Def Constr Req Tunable Trafo
#> pca.center logical   -  TRUE      -   -    TRUE     -
#> pca.scale  logical   - FALSE      -   -    TRUE     -
getCPOName(compound)
#> [1] "pca.scale"
getHyperPars(compound)
#> $scale.center
#> [1] TRUE
#> 
#> $scale.scale
#> [1] TRUE
#> 
#> $pca.center
#> [1] TRUE
#> 
#> $pca.scale
#> [1] FALSE
setHyperPars(compound, scale.center = TRUE, pca.center = FALSE)
#> (scale >> pca)(scale.center = TRUE, scale.scale = TRUE, pca.center = FALSE, pca.scale = FALSE)

getCPOId(compound)  # error: no ID for compound CPOs
#> Error in getCPOId.CPO(compound): Compound CPOs have no IDs.
getCPOAffect(compound)  # error: no affect for compound CPOs
#> Error in getCPOAffect.CPO(compound): Compound CPOs have no affect arguments.

getCPOOperatingType() always considers the operating type of the whole CPO chain and may return multiple values:

getCPOOperatingType(NULLCPO)
#> character(0)
getCPOOperatingType(cpoScale())
#> [1] "feature"
getCPOOperatingType(cpoScale() %>>% cpoLogTrafoRegr() %>>% cpoSample())
#> [1] "feature"     "target"      "retrafoless"

Compound CPO Chaining and Decomposition

Composite CPO objects can be broken into their constituent primitive CPOs using as.list(). The inverse of this operation is pipeCPO(), which composes a list of CPOs in the given order.

as.list(compound)
#> [[1]]
#> scale(center = TRUE, scale = TRUE)
#> 
#> [[2]]
#> pca(center = TRUE, scale = FALSE)[not exp'd: tol = <NULL>, rank = <NULL>]
pipeCPO(as.list(compound))  # chainCPO: (list of CPO) -> CPO
#> (scale >> pca)(scale.center = TRUE, scale.scale = TRUE, pca.center = TRUE, pca.scale = FALSE)
pipeCPO(list())
#> NULLCPO

CPOLearner

CPO-Learner attachment works using %>>% or attachCPO.

lrn = makeLearner("classif.logreg")
(cpolrn = cpo %>>% lrn)  # the new learner has the CPO hyperparameters
#> Learner classif.logreg.scale from package stats
#> Type: classif
#> Name: ; Short name: 
#> Class: CPOLearner
#> Properties: numerics,factors,twoclass,prob
#> Predict-Type: response
#> Hyperparameters: model=FALSE
attachCPO(compound, lrn)  # attaching compound CPO
#> Learner classif.logreg.pca.scale from package stats
#> Type: classif
#> Name: ; Short name: 
#> Class: CPOLearner
#> Properties: numerics,factors,twoclass,prob
#> Predict-Type: response
#> Hyperparameters: model=FALSE

The new object is a CPOLearner, which performs the operation given by the CPO before trainign the Learner.

class(lrn)
#> [1] "classif.logreg"  "RLearnerClassif" "RLearner"        "Learner"

The work performed by a CPOLearner can also be performed manually:

lrn = cpoLogTrafoRegr() %>>% makeLearner("regr.lm")
model = train(lrn, subsetTask(bh.task, 1:300))
predict(model, subsetTask(bh.task, 301:500))
#> Prediction: 200 observations
#> predict.type: response
#> threshold: 
#> time: 0.00
#>     id truth response
#> 301  1  24.8 28.69715
#> 302  2  22.0 27.89821
#> 303  3  26.4 28.33370
#> 304  4  33.1 33.80868
#> 305  5  36.1 34.93957
#> 306  6  28.4 28.77130
#> ... (#rows: 200, #cols: 3)

is equivalent to

trafo = subsetTask(bh.task, 1:300) %>>% cpoLogTrafoRegr()
model = train("regr.lm", trafo)

newdata = subsetTask(bh.task, 301:500) %>>% retrafo(trafo)
pred = predict(model, newdata)
invert(inverter(newdata), pred)
#> Prediction: 200 observations
#> predict.type: response
#> threshold: 
#> time: 0.00
#>     id truth response
#> 301  1  24.8 28.69715
#> 302  2  22.0 27.89821
#> 303  3  26.4 28.33370
#> 304  4  33.1 33.80868
#> 305  5  36.1 34.93957
#> 306  6  28.4 28.77130
#> ... (#rows: 200, #cols: 3)

CPOLearner Decomposition

It is possible to obtain both the underlying Learner and the attached CPO from a CPOLearner. Note that if a CPOLearner is wrapped by some method (e.g. a TuneWrapper), this does not work, since CPO can not probe below the first wrapping layer.

getLearnerCPO(cpolrn)  # the CPO
#> scale(center = TRUE, scale = TRUE)
getLearnerBare(cpolrn)  # the Learner
#> Learner classif.logreg from package stats
#> Type: classif
#> Name: Logistic Regression; Short name: logreg
#> Class: classif.logreg
#> Properties: twoclass,numerics,factors,prob,weights
#> Predict-Type: response
#> Hyperparameters: model=FALSE

CPOTrained

CPOs perform data-dependent operation. However, when this operation becomes part of a machine-learning process, the operation on predict-data must depend only on the training data. A CPORetrafo object represents the re-application of a trained CPO. A CPOInverter object represents the transformation of a prediction made on a transformed task back to the form of the original data.

The CPOTrained objects generated by application of a CPO (or application of another CPOTrained) can be retrieved using the retrafo() or the inverter() function.

transformed = iris %>>% cpoScale()
head(transformed)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
#> 2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
#> 3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
#> 4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
#> 5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
#> 6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa
(ret = retrafo(transformed))
#> CPO Retrafo chain
#> [RETRAFO scale(center = TRUE, scale = TRUE)]

head(getTaskTargets(bh.task))
#> [1] 24.0 21.6 34.7 33.4 36.2 28.7
transformed = bh.task %>>% cpoLogTrafoRegr()
head(getTaskTargets(transformed))
#> [1] 3.178054 3.072693 3.546740 3.508556 3.589059 3.356897
(inv = inverter(transformed))
#> CPO Inverter chain {type:regr} (able to predict 'response', 'se')
#> [INVERTER fun.apply.regr.target(){type:regr}]
head(invert(inv, getTaskTargets(transformed)))
#> [1] 24.0 21.6 34.7 33.4 36.2 28.7

Retrafos and inverters are stored as attributes:

attributes(transformed)
#> $names
#> [1] "type"        "env"         "weights"     "blocking"    "coordinates"
#> [6] "task.desc"  
#> 
#> $class
#> [1] "RegrTask"       "SupervisedTask" "Task"          
#> 
#> $retrafo
#> CPO Retrafo / Inverter chain {type:regr} (able to predict 'response', 'se')
#> [RETRAFO fun.apply.regr.target(){type:regr}]
#> 
#> $inverter
#> CPO Inverter chain {type:regr} (able to predict 'response', 'se')
#> [INVERTER fun.apply.regr.target(){type:regr}]

It is possible to set the "retrafo" and "inverter" attributes of an object using retrafo() and inverter(). This can be useful for writing elegant scripts, especially since CPOTrained are automatically chained. To delete the CPOTrained attribute of an object, set it to NULL or NULLCPO, or use clearRI().

bh2 = bh.task
retrafo(bh2) = ret
attributes(bh2)
#> $names
#> [1] "type"        "env"         "weights"     "blocking"    "coordinates"
#> [6] "task.desc"  
#> 
#> $class
#> [1] "RegrTask"       "SupervisedTask" "Task"          
#> 
#> $retrafo
#> CPO Retrafo chain
#> [RETRAFO scale(center = TRUE, scale = TRUE)]

retrafo(bh2) = NULLCPO
# equivalent:
# retrafo(bh2) = NULL
attributes(bh2)
#> $names
#> [1] "type"        "env"         "weights"     "blocking"    "coordinates"
#> [6] "task.desc"  
#> 
#> $class
#> [1] "RegrTask"       "SupervisedTask" "Task"

# clearRI returns the object without retrafo or inverter attributes
bh3 = clearRI(transformed)
attributes(bh3)
#> $names
#> [1] "type"        "env"         "weights"     "blocking"    "coordinates"
#> [6] "task.desc"  
#> 
#> $class
#> [1] "RegrTask"       "SupervisedTask" "Task"

Functions that work on CPOTrained

General methods that work on CPOTrained object to inspect its object properties. Many methods that work on a CPO also work on a CPOTrained and give the same result.

getCPOName(ret)
#> [1] "scale"
getParamSet(ret)
#>           Type len  Def Constr Req Tunable Trafo
#> center logical   - TRUE      -   -    TRUE     -
#> scale  logical   - TRUE      -   -    TRUE     -
getHyperPars(ret)
#> $center
#> [1] TRUE
#> 
#> $scale
#> [1] TRUE
getCPOProperties(ret)
#> $handling
#>  [1] "numerics"   "factors"    "ordered"    "missings"   "cluster"   
#>  [6] "classif"    "multilabel" "regr"       "surv"       "oneclass"  
#> [11] "twoclass"   "multiclass" "prob"       "se"        
#> 
#> $adding
#> character(0)
#> 
#> $needed
#> character(0)
getCPOPredictType(ret)
#>   response       prob         se 
#> "response"     "prob"       "se"
getCPOOperatingType(ret)  # Operating on feature, target, both?
#> [1] "feature"
getCPOOperatingType(inv)
#> [1] "target"

A CPOTrained has information about whether it can be used as a CPORetrafo object (and be applied to new data using %>>%), or as a CPOInverter object (and used by invert()), or possibly both. This is given by getCPOTrainedCapability(), which returns a 1 if the object has an effect in the given role, 0 if the object has no effect (but can be used), or -1 if the object can not be used in the role.

getCPOTrainedCapability(ret)
#> retrafo  invert 
#>       1       0
getCPOTrainedCapability(inv)
#> retrafo  invert 
#>      -1       1
getCPOTrainedCapability(NULLCPO)
#> retrafo  invert 
#>       0       0

The “CPO class” of a CPOTrained is determined by this as well. A pure inverter is CPOInverter, an object that can be used for retrafo is a CPORetrafo.

getCPOClass(ret)
#> [1] "CPORetrafo"
getCPOClass(inv)
#> [1] "CPOInverter"

The CPO and the CPOConstructor used to create the `CPOTrained can be queried.

getCPOTrainedCPO(ret)
#> scale(center = TRUE, scale = TRUE)
getCPOConstructor(ret)
#> <<CPO scale(center = TRUE, scale = TRUE)>>

CPOTrained Inspection

CPOTrained objects can be inspected using getCPOTrainedState(). The state contains the hyperparameters, the control object (CPO dependent data representing the data information needed to re-apply the operation), and information about the Task / data.frame layout used for training (column names, column types) in data$shapeinfo.input and data$shapeinfo.output.

The state can be manipulated and used to create new CPOTraineds, using makeCPOTrainedFromState().

(state = getCPOTrainedState(retrafo(iris %>>% cpoScale())))
#> $center
#> [1] TRUE
#> 
#> $scale
#> [1] TRUE
#> 
#> $control
#> $control$center
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#>     5.843333     3.057333     3.758000     1.199333 
#> 
#> $control$scale
#> Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
#>    0.8280661    0.4358663    1.7652982    0.7622377 
#> 
#> 
#> $data
#> $data$shapeinfo.input
#> <ShapeInfo (input) Sepal.Length: num, Sepal.Width: num, Petal.Length: num, Petal.Width: num, Species: fac>
#> 
#> $data$shapeinfo.output
#> <ShapeInfo (output)>:
#> numeric:
#> <ShapeInfo Sepal.Length: num, Sepal.Width: num, Petal.Length: num, Petal.Width: num>
#> factor:
#> <ShapeInfo Species: fac>
#> other:
#> <ShapeInfo (empty)>
state$control$center[1] = 1000  # will now subtract 1000 from the first column
new.retrafo = makeCPOTrainedFromState(cpoScale, state)
head(iris %>>% new.retrafo)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1    -1201.474  1.01560199    -1.335752   -1.311052  setosa
#> 2    -1201.716 -0.13153881    -1.335752   -1.311052  setosa
#> 3    -1201.957  0.32731751    -1.392399   -1.311052  setosa
#> 4    -1202.078  0.09788935    -1.279104   -1.311052  setosa
#> 5    -1201.595  1.24503015    -1.335752   -1.311052  setosa
#> 6    -1201.112  1.93331463    -1.165809   -1.048667  setosa

CPOTrained are Automatically Chained

When executing data %>>% CPO, the result has an associated CPORetrafo and CPOInverter object. When applying another CPO, the CPORetrafo and CPOInverter will be chained automatically. This is to make (data %>>% CPO1) %>>% CPO2 work the same as data %>>% (CPO1 %>>% CPO2).

data = head(iris) %>>% cpoPca()
retrafo(data)
#> CPO Retrafo chain
#> [RETRAFO pca(center = TRUE, scale = FALSE)]
data2 = data %>>% cpoScale()

retrafo(data2) is the same as retrafo(data %>>% pca %>>% scale):

retrafo(data2)
#> CPO Retrafo chain
#> [RETRAFO pca(center = TRUE, scale = FALSE)] =>
#> [RETRAFO scale(center = TRUE, scale = TRUE)]

To interrupt this chain, set retrafo to NULL either explicitly, or using clearRI().

data = clearRI(data)
data2 = data %>>% cpoScale()
retrafo(data2)
#> CPO Retrafo chain
#> [RETRAFO scale(center = TRUE, scale = TRUE)]

this is equivalent to

retrafo(data) = NULL
inverter(data) = NULL
data3 = data %>>% cpoScale()
retrafo(data3)
#> CPO Retrafo chain
#> [RETRAFO scale(center = TRUE, scale = TRUE)]

CPOTrained Composition, Decomposition, and Chaining

CPOTrained can be composed using %>>% and pipeCPO(), just like CPOs. They can also be split apart into primitive parts using as.list. It is recommended to only chain CPOTrained objects if they were created in the given order by preprocessing operations, since CPOTraineds are very dependent on their position within a preprocessing pipeline.

compound.retrafo = retrafo(head(iris) %>>% compound)
compound.retrafo
#> CPO Retrafo chain
#> [RETRAFO scale(center = TRUE, scale = TRUE)] =>
#> [RETRAFO pca(center = TRUE, scale = FALSE)]

(retrafolist = as.list(compound.retrafo))
#> [[1]]
#> CPO Retrafo chain
#> [RETRAFO scale(center = TRUE, scale = TRUE)]
#> 
#> [[2]]
#> CPO Retrafo chain
#> [RETRAFO pca(center = TRUE, scale = FALSE)]

retrafolist[[1]] %>>% retrafolist[[2]]
#> CPO Retrafo chain
#> [RETRAFO scale(center = TRUE, scale = TRUE)] =>
#> [RETRAFO pca(center = TRUE, scale = FALSE)]
pipeCPO(retrafolist)
#> CPO Retrafo chain
#> [RETRAFO scale(center = TRUE, scale = TRUE)] =>
#> [RETRAFO pca(center = TRUE, scale = FALSE)]

Application of CPOTrained

Similarly to CPOs, CPOTrained objects can be applied to data using %>>%, applyCPO, or predict. This only works with objects that have the "retrafo" capability and hence the CPORetrafo class.

transformed = iris %>>% cpoScale()
head(iris) %>>% retrafo(transformed)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
#> 2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
#> 3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
#> 4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
#> 5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
#> 6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa

Should in general give the same as head(transformed), since the same data was used:

head(transformed)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1   -0.8976739  1.01560199    -1.335752   -1.311052  setosa
#> 2   -1.1392005 -0.13153881    -1.335752   -1.311052  setosa
#> 3   -1.3807271  0.32731751    -1.392399   -1.311052  setosa
#> 4   -1.5014904  0.09788935    -1.279104   -1.311052  setosa
#> 5   -1.0184372  1.24503015    -1.335752   -1.311052  setosa
#> 6   -0.5353840  1.93331463    -1.165809   -1.048667  setosa

applyCPO() and predict() are synonyms of %>>% when used for CPORetrafo objects:

applyCPO(retrafo(transformed), head(iris))
predict(retrafo(transformed), head(iris))

Inversion using CPOTrained

To use CPOTrained objects for inversion, the invert() function is used. Besides the CPOTrained, it takes the data to invert, and optionally the predict.type. Typically CPOTrained objects that were retrieved using inverter() from a transformed dataset should be used for inversion. Retrafo CPOTrained objects retrieved from a transformed data set using retrafo() sometimes have both the "retrafo" as well as the "invert" capability (precisely when all TOCPOs used had the constant.invert flag set, see Building Custom CPOs) and can then also be used for inversion. In that case, however, the "truth" column of an inverted prediction is dropped.

transformed = bh.task %>>% cpoLogTrafoRegr()
prediction = predict(train("regr.lm", transformed), transformed)
inv = inverter(transformed)
invert(inv, prediction)
#> Prediction: 506 observations
#> predict.type: response
#> threshold: 
#> time: 0.00
#>   id truth response
#> 1  1  24.0 29.46569
#> 2  2  21.6 24.65039
#> 3  3  34.7 30.48177
#> 4  4  33.4 28.91454
#> 5  5  36.2 27.40745
#> 6  6  28.7 25.77416
#> ... (#rows: 506, #cols: 3)

ret = retrafo(transformed)
invert(ret, prediction)
#> Prediction: 506 observations
#> predict.type: response
#> threshold: 
#> time: 0.00
#>   id response
#> 1  1 29.46569
#> 2  2 24.65039
#> 3  3 30.48177
#> 4  4 28.91454
#> 5  5 27.40745
#> 6  6 25.77416
#> ... (#rows: 506, #cols: 2)

Inversion can be done on both predictions given by mlr Learners, as well as plain vectors, data.frames, and matrix objects.

Note that the prediction being inverted must have the form of a prediction done with the predict.type that an inverter expects as input for the predict.type given to invert() as an argument. This can be queried using the getCPOPredictType() function. If invert() is called with predict.type = p, then the prediction must be one made with a Learner that has predict.type set to getCPOPredictType(cpo)[p].

NULLCPO

NULLCPO is the neutral element of %>>% and the operations it represents (composeCPO(), applyCPO(), and attachCPO()), i.e. when it is used as an argument of these functions, the data, Learner or CPO is not changed. NULLCPO is also the result pipeCPO() called with the empty list, and of retrafo() and inverter() when they are called for objects with no CPOTrained objects attached.

pipeCPO(list())
#> NULLCPO
as.list(NULLCPO)  # the inverse of pipeCPO
#> list()
retrafo(bh.task)
#> NULLCPO
inverter(bh.task %>>% cpoPca())  # cpoPca is a TOCPO, so no inverter is created
#> NULLCPO

Many getters give characteristic results for NULLCPO.

getCPOClass(NULLCPO)
#> [1] "NULLCPO"
getCPOName(NULLCPO)
#> [1] "NULLCPO"
getCPOId(NULLCPO)
#> [1] "NULLCPO"
getHyperPars(NULLCPO)
#> named list()
getParamSet(NULLCPO)
#> [1] "Empty parameter set."
getCPOAffect(NULLCPO)
#> named list()
getCPOOperatingType(NULLCPO)  # operates neither on features nor on targets.
#> character(0)
getCPOProperties(NULLCPO)
#> $handling
#>  [1] "numerics"   "factors"    "ordered"    "missings"   "cluster"   
#>  [6] "classif"    "multilabel" "regr"       "surv"       "oneclass"  
#> [11] "twoclass"   "multiclass" "prob"       "se"        
#> 
#> $adding
#> character(0)
#> 
#> $needed
#> character(0)
# applying NULLCPO leads to a retrafo() of NULLCPO, so it is its own CPOTrainedCPO
getCPOTrainedCPO(NULLCPO)
#> NULLCPO
# NULLCPO has no effect on applyCPO and invert, so NULLCPO's capabilities are 0.
getCPOTrainedCapability(NULLCPO)
#> retrafo  invert 
#>       0       0
getCPOTrainedState(NULLCPO)
#> NULL

Some helper functions convert NULLCPO to NULL and back, while leaving other values as they are.

nullToNullcpo(NULL)
#> NULLCPO
nullcpoToNull(NULLCPO)
#> NULL
nullToNullcpo(10) # not changed
#> [1] 10
nullcpoToNull(10) # ditto
#> [1] 10

CPO Name and ID

A CPO has a “name” which identifies the general operation done by this CPO. For example, it is "pca" for a CPO created using cpoPca(). Furthermore, a CPO has an “ID” which is associated with the particular CPO object at hand. For primitive CPOs, it can be queried and set using getCPOId() and setCPOId(), and it can be set during construction, but it defaults to the CPO’s name. The ID will also be prefixed to the CPO’s hyperparameters after construction, if they are exported. This can help prevent hyperparameter name clashes when composing CPOs with otherwise identical hyperparameter names. It is possible to set the ID to NULL to have no prefix for hyperparameter names.

cpo = cpoPca()
getCPOId(cpo)
#> [1] "pca"

getParamSet(cpo)
#>               Type len   Def Constr Req Tunable Trafo
#> pca.center logical   -  TRUE      -   -    TRUE     -
#> pca.scale  logical   - FALSE      -   -    TRUE     -

getParamSet(setCPOId(cpo, "my.id"))
#>                 Type len   Def Constr Req Tunable Trafo
#> my.id.center logical   -  TRUE      -   -    TRUE     -
#> my.id.scale  logical   - FALSE      -   -    TRUE     -

getParamSet(setCPOId(cpo, NULL))
#>           Type len   Def Constr Req Tunable Trafo
#> center logical   -  TRUE      -   -    TRUE     -
#> scale  logical   - FALSE      -   -    TRUE     -

In the following (silly) example an error is thrown because of hyperparameter name clash. This can be avoided by setting the ID of one of the constituents to a different value.


cpo %>>% cpo
#> Error in parameterClashAssert(cpo1, cpo2, cpo1$debug.name, cpo2$debug.name): Parameters "pca.center", "pca.scale" occur in both pca and pca
#> Use the id parameter when constructing, or setCPOId, to prevent name collisions.

cpo %>>% setCPOId(cpo, "two")
#> (pca >> two<pca>)(pca.center = TRUE, pca.scale = FALSE, two.center = TRUE, two.scale = FALSE)

CPO Properties

CPOs contain information about the kind of data they can work with, and what kind of data they produce. getCPOProperties returns a list with the slots handling, adding, needed. properties$handling indicates the kind of data a CPO can handle, properties$needed indicates the kind of data it needs the data receiver (e.g. attached learner) to have, and properties$adding lists the properties it adds to a given learner. An example is cpoDummyEncode(), a CPO that converts factors to numerics: The receiving learner needs to handle numerics, so properties$needed == "numerics", but it adds the ability to handle factors (since they are converted), so properties$adding = c("factors", "ordered").

getCPOProperties(cpoDummyEncode())
#> $handling
#>  [1] "numerics"   "factors"    "ordered"    "missings"   "cluster"   
#>  [6] "classif"    "multilabel" "regr"       "surv"       "oneclass"  
#> [11] "twoclass"   "multiclass" "prob"       "se"        
#> 
#> $adding
#> [1] "factors" "ordered"
#> 
#> $needed
#> [1] "numerics"

As a result, cpoDummyEncode endows a Learner with the ability to train on data with factor variables:

train("classif.fnn", bc.task)  # gives an error
#> Error in checkLearnerBeforeTrain(task, learner, weights): Task 'BreastCancer-example' has factor inputs in 'Cl.thickness, Cell.size, Cell.shape, Marg.adhes...', but learner 'classif.fnn' does not support that!

train(cpoDummyEncode(reference.cat = TRUE) %>>% makeLearner("classif.fnn"), bc.task)
#> Model for learner.id=classif.fnn.dummyencode; learner.class=CPOLearner
#> Trained on: task.id = BreastCancer-example; obs = 683; features = 9
#> Hyperparameters:

getLearnerProperties("classif.fnn")
#> [1] "twoclass"   "multiclass" "numerics"

getLearnerProperties(cpoDummyEncode(TRUE) %>>% makeLearner("classif.fnn"))
#> [1] "numerics"   "factors"    "ordered"    "twoclass"   "multiclass"

`.sometimes`-Properties

As described in more detail in the Building Custom CPOs vignette, CPOs can have properties that are considered only when composing CPOs, or only when checking data returned by CPOs. In short, consider a CPO that does imputation, but only for factorial features. This CPO would need to have "missings" in its $adding properties slot, since it enables Learner to handle (some) Tasks that have missing values. However, this CPO may under certain circumstances still return data that has missing values. This discrepancy is recorded internally by having two “hidden” sets of properties that can be retrieved with getCPOProperties() with get.internal set to TRUE. These properties are adding.min, the minimal set of properties added, and needed.max, the maximal set of properties needed by consecutive operators. These can be understood as a description of the “worst case” behaviour of the CPO, since behaviour that is out of bounds of these sets causes an error by the mlrCPO-framework.

An example is the cpoApplyFun CPO: When it is constructed, it is not known what kind of properties will be added or needed, so adding.min is empty while needed.max is the set of all data properties. When composing CPOs, this CPO is handled as if it magically does exactly the data conversion necessary to make the CPOs or Learner coming after it work with the data. If this ends up not being the case, an error is thrown during application or training by the following CPO or Learner.

getCPOProperties(cpoApplyFun(export = "export.all"), get.internal = TRUE)
#> $handling
#>  [1] "numerics"   "factors"    "ordered"    "missings"   "cluster"   
#>  [6] "classif"    "multilabel" "regr"       "surv"       "oneclass"  
#> [11] "twoclass"   "multiclass" "prob"       "se"        
#> 
#> $adding
#> [1] "numerics" "factors"  "ordered"  "missings"
#> 
#> $needed
#> character(0)
#> 
#> $adding.min
#> character(0)
#> 
#> $needed.max
#> [1] "numerics" "factors"  "ordered"  "missings"

CPO Affect

When constructing a CPO, it is possible to restrict the columns on which the CPO operates using the affect.* parameters of the CPOConstructor. These parameters are:

affect.index: Identify affected columns by a vector of column indices.
affect.names: Identify affected columns by a vector of column names.
affect.pattern: Match column names against a grep() style regex pattern.
affect.pattern.ignore.case: Ignore case when matching by pattern.
affect.pattern.perl: Use “perl” syntax in affect.pattern.
affect.pattern.fixed: Use fixed pattern instead of regex in affect.pattern.
affect.invert: Invert the columns to affect: Only columns not matched by any of the other affect.* parameters are affected.

# onlhy PCA columns that have '.Length' in their name
cpo = cpoPca(affect.pattern = ".Length")
getCPOAffect(cpo)
#> $pattern
#> [1] ".Length"

triris = iris %>>% cpo
head(triris)
#>   Sepal.Width Petal.Width Species       PC1         PC2
#> 1         3.5         0.2  setosa -2.460241 -0.24479165
#> 2         3.0         0.2  setosa -2.538962 -0.06093579
#> 3         3.2         0.2  setosa -2.709611  0.08355948
#> 4         3.1         0.2  setosa -2.565116  0.25420858
#> 5         3.6         0.2  setosa -2.499602 -0.15286372
#> 6         3.9         0.4  setosa -2.066375 -0.40249369

CPO Parameter Export

Sometimes when using many CPOs, their hyperparameters may get messy. mlrCPO enables the user to control which hyperparameter get exported. The parameter “export” can be one of "export.default", "export.set", "export.unset", "export.default.set", "export.default.unset", "export.all", "export.none". “all” and “none” do what one expects; “default” exports the “recommended” parameters; “set” and “unset” export the values that have not been set, or only the values that were set (and are not left as default). “default.set” and “default.unset” work as “set” and “unset”, but restricted to the default exported parameters.

!cpoScale()
#> Trafo chain of 1 cpos:
#> scale(center = TRUE, scale = TRUE)
#> Operating: feature
#> ParamSet:
#>                 Type len  Def Constr Req Tunable Trafo
#> scale.center logical   - TRUE      -   -    TRUE     -
#> scale.scale  logical   - TRUE      -   -    TRUE     -

!cpoScale(export = "export.none")
#> Trafo chain of 1 cpos:
#> scale()[not exp'd: center = TRUE, scale = TRUE]
#> Operating: feature
#> ParamSet:
#> [1] "Empty parameter set."

!cpoScale(scale = FALSE, export = "export.unset")
#> Trafo chain of 1 cpos:
#> scale(center = TRUE)[not exp'd: scale = FALSE]
#> Operating: feature
#> ParamSet:
#>                 Type len  Def Constr Req Tunable Trafo
#> scale.center logical   - TRUE      -   -    TRUE     -

Syntactic Sugar

There are some %>>%-related operators that perform similar operations but may be more concise in certain applications. In general these operators are left-assiciative, i.e. they are evaluated after the expressions to their left were evaluated. Therefore, for example, a %>>% b %<<% c is equivalent to (a %>>% b) %<<% c. Exceptions are the assignment operators, %<>>% and %<<<%, as well as the %>|% operator, see below.

The operators are:

%>>%: The application, composition or attachment operator.
%<<%: The above with exchanged arguments. a %<<% b is equivalent to b %>>% a
%<>>%: %>>%, followed with assignment to the left. This operator evaluates the arguments to its right before being evaluated itself. a %<>>% b %>>% c is equivalent to a = (a %>>% b %>>% c).
%<<<%: %<<%, followed with assignment to the left. Note this is not the %<>>% operator with its arguments flipped. This operator evaluates the arguments to its right before being evaluated itself. a %<<<% b %>>% c is equivalent to a = (a %<<% (b %>>% c)).
%>|%: %>>%, followed by application of retrafo(). This operator evaluates the arguments to its right before being evaluated itself. a %>|% b %<<% c is equivalent to retrafo(a %>>% (b %<<% c)).
%|<%: The above with exchanged arguments. Like most R operators, this one evaluates arguments to its left before being evaluated itself. a %>>% b %|<% c is equivalent to retrafo((a %>>% b) %<<% c).

mlrCPO Core

Martin Binder

2024-02-20

CPO Vignette Navigation

Table of Contents

Introduction

Lifecycle of a CPO

CPOConstructor

CPO

Functions that work on CPOs

CPO Application

CPO Composition

Compound CPO Chaining and Decomposition

CPOLearner

CPOLearner Decomposition

CPOTrained

Functions that work on CPOTrained

CPOTrained Inspection

CPOTrained are Automatically Chained

CPOTrained Composition, Decomposition, and Chaining

Application of CPOTrained

Inversion using CPOTrained

NULLCPO

CPO Name and ID

CPO Properties

`.sometimes`-Properties

CPO Affect

CPO Parameter Export

Syntactic Sugar

mlrCPO Core

Martin Binder

2024-02-20

CPO Vignette Navigation

Table of Contents

Introduction

Lifecycle of a CPO

CPOConstructor

CPO

Functions that work on CPOs

CPO Application

CPO Composition

Compound CPO Chaining and Decomposition

CPOLearner

CPOLearner Decomposition

CPOTrained

Functions that work on CPOTrained

CPOTrained Inspection

CPOTrained are Automatically Chained

CPOTrained Composition, Decomposition, and Chaining

Application of CPOTrained

Inversion using CPOTrained

NULLCPO

CPO Name and ID

CPO Properties

.sometimes-Properties

CPO Affect

CPO Parameter Export

Syntactic Sugar

`.sometimes`-Properties