3 The package
The R package iRegression contains the following functions:
bivar
is a function that fits a regression model for interval-valued variables based on the bivariate symbolic regression method (BSRM);crm
is a function that fits a linear regression model based on the center and range method (CRM);ccrm
is function that fits a linear regression with inequality constraints over the range’s parameters (CCRM);MinMax
is a function that fits a linear regression model for interval variables based on the MinMax method;cm
is a function that fits a linear regression model based on the center method;
All five functions require an object of class formula
giving the symbolic description of the regression model to be estimated. Methods for analyzing the models above are also provided. The functions coef()
, fitted()
and residuals()
extract the estimated coefficients, fitted values and residuals from the adjusted models. An object from any of the five classes can be summarized through the function summary()
.
3.1 Function bivar
The function bivar()
can be used to estimate the parameter of a gaussian bivariate regression model for interval variables. It is possible to consider any pair of interval features for the bivariate random vector \(\pmb Y\). For example, the lower and upper interval bounds or the midpoint and the range of intervals, respectively. The function is used as
bivar(formula1, lig1, formula2, lig2, data, ...)
and considers the following arguments:
formula1
: an object of classformula
that represents the symbolic description of the first marginal model;lig1
: represents the link function to be considered for the first model;formula2
: an object of classformula
that represents the symbolic description of the second marginal model;lig2
: represents the link function to be considered for the second model;data
: an optional data frame containing the variables of the models.
Notice that it is possible to choose from different link functions (identity
, inverse
or log
) to connect the random variables \(Y_1\) and \(Y_2\) with the respective linear predictors \(\eta_1\) and \(\eta_2\). The function summary.bivar()
returns the following elements, given an object of the class bivar
:
Coefficients1
andCoefficients2
: the vectors of coefficients for the explanatory variables of the models 1 and 2, respectively;RMSE1
andRMSE2
: the root mean square error for the models 1 and 2, respectively;Rho
: the estimate for the correlation coefficient between \(Y_1\) and \(Y_2\);Phi
: the estimate of the dispersion parameter;D
: the goodness-of-fit measure deviance for the current model.
The function bivar()
considers the expression (2.5) to estimate the vectors of coefficients \(\hat{\pmb \beta_1}\) and \(\hat{\pmb \beta_2}\), the expression (2.6) to compute the Deviance for the current model, and the expressions (2.7) and (2.8) provide, respectively, the estimates of the dispersion parameter \(\hat{\phi}\) and the correlation coefficient \(\hat{\rho}\). Moreover, the function coef.bivar()
returns just the estimated coefficients while the functions fitted.bivar()
and residuals.bivar()
provide, respectively, the matrices of the fitted values and the residuals for an object of the class bivar
. The expression (2.12) is used to obtain the residual deviance. The function bivar
considers the bivariate gaussian distribution as probabilistic support for the error of the model BSRM and the following elements belongs from an object of class bivar
:
coefficients1
: a named vector of coefficients for the explanatory variables of the model “1”;coefficients2
: a named vector of coefficients for the explanatory variables of the model “2”;fitted.values1
: the fitted values for the response variable \(\pmb Y_1\);fitted.values2
: the fitted values for the response variable \(\pmb Y_2\);residuals1
: the ordinary residual for the response variable \(\pmb Y_1\);residuals2
: the ordinary residual for the response variable \(\pmb Y_2\);residual.deviance
: the global residual for the bivariate vector \(\pmb Y = [\pmb Y_1, \pmb Y_2]\);Rho
: the estimative for the correlation coefficient between \(\pmb Y_1\) and \(\pmb Y_2\);Phi
:the estimative of the dispersion parameter;D
: the goodness-of-fit measure deviance for the current model.
3.2 Function crm
The function crm()
fits two independent linear regression models to the center and range
of the interval variables, respectively, and minimizes the sum of squared center’s error plus the sum of squared range’s error (Lima Neto and De Carvalho 2008). The function is used as
crm(formula1, formula2, data, ...)
and considers the following arguments:
formula1
: an object of classformula
that represents the symbolic description of the center’s model;formula2
: an object of classformula
that represents the symbolic description of the range’s model;data
: an optional data frame containing the variables of the models.
This function returns an object of class crm
including the following elements:
coefficients.C
: the vector of coefficients for the center’s explanatory variables;coefficients.R
: the vector of coefficients for the range’s explantory variables;sigma.C
: an estimate of standard deviation for the center’s regression model;sigma.R
: an estimate of standard deviation for the range’s regression model;df.C
: the degrees of freedom for the center’s residuals;df.R
: the degrees of freedom for the range’s residuals;fitted.values.l
: the fitted values of the lower interval bounds;fitted.values.u
: the fitted values of the upper interval bounds;residuals.l
: the residuals of the lower interval bounds;residuals.u
: the residuals of the upper interval bounds.
The function summary.crm()
returns the elements RMSE.l
(the root mean squared error of the lower bound) and RMSE.u
(the root mean squared error of the upper bound), given an object of the class crm
. The function coef.crm()
returns just the estimated coefficients while the functions fitted.crm()
and residuals.crm()
provide, respectively, the matrices of the fitted values and the residuals for an object of the class crm
. Notice that the fitted values, residuals and root mean square errors are denoted in terms of lower and upper interval bounds to a better comparison with the original values of the response variable \(\pmb Y\).
3.3 Function ccrm
The function ccrm()
fits two independent linear regression models to the center and range
of the interval variables. However, the parameter estimation of the range’s coefficients takes into account inequality constraints and is estimated using the function pcls()
(see package mgcv, available in mgcv). This function solves least squares problems with quadratic penalties subject to linear equality and inequality constraints using quadratic programming. The aim is to guarantee mathematical coherence between the predicted values of the lower and upper bounds of the response interval variable Y, i.e., \(\hat{y}_L < \hat{y}_U\). There are no constraints over the parameter estimates for the center coefficients. For further details about the constrained center and range method, see . The function is used as
ccrm(formula1, formula2, data, ...)
and considers the following arguments:
formula1
: an object of classformula
that represents the symbolic description of the center’s model;formula2
: an object of classformula
that represents the symbolic description of the range’s model;data
: an optional data frame containing the variables of the models.
This function returns an object of class ccrm
with the following elements: coefficients.C
, coefficients.R
, sigma.C
, sigma.R
, df.C
, df.R
, fitted.values.l
, fitted.values.u
, residuals.l
and residuals.u
. All these elements present the same description of the function crm()
. The function summary.ccrm()
returns the elements RMSE.l
(the root mean squared error of the lower bound) and RMSE.u
(the root mean squared error of the upper bound), given an object of the class ccrm
. The function coef.ccrm()
returns just the estimated coefficients while the functions fitted.ccrm()
and residuals.ccrm()
provide, respectively, the matrices of the fitted values and the residuals for an object of the class ccrm
.
3.4 Function MinMax
The function MinMax()
suggests to estimate the lower and the upper bounds of the interval variables using two linear regression models with different vectors of parameters. This is equivalent to supposing independence between the values of lower and upper bounds of the intervals. The function is used as
MinMax(formula1, formula2, data, ...)
formula1
: an object of classformula
that represents the symbolic description of the lower bound model;formula2
: an object of classformula
that represents the symbolic description of the upper bound model;data
: an optional data frame containing the variables of the models.
The following elements belong to an object of class MinMax
: coefficients.l
, coefficients.u
, sigma.l
, sigma.u
, df.l
, df.u
, fitted.values.l
, fitted.values.u
, residuals.l
and residuals.u
. The function summary.MinMax()
returns the elements RMSE.l
(the root mean squared error of the lower bound) and RMSE.u
(the root mean squared error of the upper bound), given an object of the class MinMax
. The function coef.MinMax()
returns just the estimated coefficients while the functions fitted.MinMax()
and residuals.MinMax()
provide, respectively, the matrices of the fitted values and the residuals for an object of the class MinMax
.
3.5 Function cm
The function cm()
implements the first approach proposed to fit a linear regression model to interval variables (Billard and Diday 2000). This approach consists in fitting a linear regression model
to the centers of the interval variables, minimizing the sum of squared center’s error. The lower and upper bounds of the response interval variable Y are predicted, respectively, from the
lower and upper bounds of the independent variables using the same vector of parameters \(\pmb{\beta}\). The function is used as
cm(formula1, formula2, data, ...)
and considers the following arguments:
formula1
: an object of classformula
that represents the symbolic description of the lower bound model;formula2
: an object of classformula
that represents the symbolic description of the upper bound model;data
: an optional data frame containing the variables of the models.
The function cm
returns an object of class cm
including the following elements: coefficients
, sigma
, df
, fitted.values.l
, fitted.values.u
, residuals.l
and residuals.u
. The function summary.cm()
returns the elements RMSE.l
(the root mean squared error of the lower bound) and RMSE.u
(the root mean squared error of the upper bound), given an object of the class cm
. The function coef()
returns just the estimated coefficients while the functions fitted.cm()
and residuals.cm()
provide, respectively, the matrices of the fitted values and the residuals for an object of the class cm
.
3.6 An example
This section illustrates the functions in the package and demonstrates the differences between the regression methods implemented for interval variables. This is performed using the data set Soccer, illustrated in
References
Lima Neto, Eufrásio Andrade, and Francisco Assis T. De Carvalho. 2008. “Centre and Range Method for Fitting a Linear Regression Model to Symbolic Interval Data.” Computational Statistics & Data Analysis 52 (3): 1500–1515. https://doi.org/http://dx.doi.org/10.1016/j.csda.2007.04.014.
Billard, L., and E. Diday. 2000. “Regression Analysis for Interval-Valued Data.” In Data Analysis, Classification, and Related Methods, edited by Henk A. L. Kiers, Jean-Paul Rasson, Patrick J. F. Groenen, and Martin Schader, 369–74. Heidelberg: Springer. https://doi.org/10.1007/978-3-642-59789-3_58.