1 Introduction

Nowadays, data is an important source of information. Statistical data analysis techniques are permanently evolving due to the presence of new types of data in real-life studies and the development of tools to analyze more complex types of data. The presence of interval data sets is becoming common in data analysis problems. This type of data represents either the uncertainty existing in an error measurement or the natural variability present in the data or derive from the aggregation of huge databases into a compact number of groups (Bock 2000), (Billard and Diday 2007), (Diday and Noirhomme-Fraiture 2008). Interval data can be found in some fields such as engineering, economy, medicine, among others. Furthermore, technical specification, temperatures in meteorological stations and daily stock prices are examples of possible interval-valued variables. Therefore, statistical tools to analyze interval-valued data are very much required.

Some regression methods for modeling interval data have been proposed in the literature. Several of these methods study the problem itself from an optimization point of view (Billard and Diday 2000), (Lima Neto and De Carvalho 2008), (Maia and A.T. de Carvalho 2008), (Lima Neto and De Carvalho 2010), (Xu 2010, Wangetal2012). Other methods, however, consider the principles of set arithmetic algebra to obtain a linear regression model for interval data Gil et al. (2007) and Blanco-Fernández, Corral, and González-Rodrı́guez (2011). In the last years, new regression methods for interval data have taken into account a probabilistic background for the dependent interval variable \(Z\) (Souza, Queiroz, and Cysneiros 2011), (Brito and Silva 2012), (Fagundes, Souza, and Cysneiros 2013), (Sun and Li 2014).

(Lima Neto, Cordeiro, and De Carvalho 2011) presented an approach based on the generalized linear models theory, called bivariate symbolic regression method (BSRM). They consider the interval response variable \(Z = [Y_{L}, Y_{U}]\) as a bivariate random vector with joint distribution belonging to the continuous bivariate exponential family (Iwasaki and Tsubaki 2005). Moreover, the BSRM allows the use of inference procedures for the parameter estimates, goodness-of-fit tests and residual analyses. This fact represents an advantage of the BSRM in comparison with the methods based on optimization point of view or set arithmetic algebra principles.

This paper introduces the R (R Core Team 2014) package iRegression (Lima Neto and Vasconcelos 2012) available from the Comprehensive R Archive Network at iRegression. The package implements important regression methods for interval-valued variables, namely: center method (Billard and Diday 2000), center and range method (Lima Neto and De Carvalho 2008), constrained center and range method (Lima Neto and De Carvalho 2010), min-max method and the bivariate symbolic regression method (Lima Neto, Cordeiro, and De Carvalho 2011). The fitted values, residuals and goodness-of-fit measures are available for each method. Real-life data sets also are ready to use in the package. Currently, there are other packages to analyze interval variables. The package ISDA.R (Queiroz Filho and Fagundes 2012) presents descriptive statistics and visualization techniques for interval variables. The packages RSDA (Rodriguez, Calderon, and Zuniga 2014) and GPCSIV (Brahim and Makosso-Kallyth 2013) consider principal component analysis (PCA) approaches for interval data. The package MAINT.Data (Silva and Brito 2015) brings multivariate analysis of variance (MANOVA) and discriminant analysis for interval data. The package symbolicDA (Dudek, Pelka, and Wilk 2013) implements several multivariate techniques for interval variables (similarity and dissimilarity measures, clustering methods, principal component analysis, kernel discriminant analysis), decision trees and visualization techniques. It is important to mention that the package iRegression was the first one available on CRAN to analyse interval variables. Moreover, in the version \(1.2\) was included the function bivar() that represents the first regression model with a probabilistic background for interval variables, being possible the use of inferential procedures and residual analysis in order to validate the model.

This article is organized as follows: Section 2 details the bivariate symbolic regression method, the parameter estimates algorithm, inference aspects, goodness-of-fit and residual measures. A description of the iRegression package is presented in Section 3. An application to real-life data set is demonstrated in Section 4. Finally, in Section 5 we conclude the paper with some remarks.


References

Bock, Hans Hermann. 2000. Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Edited by E. Diday. Secaucus, NJ, USA: Springer-Verlag.

Billard, Lynne, and Edwin Diday. 2007. Symbolic Data Analysis: Conceptual Statistics and Data Mining (Wiley Series in Computational Statistics). John Wiley & Sons.

Diday, Edwin, and Monique Noirhomme-Fraiture. 2008. Symbolic Data Analysis and the Sodas Software. New York, NY, USA: Wiley-Interscience.

Billard, L., and E. Diday. 2000. “Regression Analysis for Interval-Valued Data.” In Data Analysis, Classification, and Related Methods, edited by Henk A. L. Kiers, Jean-Paul Rasson, Patrick J. F. Groenen, and Martin Schader, 369–74. Heidelberg: Springer. https://doi.org/10.1007/978-3-642-59789-3_58.

Lima Neto, Eufrásio Andrade, and Francisco Assis T. De Carvalho. 2008. “Centre and Range Method for Fitting a Linear Regression Model to Symbolic Interval Data.” Computational Statistics & Data Analysis 52 (3): 1500–1515. https://doi.org/http://dx.doi.org/10.1016/j.csda.2007.04.014.

Maia, André Luis Santiago, and Francisco de A.T. de Carvalho. 2008. “Fitting a Least Absolute Deviation Regression Model on Interval-Valued Data.” In Advances in Artificial Intelligence - Sbia 2008: 19th Brazilian Symposium on Artificial Intelligence Savador, Brazil, October 26-30, 2008. Proceedings, edited by Gerson Zaverucha and Augusto Loureiro da Costa, 207–16. Heidelberg: Springer. https://doi.org/10.1007/978-3-540-88190-2_26.

Lima Neto, Eufrásio Andrade, and Francisco Assis T. 2010. “Constrained Linear Regression Models for Symbolic Interval-Valued Variables.” Computational Statistics & Data Analysis 54 (2): 333–47. https://doi.org/http://dx.doi.org/10.1016/j.csda.2009.08.010.

Xu, Wei. 2010. “Symbolic Data Analysis: Interval-Valued Data Regression, Phd. Thesis, University of Georgia, Usa.” PhD.Thesis, University of Georgia.

Gil, María Ángeles, Gil González-Rodríguez, Ana Colubi, and Manuel Montenegro. 2007. “Testing Linear Independence in Linear Models with Interval-Valued Data.” Computational Statistics & Data Analysis 51 (6): 3002–15. https://doi.org/http://dx.doi.org/10.1016/j.csda.2006.01.015.

Blanco-Fernández, Angela, Norberto Corral, and Gil González-Rodrı́guez. 2011. “Estimation of a Flexible Simple Linear Model for Interval Data Based on Set Arithmetic.” Computational. Statistics & Data Analysis 55 (9): 2568–78. https://doi.org/10.1016/j.csda.2011.03.005.

Souza, Renata M. C. R., Diego C. F. Queiroz, and Francisco José A. Cysneiros. 2011. “Logistic Regression-Based Pattern Classifiers for Symbolic Interval Data.” Pattern Analysis and Applications 14 (3): 273–82. https://doi.org/10.1007/s10044-011-0222-1.

Brito, Paula, and A. Pedro Duarte Silva. 2012. “Modelling Interval Data with Normal and Skew-Normal Distributions.” Journal of Applied Statistics 39 (1): 3–20. https://doi.org/10.1080/02664763.2011.575125.

Fagundes, Roberta A.A., Renata M.C.R. Souza, and Francisco José A. Cysneiros. 2013. “Robust Regression with Application to Symbolic Interval Data.” Engineering Applications of Artificial Intelligence 26 (1): 564–73. https://doi.org/http://dx.doi.org/10.1016/j.engappai.2012.05.004.

Sun, Y., and C. Li. 2014. “Linear regression for interval-valued data: a new and comprehensive model.” ArXiv E-Prints, January. http://arxiv.org/abs/1401.1831.

Lima Neto, Eufrásio Andrade, Gauss M. Cordeiro, and Francisco Assis T. De Carvalho. 2011. “Bivariate Symbolic Regression Models for Interval-Valued Variables.” Journal of Statistical Computation and Simulation 81 (11): 1727–44. https://doi.org/10.1080/00949655.2010.500470.

Iwasaki, Masakazu, and Hiroe Tsubaki. 2005. “A New Bivariate Distribution in Natural Exponential Family.” Metrika 61 (3): 323–36. https://doi.org/10.1007/s001840400348.

R Core Team. 2014. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. http://www.R-project.org/.

Lima Neto, Eufrásio Andrade, and Claudio A. Vasconcelos. 2012. IRegression: Regression Methods for Interval-Valued Variables. http://CRAN.R-project.org/package=iRegression.

Queiroz Filho, Ricardo Jorge Almeida, and Roberta Andrade Araujo Fagundes. 2012. ISDA.R: Interval Symbolic Data Analysis for R. http://CRAN.R-project.org/package=ISDA.R.

Rodriguez, Oldemar, Olger Calderon, and Roberto Zuniga. 2014. RSDA: RSDA - R to Symbolic Data Analysis. http://CRAN.R-project.org/package=RSDA.

Brahim, Brahim, and Sun Makosso-Kallyth. 2013. GPCSIV: GPCSIV, Generalized Principal Component of Symbolic Interval Variables. http://CRAN.R-project.org/package=GPCSIV.

Silva, Pedro Duarte, and Paula Brito. 2015. MAINT.Data: Model and Analyse Interval Data. http://CRAN.R-project.org/package=MAINT.Data.

Dudek, Andrzej, Marcin Pelka, and Justyna Wilk. 2013. SymbolicDA: Analysis of Symbolic Data. http://CRAN.R-project.org/package=symbolicDA.