About Us

The continuously growing capacities for the acquisition and storage of data sets call for new approaches to process data efficiently and extract relevant information. In fact, the interest in large data sets is when they are actually 'strange' and allow us to learn about complex mechanisms generating them. In these contexts, it may be difficult or even counterproductive to employ parametric statistical models for the learning process.

The Center for the Discovery of Structures in Complex Data is funded by a grant awarded in 2018 by Iniciativa Científica Milenio from the Chilean Ministry of Economy to a group of Statisticians. The center is based at Pontificia Universidad Católica de Chile. The Center focuses on new statistical approaches for the efficient identification, reconstruction and classification of relevant structural information in complex data sets.

Associate Researchers

Bevilacqua, Moreno

Full Professor. Department of Statistics, Universidad de Valparaiso.

Jara, Alejandro (Director)

Associate Professor. Department of Statistics, School of Mathematics, Pontificia Universidad Católica de Chile.

Quintana, Fernando (Deputy Director)

Full Professor. Department of Statistics, School of Mathematics, Pontificia Universidad Católica de Chile.

Sing-Long, Carlos

Assistant Professor. Institute of Mathematical and Computational Engineering, School of Mathematics and Engineering, Pontificia Universidad Católica de Chile.

Young Researchers

Beaudry, Isabelle

Assistant Professor. Department of Statistics, School of Mathematics, Pontificia Universidad Católica de Chile.

García-Zattera, María José

Assistant Professor. Department of Statistics, School of Mathematics, Pontificia Universidad Católica de Chile.

Guzman, Cristobal

Ph.D. in Mathematics, Institute for Mathematical and Computational Engineering, School of Mathematics and Engineering, Pontificia Universidad Católica de Chile

Senior Researchers

Maceachern, Steve

Full Professor. Department of Statistics, The Ohio State University.

Müeller, Peter

Full Professor. Department of Mathematics, The University of Texas at Austin.

Porcu, Emilio

Chair of Statistics, Trinity College, Dublin.

Prünster, Igor

Full Professor. Institute of Data Science and Analytics, Bocconi University.

Research Lines

For the period 2018-2021, the Center for the Discovery of Structures in Complex Data will be centered on the following aspects of the statistical learning in the context of complex data:

(I) The development, study of properties, and the implementation of scalable Bayesian nonparametric approaches for collection of probability measures indexed by predictors, and when both responses and predictors are defined on non-standard spaces,

(II) The development, study of properties, and the implementation of nonparametric approaches for misclassified doubly-interval-censored time-to-event data, and

(III) The development, study of properties, and the implementation of nonparametric approaches for space and time data.

Visitors

Previous Next
Networking - Visitors
Alternative content for the map

Past Visitors

  • Garritt Page, Associate Professor of Statistics, Department of Statistics, Brigham Young University, January 27-31st, 2020
  • Ramses Mena, Associate Professor, Department of Statistics, Departamento de Probabilidad y Estadística, IIMAS, UNAM, November 18 - 23rd, 2019
  • Alicia Carriquiry, Professor of Statistics, Department of Statistics, Iowa State University. October 21-25th, 2019
  • Marc Genton, Professor of Statistics, Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology. October 21-25th, 2019
  • Antonio Lijoi, Professor of Statistics, Institute of Data Science and Analytics, Bocconi University. October 21-25th, 2019
  • Bernardo Nipoti, Associate Professor of Statistics, Department of Economics, Management and Statistics, Università degli Studi di Milano Bicocca. October 21-25th, 2019
  • Igor Prünster, Professor of Statistics, Institute of Data Science and Analytics, Bocconi University. October 21-25th, 2019
  • Daniela Castro, Assistant Professor, School of Mathematics and Statistics, University of Glasgow. June 10-14th, 2019
  • Alessandra Guglielmi, Professor of Statistics, Department of Mathematics, Politecnico di Milano, May 6 - 14, 2019
  • Federico Crudu, Assistant Professor, Department of Economics and Statistics, University of Siena, May 13th - 22nd, 2019
  • Ilias Diakonikolas, Assistant Professor, Department of Computer Science, University of Southern California, April, 1-5, 2019
  • Peter Müller, Professor, Department of Mathematics, The University of Texas at Austin, March 27th - April 7th, 2019
  • Edward Bedrick, Professor, Department of Epidemiology and Biostatistics, The University of Arizona, March 24 - 31, 2019
  • David B. Dahl, Professor, Department of Statistics, Brigham Young University, March, 24 - 31, 2019
  • Carlos Díaz-Avalos. Professor, Departamento de Probabilidad y Estadística, IIMAS, UNAM, March 24 - 31, 2019
  • Michele Guindani, Professor, Department of Statistics, University of California at Irvine, March 24 - 31, 2019
  • Wesley Johnson, Professor, Department of Statistics, University of California at Irvine, March 24 - 31, 2019
  • Gary Rosner, Professor, Biostatistics/Bioinformatics Division, Johns Hopkins University, March 24 - 31, 2019
  • Babak Shahbaba, Associate Professor, Department of Statistics, University of California at Irvine, March 24 - 31, 2019
  • Heejung Shim, Assistant Professor, School of Mathematics and Statistics, University of Melbourne, March 24 - 31, 2019
  • Steve MacEachern, Professor, Department of Statistics, Ohio State University, March 24 - 31, 2019
  • Ramses Mena, Associate Professor, Department of Statistics, Departamento de Probabilidad y Estadística, IIMAS, UNAM, March 24 - 31, 2019
  • Oscar Peralta, Reserach Associate, School of Mathematical Sciences, The University of Adelaide, March 24 - 31, 2019
  • Jessica Utts, Professor, Department of Statistics, University of California at Irvine, March 24 - 31, 2019
  • Debajyoti Sinha, Professor, Department of Statistics, Florida State University, March, 18 - 22, 2019
  • Amy Herring, Professor, Department of Statistical Sciences, Duke University, January, 14 - 20, 2019
  • David B. Dahl, Professor, Department of Statistics, Brigham Young University, January, 8 - 17th, 2019
  • Tamara Fernández, Research Associate, Gatsby Computational Neurosci Unit, University College London, December 17th, 2018 - January 14th, 2019
  • Carlos Díaz-Avalos. Professor, Departamento de Probabilidad y Estadística, IIMAS, UNAM, 12 - 26th, 2018
  • Nishant Mehtan. Assistant Professor, Department of Computer Science, University of Victoria, November, 12 - 26th, 2018
  • Alejandro Murua. Professor, Department of Statistics, University of Montreal. September, 7 - 16th, 2018
  • Garritt Page. Associate Professor, Department of Statistics, Brigham Young University. August, 7 - 14th, 2018
  • Evan Ray. Assitant Professor, Department of Statistics, Mount Holyoke College. August, 12 - 17th, 2018

Future Research Seminars

Past Research Seminars

2020

Automated learning of t factor analysis models with complete and incomplete data

The t factor analysis (tFA) model is a promising tool for robust reduction of high-dimensional data in the presence of heavy-tailed noises. When determining the number of factors of the tFA model, a two-stage procedure is commonly performed in which parameter estimation is carried out for a number of candidate models, and then the best model is chosen according to certain penalized likelihood indices such as the Bayesian information criterion. However, the computational burden of such a procedure could be extremely high to achieve the optimal performance, particularly for extensively large data sets. In this paper, we develop a novel automated learning method in which parameter estimation and model selection are seamlessly integrated into a one-stage algorithm. This new scheme is called the automated tFA (AtFA) algorithm, and it is also workable when values are missing. In addition, we derive the Fisher information matrix to approximate the asymptotic covariance matrix associated with the ML estimators of tFA models. Experiments on real and simulated data sets reveal that the AtFA algorithm not only provides identical fitting results, as compared to traditional two-stage procedures, but also runs much faster, especially when values are missing.

Intertwinings for Markov branching processes

Using a stochastic filtering framework we devise some intertwining relationships in the setting of Markov branching processes. One of our result turns out to be the basis of an exact simulation method for these kind of processes. Also, the population dynamic scheme inherent in the model helps to study the behavior of prolific individuals by observing the total size of the population. Moreover, we study a population with two types of immigrations, where it is observed the total immigration, and our objective is to study each immigration separately. This result allows to link continuous-time Markov chains with continuous-state branching (CB) processes.

Bayesian nonparametric hypothesis testing procedures

Scientific knowledge is firmly based on the use of statistical hypothesis testing procedures. A scientific hypothesis can be established by performing one or many statistical tests based on the evidence provided by the data. Given the importance of hypothesis testing in science, these procedures are an essential part of statistics. The literature of hypothesis testing is vast and covers a wide range of practical problems. However, most of the methods are based on restrictive parametric assumptions. In this talk, we will discuss Bayesian nonparametric approaches to construct hypothesis tests in different contexts. Our proposal resorts to the literature of model selection to define Bayesian tests for multiple samples, paired-samples, and longitudinal data analysis. Applications with real-life datasets and illustrations with simulated data will be discussed.

Linking measurements: a Bayesian nonparametric approach

Equating methods is a family of statistical models and methods used to adjust scores on different test forms so that scores can be comparable and used interchangeably. These methods lie on functions to transform scores on two or more versions of a test. Most of the proposed approaches for the estimation of these functions are based on continuous approximations of the score distributions, as they are most of the time, discrete functions. Considering scores as ordinal random variables, we propose a flexible dependent Bayesian nonparametric model for test equating. The new approach avoids continuous assumptions of the score distributions, in contrast to current equating methods. Additionally, it allows the use of covariates in the estimation of the score distribution functions, an approach not explored at all in the equating literature. Applications of the proposed model to real and simulated data under different sampling designs are discussed. Several methods are considered to evaluate the performance of our method and to compare it with current methods of equating. Respect to discrete versions of equated scores obtained from traditional equating methods, results show that the proposed method has better performance.

On modeling and estimating geo-referenced count spatial data

Modeling spatial data is a challenging task in statistics. In many applications, the observed data can be modeled using Gaussian, skew-Gaussian or even restricted random field models. However, in several fields, such as population genetics, epidemiology and aquaculture, the data of interest are often count data, and therefore the mentioned models are not suitable for their analysis. Consequently, there is a need for spatial models that are able to properly describe data coming from counting processes. Commonly three approaches are used to model this type of data: GLMMs with gaussian random field (GRF) effects, hierarchical models, and copula models. Unfortunately, these approaches do not give an explicit characterization of the count random field like their q-dimensional distribution or correlation function. It is important to stress that GLMMs and hierarchical models induces a discontinuity in the path. Therefore, samples located nearby are more dissimilar in value than in the case when the correlation function is continuous at the origin. Moreover, there are cases in which the copula representation for discrete distributions is not unique, so it is unidentifiable. Hence to deal with this, we propose a novel approach to model spatial count data in an efficient and accurate manner. Briefly, starting from independent copies of a “parent” gaussian random field, a set of transformations can be applied, and the result is a non-Gaussian random field. This approach is based on the characterization of count random fields that inherit the well-known geometric properties from Gaussian random fields.

On the Support of Yao-Based Random Ordered Partitions for Change-Point Analysis

TBA.

Respondent-Driven Sampling: Challenges and Opportunities

Respondent-driven sampling leverages social networks to sample hard-to-reach human populations, including among those who inject drugs, sexual minority, sex worker, and migrant populations. As with other link-tracing sampling strategies, sampling involves recruiting a small convenience sample, who invite their contacts into the sample, and in turn invite their contacts until the desired sample size is reached. Typically, the sample is used to estimate prevalence, though multivariable analyses of data collected through respondent-driven sampling are becoming more common. Although respondent-driven sampling may allow for quickly attaining large and varied samples, its reliance on social network contacts, participant recruitment decisions, and self-report of ego-network size makes it subject to several concerns for statistical inference. After introducing respondent-driven sampling I will discuss how these data are actually being collected and analyzed, and opportunities for statisticians to improve upon this widely-adopted method.

FATSO: A family of operators for variable selection in linear models

In linear models it is common to have situations where several regression coefficients are zero. In these situations a common tool to perform regression is a variable selection operator. One of the most common such operators is the LASSO operator, which promotes point estimates which are zero. The LASSO operator and similar approaches, however, give little in terms of easily interpretable parameters to determine the degree of variable selectivity. In this paper we propose a new family of selection operators which builds on the geometry of LASSO but which yield an easily interpretable way to tune se- lectivity. These operators correspond to Bayesian prior densities and hence are suitable for Bayesian inference. We present some examples using simulated and real data, with promising results.

2019

Visibility Imputation for Population Size Estimation using Respondent-Driven Sampling

Respondent-driven sampling (RDS) is a network sampling method commonly used to access hidden populations, such as those at high risk for HIV/AIDS and related diseases, in situations where sampling frames do not exist and conventional sampling techniques are not possible. In RDS, participants recruit their peers into the study, which has proven effective as an enrollment strategy but requires careful statistical analysis when making inference about the population. Data from RDS surveys inform key policy and resource allocation decisions, and in particular population size estimates are essential to understand counts of at-risk individuals to develop counseling and treatment programs and monitor health needs and epidemics. Successive sampling population size estimation (SS-PSE) is a commonly used method to estimate population size from RDS surveys, in which the decrease in social network size of participants over the study period is used to gauge the sample fraction. However, SS-PSE relies on self-reported social network sizes, which are subject to missingness, misreporting, and bias, and it is not robust to extreme values. In this talk, we present a modification to the SS-PSE methodology that jointly models the effective social network size of each individual along with the population size in a Bayesian framework. The model for effective network size, which we call visibility to reflect its usage as a proxy for inclusion probability, incorporates a measurement error model for self-reported social network size, as well as the number of recruits an individual was able to enroll and the time they had to recruit. We present and assess the imputed visibility SS-PSE framework, and demonstrate its utility using an RDS study of people who inject drugs (PWID) from Kosovo.

Performance of asymmetric links and correction methods for imbalanced data in binary regression

In binary regression, imbalanced data result from the presence of values equal to zero (or one) in a proportion that is significantly greater than the corresponding real values of one (or zero). In this work, we evaluate two methods developed to deal with imbalanced data and compare them to the use of asymmetric links. The results based on simulation study show, that correction methods do not adequately correct bias in the estimation of regression coefficients and that the models with power links and reverse power considered produce better results for certain types of imbalanced data. Additionally, we present an application for imbalanced data, identifying the best model among the various ones proposed. The parameters are estimated using a Bayesian approach, considering the Hamiltonian Monte-Carlo method, utilizing the No-U-Turn Sampler algorithm and the comparisons of models were developed using different criteria for model comparison, predictive evaluation and quantile residuals.

Birnbaum-Saunders Linear Mixed-Effects Models with Censored Data: Bayesian MCMC Inference

It is usual in data analysis the use of linear mixed effects models, when the responses are clustered around some random effects. This paper is focused on the Bayesian inference for the log-Birnabuam-Saunders linear mixed (log-BSLM) models, previously defined in the literature, under a frequentist point of view. The use of Markov chain Monte Carlo (MCMC) method is explored, which provides an alternative to the marginal maximum likelihood approach, which depends on the approximation of the likelihood. We developed, besides parameter estimation, residual analysis, influence diagnostics, model comparison and Bayesian prediction. We developed two MCMC algorithms, with and without consider a certain acceleration procedure. Simulation studies are conducted, under different scenarios of interest, where it is shown that the Bayesian approach, in general, provides better results than the frequentist one. In addition, the algorithm with the acceleration procedure showed to be better, in terms of convergence, than the usual MCMC approach. Also, a real data is analyzed, where is shown that our approach works properly. Finally, some directions toward some extensions are discussed.

A spliced Gamma-Generalized Pareto model for short-term extreme wind speed probabilistic forecasting

Renewable sources of energy, such as wind power have become a sustainable alternative to fossil fuel-based energy. However, the uncertainty and fluctuation of the wind speed derived from its intermittent nature bring a great threat to the wind power production stability, and the wind turbines themselves. Lately, much work has been done on developing models to forecast average wind speed values, yet surprisingly little has focused on proposing models to accurately forecast extreme wind speeds, which can damage the turbines. In this work, we develop a flexible spliced Gamma-Generalized Pareto model to forecast extreme and non-extreme wind speeds simultaneously. Our model belongs to the class of latent Gaussian models, for which inference is conveniently performed based on the integrated nested Laplace approximation method. Considering a flexible additive regression structure, we propose two models for the latent linear predictor to capture the spatio-temporal dynamics of wind speeds. Our models are fast to fit and can describe both the bulk and the tail of the wind speed distribution while producing short-term extreme and non-extreme wind speed probabilistic forecasts.

Inference in instrumental variables models with heteroskedasticity and many instruments

In this talk we propose a specification test for instrumental variable models that is robust to the presence of heteroskedasticity. The test can be seen as a generalization the Anderson-Rubin test. Our approach is based on the jackknife principle. We are able to show that under the null the proposed statistic has a Gaussian limiting distribution. Moreover, a simulation study shows its competitive finite sample properties in terms of size and power.

Determinantal Point Process Mixtures Via Spectral Density Approach

We consider mixture models where location parameters are a priori encouraged to be well separated. We explore a class of determinantal point process (DPP) mixture models, which provide the desired notion of separation or repulsion. Instead of using the rather restrictive case where analytical results are partially available, we adopt a spectral representation from which approximations to the DPP density functions can be readily computed. For the sake of concreteness the presentation focuses on a power exponential spectral density, but the proposed approach is in fact quite general. We later extend our model to incorporate covariate information in the likelihood and also in the assignment to mixture components, yielding a trade-off between repulsiveness of locations in the mixtures and attraction among subjects with similar covariates. We develop full Bayesian inference, and explore model properties and posterior behavior using several simulation scenarios and data illustrations. The talk is based on the following paper: Bianchini, Guglielmi, Quintana (2019), Determinantal Point Process Mixtures Via Spectral Density Approach, Bayesian Analysis.

Algorithmic Questions in High-Dimensional Robust Statistics

Fitting a model to a collection of observations is one of the quintessential questions in statistics. The standard assumption is that the data was generated by a model of a given type (e.g., a mixture model). This simplifying assumption is at best only approximately valid, as real datasets are typically exposed to some source of contamination. Hence, any estimator designed for a particular model must also be robust in the presence of corrupted data. This is the prototypical goal in robust statistics, a field that took shape in the 1960s with the pioneering works of Tukey and Huber. Until recently, even for the basic problem of robustly estimating the mean of a high-dimensional dataset, all known robust estimators were hard to compute. Moreover, the quality of the common heuristics degrades badly as the dimension increases. In this talk, we will survey the recent progress in algorithmic high-dimensional robust statistics. We will describe the first computationally efficient algorithms for robust mean and covariance estimation and the main insights behind them. We will also present practical applications of these estimators to exploratory data analysis and adversarial machine learning. Finally, we will discuss new directions and opportunities for future work. The talk will be based on a number of joint works with (various subsets of) G. Kamath, D. Kane, J. Li, A. Moitra, and A. Stewart.

Semiparametric Bayesian latent variable regression for skewed multivariate data

For many real-life studies with skewed multivariate responses, the level of skewness and association structure assumptions are essential for evaluating the covariate eff ects on the response and its predictive distribution. We present a novel semiparametric multivariate model and associated Bayesian analysis for multivariate skewed responses. Similar to multivariate Gaussian, this multivariate model is closed under marginalization, allows a wide class of multivariate associations, and has meaningful physical interpretations of skewness levels and covariate eff ects on the marginal density. Other desirable properties of our model include the Markov Chain Monte Carlo computation through available statistical software, and the assurance of consistent Bayesian estimates of the parameters and the nonparametric error density under a set of plausible prior assumptions. We illustrate the practical advantages of our methods over existing alternatives via simulation studies, the analysis of a clinical study on periodontal disease and extensions to Bayesian regression trees. This is a joint work with Drs.A.Bhingare, S.Lipsitz, D.Bandopadyay and A.Linero.

Summarizing distributions of latent structure

In a typical Bayesian analysis, consider effort is placed on "fitting the model" (e.g., obtaining samples from the posterior distribution) but this is only half of the inference problem. Meaningful inference usually requires summarizing the posterior distribution of the parameters of interest. Posterior summaries can be especially important in communicating the results and conclusions from a Bayesian analysis to a diverse audience. If the parameters of interest live in R^n, common posterior summaries are means, medians, and modes. Summarizing posterior distributions of parameters with complicated structure is a more difficult problem. For example, the "average" network in the posterior distribution on a network is not easily defined. This paper reviews methods for summarizing distributions of latent structure and then proposes a novel search algorithm for posterior summaries. We apply our method to distributions on variable selection indicators, partitions, feature allocations, and networks. We illustrate our approach in a variety of models for both simulated and real datasets.

2018

RKHS testing for censored data

We introduce kernel-based tests for censored data, where observations may be missing in random time intervals: a common occurrence in clinical trials and industrial life testing. Our approach is based on computing distances between probability distribution embeddings in a reproducing kernel Hilbert space (RKHS). Previously, this approach has been applied in many Machine Learning and Statistical data settings obtaining very good results. The main advantages of these methods are the ability of kernels to deal with complex data and high dimensionality. In this talk we revert to the real-line problem in which the complexity of the data is due to censored observations. In particular, we propose an extension of these set of tools to censored data, derive its asymptotic results and explain its relation with dominant approaches in Survival Analysis such as the Log-rank test. We finalise showing an empirical evaluation of our methods in which we outperform competing approaches in multiple scenarios.

A sequential approach to updating posterior information

In this talk we show the performance of a sequential Monte Carlo (SMC) algorithm. As prerequisite to understand it, we discuss the Metropolis-Hastings algorithm and also illustrate the general idea of particle-based methods. The SMC algorithm presented here is a particular case of the sequential methods, where the objective is to update the posterior distribution in "static" models.

Procesos puntuales espaciales como herramienta de análisis en ecología

Los procesos puntuales espaciales han cobrado popularidad en los últimos años debido a su utilidad para contestar diversas preguntas en campos científicos. En el campo de la ecología de comunidades, los procesos puntuales han mostrado su utilidad para detectar la presencia de interacciones intra e interespecíficas en ecosistemas boscosos o para evaluar el riesgo y los factores asociados a perturbaciones ecológicas como incendios forestales. Aunque la estimación de los parámetros de modelos en aplicaciones de procesos puntuales espaciales puede ser complicada, los avances en la parte computacional han permitido lograr aproximaciones numéricas aceptables, los cual ha sido factor para su uso en diversos campos del conocimiento humano. En esta charla se presenta un panorama general de los fundamentos teóricos de los procesos puntuales espaciales y se ilustra con un ejemplo de su aplicación en la construcción de mapas de riesgo de incendios forestales.

Fast Rates for Unbounded Losses: from ERM to Generalized Bayes

I will present new excess risk bounds for randomized and deterministic estimators, discarding boundedness assumptions to handle general unbounded loss functions like log loss and squared loss under heavy tails. These bounds have a PAC-Bayesian flavor in both derivation and form, and their expression in terms of the information complexity forms a natural connection to generalized Bayesian estimators. The bounds hold with high probability and a fast $\tilde{O}(1/n)$ rate in parametric settings, under the recently introduced central' condition (or various weakenings of this condition with consequently weaker results) and a type of 'empirical witness of badness' condition. The former conditions are related to the Tsybakov margin condition in classification and the Bernstein condition for bounded losses, and they help control the lower tail of the excess loss. The 'witness' condition is new and suitably controls the upper tail of the excess loss. These conditions and our techniques revolve tightly around a pivotal concept, the generalized reversed information projection, which generalizes the reversed information projection of Li and Barron. Along the way, we connect excess risk (a KL divergence in our language) to a generalized Rényi divergence, generalizing previous results connecting Hellinger distance to KL divergence. This is joint work with Peter Grünwald.

Discovering Interactions Using Covariate Informed Random Partition Models

Combination chemotherapy treatment regimens created for patients diagnosed with childhood acute lymphoblastic leukemia have had great success in improving cure rates. Unfortunately, patients prescribed these types of treatment regimens have displayed susceptibility to the onset of osteonecrosis. Some have suggested that this is due to pharmacokinetic interaction between two agents in the treatment regimen (asparaginase and dexamethasone) and other physiological variables. Determining which physiological variables to consider when searching for interactions in scenarios like these, minus a priori guidance, has proved to be a challenging problem, particularly if interactions influence the response distribution in ways beyond shifts in expectation or dispersion only. In this paper we propose an exploratory technique that is able to discover associations between covariates and responses in a very general way. The procedure connects covariates to responses very flexibly through dependent random partition prior distributions, and then employs machine learning techniques to highlight potential associations found in each cluster. We apply the method to data produced from a study dedicated to learning which physiological predictors influence severity of osteonecrosis multiplicatively.

Cox regression with Potts-driven latent clusters model

We consider a Bayesian nonparametric survival regression model with latent partitions. Our goal is to predict survival, and to cluster survival patients within the context of building prognosis systems. We propose the Potts clustering model as a prior on the covariates space so as to drive cluster formation on individuals and/or Tumor-Node-Metastasis stage system patient blocks. For any given partition, our model assumes a interval-wise Weibull distribution for the baseline hazard rate. The number of intervals is unknown. It is estimated with a lasso-type penalty given by a sequential double exponential prior. Estimation and inference are done with the aid of MCMC. To simplify the computations, we use the Laplace's approximation method to estimate some constants, and to propose parameter updates within MCMC. We illustrate the methodology with an application to cancer survival.

A Bayesian Nonparametric Multiple Testing Procedure for Comparing Several Treatments Against a Control

We propose a Bayesian nonparametric strategy to test for differences between a control group and several treatment regimes. Most of the existing tests for this type of comparison are based on the differences between location parameters. In contrast, our approach identifies differences across the entire distribution, avoids strong modeling assumptions over the distributions for each treatment, and accounts for multiple testing through the prior distribution on the space of hypotheses. The proposal is compared to other commonly used hypothesis testing procedures under simulated scenarios. A real application is also analyzed with the proposed methodology.

Temporal and Spatio-Temporal Random Partition Models

Data that are spatially referenced often represent an instantaneous point in time at which the spatial process is measured. Because of this it is becoming more common to monitor spatial processes over time. We propose capturing the temporal evolution of dependent structures by modeling a sequence of partitions indexed by time jointly. We derive a few characteristics from the joint model and show how it impacts dependence at the observation level. Computation strategies are detailed and apply the method to Chilean standardized testing scores.

Workshops

MiDaS workshops aims to highlight recent advances in modeling and computation through the lens of applied, domain-driven problems that require flexible statistical models. The workshops bring together leading experts and talented young researchers working on applications and theory of felxible parametric and nonparametric (Bayesian) statistics. The workshops focus on new statistical approaches for the efficient identification, reconstruction and classification of relevant structural information in complex data sets.

MiDaS 2019 workshop was held in the hotel Enjoy of Viña del Mar, Viña del Mar, Chile, March 25th to 29th, 2019. For more details please click here.

Conferences

MiDaS conferences aims to highlight recent advances in modeling and computation through the lens of applied, domain-driven problems that require flexible statistical models. The conferences bring together leading experts and talented young researchers working on applications and theory of felxible parametric and nonparametric (Bayesian) statistics. The conberences focus on new statistical approaches for the efficient identification, reconstruction and classification of relevant structural information in complex data sets.

MiDaS 2019 conference took place in Puerto Varas, Chile, October 21-25nd, 2019. In this case, we hosted the meting of the Chilean Society of Statistics, XLV Jornadas Nacionales de la Sociedad Chilena de Estadística. This is the largest gathering of statisticians and data scientists held in Chile. Confirmed keynoite speakers are; Alicia Carriquiry, Marc Genton, Antonio Lijoi, Bernardo Nipoti, and Igor Prünster. For more details please click here.

MiDaS Outreach Videos

MiDaS - Outreach Video 1
MiDaS - Outreach Video 1 (spanish)

Other Videos about Statistics

General video about staistics (spanish)
TED talk by Arthur Benjamin
TED talk by Alan Smith
Statistics is for everyone
Statisticians making a difference
Statisticians in other fields

Big DATA Olympiad

This is a contest in which teams of high school students from Chilean schools solve problems of data analysis. The objective of the competition is to stimulate the interest of students in Statistics and Data Science.

The intent of the competition is allow competitors to ‘get their hands dirty’ by performing in depth analysis of the data in order to come up with the best recommendation to address the problem.

The competition has two stages. In the pre-selection phase, the teams must prepare a written report using basic statistical techniques and MS Excel. The teams selected in this stage are invited to a week of training at the Faculty of Mathematics of the Pontifical Catholic University of Chile. The training include modern techniques for the description and visualization of data, and on statistical softwarw such as R. After the training, the final competition is carried out. The costs of stay and transfer of selected teams from regions other than the Metropolitan one are covered by the competition.

The Selection Committee is formed by Professors of the Department of Statistics of the Faculty of Mathematics of the UC.

For details on the next version of the competition please click here.

Future Outreach Conferences and Seminars

Women in Data Science Santiago at UC

As part of the 2021 Stanford Women in Data Science (WiDS) conference, MiDaS is proud to host an event celebrating the women of statistics and data science in Santiago.

The WiDS initiative aims to inspire and educate data scientists worldwide, regardless of gender, and support women in the field. WiDS started as a conference at Stanford in November 2015. Now, WiDS includes a global conference, with 150+ regional events worldwide; a datathon, encouraging participants to hone their skills; and a podcast, featuring leaders in the field talking about their work, and their journeys.

We invite all women (and the men who want to support them) to join us for a day of conversation, connection, networking, training and awareness raising. Speakers include Industry leaders, shapeshifters and datapreneurs.

For more details please click here.


Past Outreach Conferences and Seminars

Women in Data Science Santiago at UC

As part of the 2020 Stanford Women in Data Science (WiDS) conference, MiDaS is proud to host an event celebrating the women of statistics and data science in Santiago.

The meeting was online, Saturday 3rd October 2020.

Women in Data Science Santiago at UC

As part of the 2019 Stanford Women in Data Science (WiDS) conference, MiDaS was proud to host an event celebrating the women of statistics and data science in Santiago.

The meeting took place in Santiago, Monday 4th March 2019.

The Big Data Revolution in Biomedical Research

Asssociated to 'Congreso Furturo 2019', MiDaS was proud to host this event at the Catholic University of Chile. In this seminar, leading international researchers discussed the revolution that large data sets have generated in biomedical research for general public. The seminar took place on January 15th, 2019. The speakers included Professors Amy Herring of Duke University, Gerd Antes of Univefrsity of Freiburg, and Harris Lewin of University of California at Davis.

The Big Data Revolution in Biomedical Research The Big Data Revolution in Biomedical Research

Big Data: The revolution of the information in Biomedical Research

MiDaS, along with the School of Medicine of the Catholic University of Chile, has co-organized this event at the Catholic University of Chile, where some researchers from MiDaS gave talks to illustrate how the research results obtained in our center can help researchers in Biomedical Sciences to obtain better conclusions. The event took placed on December 18th, 2018.

TBig Data: The revolution of the information in Biomedical Research TBig Data: The revolution of the information in Biomedical Research

Outreach Talks

News

Past News

Our associate researcher Fernando Quintana will be giving a Foundational Lecture in the ISBA 2020 world meeting.

More information here →

Our Director Alejandro Jara has been nominated as a member of the Scientific Committee for the 2020 ISBA world meeting.

More information here →

Our Director Alejandro Jara has been reenwed as Associated Editor for the period 2019-2021 for the prestigious journal Bayesian Analysis.

More information here →

Our associate researcher Fernando Quintana has been reenwed as Associated Editor for the period 2019-2021 for the prestigious journal Bayesian Analysis.

More information here →

Our associate researcher Fernando Quintana gave a semi-plenary talk in the XV Congreso Latinoamericano de Probabilidad y Estadística.

More information here →

Luz has very successfully defended her PhD thesis and is the newest doctor in our group!. Sincere thanks also to her advisor Prof. Luis Gutierrez, and the thesis committee members Prof. Fernando Quintana, Prof. Alejandro Jara and Prof. Felipe Barrientos for making the event a great success.

Our Director Alejandro Jara has been nominated member of selection committee for the Mitchel Prize of ISBA. The Mitchell Prize is awarded in recognition of an outstanding paper that describes how a Bayesian analysis has solved an important applied problem.

More information here →

Our young researcher Fernando Quintana will be a speaker of the '11th International Conference of the ERCIM WG on Computational and Methodological Statistics (CMStatistics 2018)'.

More information here →

Our associate researcher Fernando Quintana will be a speaker of the '11th International Conference of the ERCIM WG on Computational and Methodological Statistics (CMStatistics 2018)'.

More information here →

Our young researcher Isabelle Beaudry will be a speaker of the '3era Jornada Franco-Chilena de Estadística'. August 2nd, 2018.

More information here →

Job opportunities

Postdoc position
MiDaS invites applications for up to two postdoctoral fellows to conduct research at the intersection of Bayesian methods, survival analysis, mismeasured data, and/or spatial statistics. The position will be supervised by MiDaS' Associate Reserachers, with opportunities to conduct research at the Department of Statistics, Pontificia Universidad Católica de Chile. Opportunities to collaborate with other members of MiDaS will be part of both positions. Expertise in Bayesian nonparametric statistics, survival analysis or spatial statistics (not necessarily the three of them) is strongly encouraged, with expectations to contribute to ongoing methodological development in (i) the development, study of properties, and the implementation of scalable Bayesian nonparametric approaches for collection of probability measures indexed by predictors, and when both responses and predictors are defined on non-standard spaces, (ii) the development, study of properties, and the implementation of nonparametric approaches for misclassified doubly-interval-censored time-to-event data, and (iii) the development, study of properties, and the implementation of nonparametric approaches for space and time data. The successful candidate will be encouraged to augment contribution to ongoing research projects with his or her own independent research agenda. Qualifications: PhD in statistics, biostatistics, or other related field. Experience in Bayesian nonparametric statistics, survival analysis or spatial statistic strongly preferred. Additional Information: To apply, please email a cover letter describing research interests and experience along with a CV and names of three references to: atjara AT uc.cl.

How to Contact Us

Call or email us at

Phone: +56 22 354 4506
Fax: +56 22 354 4506

Send email

Visit us at

Faculty of Mathematics UC,
Campus San Joaquin, Vicuña Mackenna 4860, Macul

View on Google Map

Be social

Twitter: @MiDaS_Chile
Facebook: facebook.com/midas.mat.uc.cl
Instagram: instagram.com/MiDaS_Chile