... a few things done by Liang

Statistical Modeling and Data Mining

Models should be as simple as possible, but not simpler.

-Albert Einstein

Here listed a few reports and manuscripts related to statistical modeling.

  • Transformed Residual: a New Model Checking Method

Manuscript [under writting]
Keywords: hierarchical model, marginal distribution, compare distributions, Hellinger distance, estimation inaccuracy, Monte Carlo methods
Abstract: A new model checking procedure based on transformed residuals is proposed, which is capable of dealing with complex models that have non-Gaussian distributions involved or don't have likelihood function in close form. This method relies on the approximation of marginal distribution of response variables by using Monte Carlo methods and the comparison between the empirical distribution of transformed residuals and standard normal by using Hellinger distance. Furthermore, an additional testing procedure that takes account of the information of estimation inaccuracy is also proposed which largely increase the robustness of this method. Both simulated and real data sets are used to fit generalized linear models and hierarchical models, and the performance of the checking method is evaluated.

  • Bayesian Model Checking for Generalized Linear Spatial Models with Count Data

Manuscript [request] | JSM poster [pdf]
Keywords: latent variables, measure of discrepancy, posterior predictive p-value, cross-validation alternative, diagnostic statistic
Abstract: Hierarchical models are increasingly used in a variety of applications. However, the model checking and selection of hierarchical models remain difficult when the models are complicated with unobservable latent process ...

  • Development of Automatic Trading System with Machine Learning Algorithms

Report [request]
Keywords: technical indicators, variable selection, neural network, MARS, risk management, trading signal, Monte Carlo simulation
Abstract: A trading system is built to automatically convert historical information of stock prices into actionable trading signals. A variety of indicators are tested as candidate predictors by using variable selection methods as well as random forest algorithm. And different machine learning algorithms, including neural network, projection pursuit regression and multivariate adaptive regression splines, are tested as the prediction generator. The whole system is evaluated based on the results of both profit-generation and risk management with simulated and real data.

  • Assessment of Data Mining Models with Beijing Transportation Data

Report [pdf] | Data [tran.csv]
Code [func.R] [prep.R] [caseI.R] [caseII.R] [caseIII.R]
Keywords: preprocessing, variable selection, generalized additive model, tree, neural network, cross-validation, bootstrap, overfitting, confusion matrix
Abstract: A data set from an Internet questionnaire about transportation condition in Beijing is analyzed with the help of data mining techniques. The data
set is preprocessed by cleaning the original data, examining and combining related variables. Then several variables are studied as the response variables ...

  • Model Fitting and Comparison for Repeated Measurement Data from Eye Tracking Experiment

Report [pdf]
Code [eye.R] | Data [eye.txt] [id.txt]
Keywords: linear model, random effects, covariance matrix, AR(1), goodness of fitting, AIC, log-likelihood, prediction error, residual analysis
Abstract: The data from an eye tracking experiment is introduced. Both linear models and mixed models are used to fit the data. Different covariance structures for random effects are used. The goodness of fitting is compared for different scenarios. Residual analysis for chosen models is also conducted.

  • The Analysis of Car Emissions Data using Bayesian Model Averaging with Comparison to Frequentist Methods

Report [pdf] | Code [R] | Data [txt]
Keywords: generalized linear models, prediction accurary, model selection, p-value, posterior probability
Abstract: A data set of car emissions is studied. Linear models and generalized linear models are used to fit the data in both Frequentist and Bayesian ways. Bayesian model averaging techniques are used to search for the best models. The prediction accuracies of all the models are comparied, and Frequentist p-value and Bayesian posterior probability are calcualted for the parameter coefficients and compared.

  • A Short Summary of Performance Evaluation for Employees from A Local Call Center

Report [pdf]
Keywords: information diagram, performance evaluation, sampling strategy, standardization, simulation
Abstract: This is a summary of project information and some proposed methods regarding to a consulting inquiry. The client, a local call center, intends to evaluate their employees' performance and needs help on statistical analsyis of historical data and design of sampling strategy ...

  • Linear Regression for Air Pollution Data

Report [pdf] | Code [SAS]
Keywords: normality test, correlation analysis, significance test, Bonferroni test, confidence interval
Abstract: Linear models are used to detect the relationship between the concentration of an air pollutant at a specific site and traffic volume as well as other meteorological variables. The procedure of model building and validating is demonstrated along with a variety of coefficient tests.

  • Sample Size Study for Measuring Process Time

Report [pdf]
Keywords: estimation of variance, power function, confidence interval constraint
Abstract: This is a consulting project that helps the client with sample size study for a engineering experiment. The client wants to find a design for the experiment which has enough power to reject false hypothesis and in the meantime requires small sample size. The estimation of variance and confidence interval constraint also need to be considered in the study ...


Contact ljing918@gmail.com if there is any question.