Rob Barton: Cluster Analysis and Special Probability Distributions

Antonenko, P., Toy, S., & Niederhauser, D. (2012). Using cluster analysis for data mining in educational technology research. Educational Technology Research and Development, 60(3), 383-398.

Server log data from online learning environments can be analyzed to examine student behaviors, in terms of pages visited, length of time on a page, order of links clicked, and so on. This analysis is less cognitively taxing to the student than think aloud techniques and to the researcher since there is no coding of behaviors involved. Cluster analysis groups cases such that they are very similar within the cluster and dissimilar to other cases outside the cluster across target variables. It is related to factor analysis, where regression models are created based on a set of variables across cases, but in cluster analysis, cases are then grouped. Proximity indices (squared Euclidean distance or sum of the squared differences across variables) are calculated for every pair of cases. Squaring makes them all positive and accentuates the outliers. Various clustering algorithms are available to then group similar cases. Ward’s is a hierarchical clustering technique that combines cases one at a time from n clusters to 1 cluster and determines which minimizes the standard error, and is used when there is no preconceived idea about the likely number of clusters. Using k-means clustering, a non-hierarchical techniques, an empirical rationale for a predetermined number of clusters is tested. It may also be used when there is a large sample size in order to increase efficiency; if no empirical basis exists, the model is run on 3, 4, and 5 clusters. The method calculates k centroids and associates cases with the closest centroid, repeating until the standard error is minimized by allowing cases to move to a different centroid. It may also be possible to use two different kinds of techniques, for example, a Ward’s cluster analysis on a small sample followed by a k-means cluster analysis based on the findings from Ward’s. After determining the clusters, the characteristics of each cluster should be compared to ensure there is a meaningful difference among them and that there is a meaningful difference in the outcome based on their behaviors, since cluster analysis can find structures in data where none exists. ANOVA may then be used to determine for each cluster how much each variable contributes to variation in the dependent variable. It may be useful to use more than one technique and compare or average them, as different techniques may result in a variation in the results.

Bain, L.J. & Englehardt, M. (1991). Special probability distributions. In Introduction to probability and mathematical statistics (2nd Edition). Belmont, CA: Duxberry Press.

A Bernoulli trial has two discrete outcomes whose probabilities add up to 1. A series of independent Bernoulli trials forms a Binomial distribution, where the number of successes (or failures) are determined for n trials. A Hypergeometric distribution occurs when n samples are taken from a population of N+M without replacement. It can be useful for testing a batch of manufactured products for defects in order to accept or reject the batch. The Geometric Binomial distribution determines the minimum number of Bernoulli trials that must occur to achieve a success. The Negative Binomial distribution determines the minimum number of Bernoulli trials that must occur to achieve n successes. The Poisson distribution describes the probability of n independent successes occurring over a certain number of trials. The discrete uniform distribution allows for n possible values, each with equal probability of occurrence.

Blau, B.M., Brough, T.J., & Thomas, D.W. (2013). Corporate lobbying, political connections, and the bailout of banks. Unpublished manuscript, Department of Finance and Economics, Utah State University, Logan, UT.

When measuring a dependent variable with discrete values, an appropriate count regression framework must be used. Poisson, negative binomial, and OLS are possible models to use. Poisson regression uses a distribution where the mean is equal to its variance. If the distribution is over-dispersed or significantly greater than 0, Poisson will not work. No discussion of when negative binomial or OLS work.

Collins, L.M. & Lanza, S.T. (2010). Latent class and latent transition analysis for the social, behavioral, and health sciences. New York: Wiley. Latent variables are unobserved but predicted by the observation of multiple observed variables. The latent variable is presumed to cause the observed indicator variables. Different models are used, depending on whether the observed and latent variables are discrete or continuous. Using a discrete latent variable helps organize complex arrays of categorical data. A given construct may be measured using either continuous or discrete variables, so the method used when there is a choice should be based on which best helps address the research questions. When cases are placed into classes, the classes are named by the researcher based on their similar characteristics.

Fisher, W.D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53, 789-798.

Grouping or clustering is a useful tool for distinguishing sets of cases based either on prior theory of what the groups should entail or with no initial structure in mind. Combining the groups has a goal of minimizing the variance or error sum of squares. For some small cases, a visual inspection of data may allow the researcher to come up with the clusters. In large data sets with evenly dispersed data, this is difficult or impossible.

Francis, B. (2010). Latent class analysis methods and software. Presented at 4th Economic and Social Research Council Research Methods Festival, 5 - 8 July 2010, St. Catherine’s College, Oxford, UK.

Latent class cluster analysis assigns cases to groups based on statistical likelihood; they do not have to be assigned to discrete classes. K-means clustering is problematic, since the number of groups has to be specified a priori, cases are assigned to unique clusters, and only allows continuous data.

Gardner, W., Mulvey, E.P., & Shaw, E.C. (1995). Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychological Bulletin 118(3).

Researchers often use suboptimal strategies when analyzing count data, such as artificially breaking down counts into categories of 5 or 10, but this loses data and statistical power. Another ineffective strategy is to use regular linear regression or OLS. Using OLS, illogical values, such as negatives will be predicted, and the model’s variance of values around the mean is not likely to fit well. Another problem with OLS is heteroscedastic error terms, where larger values will have larger variances and smaller values small variances. Nonlinear models that allow for only positive values and describe likely dispersion about the mean must be used. Poisson places restrictive assumptions on the size of the variance. The Overdispersed Poisson model corrects for the large variances that are common. The negative binomial is another option. In the regular Poisson model, truncated extreme tail values could lead to underdispersion and a large number of high values could lead to overdispersion. An overdispersion parameter is calculated by dividing Pearson’s chi-squared by the degrees of freedom and then the overdisperson parameter is multiplied by the mean. The negative binomial model includes a random component that accounts for individual variances. The negative binomial model allows one to estimate the probability distribution, where the overdispersed Poisson does not.

Osgood, D.W. (2000). Poisson-based regression analysis of aggregate crime rates. Journal of Quantitative Criminology 16(1).

The normal approach to analyze per capita rates of occurrence is to use the OLS model. However, OLS does not provide an effective model when recording a small number of events. For large populations, OLS may work, but for a small number of events in a small population, the results is an overestimated rate of occurrence. Often small counts will be skewed with a floor of 0. The Poisson model corrects for many of these issues with OLS; however, the unlikely assumption of the Poisson’s mean equaling the variance must hold. Due to individual variations and correlation between observed values and variance, overdispersion is common. Adjusting the standard errors and thus t-test results for the overdispersion helps correct the model. The negative binomial model combines the Poisson distribution with a gamma distribution that accounts for unexplained variation.

Romesburg, H.C. (1990). Cluster Analysis for Researchers. Malabar, FL: Robert E. Krieger Publishing Company.

The steps in doing cluster analysis begin with creating the data matrix, including objects and their attributes. The objective is to determine which objects are most similar based on those attributes. An optional step is to standardize the data matrix. A resemblance matrix is then calculated, showing for each pair of objects a similarity coefficient, such as the Euclidean distance. Based on the similarity coefficient, a tree is created by combining similar objects and comparing their average to the other existing objects. Then rearrange objects in the data matrix to show the closest objects next to each other.

Velasquez, N.F., Sabherwal, R., & Durcikova, A. (2011). Adoption of an electronic knowledge repository: A feature-based approach. Presented at 44th Hawaii International Conference on System Sciences, 4-7 January 2011, Kauai, HI.

This article discusses the types of use for knowledge base users. It utilizes a cluster analysis to come up with three types of users. Clustering methods compared were Ward’s, between-groups linkage, within-groups linkage, centroid clustering, and median clustering and the one with the best fit was used.

Wang, W. & Famoye, F. (1997). Modeling household fertility decisions with generalized Poisson regression. Journal of Population Economics 10. Poisson and negative binomial models account for non-negative counts of discrete occurences. The Poisson model requires that the mean and variance of the dependent variable are equal, which is rarely true. This leads to a consistent model but invalid standard errors. The negative binomial model handles counts with overdispersion. When underdispersion is present, a generalized Poisson regression model may be used. Generalized Poisson handles both overdispersion and underdispersion.

Ward, J. H., Jr. (1963), Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, 48, 236–244.

Ward describes a clustering technique that allows for grouping with respect to many variables in such a way that minimizes the loss in each group. Traditional statistics would take a group of numbers, find the mean, and then calculate the error sum of squares for all cases and the one mean. By grouping, the ESS will be minimized as they are compared to the group means. The appropriate number of groups can be determined in the grouping process rather than needing to specify it in advance.

Rob Barton

Thursday, December 4, 2014

Cluster Analysis and Special Probability Distributions - An Annotated Bibliography

No comments: