Rob Barton: Data

Showing posts with label Data. Show all posts

Thursday, December 4, 2014

Cluster Analysis and Special Probability Distributions - An Annotated Bibliography

Antonenko, P., Toy, S., & Niederhauser, D. (2012). Using cluster analysis for data mining in educational technology research. Educational Technology Research and Development, 60(3), 383-398.

Server log data from online learning environments can be analyzed to examine student behaviors, in terms of pages visited, length of time on a page, order of links clicked, and so on. This analysis is less cognitively taxing to the student than think aloud techniques and to the researcher since there is no coding of behaviors involved. Cluster analysis groups cases such that they are very similar within the cluster and dissimilar to other cases outside the cluster across target variables. It is related to factor analysis, where regression models are created based on a set of variables across cases, but in cluster analysis, cases are then grouped. Proximity indices (squared Euclidean distance or sum of the squared differences across variables) are calculated for every pair of cases. Squaring makes them all positive and accentuates the outliers. Various clustering algorithms are available to then group similar cases. Ward’s is a hierarchical clustering technique that combines cases one at a time from n clusters to 1 cluster and determines which minimizes the standard error, and is used when there is no preconceived idea about the likely number of clusters. Using k-means clustering, a non-hierarchical techniques, an empirical rationale for a predetermined number of clusters is tested. It may also be used when there is a large sample size in order to increase efficiency; if no empirical basis exists, the model is run on 3, 4, and 5 clusters. The method calculates k centroids and associates cases with the closest centroid, repeating until the standard error is minimized by allowing cases to move to a different centroid. It may also be possible to use two different kinds of techniques, for example, a Ward’s cluster analysis on a small sample followed by a k-means cluster analysis based on the findings from Ward’s. After determining the clusters, the characteristics of each cluster should be compared to ensure there is a meaningful difference among them and that there is a meaningful difference in the outcome based on their behaviors, since cluster analysis can find structures in data where none exists. ANOVA may then be used to determine for each cluster how much each variable contributes to variation in the dependent variable. It may be useful to use more than one technique and compare or average them, as different techniques may result in a variation in the results.

Bain, L.J. & Englehardt, M. (1991). Special probability distributions. In Introduction to probability and mathematical statistics (2nd Edition). Belmont, CA: Duxberry Press.

A Bernoulli trial has two discrete outcomes whose probabilities add up to 1. A series of independent Bernoulli trials forms a Binomial distribution, where the number of successes (or failures) are determined for n trials. A Hypergeometric distribution occurs when n samples are taken from a population of N+M without replacement. It can be useful for testing a batch of manufactured products for defects in order to accept or reject the batch. The Geometric Binomial distribution determines the minimum number of Bernoulli trials that must occur to achieve a success. The Negative Binomial distribution determines the minimum number of Bernoulli trials that must occur to achieve n successes. The Poisson distribution describes the probability of n independent successes occurring over a certain number of trials. The discrete uniform distribution allows for n possible values, each with equal probability of occurrence.

Blau, B.M., Brough, T.J., & Thomas, D.W. (2013). Corporate lobbying, political connections, and the bailout of banks. Unpublished manuscript, Department of Finance and Economics, Utah State University, Logan, UT.

When measuring a dependent variable with discrete values, an appropriate count regression framework must be used. Poisson, negative binomial, and OLS are possible models to use. Poisson regression uses a distribution where the mean is equal to its variance. If the distribution is over-dispersed or significantly greater than 0, Poisson will not work. No discussion of when negative binomial or OLS work.

Collins, L.M. & Lanza, S.T. (2010). Latent class and latent transition analysis for the social, behavioral, and health sciences. New York: Wiley. Latent variables are unobserved but predicted by the observation of multiple observed variables. The latent variable is presumed to cause the observed indicator variables. Different models are used, depending on whether the observed and latent variables are discrete or continuous. Using a discrete latent variable helps organize complex arrays of categorical data. A given construct may be measured using either continuous or discrete variables, so the method used when there is a choice should be based on which best helps address the research questions. When cases are placed into classes, the classes are named by the researcher based on their similar characteristics.

Fisher, W.D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53, 789-798.

Grouping or clustering is a useful tool for distinguishing sets of cases based either on prior theory of what the groups should entail or with no initial structure in mind. Combining the groups has a goal of minimizing the variance or error sum of squares. For some small cases, a visual inspection of data may allow the researcher to come up with the clusters. In large data sets with evenly dispersed data, this is difficult or impossible.

Francis, B. (2010). Latent class analysis methods and software. Presented at 4th Economic and Social Research Council Research Methods Festival, 5 - 8 July 2010, St. Catherine’s College, Oxford, UK.

Latent class cluster analysis assigns cases to groups based on statistical likelihood; they do not have to be assigned to discrete classes. K-means clustering is problematic, since the number of groups has to be specified a priori, cases are assigned to unique clusters, and only allows continuous data.

Gardner, W., Mulvey, E.P., & Shaw, E.C. (1995). Regression analyses of counts and rates: Poisson, overdispersed Poisson, and negative binomial models. Psychological Bulletin 118(3).

Researchers often use suboptimal strategies when analyzing count data, such as artificially breaking down counts into categories of 5 or 10, but this loses data and statistical power. Another ineffective strategy is to use regular linear regression or OLS. Using OLS, illogical values, such as negatives will be predicted, and the model’s variance of values around the mean is not likely to fit well. Another problem with OLS is heteroscedastic error terms, where larger values will have larger variances and smaller values small variances. Nonlinear models that allow for only positive values and describe likely dispersion about the mean must be used. Poisson places restrictive assumptions on the size of the variance. The Overdispersed Poisson model corrects for the large variances that are common. The negative binomial is another option. In the regular Poisson model, truncated extreme tail values could lead to underdispersion and a large number of high values could lead to overdispersion. An overdispersion parameter is calculated by dividing Pearson’s chi-squared by the degrees of freedom and then the overdisperson parameter is multiplied by the mean. The negative binomial model includes a random component that accounts for individual variances. The negative binomial model allows one to estimate the probability distribution, where the overdispersed Poisson does not.

Osgood, D.W. (2000). Poisson-based regression analysis of aggregate crime rates. Journal of Quantitative Criminology 16(1).

The normal approach to analyze per capita rates of occurrence is to use the OLS model. However, OLS does not provide an effective model when recording a small number of events. For large populations, OLS may work, but for a small number of events in a small population, the results is an overestimated rate of occurrence. Often small counts will be skewed with a floor of 0. The Poisson model corrects for many of these issues with OLS; however, the unlikely assumption of the Poisson’s mean equaling the variance must hold. Due to individual variations and correlation between observed values and variance, overdispersion is common. Adjusting the standard errors and thus t-test results for the overdispersion helps correct the model. The negative binomial model combines the Poisson distribution with a gamma distribution that accounts for unexplained variation.

Romesburg, H.C. (1990). Cluster Analysis for Researchers. Malabar, FL: Robert E. Krieger Publishing Company.

The steps in doing cluster analysis begin with creating the data matrix, including objects and their attributes. The objective is to determine which objects are most similar based on those attributes. An optional step is to standardize the data matrix. A resemblance matrix is then calculated, showing for each pair of objects a similarity coefficient, such as the Euclidean distance. Based on the similarity coefficient, a tree is created by combining similar objects and comparing their average to the other existing objects. Then rearrange objects in the data matrix to show the closest objects next to each other.

Velasquez, N.F., Sabherwal, R., & Durcikova, A. (2011). Adoption of an electronic knowledge repository: A feature-based approach. Presented at 44th Hawaii International Conference on System Sciences, 4-7 January 2011, Kauai, HI.

This article discusses the types of use for knowledge base users. It utilizes a cluster analysis to come up with three types of users. Clustering methods compared were Ward’s, between-groups linkage, within-groups linkage, centroid clustering, and median clustering and the one with the best fit was used.

Wang, W. & Famoye, F. (1997). Modeling household fertility decisions with generalized Poisson regression. Journal of Population Economics 10. Poisson and negative binomial models account for non-negative counts of discrete occurences. The Poisson model requires that the mean and variance of the dependent variable are equal, which is rarely true. This leads to a consistent model but invalid standard errors. The negative binomial model handles counts with overdispersion. When underdispersion is present, a generalized Poisson regression model may be used. Generalized Poisson handles both overdispersion and underdispersion.

Ward, J. H., Jr. (1963), Hierarchical Grouping to Optimize an Objective Function, Journal of the American Statistical Association, 48, 236–244.

Ward describes a clustering technique that allows for grouping with respect to many variables in such a way that minimizes the loss in each group. Traditional statistics would take a group of numbers, find the mean, and then calculate the error sum of squares for all cases and the one mean. By grouping, the ESS will be minimized as they are compared to the group means. The appropriate number of groups can be determined in the grouping process rather than needing to specify it in advance.

Tuesday, November 22, 2011

Meta Blogging

In September, I started a series of blog posts on college courses that I've taken. I wrote a post every day and got through about four or five semesters' worth, depending on how you count it (one year was quarters, plus some AP classes from high school). I haven't stopped but have slowed way down. Between not wanting so spam my (3) loyal readers and not having the time to keep up a post a day, I couldn't maintain that kind of production.

I don't know how the NaNoWriMo people do it. Well, I do know that of the hundreds of thousands of people who sign up, the average number of words written per person is just under 15,000. They're supposed to write 50,000. About one in five finish, which means about two in three write nothing.

I did write a post for every day that one month, which is something I wanted to try, and I have kept up my streak of at least one post a month for the past four years. Interestingly enough, I pulled in a little over 15,000 words in September, which means I beat a lot of NaNoWriMo people.

Something else I have let slip is my RSS reader. I had close to a thousand unread posts in there from all around the web. I unsubscribed from the feeds for a MOOC I stopped participating in, which dropped a few hundred unread posts off. I marked the posts from the You Are Not a Photographer blog, because everything is in there twice and they keep doing weird things with their feed that makes old stuff I've already read show up as unread again. I may end up just unsubscribing, since it looks like they stopped including the picture in their RSS feed, so you have to actually visit the site to make fun of how bad people are at photography. I've been considering unsubscribing from the Freakonomics blog for awhile now, but every once in awhile a post comes along that makes it all worth it.

I finally unsubscribed from Larry Ferlazzo's blog. I'm sorry, Larry. I tried to keep up. I really did. I subscribed when I found several interesting posts related to Bloom's Taxonomy, which I was reading about at the time. Seven posts a day is too much for me, especially if I get a couple days behind. To give you an idea of the volume here, he has well over 500 "most popular" posts. I have no idea how many unpopular posts he has. I was going to maybe suggest that he try Twitter, since his blog posts are mainly lists of interesting site related to teaching a given topic, and Twitter is great at sending out links to people. Of course, I should have known; he's got more than 30,000 tweets. That means that over 3 years on Twitter, he averages 27 tweets a day. Given an average of about 12 words per tweet, that's almost 10,000 words per month, so he's not much off the NaNoWriMo average and doing better than most would-be authors just on Twitter.

One thing I do need to do is go back and fix the pictures in my September posts. I didn't add a picture to every post, but for the ones that I did, I got lazy. I just randomly googled images and grabbed stuff wherever I found it. Normally I use photos licensed with Creative Commons on Flickr. When I use their photo, I will link back to their Flickr stream and leave a comment on the photo I used with my thanks for their sharing and a link to the post where I used it. The bus up there is just a random openly licensed photo I found that kind of popped out at me. Thanks for sharing but not sharing too much.

photo by didbygraham

Thursday, December 23, 2010

We can do this the easy way or the hard way

There's an obscure change that was made in Office 2007, which I only noticed because of a chance experience. They changed the terminology from labeling the X and Y axes of a chart to labeling the horizontal and vertical axes. What's the difference, you might ask? The horizontal axis is the X axis, so who cares?

I'll get to that, but first how I even became aware of this issue. I happened to be helping a student who wanted help getting ready for a retake of a test on Excel. She missed part of the chart, because she mixed up the X and Y axis labels. As we were looking at what she had done, it was very obvious that the labels did not match. I don't remember the exact topic, but it would have been analogous to having a label that said "States" next to the axis with a range of numbers and a label that said "Population" next to the axis with the names of several states. You look at it and have to wonder if something is messed up, but then again, I have written before about how students will consciously choose to answer a question incorrectly with the idea in their heads that our tests were constructed by idiots and therefore the wrong answer will likely be scored as correct.

I showed her how the labels obviously didn't match, and I opened her spreadsheet file, and showed her how the box labeled X-axis had the text she was supposed to put in the Y-axis box and vice versa. The problem? I have to admit there was some logic to her decision to switch them, albeit based on a possible problem in our educational system, which is where I'm headed with this.

Among the various chart types in Excel are the bar chart and the column chart. I don't want to get into the difference between the types of charts, where you'd use a histogram vs a bar chart vs a line chart, etc. Perhaps another post. Suffice it to say that Excel doesn't really do a histogram without a lot of work on your part, and it's beyond the scope of this post.

So a column chart and a bar chart in Excel are actually both bar charts, with Excel's bar chart rotated 90 degrees. What ends up as the vertical axis, since it is rotated, is actually the X-axis. The reason it is the X-axis is because it is the independent variable. The dependent variable is the Y. I still remember in middle school missing a quiz question, because I hadn't read the chapter for that day and had to guess whether it was the X or Y that was vertical and horizontal. It turns out, that while convention does generally put the X horizontally, it doesn't have to be that way. There is a greater law. Unfortunately, we are taught the simplistic version of the law. If we were to take the advice of some and teach more statistics rather than calculus in school, perhaps there would be some importance of knowing the difference between a dependent and independent variable and thus we might be taught the greater law.

So, what the girl had done based on this "fact" that had been so ingrained in her throughout years of math classes was specifically decide to put the labels in the wrong boxes just so the X label would be on the horizontal axis, even if that meant having the X label in the properties box labeled Y and next to data that didn't make sense. After mistakes like this and others by a multitude of students, I started putting notes like "if something looks wrong, it probably is" on most test versions that I would write.

Apparently, Microsoft must have gotten some feedback from other people getting confused, and so rather than leave it as technically correct but difficult to understand, they punted. They just changed the labels to be called the horizontal and vertical axis labels in Office 2007. Now there is no question. And the three people per year that had a problem with this now don't learn anything, because it never comes up. In case you're wondering, OpenOffice still labels the vertical axis on the bar chart as the X-axis, because that's what it is.

So at what point do we switch from teaching the easy rule to teaching the more complicated but correct rule? Is there ever a reason to teach the easy rule? Wouldn't we perhaps see fewer line charts that should actually be histograms, etc. if we taught people assuming they were capable of understanding an advanced concept? Is this an advanced concept?

Wednesday, December 8, 2010

TMI

The acronym TMI is often used as a way of expressing that someone just shared too much information with you, generally something embarrassing or private or that they just don't care about. I believe some of this comes because of an overload of information always flowing around us through computers and mobile devices, so we lose the ability to filter out extraneous or private information from that which should be communicated.

Given the large amounts of information that is put out there, since people do seem to just braindump it all onto various social media sites (or vetted news sites) in a way that's easily accessible by others, those who learn to actually mine the vast data fields will do very well for themselves in our information-based society.

While some people see these vast data fields as a wasteland, like the Abominable Snowman in Monsters, Inc., I would say, "I think you mean wonderland!" We should be able to take the plethora of information created by others and turn it into something useful. There should be no such thing as TMI, because the more data out there, the better we can harness it for the good of ourselves or others.

I'm occasionally made fun of for looking up information on my phone or a computer. Someone asks a question or wants a clarification about a statement made by someone else, and everyone just sits there thinking, yeah, someone should find that out. Then the conversation goes a different direction and everyone forgets there was something they wanted to know. So I look it up before forgetting what the question was. And I get strange looks for providing the answer. I'm the smarty pants because I googled it, when half the room could have also pulled out their iPhone or Droid and looked it up themselves. They're using their phones to text, so it's not like they're put away to be polite to the present company.

Isn't that why we have smartphones? I mean, it's cool that you can use your iPhone as a digital rattle to keep your kids occupied, but isn't access to Wikipedia, Google, Yahoo Answers, Twitter, etc. the best reason to have a smartphone? Nielsen claims that 25% of smartphone users don't even access any data on their phones and that 6% of users consume over half the mobile phone data. The rest of us are on that logarithmic continuum somewhere.

What other information is out there that we aren't using? If you have a blog (or any website for that matter), do you use Google Analytics to see who is coming to your site, from where, and for what? How long do they stay, what do they do while visiting, and do they actually find what they were looking for? Is that just another type of TMI that you don't care about?

Over the past 2-3 years, my two most consistently popular posts are on Pedagogy vs Andragogy and Surf the Channel. Coming in a distant third place is a post related to Cognitive Load Theory.

It's interesting, since I spent a lot more time writing the cognitive load post than I did the other two. Maybe the third place post was too long or too academic or too focused on the specific situation in which I was using it. Something I have been able to figure out, though, is that if you can find something that is an interesting or upcoming topic that not many people are blogging about, you'll get a lot of hits. That may seem obvious, but I only figured it out because I had data that told me. If you think I should have known that already, guess which of your blog posts are the most popular and then turn on Google Analytics and tell me how close you were after a month or two of collecting data.

I've been thinking for awhile now that I need to write another post on Andragogy, because I'm afraid that the one that several hundred people a month find just wasn't all that well written or informative. It was just a quick recap of an experience I had and a few comments on some of the basics of that area. I know a lot more about the topic now and knowing people are looking for that information and having a hard time finding it, I feel it my duty to help others make some sense of it.

They could find academic articles on it or take a class on the subject. They could go to conferences and talk to people using techniques based on these principles. But they don't. At least they're googling it, which many don't even have the motivation to do like I already talked about. So if that many people come asking the question, how many more are out there who don't even ask, because they can't be bothered? There's probably not much I can do to reach out to them, since I can't force my blog down everyone's throats. I should at least try to answer the question people who do find my site are asking.

So what does Google Analytics tell you? And what are you going to do about it?