Thursday, April 30, 2009

Collaborative Filtering and Web 2.0 Technologies

Filtering Methods

Collaborative filtering systems connect a person's needs with content based on ratings by others with similar interests and needs. Depending on the system, filtering may be based on human or machine analysis of content or a hybrid approach (Herlocker, Konstan, & Riedl, 2000). An example of the hybrid approach is Google Images, which uses a machine analysis of file names and text content in the page around images in combination with its Image Labeler. The Image Labeler is a game of sorts where users accumulate points for matching key words with a randomly selected partner, with more points awarded for providing more specific terms. The two sets of data are combined and then used in searching (Google, 2009).

Doctorow (2001) claims that observational metadata by a machine is more reliable than that created by humans, listing several obstacles to dependable human-created metadata including people's inability to fully report their own behavior and the ambiguities and non-neutral nature of many measurement and reporting techniques. Avery & Zeckhauser (1997) suggest that some incentive to evaluate content is necessary to avoid issues where the majority of users wait for others to evaluate content for them.

Some of the problems with the lack of human-created metadata may be due to the types of tools available to catalog resources. As I've mentioned before, the complex metadata standards like LOM were designed by engineers and just take too much time to implement; however, with the advent of many Web 2.0 tools there is an abundance of tagged resources and RSS feeds that easily work together. However, an abundance of tags does not necessarily solve problems without causing new ones. Tagging with a common or ambiguous word may cause unrelated content to be displayed together, and spammers may mark their garbage such that it displays alongside legitimate content (Walker, 2005). A closed community might help keep these ambiguities under control, but restrictions would likely lead to lower participation.


Communities use social interaction to combine existing knowledge with new knowledge to meet their needs. One piece of content may mean different things, based on the context in which it is used (Burnett, Dickey, Kazmer, & Chudoba, 2003). The question is how to make open tools like, Twitter, and Flickr work to facilitate individual communities without blending them all together or limiting access. It may be ideal to build or expand collaborative filtering capabilities that work in conjunction with manual tagging and machine analysis of content. In order to be successful, such a collaborative filter should filter out irrelevant information and provide a means for community members to access relevant information at the appropriate time, based on the behavior of others in the community (Walker, 2002).

De Souza & Preece (2004) point out two components by which an online community can be assessed: sociability and usability. The sociability component applies to any community, whether online or offline and includes the people, purposes, and policies involved. The usability component focuses on the technical and HCI issues of the software used. In their framework, these two components have to be aligned to produce success. Web 2.0 tools do well in terms of usability, based on the large numbers of people blogging, tagging, editing wikis, and otherwise collaborating. In terms of sociability, there is still work to do. It is easy to set up whitelists of content producers or tags once you know about them, but finding that content to begin with is difficult to do.

Web 2.0 Technologies

Walker (2005) lists Flickr tags that are related to the tag "bush" including: protest, election, politics, kerry, president, graffiti, snow, war, vote, iraq, tree, winter, cameraphone, cheney, and antibush. These associations among terms are then described as "sheep paths in the mountains" that have just formed over time, with no systematic approach. Over the past few years, clusters and pools of related content have made it a little easier to find what one is looking for. Now when searching for that same term, instead of just listing a few related tags, Flickr will prompt the user to see the clusters of related tags such as bush/green/nature/tree or bush/protest/war/iraq. These clusters help bring the sociability level of Flickr up towards its usability level which has been high for awhile now, however they are still based on manual tagging of content.

So one tool has begun to work on becoming a little bit more community-friendly, but how are the rest doing? Digg does well at quickly floating news stories in and out of the spotlight, based on their popularity within certain categories, but it is done by manual voting and categorization. Youtube videos can be associated with channels, contests, groups, categories, and tags, in addition to being rated by viewers. Videos can also be prioritized based on the number of overall views, but not by views of those similar to the user, which would be ideal. Wikipedia allows users to collaborate on documents and hold behind-the-page discussions before doing so, but in order to find a page that might be interesting to the user, a text-based search engine is used. Wordpress and Blogspot seem to follow the same pattern as these other popular tools, using RSS and tagging and linking, but not following a truly dynamic model that builds rules based on behavior and interests rather than cataloging by humans. Much of the human-generated data is good data, but it is simply not enough to narrow down the results by removing false positives. Combining with observational data and machine-generated contextual data will help triangulate the most accurate results for each individual user. Twitter and third party tools built on its API may be the closest to success with its ability to bring together both spontaneous and organized groups of people in real time for any given event.

Good Examples

For an example of non Web 2.0 collaborative filtering, we can look to TiVo (Ali & van Stam, 2004). TiVo still depends on users give shows they watch a thumbs up or thumbs down rating, but it has a few additional features most current Web 2.0 tools do not. It recommends shows the user might like, based on other shows they have watched and rated using correlated pairs of shows. It can also predict a "thumbs level" for unrated shows based on other characteristics.

For another example, Google tracks the searches and site visits of users that are logged into Google while they surf. Users can view statistics on their surfing habits and receive recommendations from Google for searches, web pages, videos, and gadgets the user might like based on the user's searches.

What's Next?

So if Google and TiVo can utilize a combination of factors to pinpoint content that would be appropriate for a user's general searching or entertainment needs, how do we harness those algorithms to extend the widely available Web 2.0 tools so they are more effective in the classroom or in business environments? Setting up a closed system is an option, but as mentioned above, a more open system should encourage more participation. With several of the tools such as wikis and blogs designed for teamwork and collaboration, it seems that the most useful collaborative filters would be those that perform well with newly created, unrated content that is identified by RSS feeds and then quickly react to the actions of users.

I'm not really sure how all these pieces ultimately fit together, but I am interested in further study on the topic. As I have been reading about virtual communities and open content lately and using several of these Web 2.0 tools for various projects, I am drawn to the power that is given to the masses to create content and influence politics, education, and many more aspects of our lives that were not open before. Traditional newspapers have new competition. There are free alternatives to the content traditionally provided by textbook publishers.

With my background in business, I believe that a reasonable amount of competition can be a very good thing. Enabling teams to more efficiently communicate with each other prevents duplication of effort and miscommunications within the group, as well as allowing the group to meet synchronously or asynchronously as schedules allow. Collaborative filtering seems to be an important next step in enabling virtual communities to better utilize the resources currently available to them. The tools for generating new content within a well-known context seem to be well developed, but an essential component of successful teamwork is better organization and dissemination of content and culture that already exist in order to maintain order when certain dynamics of the group change.


Ali, K., & van Stam, W. (2004) TiVo: Making show recommendations using a distributed collaborative filtering architecture. Proceedings of the 2004 ACM conference on Knowledge Discovery and Data mining.

Avery, C. & Zeckhauser, R. (1997). Recommender systems for evaluating computer messages. Communications of the ACM, 40(3).

Burnett, G., Dickey, M.H., Kazmer, M.M. & Chudoba, K.M. (2003) Inscription and interpretation of text: A cultural hermeneutic examination of virtual community. Information Research, 9(4).

de Souza, C. S., & Preece, J. (2004). A framework for analyzing and understanding online communities. Interacting with Computers, 16(3), 579-610.

Doctorow, C. (2001) Metacrap: Putting the torch to seven straw-men of the meta-utopia. Retrieved from

Google (2009). Google Image Labeler. Retrieved from

Herlocker, J., Konstan, J., & Riedl, J. (2000). Explaining collaborative filtering recommendations. Proceedings of the 2000 ACM Conference on Computer Supported Cooperative Work.

Walker, A. (2002). An educational recommender system: New territory for collaborative filtering (Doctoral Dissertation, Utah State University).

Walker, J. (2005). Feral hypertext: When hypertext literature escapes control. Proceedings of the 2005 ACM Conference on Hypertext and Hypermedia, 46-53.

No comments: