smsociety14 has ended
Saturday, September 27 • 14:56 - 15:15
"Information-source extraction for inferring communities"

Sign up or log in to save this to your schedule and see who's attending!

Background:  As primary evidence of influence and impact (Garfield 1963), citation and hyperlink relations have been used to build social networks that provide insights into academic, professional, and online political communities (Newman 2004, Adamic 2005, Fowler 2007).   In this paper we explore networks constructed by generalizing citation and hyperlink relations to include any informal reference to an entity as a source of information (songs, books, organizations, biblical verses), whether viewed as credible or not, for example, “How a Handful of Scientists Obscured the Truth.”  We hypothesize that strong similarity of source citation patterns implies community affiliation.

Objective: In contrast to a citation network, which directly links authors or works, source extraction will in general produce a bipartite graph; the two types of nodes are sources (which may be any kind of information source) and group members (websites in our data).  We use the similarity matrix of the group members to construct a K-Nearest Neighbor (KNN) graph (positing a link between any two sites such that either is one of the K nearest neighbours of the other).  We show that in cases where direct citation links are known, the KNN graph can function as well as the primary citation graph in helping to infer communities within the group, and that the KNN graph can generate reasonable community proposals when the direct graph is sparse or unavailable.

Methods:  We first demonstrate the KNN graph concept on the political blog data of Adamic (2005), comparing the performance of the Louvain community discovery algorithm (Blondel 2008) using the primary citation graph with that of a spectral clustering algorithm (Ng et al. 2001) on the KNN graph, in recovering liberal and conservative blogging communities.  Next we use a source extraction system based on the architecture of Choi et al (2005) to extract information sources from 700 web pages discussing climate change to attempt to recover the pro-climate change and anti-climate change groups with the data.  In the climate change data, direct page to page links are rare, so we compared the performance of KNN-graph based clustering using only the information sources to that of a bisective clustering algorithm using the entire vocabulary, the so-called Bag-of-Words (BOW) model.

Results:  On the political blog data, both the primary citation graph method and the KNN graph method discovered clusters that were over 95% composed of members from one political community (cluster purity). On the climate change data, the best KNN-clustering  system outperformed  the best BOW system in discovering pro- and anti- climate change websites (cluster purity: 72% vs. 65%).

Conclusions:  The political blog experiment shows that similarity of citation pattern can be an effective proxy for direct citation links in inferring communities. The climate change experiment shows that similarity of source citation pattern is a better indicator of orientation on the issue of climate change (a key indicator of community affiliation) than similarity of word usage patterns in general.  These results suggest that further generalizing the similarity relations considered might improve community detection.



Adamic, L. A., & Glance, N. (2005, August). The political blogosphere and the 2004 US election: divided they blog. In Proceedings of the 3rd international workshop on Link discovery (pp. 36-43). ACM.

Blondel, V. D., Guillaume, J. L., Lambiotte, R., & Lefebvre, E. (2008) Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment. 10: P10008.

Choi, Y., Cardie, C., Riloff, E., & Patwardhan, S. (2005, October). Identifying sources of opinions with conditional random fields and extraction patterns. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 355-362). Association for Computational Linguistics.

Fowler, J. H., Johnson, T. R., Spriggs, J. F., Jeon, S., & Wahlbeck, P. J. (2007). Network analysis and the law: Measuring the legal importance of precedents at the US Supreme Court. Political Analysis, 15(3), 324-346.

Garfield, E. (1963). Citation indexes in sociological and historical research. American documentation, 14(4), 289-291.

Newman, M. E. (2004). Who is the best connected scientist? A study of scientific coauthorship networks. In Complex networks (pp. 337-370). Springer Berlin Heidelberg.

Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair.  (2001). On Spectral Clustering: Analysis and an algorithm. In Tesauro, G,  Touretzky, D., and Leed, T.  (Eds.). Advances in neural information processing systems. Cambridge, MA: MIT Press.


Li An

San Diego State University

Alex Dodge

San Diego State University

Jean Mark Gawron

Honcho, San Diego State University
Computational linguistics, homophily via language, studying social groups via language.

Dipak Gupta

San Diego State University

Kathleen Preddy

San Diego State University

Brian Spitzberg

San Diego State University
avatar for Ming-Hsiang Tsou

Ming-Hsiang Tsou

Professor, San Diego State University
Dr. Ming-Hsiang (Ming) Tsou is a Professor in the Department of Geography, San Diego State University (SDSU) and the Director of Center for Human Dynamics in the Mobile Age (HDMA). His research interests are in Human Dynamics, Social Media, Big Data, Visualization, Internet Mapping... Read More →

Saturday September 27, 2014 14:56 - 15:15
TRS 1-149 Ted Rogers School of Management

Attendees (0)