2012年11月28日星期三

Social Network Analysis & PageRank, HITS


Social Network Analysis

These two weeks, we have learned an effective method to find the pattern of interaction between actors in a social network, named social network analysis.

Social network analysis is the mapping and measuring of relationships and flows between people, groups, organizations, computers, URLs, and other connected information/knowledge entities. The nodes in the network are the people and groups while the links show relationships or flows between the nodes. SNA provides both a visual and a mathematical analysis of human relationships. To understand networks and their participants, we evaluate the location of actors in the network. Measuring the network location is finding the centrality of a node. These measures give us insight into the various roles and groupings in a network -- who are the connectors, mavens, leaders, bridges, isolates, where are the clusters and who is in them, who is in the core of the network, and who is on the periphery.

Two core notions in network analysis are centrality and prestige—as we will see, both are used in the link analysis for the Web. Informally, a node (or page or actor or resource) is central if it is involved in many links.
To have an actual understanding of the different categories of centrality, we see a sociogram first:

We look at a social network -- the "Kite Network" above -- developed by David Krackhardt, a leading researcher in social networks. Two nodes are connected if they regularly talk to each other, or interact in some way. Andre regularly interacts with Carol, but not with Ike. Therefore Andre and Carol are connected, but there is no link drawn between Andre and Ike. This network effectively shows the distinction between the three most popular individual centrality measures: Degree Centrality, Betweenness Centrality, and Closeness Centrality.

Degree Centrality: Social network researchers measure network activity for a node by using the concept of degrees -- the number of direct connections a node has. In the kite network above, Diane has the most direct connections in the network, making hers the most active node in the network. She is a 'connector' or 'hub' in this network. Common wisdom in personal networks is "the more connections, the better." This is not always so. What really matters is where those connections lead to -- and how they connect the otherwise unconnected! Here Diane has connections only to others in her immediate cluster -- her clique. She connects only those who are already connected to each other.
Betweenness Centrality: While Diane has many direct ties, Heather has few direct connections -- fewer than the average in the network. Yet, in may ways, she has one of the best locations in the network -- she is between two important constituencies. She plays a 'broker' role in the network. The good news is that she plays a powerful role in the network, the bad news is that she is a single point of failure. Without her, Ike and Jane would be cut off from information and knowledge in Diane's cluster. A node with high betweenness has great influence over what flows -- and does not -- in the network. Heather may control the outcomes in a network. That is why I say, "As in Real Estate, the golden rule of networks is: Location, Location, Location."
Closeness Centrality: Fernando and Garth have fewer connections than Diane, yet the pattern of their direct and indirect ties allow them to access all the nodes in the network more quickly than anyone else. They have the shortest paths to all others -- they are close to everyone else. They are in an excellent position to monitor the information flow in the network -- they have the best visibility into what is happening in the network.
Network Centralization: Individual network centralities provide insight into the individual's location in the network. The relationship between the centralities of all nodes can reveal much about the overall network structure.

A very centralized network is dominated by one or a few very central nodes. If these nodes are removed or damaged, the network quickly fragments into unconnected sub-networks. A highly central node can become a single point of failure. A network centralized around a well-connected hub can fail abruptly if that hub is disabled or removed. Hubs are nodes with high degree and betweeness centrality.

A less centralized network has no single points of failure. It is resilient in the face of many intentional attacks or random failures -- many nodes or links can fail while allowing the remaining nodes to still reach each other over other network paths. Networks of low centralization fail gracefully.

Besides, I take a simple description about the basic concept of prestige.
Degree prestige: A page is prestigious if it receives many in-links. The idea is that actors who are prestigious tend to receive many nominations or choices.
Proximity Prestige: Proximity prestige considers how proximate is to the actors in its influence domain. Define proximity as closeness that focuses on distances to rather than from each actor.
Rank prestige: An important aspect that we have ignored so far in the notions of prestige is the importance of the pages that link to i. Surely, if i receives links from important pages, it is probably important itself.

I am very interested in the ranking algorithm in information retrieval, such as PageRank or Hits. When analyzing social network, we will always deal with the web pages. So there are lots of similar methods or ideas as link analysis in information retrieval.
Link-based techniques for analyzing social networks enhance text-based retrieval and ranking strategies. As we shall see, social network analysis was well established long before the Web, in fact, long before graph theory and algorithms became mainstream computer science. Therefore, later developments in evolution models and properties of random walks, mixing rates, and eigensystems (Motwani & Raghavan, 1995) may make valuable contributions to social network analysis, especially in the context of the Web.
First I want to talk about PageRank, not in detail of the algorithm, just some core ideas. While PageRank and HITS were first presented in the same year (1998), PageRank has emerged as the dominant link analysis model for web search and mining. Like rank prestige, PageRank looks at the number of inlinks, together with the importance of these inlinks. It yields a static ranking, that is computed off-line, for each page, and does not depend on the search queries.
From the perspective of prestige, the following intuitions are used to derive the PageRank algorithm:
  • A link from a page pointing to another page is an implicit conveyance of authority to the target page. Hence, the more in-links a page i has, the more prestige it has.
  • Pages that link to page i have their own prestige scores. A page with a higher prestige score pointing to i is more important than a page with a lower prestige score pointing to i.
  • Since a page may point to many other pages, its prestige score should be shared among all the pages that it points to.
In Google, the crawled graph is first used to pre-compute and store the PageRank of each page. Note that the PageRank is independent of any query or textual content. When a query is submitted, a text index is used to first make a selection of possible response pages. Then an undisclosed ranking scheme that combines PageRank with textual match is used to produce a final ordering of response URLs.  All this makes Google comparable in speed, at query time, to conventional text- based search engines.

PageRank is an important ranking mechanism at the heart of Google, but it is not the only one: keywords, phrase matches, and match proximity are also taken into account, as is anchor text on pages linking to a given page. Search Engine Watch (www.searchenginewatch.com) reports that during some weeks in 1999, Google’s top hit to the query “more evil than Satan” returned www.microsoft.com, probably because of anchor text spamming. This embarrassment was fixed within a few weeks. The next incident occurred around November 2000, when Google’s top response to a rather offensive query was www.georgewbushstore.com. This was traced to www.hugedisk.com, which hosted a page that had the offensive query words as anchor text for a hyperlink to www.georgewbushstore.com.Although the details of Google’s combined ranking strategy are unpublished, such anecdotes suggest that the combined ranking strategy is tuned using many empirical parameters and checked for problems using human effort and regression testing. The strongest criticism of PageRank is that it defines prestige via a single random walk uninfluenced by a specific query. A related criticism is of the artificial decoupling between relevance and quality, and the ad hoc manner in which the two are brought together at query time, for the sake of efficiency.

There is something different of HITs from PageRank. HITS stands for hypertext induced topic search. Unlike PageRank, HITS depends on a query. When the user submits a query, HITS first expands the list of pages returned by the search engine and then produces two rankings of the expanded set of pages: authority ranking and hub ranking.

An authority is a page with many in-links; it may have good content on some topic, and hence it is linked to by many. A hub is a page with many out-links, and serves as a good organizer of the information on a certain topic. The idea behind HITS is that a good hub points to many authorities, and a good authority is pointed to by many good hubs.

How does HITS collect pages to be ranked? Given a search query q, HITS collects a set of pages as follows:
1. Submit the query to a search engine. Collect the t (typically, t = 200) highest ranked pages, which we assume to be highly relevant to q. This set is called the root set W.
2. Expand W by including any page pointed to by a page in W and any page that points to a page in W. The result is a larger set called S. As the set can be very large, the algorithm restricts its size by allowing each page in W to import at most m pages pointing to it into S. The set S is called the base set.

The entire process is generically called topic distillation. User studies (Chakrabarti et al., 1999) have shown that reporting hubs is useful over and above reporting authorities, because they provide useful annotations and starting points for users to start exploring a topic.

9 条评论:

  1. Both Diane and Heather are very important in your map of relationship. But I do wonder who is more important in this model of social network?

    回复删除
  2. Thank you for your blog that tells a history about PageRank. As you said, PageRank is an important ranking mechanism at the heart of Google. And I am not surprised that Google has 82.80% market share of searching engine in May 2011. Searching engine has changed our life a lot!

    回复删除
  3. Gosh...this could be the blog presenting most comprehensive introduction and understanding of SNA among all ones I've ever read. The part of centrality and prestige is probably easier to grasp and I believe Professor has given detailed explanation. So I would more like to talk about the ranking algorithm. As you mentioned, it is the core of a search engine such as Google since how to display the results on the webpage is quite important. Of course there are more than one way to sort the outcomes but rank as a simple and intuitive algorithm actually offers the most rational choice to users.

    回复删除
  4. Really a comprehensive introduction. As you mentioned, the importance of page ranking algorithm has been regarded as the most influential factor that affects the users' experience with web and search engine services. As Professor introduced in the previous lecture, we could also notice that the ranking algorithm is also the core technology we have to improve in order to transfer of Web 2.0 to Web 3.0 with semantic technologies.

    回复删除
  5. Thank you for sharing this informative post. The PageRank and HIT part is quite interesting.

    回复删除
  6. Long blog and good blog,it gives me lots of knowledge related to PageRank. The detail you show about how HITS collects pages to be ranked is really interesting.

    回复删除
  7. woo...This blog is so exhaustive to explain the degree and centrality of the social network. Actually, my blot is also written about the same topics. But compared with your blog, mine missed the explaination of Rank prestige. Your blog almost concluded all the factors of social network analysis. Truly thank you for your explaination.

    回复删除
  8. after i read Hit part, now i understand how hit works, thank you :)

    回复删除
  9. This blog is soooo helpful for me to understand those new concepts!! Thank you so much~ you give a detail statement on pagerank & HITS. I agree that Link-based techniques for analyzing social networks enhance text-based retrieval and ranking strategies. and the history of pagerank & HITS is attracted to me. the thoery background of HITS is also interesting!

    回复删除