Social Network Analysis
These two weeks,
we have learned an effective method to find the pattern of interaction between
actors in a social network, named social network analysis.
Social network
analysis is the mapping and measuring of relationships and flows between
people, groups, organizations, computers, URLs, and other connected
information/knowledge entities. The nodes in the network are the people and
groups while the links show relationships or flows between the nodes. SNA
provides both a visual and a mathematical analysis of human relationships. To
understand networks and their participants, we evaluate the location of actors
in the network. Measuring the network location is finding the centrality of a
node. These measures give us insight into the various roles and groupings in a
network -- who are the connectors, mavens, leaders, bridges, isolates, where
are the clusters and who is in them, who is in the core of the network, and who
is on the periphery.
Two core notions
in network analysis are centrality and prestige—as we will see, both are used
in the link analysis for the Web. Informally, a node (or page or actor or
resource) is central if it is involved in many links.
To have an actual understanding
of the different categories of centrality, we see a sociogram first:
We look at a
social network -- the "Kite Network" above -- developed by David
Krackhardt, a leading researcher in social networks. Two nodes are connected if
they regularly talk to each other, or interact in some way. Andre regularly
interacts with Carol, but not with Ike. Therefore Andre and Carol are
connected, but there is no link drawn between Andre and Ike. This network
effectively shows the distinction between the three most popular individual
centrality measures: Degree Centrality, Betweenness Centrality, and Closeness
Centrality.
Degree Centrality: Social network researchers measure network activity
for a node by using the concept of degrees -- the number of direct connections
a node has. In the kite network above, Diane has the most direct connections in
the network, making hers the most active node in the network. She is a 'connector'
or 'hub' in this network. Common wisdom in personal networks is "the more
connections, the better." This is not always so. What really matters is
where those connections lead to -- and how they connect the otherwise
unconnected! Here Diane has connections only to others in her immediate cluster
-- her clique. She connects only those who are already connected to each other.
Betweenness Centrality: While Diane has many direct ties, Heather has few
direct connections -- fewer than the average in the network. Yet, in may ways,
she has one of the best locations in the network -- she is between two
important constituencies. She plays a 'broker' role in the network. The good
news is that she plays a powerful role in the network, the bad news is that she
is a single point of failure. Without her, Ike and Jane would be cut off from
information and knowledge in Diane's cluster. A node with high betweenness has
great influence over what flows -- and does not -- in the network. Heather may
control the outcomes in a network. That is why I say, "As in Real Estate,
the golden rule of networks is: Location, Location, Location."
Closeness Centrality: Fernando and Garth have fewer connections than Diane,
yet the pattern of their direct and indirect ties allow them to access all the
nodes in the network more quickly than anyone else. They have the shortest
paths to all others -- they are close to everyone else. They are in an
excellent position to monitor the information flow in the network -- they have
the best visibility into what is happening in the network.
Network Centralization: Individual network centralities provide insight into
the individual's location in the network. The relationship between the
centralities of all nodes can reveal much about the overall network structure.
A very centralized
network is dominated by one or a few very central nodes. If these nodes are
removed or damaged, the network quickly fragments into unconnected
sub-networks. A highly central node can become a single point of failure. A network
centralized around a well-connected hub can fail abruptly if that hub is
disabled or removed. Hubs are nodes with high degree and betweeness centrality.
A less centralized
network has no single points of failure. It is resilient in the face of many
intentional attacks or random failures -- many nodes or links can fail while
allowing the remaining nodes to still reach each other over other network
paths. Networks of low centralization fail gracefully.
Besides, I take a
simple description about the basic concept of prestige.
Degree prestige: A page is prestigious if it receives many in-links. The
idea is that actors who are prestigious tend to receive many nominations or
choices.
Proximity Prestige: Proximity prestige considers how proximate is to the
actors in its influence domain. Define proximity as closeness that focuses on
distances to rather than from each actor.
Rank prestige: An important aspect that we have ignored so far in
the notions of prestige is the importance of the pages that link to i. Surely,
if i receives links from important pages, it is probably important itself.
I am very
interested in the ranking algorithm in information retrieval, such as PageRank
or Hits. When analyzing social network, we will always deal with the web pages.
So there are lots of similar methods or ideas as link analysis in information
retrieval.
Link-based
techniques for analyzing social networks enhance text-based retrieval and ranking
strategies. As we shall see, social network analysis was well established long
before the Web, in fact, long before graph theory and algorithms became
mainstream computer science. Therefore, later developments in evolution models
and properties of random walks, mixing rates, and eigensystems (Motwani &
Raghavan, 1995) may make valuable contributions to social network analysis,
especially in the context of the Web.
First I want to
talk about PageRank, not in detail of the algorithm, just some core ideas. While
PageRank and HITS were first presented in the same year (1998), PageRank has emerged
as the dominant link analysis model for web search and mining. Like rank
prestige, PageRank looks at the number of inlinks, together with the importance
of these inlinks. It yields a static ranking, that is computed off-line, for
each page, and does not depend on the search queries.
From the
perspective of prestige, the following intuitions are used to derive the
PageRank algorithm:
- A link from a page pointing to another page is an implicit conveyance of authority to the target page. Hence, the more in-links a page i has, the more prestige it has.
- Pages that link to page i have their own prestige scores. A page with a higher prestige score pointing to i is more important than a page with a lower prestige score pointing to i.
- Since a page may point to many other pages, its prestige score should be shared among all the pages that it points to.
In Google, the crawled graph is first used to pre-compute and store the
PageRank of each page. Note that the PageRank is independent of any query or
textual content. When a query is submitted, a text index is used to first make
a selection of possible response pages. Then an undisclosed ranking scheme that
combines PageRank with textual match is used to produce a final ordering of
response URLs. All this makes Google
comparable in speed, at query time, to conventional text- based search engines.
PageRank is an important ranking mechanism at the heart of Google, but
it is not the only one: keywords, phrase matches, and match proximity are also
taken into account, as is anchor text on pages linking to a given page. Search
Engine Watch (www.searchenginewatch.com) reports that during some weeks in
1999, Google’s top hit to the query “more evil than Satan” returned
www.microsoft.com, probably because of anchor text spamming. This embarrassment
was fixed within a few weeks. The next incident occurred around November 2000,
when Google’s top response to a rather offensive query was
www.georgewbushstore.com. This was traced to www.hugedisk.com, which hosted a
page that had the offensive query words as anchor text for a hyperlink to
www.georgewbushstore.com.Although the details of Google’s combined ranking
strategy are unpublished, such anecdotes suggest that the combined ranking
strategy is tuned using many empirical parameters and checked for problems using
human effort and regression testing. The strongest criticism of PageRank is
that it defines prestige via a single random walk uninfluenced by a specific
query. A related criticism is of the artificial decoupling between relevance
and quality, and the ad hoc manner in which the two are brought together at
query time, for the sake of efficiency.
There is something different of HITs from PageRank. HITS stands for hypertext induced topic search.
Unlike PageRank, HITS depends on a query. When the user submits a query, HITS
first expands the list of pages returned by the search engine and then produces
two rankings of the expanded set of pages: authority ranking and hub ranking.
An authority is a
page with many in-links; it may have good content on some topic, and hence it
is linked to by many. A hub is a page with many out-links, and serves as a good
organizer of the information on a certain topic. The idea behind HITS is that a
good hub points to many authorities, and a good authority is pointed to by many
good hubs.
How does HITS
collect pages to be ranked? Given a search query q, HITS collects a set of
pages as follows:
1. Submit the
query to a search engine. Collect the t (typically, t = 200) highest ranked pages,
which we assume to be highly relevant to q. This set is called the root set W.
2. Expand W by
including any page pointed to by a page in W and any page that points to a page
in W. The result is a larger set called S. As the set can be very large, the algorithm
restricts its size by allowing each page in W to import at most m pages pointing
to it into S. The set S is called the base set.
The entire process
is generically called topic distillation. User studies (Chakrabarti et al.,
1999) have shown that reporting hubs is useful over and above reporting authorities,
because they provide useful annotations and starting points for users to start
exploring a topic.


