9

NETWORK ANALYSIS

Network analysis differs from more traditional content analysis techniques in that instead of individual words or actors, it is the patterns of connections between those entities that are the focus. Increasingly sophisticated network analysis methodologies allow very advanced exploitation of this information, ranging from simple analysis of who's connected to whom to more advanced structural characterization able to indicate how important a given node is to the network as a whole. In many ways, network analysis represents the future of where content analysis is heading as a discipline. The techniques in this chapter are relatively advanced and not every content analysis project will have data that is easily represented in network form, but for those needing to explore the deeper relationships encoded in their data, network analysis offers a number of unique capabilities.

Understanding Network Analysis

A network is about representing connections and consists of a set of nodes connected by edges. A node can be an isolate, with no edges connecting it to the rest of the network, a pendant, connected to only a single other node, or have any number of edges connecting it to any number of nodes. A node can be connected to just one or two other nodes, or have edges linking it to every other node in the entire network. Nodes can also have self-ties in which a circular edge connects them back to themselves in a self-referencing fashion. An edge can have a weight associated with it that indicates the strength of the underlying connection and can be unidirectional (connecting A to B, but not B to A) or bidirectional (A and B are connected in both directions).

To understand how a network works, think of a city and its road system. Each building (node) is connected by a set of roads (edges). Some roads are small one way streets that allow travel only in a single direction (unidirectional edge with low weight score). Other roads are 12-lane highways allowing vast numbers of vehicles to travel simultaneously in both directions (bidirectional edge with high weight score). Two buildings located on the same two-way street might require a very short travel distance. If the buildings are located on a one-way street, with B down the street from A, travel from A to B is easy, but traveling from B to A requires turning onto side streets to loop back around to the beginning of the street. Traveling cross-country requires traversing many different roads, with multiple possible paths to choose from, some shorter or faster than others. A building in the central business district of a major city has a large number of possible routes leading to it along major highways, while a building in a rural area might have only a few small roads connecting it to the rest of the transportation system, leading to an uneven landscape of access.

Network analysis is about understanding patterns in the relationships that connect objects and using those relationships to characterize actors based on the roles they play in the overall network environment. Two characters in a play might appear in equal numbers of scenes, but one may spend the majority of their appearances with just two other characters, while the other has one scene each with every character in the play. A traditional vocabulary analysis would be hard-pressed to differentiate these two characters, since their differences lie in their relationships with others, rather than their own actions. These are the kinds of research questions that network methods are designed to answer.

Network Content Analysis

Content analysts are used to working with measures of individual objects: the only relationships a typical analyst explores are basic patterns of correlation and cooccurrence. How then might network analysis be integrated into a content analysis project? In particular, since most content analyses revolve around large archives of text, how can a body of text be transformed into a network representation?

Some content collections include metadata that allow connections to be drawn between documents. For example, each record in an email archive contains both a sender and a list of recipients. To model this archive as a network, each address is treated as a node and each email establishes a set of edges between the sender and recipient addresses. Similarly, an academic journal paper contains a list of references linking it to the broader body of literature cited by its author. Many scholarly rankings measure the importance of a paper by the number of subsequent papers citing it and the highest-ranked journals in most fields are those whose papers are the most heavily cited. Such uses of network analysis are known as citation analysis, part of the larger field of bibliometrics.

In the absence of metadata, or as a complement to it, entity extraction and cooccurrence information may be used to construct a network directly from the text itself. For example, a proximity correlation table, introduced in Chapter 4, can be considered a type of network model, in which the words or actors from the text are the nodes, and each co-occurrence of a pair of words and their distance in that document become the network edges and their weights. The benefit of a network representation in this case is that instead of measuring just direct correlations between words, it becomes possible to examine transitive correlations, in which word A tends to co-occur with B, which co-occurs with C. A and C may never co-occur together and thus would not be connected by a traditional correlation table, while a network analysis would immediately reveal the transitive nature of their relatedness.

Character interactions in a theatrical play are another area where network analysis can play a role in textual content analysis. Plays revolve around a fixed set of characters and the interactions between them, lending themselves readily to a network representation of nodes representing characters and edges formed from co-occurrences of those characters in a scene. A traditional correlation analysis would show which actors are the most related in terms of co-appearances. Only a network analysis, however, makes it possible to identify those characters that connect any pair of other characters in the play. Similarly, when working with online news or blog web sites, the pages each site links to can be used to measure the overlap of their shared interests, while similar quotes or “memes” in news coverage can be used to track the flow of information through a news system. In essence, network techniques will be useful to those analysts working with content that can be broken down into discrete entities and for which the relationships between them are of interest.

Representing Network Data

Network data is most commonly represented as a table or matrix, with the nodes being repeated as both the columns and rows. Each cell indicates whether a connection exists between the two nodes and its strength. An adjacency matrix reports a 0 if there is no connection between the nodes and a 1 if there is a connection of any strength, while a distance matrix records the actual strength of the connection. Ordinarily, a node will have a 0 in the cell connecting it to itself, but some networks may have situations where nodes connect back to themselves. For example, email users often carbon-copy themselves on certain emails for reporting or records compliance, generating a self-edge that could be recorded.

TABLE 9.1 Network represented as a table

Node 1 Node 2 Node 3
Node 1 0
Node 2 0
Node 3 0

Constructing the Network

Network analysis can be applied to data already in network format, or may require the extraction of network structure from a text archive. In the earlier example of an email archive, the To and From fields of each email would be used to record the links between each email user. In a collection of web pages, each page and each hyperlink in those pages would be represented as nodes, and the connections between a page and the hyperlinks it contains would be the edges. In both cases the resulting network would allow the larger-scale communication patterns to be explored: who communicates the most with whom, and how many degrees of separation exist between any pair of users?

Constructing a network from a text archive is more complex and can rely either on complex grammatical parsing or simple textual co-occurrence. Grammatical systems attempt to produce a network that encodes as much semantic information as possible from the text. A phrase like John is Mary's father would be converted by a grammatical parser to a network representation with John and Mary as nodes and father as an edge connecting the two, indicating that father is a kind of relationship between them. Like other grammatical parsing systems, accuracy can vary significantly based on the quality of the input text.

Simple co-occurrence linking is a more common method with each word or phrase extracted from the text as nodes and edges recorded for each document the words co-occur in, with the weight being the minimal distance in letters or words between them. In the example above, John, Mary, is, and father would all be nodes, and all four would be connected to each other. By itself, this yields less information than the grammatically generated network, but if aggregated across a large archive, this information could be used to fill ties between John and Mary and other entities they are connected to. In particular, this approach does not require pre-existing language models or other tools customized for the text being analyzed and works regardless of how grammatically correct the documents are.

Network Structure

Every network has an underlying structure formed by the particular arrangement of its nodes and edges. Different networks can be composed of vastly different structures, while different structures can exhibit similar behaviors. Most network analyses therefore begin with an examination of the micro- and macro-level structures formed by a particular collection of nodes and edges, reducing their complex spatial structure into a set of simple quantitative measurements.

For a single node, the most common structural indicator is its centrality, which measures its overall importance to the network and reflects its role in connecting the other nodes. There are several common types of centrality measurements:

•   Betweenness centrality How often a node is part of the shortest route between two other nodes. In an email network a high betweenness centrality would suggest a person who plays a central role in relaying information between other users. This user may not have direct connections to many other users in the network, but occupies a location in the network where they facilitate a lot of communication.

•   Closeness centrality The average distance between a node and all other nodes in the network. It is measured in terms of the number and total weight of all edges that must be traversed to reach every other node in the network. In an email network a user with high closeness centrality would be someone who directly emails many other users, such as an office manager.

•   Degree centrality The total number of edges connected to a node. This is broken further into indegree and outdegree centrality, measuring the total number of edges pointing toward the node and the total pointing away from the node, respectively. In an email network a user with high degree centrality is someone who emails or is emailed by many other users.

A pair of nodes can be compared by examining their respective centrality scores, or through a set of pair-wise network measures:

•   Adjacency Whether the two nodes are directly connected to each other.

•   Connectivity The number of nodes that would have to be removed from the network such that there no longer exist any paths connecting the two nodes.

•   Distance The total number of edges and the total sum of the weights of those edges along a given path between the two nodes.

•   Maximum flow The number of nodes directly connected to the first node that have at least one path connecting them to the second node. If node A is linked to B and B has many paths to C, the connection between A and C still hinges on B. If A is connected to B and D and E, each of which has a connection to C, then even in the loss of one of them, A still has a way of reaching C.

•   Reachability Whether a path exists, regardless of length, that connects the two nodes. In a network with unidirectional edges, large sections of the network may not be reachable from the other.

•   Structural equivalence Whether two nodes are both connected to the same set of other nodes (even if they are not connected to each other).

Often, a network as a whole must be characterized to allow comparisons between collections, such as differences in the overall connectivity of two email archives. Several whole-network measures offer a general summarization of the network's structure:

•   Density A fully dense network would have an edge connecting every node to every other node. A network's density is therefore the percentage of possible edges that are actually present.

•   Triad census The overall connectivity profile of a network. This measure is important enough to be addressed in its own section in a moment.

•   Size The total number of nodes and edges in the network.

The Triad Census

A triad census measures the connectivity profile of a network, classifying the overall well-connectedness of the network. A network in which most nodes are connected to only a few other nodes has a weaker aggregate structure (and thus is more vulnerable to loss) than one in which each node is connected to every other node (and thus will still have paths between most nodes even if many nodes are removed). Overall reciprocity is also captured, indicating whether, on average, if A is connected to B, B is also connected back to A.

To conduct a triad census, all nodes of a network are grouped together in every possible permutation of threes (whether or not they are actually connected in the network). For example, a network with nodes A, B, C, and D would have triads ABC, ABD, ACD, and BCD. A network with directed edges and N nodes has a total of (N(N-1)(N-2))/6 triads. Each triad is classified into one of 64 possible states, ranging from completely unconnected (none of the nodes has an edge connecting it to the other two) to fully connected (all three nodes have edges connecting them to each other). The output of the triad census is simply a list of the 64 states, along with a count of how many triads fell into each category. Given that many of the 64 possible configurations are structurally equivalent to each other, there are actually only 16 distinct isomorphic states, and many software programs will output only those 16 states, rather than the full set of 64 states.

Network Evolution

Most network methods are like traditional vocabulary analysis in that they measure only the current state of the data, not change over time. An area of emerging interest in network scholarship is the concept of transformation over time or dynamic network evolution. Put simply, such analysis moves beyond the question of which nodes are most closely connected today to which nodes, possibly with no direct connections, have evolved over time to form connections with their neighbors in similar fashion or may do so in the future. Evolutionary analyses are more concerned with the ordering of network connections than on the aggregation of those connections. For example, all characters in a play may eventually share a scene with each other, but may be differentiated by the order in which they co-appeared with the other characters. Such indicators measure the diffusion of connections through a network.

A common approach to examining change in a network is to break it into specific time increments and compare network indicators for each time period. For example, an email archive could be divided into monthly snapshots and the triad census of each month compared to explore patterns in the interconnectedness of users through time. Centrality scores for individual users or pair-wise measures could also be compared at each time step to measure network velocity.

Beyond such simple metrics, there is a growing body of simulation and game theoretic modeling approaches being applied to leverage the relationship information in a network to forecast its future development. For example, epidemic models have been widely used to simulate the flow of information through a network. Instead of the spread of disease, such a model applied to a news archive would simulate the spread of a news story across outlets. Information in the real world rarely spreads uniformly outward from a source. Instead, it tends to follow an irregular and seemingly complex series of turns and cycles as it propagates. For example, a blog might only pick up a news story if it appears in two news outlets, B and C, first. Outlet B, in turn, may only cover stories that outlets D and E do. Thus, even if A is connected to all five outlets, only stories that first appeared in outlets D and E and then in B and C will have a chance of appearing in A, even though structurally it would seem that coverage in any of these outlets would intuitively make it available to A. Such graph dependency problems are very common in the real world, as actors must continually evaluate their environments and make decisions on how they interact with their neighbors. Game theoretic simulations using network data are one of the few mechanisms available to explore such interdependencies.

Visualization and Clustering

The human eye is an extremely powerful pattern-detection tool, and one of the first network methods applied to a new dataset is the creation of a series of visualizations exposing its underlying structure for human inspection. The following three figures show a small sample network of documents (light gray) drawn from a historical document archive and mentions of years (dark gray) found in them. Each mention of a year in a document establishes an edge between the two nodes, connecting the documents based on shared mentions of years. All three displays show the same network, but use different layout algorithms.

The first uses a circular arrangement, in which all nodes are arranged in a circle with connective links drawn through the center. This display makes it easy to identify nodes with large numbers of connections, but makes it hard to discern the overall structure of the network. The second display makes use of a random-scatter arrangement in which the location of each node is selected at random. This makes the network structure slightly more visible than a circular display, but still makes it hard to identify the larger patterns in the network. Finally, the last figure shows a clustered arrangement in which each link is treated like a “spring” that pulls its two nodes together, and the overall set of nodes is treated as a spring system and nodes arranged to minimize the pull between each pair of nodes. The end result of this clustering process is that nodes with high connectivity tend to clump together with their most heavily linked brethren, while bridge nodes that bring together clusters stand out very clearly. Only in this final display does the overall structure of the network become clear.

images

FIGURE 9.1 Circular network layout

images

FIGURE 9.2 Random-scatter network layout

images

FIGURE 9.3 Clustered network layout

Chapter in Summary

Network analysis is a highly specialized form of content analysis that examines the relationships between the words and actors in a document archive, rather than the objects themselves. Some document archives may include metadata that can be used directly to establish edges (such as the To and From fields in an email archive), while entity extraction and co-occurrences can be used to construct a network from freeform text. A variety of measures can be used to quantify the role each node plays in the network, from its centrality to the distance between a pair of nodes. Triad censuses allow the overall connectedness of a network as a whole to be quantified, while evolutionary measures offer insight into how a network is changing over time. Finally, visualization and clustering techniques make it easier for human analysts to make sense of the large-scale structural patterns in a network.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset