Chapter 6

Criminal Identity Resolution Using Personal and Social Identity Attributes

A Collective Resolution Approach

Jiexun Li and Alan Wang G, College of Information Science and Technology, Drexel University, Philadelphia, Pennsylvania, USA, Department of Business Information Technology, Virginia Tech, Blacksburg, Virginia, USA

Chapter Outline

6.1 Introduction

Identity resolution is a semantic reconciliation process that determines if a single identity is the same when being described differently [1]. An effective resolution technique is especially important in fighting criminals and terrorists and ensuring national security. The Assistant Director of the FBI Counterterrorism Division testified in a congressional testimony, “We must be able to determine whether an individual is who they purport to be. This is essential in our mission to identify potential terrorists, locate their means of financial support, and prevent acts of terrorism from occurring” [2]. However, this is not a trivial task because a criminal may assume multiple identities using either fraudulent or legitimate means. Terrorists are known to commit identity crimes, namely identity fraud, identity theft, and identity deception, in order to enhance the chances of success in their missions. A report released by the Office of the Coordinator for Counterterrorism describes many cases where terrorists in different countries use false passports and other identifications to facilitate their financial operations and the execution of attacks [3]. The 9/11 Commission Report also describes in detail how terrorists fraudulently acquire travel documents using, for example, photo-substituted passports or blank baptismal certificates. In record management systems that manage massive amounts of identities, duplicated identity references for an individual may also be introduced due to data entry errors. The problem of an individual having multiple identities can easily mislead intelligence and law enforcement investigations and diminish their efforts.

In database and data-mining communities many identity resolution techniques have been developed to tackle this issue. Most traditional techniques of identity resolution only use personal identity attributes such as name and identification numbers to find matching identities because those attributes are often used to describe a person in most record management systems [4,5]. However, personal identity attributes are subject to problems such as deception, fraud, and data quality issues. Their availability and reliability vary across different record management systems. Recent development in generic entity resolution techniques, such as the Probabilistic Relational Model [2] and the Collective Entity Resolution model [6], has prompted the use of contextual information in addition to descriptive attributes of entities. In the context of identity resolution contextual information may refer to one’s social acquaintance and relationships such as one’s roommates and coworkers. Compared to personal identity attributes, social characteristics are difficult to falsify because they may require extraordinary efforts for collaborative deceit [7].

In this chapter we introduce a novel identity resolution technique that takes into account both personal identity and social identity attributes in a collective fashion. Compared to other identity resolution proposals, this technique has the following advantages. Firstly, it can utilize current record management systems without introducing an infrastructure overhaul and costly hardware. Secondly, by linking known identities that supposedly co-refer to one individual, it provides stronger evidence to intelligence and law enforcement investigators and improves the efficacy and efficiency of investigations. Thirdly, it reduces the search load by deduplicating records in a database. Lastly, it will enable new identity resolution techniques by exploring the rich information behind one’s de facto identity.

This chapter is organized as follows. We first review identity theories and existing entity resolution techniques, followed by an introduction to our proposed collective identity resolution technique. We report an empirical evaluation of the performance of our proposed technique that combines both personal and social identity attributes.

6.2 Related Work

In this section we review identity theories from the social science literature and review identity attributes and analytical techniques used in existing identity resolution studies.

6.2.1 Identity Theories

The concept of identity has long been studied in philosophy, psychology, and sociology. It has a complex and changing subjective notion [8]. We consider an individual’s identity to have two basic components, namely a personal identity and a social identity.

A personal identity is one’s self-perception as an individual [9]. It deals with those necessary and sufficient conditions under which self persists over time. For example, people often ask common questions about their personal identities such as “Who am I” and “Where did I come from.” A personal identity may include personal information given at birth (e.g. name, date and place of birth), personal identifiers (e.g. social security number, passport number), physical descriptions (e.g. height, weight), and biometric information (e.g. fingerprint, DNA). A study conducted by the United Kingdom Home Office [10] suggests that identity crimes usually involve the illegal use or alteration of those personal identity components.

A social identity is one’s biographical history that builds up over time. It reflects how a person interacts with the society that he or she belongs to. Social identity theories consist of psychological and sociological views. The psychological view defines a social identity as one’s self-perception as a member of certain social groups such as nation, culture, gender, and employment [11,12]. Based on the psychological view, people within a social group may share some common characteristics that can distinguish themselves from people in other groups. On the other hand, the sociological view focuses on “the relationships between social actors who perform mutually complementary roles (e.g. employer–employee, doctor–patient)” [13]. The emphasis is on the interpersonal relationships among people and the social structure formed based on the relationships [14]. In addition, one’s social context determines the specific roles that the individual takes. For example, a man can have different roles in his family: the father of his children, the son of his parents, and the husband of his wife. An individual’s social identity, in this sense, is defined as the role-based interactions between the individual and people in his or her surrounding social networks. Deaux and Martin [13] integrated the psychological view and sociological view into one framework by regarding them as different levels of social context. The psychological view deals with large-scale social groups, whereas the sociological view deals with the proximate social groups in which members interact with each other. In this framework, a social identity becomes a multi-level concept that involves understanding one’s social groups at various scales.

6.2.2 Identity Attributes Used in Identity Resolution

For identity resolution, we are interested in determining features that can reliably identify an individual. Personal identity attributes such as names are often subject to data quality problems such as deception and data entry errors [5]. The use of identification cards and biometrics has been touted to be reliable identifiers by their proponents [15,16]. However, not only are there public concerns over its privacy implications, but a study also suggests that the connection between the use of national identity cards and antiterrorism is largely intuitive [17]. They investigated 25 countries that had been most adversely affected by terrorism since 1986. Those countries that employed an identity card system and biometric techniques did not show lessened terrorist activities in terms of number of attacks and deaths. O’Neil [18] discussed the shortcomings in biometrics as a counter-terrorist tool. He found that biometrics were not as reliable and fraud-proof as people expected. The lack of a reliable personal identifier prompts us to look for a good set of attributes for more accurate identity resolution than any single identifier.

The identity theories reviewed in the previous section suggest that personal and social identity attributes may complement each other for the purpose of identity resolution. Social identity attributes are derived from people’s social activities and relationships with other individuals. Such attributes exist implicitly in identity management databases and can be extracted for identity resolution. For example, we may infer relationships among identities based on the facts that some identities share the same address or get involved in the same incident. These inferred social attributes might be more reliable that personal identity attributes in that they cannot be easily altered or falsified by an individual. Taking into account social identity features can provide additional evidence when distinguishing one individual from others.

6.2.3 Existing Resolution Techniques

Identity resolution techniques are one type of entity resolution technique specializing in identity management. Entity resolution is also known as record linkage and deduplication in the areas of statistics and database management respectively. Record linkage, originated in the statistics community, identifies records in the same or different databases that refer to the same real-world entity [19]. The very same task is studied as record deduplication in database and artificial intelligence communities [2022]. These techniques consider records comprised of multiple fields. To determine whether or not two records match, they first compare corresponding individual fields between two records. A decision model determines whether or not the two records refer to the same real-world entity by combining all comparison results. Elmagarmid et al. [23] provide a survey on individual field-matching techniques and models that detect duplicate records.

Compared to general entity resolution, identity resolution has its own challenges. Firstly, identity resolution needs to be efficient, especially for security-related tasks. Secondly, the evaluation of an identity resolution technique may be subject to different criteria under different circumstances. For example, the false-positive rate may be less tolerable than the false-negative rate for identity authentication that grants access to a critical infrastructure facility. However, a high false-positive rate may not be of much concern when a detective searches for a suspect with limited information. Lastly, identity matching in intelligence and law enforcement communities suffers greatly from the missing data problem [4]. Those record linkage and deduplication techniques tend to work poorly if many fields are missing from a record. In the following we briefly review existing techniques most appropriate for identity resolution. We categorize them into rule-based and machine learning approaches.

In the literature there have been several rule-based approaches for identity resolution by encoding matching rules specified by human experts. For instance, in a study on cross-jurisdictional data integration, Marshall et al. [24] encode experts’ heuristics into a simple rule: two identity records match only if their first name, last name, and date-of-birth (DOB) values are all identical. Obviously, in the case of data quality problems including missing values, such exact-match heuristics will cause many false-negative decisions and cannot effectively identify matching identity records. Some other approaches allow partial matching to reduce false negatives. The IBM DB2 Identity Resolution included in the Entity Analytic Solutions (EAS) is a leading commercial product designed to manage identity records [1]. It provides a rule-based identity-matching system that associates identity records representing the same person. For a pair of identity records, a matching score is calculated by following a set of rules predefined by human experts. For example, given two identity records with identical dates of birth and last names, the system will resolve them into one if the matching score of their first names is above 70%. The score can be calculated using any of the string comparison techniques reviewed in Ref. [23]. For these rule-based approaches, the rule creation process can be quite time-consuming, and the rules tend to be domain dependent and not portable across different contexts.

An alternative to manual rule coding is machine learning, which can automatically build identity resolution models by learning from a training dataset with annotated matching pairs. Identity resolution techniques based on machine learning can be further categorized into distance-based and probabilistic methods. Distance-based methods define distance/similarity measures for different types of descriptive attributes and combine them into a weighted average distance score. Two identity records whose overall distance is below a predefined threshold will be regarded as a match. Brown and Hagen [25] proposed a data association method for linking criminal records that possibly refer to the same suspect. This method compares two records and calculates a total similarity measure as a weighted sum of the similarity measures of all corresponding feature values. The features used by their method include suspect and modus operandi (MO) descriptions. The method lacks a matching decision model and only provides a list of matching candidates. Wang et al. [5] proposed a record comparison algorithm for detecting deceptive identities by comparing four personal features (name, DOB, social security number (SSN), and address) and combining them into an overall similarity score. A supervised learning process determines a threshold for match decisions using a set of identity pairs labeled by an expert. However, missing values could significantly affect the performance of the record comparison algorithm [4]. Probabilistic methods for identity matching are rooted in the seminal work of Ref. [19]. By posing record linkage as a probabilistic classification problem, they propose a formal framework to label pairs of identities from two different datasets as “match” or “non-match” based on the agreement among different features. Assuming conditional independence among features given the match class, the framework estimates the probabilistic parameters of the record linkage model in an unsupervised fashion. Many later studies were built based upon this work to enrich the probabilistic model [2629]. These studies have shown that the probabilistic models achieve good performance for identity matching. However, the parameters of the probabilistic models may not be accurately estimated in the absence of sufficient training data [30].

Most identity resolution techniques reviewed above rely on personal identity attributes. Although these attributes provide a good basis for match decisions, they are prone to various issues, including entry errors, identity fraud and deception, and missing values. Identity resolution, especially in the crime/terrorist fighting context, is not simply a data quality problem as existing resolution techniques suggest. Solving such a complex problem requires a combination of multiple techniques and needs to be viewed from a network perspective [2]. Individuals are not isolated but interrelated to each other in a society. The social contexts associated with an individual can provide evidence that reveals his/her undeniable identity. Therefore, there is a need to develop identity resolution techniques that also incorporate the social perspective of identities.

6.2.4 Resolution Techniques Enabled by Social Contexts

Many recent studies have started to take into account linkage and contextual information for identity resolution. Köpcke and Rahm [31] categorized entity resolution approaches into attribute value matchers and context matchers. Unlike attribute value matchers, which solely consider descriptive attribute values, context matchers rely on information inferred through social interactions represented in a graph structure. Distance-based methods can combine the attribute similarity and relational similarity into one aggregated score for more reliable matching performance. Ananthakrishna et al. [32] introduced a deduplication method using a dimensional hierarchy (e.g. city–state–country) over the link relations in a customer data warehouse. This method enhances the personal feature comparison technique by only comparing feature values that have the same foreign key dependency. For example, the similarity of two people’s names will be assessed only when both live in the same city. However, that method requires a hierarchy among attribute relationships. In addition, it does not address the impact of the existence of missing values in a database. In a study on reference disambiguation, Kalashnikov et al. [33] incorporated inter-object relationships such as co-affiliation and co-authorship to disambiguate references using a nonlinear optimization model. However, their method cannot easily work as a plug-and-play solution in other entity resolution applications. In addition, probabilistic models that capture social contexts have been proposed for entity resolution. Culotta and McCallum [34] constructed a conditional random field model (CRF) for record deduplication that captures interdependencies between different types of entities. This method, however, fails to model the explicit links among the same type of entities. Pasula et al. [35] proposed a generic probabilistic relational model (PRM) for citation matching. It captures the dependences among entities over a relational database structure through foreign key relationships. This model is built upon the existing relational database structure and can be used as a plug-and-play solution because most current record management systems store records in relational databases. This approach, however, suffers from computational intractability. Bhattacharya and Getoor [36] proposed a graph-based method for entity resolution. It defines a similarity measure that combines corresponding attribute similarities with graph-based relational similarity between each entity reference pair. Their experiments show that the new similarity measure improves performance over the resolution methods that consider only attribute similarities. Furthermore, Bhattacharya and Getoor [6] extended their relational resolution approach to match entities in a collective fashion. Rather than making pairwise entity comparisons, their method collectively merges records into clusters. Through this process, new relational information can be derived and help to support further matching decisions.

The recent developments in entity resolution techniques shows that the relationships among entities can be used to improve resolution performance in addition to attribute distances/similarities. As we reviewed earlier, social relationships play an even more important role in personal identities than other types of entities because they affect one’s psychological and sociological perception of one’s own identity. It is our intention, based on the identity theories, to define social identity attributes that can be used to effectively enhance the performance of identity resolution.

6.3 A Collective Identity Resolution Approach

In this section we introduce an identity resolution technique that utilizes both personal identity attributes and social identity attributes. We define two types of social identity attributes, namely social behavior attributes and social relationship attributes. We will explore different matching strategies when combining personal identity attributes and social attributes. We will determine the optimal matching strategy using an empirical evaluation.

6.3.1 Problem Formulation

We formulate the identity resolution problem as follows. Given a set of identity references, R = {ri}, where each reference r is described by a set of personal identity attributes rA1, rA2, …, rAk. These references correspond to a set of unknown entities E = {ei}. Due to duplicates and deception, multiple references {image } may be co-referent to the same underlying entity e. We use rE to refer to the entity to which reference r refers to. Also, each reference is involved with at least one incident and each incident may involve one or multiple references. We represent each incident as a hyper-edge in which multiple references may co-occur. We describe the co-occurrence with a set of hyper-edges H = {hi}. Each hyper-edge h can also be described by a set of attributes hB1, hB2, …, hBl, and we use hR to denote the set of references involved. A reference r can belong to zero or any number of hyper-edges and we use rH to denote the set of hyper-edges in which r participates. The problem is to recover the hidden set of entities E = {ei} and the entity labels rE for individual references given the observed attributes of the references and their involved hyper-edges. Figure 6.1 illustrates the problem of identity resolution in a graph of five reference nodes connected by three hyper-edges.

image

Figure 6.1 A graphical view of the identity resolution problem.

6.3.2 Identity Resolution Approaches

Each identity resolution approach includes two major components: an attribute similarity measure and a matching strategy. A similarity function measures how similar two references are to each other based on certain attributes. A matching strategy specifies the algorithmic procedure of recovering the underlying entities from the given set of references. In this section we first define the identity attributes that can be used in identity resolution. We also discuss how we measure the similarity for each type of identity attribute. Lastly, we introduce matching strategies that make matching decisions based on the identity attribute similarities.

Identity attributes and similarity measures

To determine whether two references are co-referent, we need to measure their similarity. A greater similarity score implies a higher probability of co-reference or matching, and vice versa. We describe each type of identity attributes and their similarity measures as follows.

Personal identity attributes

Personal identity attributes are personal identifiers that are commonly used to distinguish one person from others. Examples include, but are not limited to, name, date of birth, social security number (SSN), and address. The similarity between these textual attributes can be calculated by editing distance measures such as the Levenstein Distance [37]. We use A to refer to the identity attribute-based approach. If the similarity score simA(ri, rj) is above a threshold, references ri and rj are considered as co-referent. These approaches tend to perform well for references with duplicates and typos, but may not be able to detect intentional deceptions.

Social behavior attributes

Social behavior attributes represent the common characteristics of the social group that one belongs to. They reflect the psychological view of personal identities. Most identity management systems involve a certain type of transaction (e.g. crime incidents, credit card transactions). Therefore, we choose to use transaction-based behavioral patterns to describe the common characteristics of a social group. We denote the set of hyper-edges involving reference r as rH. Then, each hyper-edge rih involving ri is compared with each hyper-edge rjh′ of rj. The overall behavioral similarity between ri and rj is computed by the average of all the hyper-edge pairs’ similarity scores. Thus, the behavioral similarity between two references ri and rj can be defined as:

image (6.1)

where |rH| denotes the number of hyper-edges involving r.

Taking into account both the personal identity similarity and social behavioral similarity, we use B to refer to this particular resolution approach. The overall similarity between two references ri and rj is defined as:

image (6.2)

where 0 ≤ α ≤ 1 and it can be changed to control the weights of the two similarity scores.

Social relationship attributes

Social relationship attributes capture the social contacts that one often interacts with. These attributes reflect the sociological view of personal identities. If two references ri and rj are both related to the same reference rk (e.g. ri and rj co-occur with rk in different hyper-edges), this can be regarded as evidence that these two references are co-referent. We denote the neighborhood of a reference r as Nbr(r). Then, the neighborhood similarity between two references ri and rj can be defined as:

image (6.3)

To compute the neighborhood similarity between references ri and rj, we define hyper-edge neighborhood similarity simN(hi, hj) between two hyper-edges hi and hj as a pairwise match between their references.

For each rhi, the best match to hj is defined as:

image (6.4)

Similarly, for each r′ in hj, the best match to hi is defined as:

image (6.5)

Here, the relational similarity between two hyper-edges hi and hj is computed as the maximum of the similarity score between a reference rhi and a reference r′ ∈ hj. Furthermore, the neighborhood similarity between two references ri and rj is defined as:

image (6.6)

where |rH| still denotes the number of hyper-edges involving r. It is worth noting that the neighborhood of reference r contains r. Thus, if two references ri and rj do not share any common neighbor, their neighborhood similarity simN(ri, rj) is equal to their personal identity attribute similarity simA(ri, rj). In other words, two references having the same/similar neighbors can be regarded as positive evidence to support that they are more likely to be co-referent, whereas two references not having a common neighbor will not be treated as evidence of them not being co-referent.

In addition, we can define a negative constraint based on social relationships: two references that co-occur in the same hyper-edge (e.g. a criminal incident) cannot refer to the same entity.

Furthermore, the relational similarity between two references ri and rj can be defined as a linear combination of the three components personal identity attribute similarity, social behavioral similarity, and social neighborhood similarity:

image (6.7)

where 0 ≤ α, β ≤ α + β ≤ 1 and they can be adjusted to control the importance of the three similarity scores.

Matching strategies

Given a collection of references, our goal is to determine which references co-refer to the same underlying identity and which do not. We find three commonly used strategies in existing resolution techniques, namely pairwise comparison, transitive closure, and collective clustering.

Pairwise comparison

Pairwise comparison is a basic and simple strategy for entity resolution. For each pair of references ri and rj, we can compute the similarity score using one of the above-mentioned functions. If the similarity score sim(ri, rj) is greater than a predefined threshold θ, we conclude that ri and rj are co-referent. We use A, B, and R to denote pairwise comparison approaches using personal identity attributes only, personal identity attributes and social behavior attributes, and all three types of identity attributes respectively.

Transitive closure

The outcome of pairwise comparison tells us whether two references match or not. Strictly speaking, this still has not yet uncovered the unknown entities for the reference collection. Consider a simple example where references ri and rj are considered as a match, and rj and rk are also considered a match, while ri and rk are determined to be a non-match. In this case, the pairwise comparison may produce conflicting resolution outcomes. A simple strategy to uncover the underlying entities is to use transitive closure. In the previous example, even though ri and rk are not sufficiently similar, we still consider them as a co-referent pair because ri matches rj and rj matches rk. This process should be performed iteratively until all transitive closures are reached. Finally, each transitive closure is considered as a distinct entity. Given the pairwise comparison results, computing transitive closures is straightforward and efficient. Obviously, such a merging process lowers the threshold of the matching criteria and therefore tends to reduce false negatives but introduce more false positives. We use A∗, B∗, and R∗ to denote transitive closure approaches using personal identity attributes alone, personal identity attributes and social behavior attributes, and all three types of identity attributes respectively.

Collective clustering

Unlike computing transitive closure, a different strategy of uncovering underlying identities given pairwise similarity scores of references is collective clustering [6]. In particular, a greedy agglomerative clustering algorithm can be used to find the most similar references (or clusters) and merge them. Figure 6.2 shows the pseudo-code for the algorithm.

image

Figure 6.2 Pseudo-code of collective clustering.

A fundamental difference between this clustering algorithm and the other two matching strategies is a similarity function defined on clusters of references. Here, we take an average linkage approach and define the similarity between two clusters ci and cj as the average similarity between each reference in ci and each reference in cj:

image (6.8)

where |cR| represents the number of references in cluster c.

At first, clusters are initialized and each reference is assigned to a different cluster. As clusters merge one cluster may contain multiple references that are determined to be co-referents. The algorithm merges the most similar cluster pair iteratively until the similarity drops below the threshold. Every time two clusters are merged, similarity scores between clusters need to be recomputed to find the new closest cluster pairs. Specifically, merging two clusters ci and cj into one entity should be regarded as new evidence for computing the similarity between other references that co-occur with references in ci and cj. Such an iterative computational procedure distinguishes the clustering algorithm with the other pairwise comparison-based approaches. We use CA, CB, and CR to denote collective clustering approaches using personal identity attributes alone, personal identity attributes and social behavioral attributes, and all three types of identity attributes respectively.

Complexity analysis

Given a collection of n references, a complete pairwise comparison approach considers all possible reference pairs as potential candidates for merging. The iterative clustering approaches require a lot more computation. Apart from the scaling issue, in reality most pairs will be rejected while often only less than 1% of all pairs are true matches. Hence, certain blocking steps can be performed before matching so as to screen out highly unlikely candidate pairs for matching. There are various blocking techniques but they often employ one simple and computationally cheap function to group references into a number of buckets. Buckets can overlap where one reference can belong to multiple buckets. Only reference pairs within the same bucket are considered candidate pairs for comparing and merging, whereas a pair of references from two different buckets is not considered as co-referents. For collective clustering, two clusters must have all of their references belonging to the same bucket to make them a candidate pair to merge. Furthermore, in the agglomerative clustering process, it is not necessary to recompute the similarity scores for every single cluster pair. More strategies for reducing complexity of collective clustering can be found in Ref. [6].

6.4 Experiments

In this study, for illustration purposes, we evaluate our identity resolution approach with different matching strategies using computer-generated synthetic data. We describe our experiments and discuss our findings in this section.

6.4.1 Synthetic Data Generation

We adopted a two-stage data-generation method described in Ref. [6] with some modifications, as described in Figure 6.3.

image

Figure 6.3 Pseudo-code of synthetic data generation.

In the creation stage the algorithm first creates N entities with a personal identity attribute x and a behavior attribute y. We differentiate these two attributes because personal identity attributes can be easily modified or falsified while social behavioral attributes tend to be more consistent and less likely to change significantly over time. Next, we create M linkages among these N entities to represent their underlying social relationships. When we create these links, two entities with a similar social behavioral attribute y are more likely to be related to each other with a probability of image, i.e. two individuals with similar interests are more likely to be related. In the generation stage, we created R hyper-edges that included a set of related entities’ references. An entity may join a hyper-edge of its neighbor with a probability of Pc. Whenever an entity is to join a hyper-edge, we either choose an existing reference of this entity with a probability Pa or create a new reference with its attribute x value following a Gaussian distribution of N(ex, 1). Each reference r joins at least one hyper-edge. Each hyper-edge is also assigned an attribute y, which is equal to the average of all participating entities’ behavior attribute y.

It is worth noting that this synthetic data generator has two major differences compared to the one described in Ref. [6]. Firstly, for each entity, we define an additional attribute y to represent one’s behavioral characteristics that can be reflected and observed in its involved hyper-edges. Secondly, our synthetic data allow a reference to join more than one hyper-edge, while in Ref. [6] each reference only joins a single hyper-edge. This is more realistic in real identity management scenarios and requires more cost in computing similarities between references.

6.4.2 Experiment Design

In our experiments, we compared nine entity resolution approaches for identity resolution based on the categorization we described in the previous section (see Table 6.1). For similarity computation, we defined three different similarity functions that use personal identity attributes alone, personal identity attributes and social behavior attributes, and all three types of identity attributes respectively. For each similarity function, we employed it using one of the three different matching strategies: pairwise comparison, transitive closure, and collective clustering.

Table 6.1

Experiment Design: Identity Attributes vs. Matching Strategy

Image

In our experiments we chose the following parameter values for generating the synthetic data: N = 100, M = 200, R = 500, Pa = 0.8, Pb = 0.9, Pc = 0.5, and the value ranges of x and y were set from 0 to 500. With such parameter settings, we generated 50 sets of synthetic data and compared the nine different resolution approaches.

The personal identity similarity between two references is defined as:

image (6.9)

The similarity between two hyper-edges is defined as:

image (6.10)

The two parameters that control the weights of the three types of similarity functions are set as α = 0.5 and β = 0.25. The threshold of overall similarity score is set as 0.99.

6.4.3 Evaluation Metrics

We evaluated the performance of each resolution approach by checking the correctness of the matching decisions for each reference pair. We followed most identity resolution studies and chose precision, recall and F-measure as our evaluation metrics [6]. As illustrated in Table 6.2, each matching decision on a pair of references is either a “match” (positive) or a “non-match” (negative). A decision can be either true or false.

Table 6.2

Possible Outcomes of Matching Decisions

Image

Based on the decision outcomes, precision, recall, and F-measure are defined as follows:

image (6.11)

image (6.12)

image (6.13)

6.4.4 Results and Discussion

Figure 6.4 summarizes the performance of the nine resolution approaches. Among the nine approaches, CB achieved the highest precision (92.92%) with the lowest recall (78.66%), A∗ achieved highest recall (99.49%) but the lowest precision (18.81%) and lowest F-measure (31.57%). The collective relational clustering approach (CR) achieved the highest F-measure of 89.39% because it considered all three types of identity attributes and matched references in a collective and iterative manner.

image

Figure 6.4 Experiment results of nine identity resolution approaches.

With the same matching strategy, the results show that approaches using personal identity attributes alone always achieved higher recall but at the cost of very low precision and F-measure. As social behavioral similarity is taken into account, we observed a significant boost in both precision and F-measure. This shows that social behavior attributes do contribute to the performance improvement of identity resolution by reducing false positives. Furthermore, when social relational attributes were also considered, there were some minor drops in precision but with generally higher F-measure. In particular, the approaches using all three types of identity attributes (R, R∗, and CR) outperformed its baselines in terms of F-measure.

6.5 Concluding Remarks

In this chapter we introduced an identity resolution technique that utilizes both personal identity attributes and social identity attributes. Guided by existing identity theories, we defined and examined three types of identity cues, namely personal identity attributes, social behavior attributes, and social relationship attributes, for identity resolution. We also evaluated three different matching strategies: pairwise comparison, transitive closure, and collective clustering. Our experimental results show that both social behavior and relationship attributes improved the performance of identity resolution as compared to using personal identity attributes alone. The collective relational clustering algorithm achieved the best overall performance in terms of F-measure among all approaches. In order to validate, improve, and operationalize these identity resolution techniques, we still need to test them on large-scale real-world identity datasets.

References

1. Jonas J. Identity resolution: 23 years of practical experience and observations at scale. Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data New York, USA: ACM Press; 2006; p. 718.

2. Mumford E. Problems, knowledge, solutions: Solving complex problems. J Strateg Inf Syst. 1998;79(4):255–269.

3. US Department of State. Country reports on terrorism 2006. <http://www.state.gov/s/ct/rls/crt/2007/index.htm>; 2007.

4. Wang GA, Chen HC, Xu JJ, Atabakhsh H. Automatically detecting criminal identity deception: An adaptive detection algorithm. IEEE Trans Syst Man Cybern A Syst Hum. 2006;36(5):988–999.

5. Wang G, Chen H, Atabakhsh H. Automatically detecting deceptive criminal identities. Commun ACM. 2004;47(3):70–76.

6. Bhattacharya I, Getoor L. Collective entity resolution in relational data. ACM Trans Knowl Discov Data (TKDD). 2007;1(1):5.

7. Donaldson T. A position paper on collaborative deceit. AAAI-94 Workshop on Planning for Interagent Communication, 1994.

8. Finch E. What a Tangled Web We Weave: Identity Theft and the Internet. Collompton, UK: Willan; 2003; 86–104.

9. Cheek JM, Briggs SR. Self-consciousness and aspects of identity. Journal of Research in Personality. 1982;16(4):401–408.

10. United Kingdom Home Office. Identity fraud: A study. In: <http://www.homeoffice.gov.uk/cpd/id_fraud-report.pdf>; 2002.

11. Tajfel H, Turner JC. The Social Identity Theory of Inter-Group Behavior. Chicago: Nelson-Hall; 1986.

12. Turner JC. Some Current Issues in Research on Social Identity and Self-Categorization Theories. Oxford: Blackwell; 1999.

13. Deaux K, Martin D. Interpersonal networks and social categories: Specifying levels of context in identity processes. Soc Psychol Q. 2003;66(2):101–117.

14. Stryker S, Serpe RT. Commitment, Identity Salience, and Role Behavior: Theory and Research Example. New York: Springer; 1982.

15. Conyers R, Sensenbrenner F. Real ID act of 2005. Congressional Record House. 2005;151.

16. Kent ST, Millett LI. IDs – Not That Easy: Questions about Nationwide Identity Systems. Washington, DC: National Academy Press; 2002.

17. Privacy International. Mistaken Identity: Exploring the Relationship between National Identity Cards and the Prevention of Terrorism. London, UK: Privacy International; 2004.

18. O’Neil PH. Complexity and counterterrorism: thinking about biometrics. Stud Confl Terror. 2005;28(6):547–566.

19. Fellegi IP, Sunter AB. A theory for record linkage. J Am Stat Assoc. 1969;64(328):1183–1210.

20. Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S. Adaptive name matching in information integration. IEEE Intell Syst. 2003;18(5):16–23.

21. Hernández MA, Stolfo SJ. The merge/purge problem for large databases. In: Carey MJ, Schneider DA, eds. SIGMOD ’95. San Jose, CA, ACM Press: Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data; 1995;127–138.

22. Monge AE. Matching algorithms within a duplicate detection system. IEEE Data Eng Bull. 2000;23(4):14–20.

23. Elmagarmid AK, Ipeirotis PG, Verykios VS. Duplicate record detection: A survey. IEEE Trans Knowl Data Eng. 2007;19(1):1–16.

24. Marshall B, Kaza S, Xu J, et al. Cross-jurisdictional criminal activity networks to support border and transportation security. Washington, DC: Proceedings of the 7th International IEEE Conference on Intelligent Transportation Systems; 2004.

25. Brown DE, Hagen SC. Data association methods with applications to law enforcement. Decis Support Syst. 2003;34(4):369–378.

26. Dey D, Sarkar S, De P. A probabilistic decision model for entity matching in heterogeneous databases. Manag Sci. 1998;44(10):1379–1395.

27. Ravikumar P, Cohen WW. A hierarchical graphical model for record linkage, Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. Arlington, VA: AUAI Press; 2004.

28. Wang GA, Chen H, Atabakhsh H. A multi-layer naïve Bayes model for approximate identity matching, Lecture Notes in Computer Science 3975. San Diego, CA: Springer; 2006; pp. 479–484.

29. Winkler WE. Methods for Record Linkage and Bayesian Networks, Statistical Research Division. Washington, DC: US Census Bureau; 2002; Available at <http://www.amstat.org/Sections/Srms/Proceedings/y2002/Files/JSM2002-000648.pdf>.

30. Nigam K, McCallum AK, Thrun S, Mitchell T. Text classification from labeled and unlabeled documents using EM. Mach Learn. 2000;39(2/3):103–134.

31. Köpcke H, Rahm E. Frameworks for entity matching: A comparison. Data Knowl Eng. 2010;69(2):197–210.

32. Ananthakrishna R, Chaudhuri S, Ganti V. Eliminating fuzzy duplicates in data warehouses. Hong Kong, China: Proceedings of the 28th International Conference on Very Large Data Bases; 2002; pp. 586–597.

33. Kalashnikov DV, Mehrotra S, Chen Z. Exploiting relationships for domain-independent data cleaning. Newport Beach, CA: Proceedings of SIAM International Conference on Data Mining (SDM); 2005.

34. Culotta A, McCallum A. Joint Deduplication of Multiple Record Types in Relational Data. Bremen, Germany: ACM; 2005;. doi 10.1145/1099554.1099615.

35. Pasula H, Marthi B, Milch B, Russell S, Shpitser I. Identity uncertainty and citation matching. Adv Neural Inf Process Syst. 2003;22(3):1425–1432.

36. Bhattacharya I, Getoor L. A latent Dirichlet model for unsupervised entity resolution. Bethesda, MD: SIAM International Conference on Data Mining; 2006; pp. 47–58.

37. Levenshtein VI. Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl. 1966;10(8):707–710.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset