Till Blesik, Matthias Murawski, Murat Vurucu, and Markus Bick

1Applying big data analytics to psychometric micro-targeting

Till Blesik, Matthias Murawski, Murat Vurucu, Markus Bick, ESCP Europe Business School, Heubnerweg 8–10, 14059 Berlin, Germany, e-mails: {tblesik, mmurawski, mbick}@escpeurope.eu, [email protected]

Abstract: In this chapter we link two recent phenomena. First, innovations in technology have lowered the cost of data storage and enabled scalable parallel computing. Connected with social media, the Internet of Things applications and other sources, large data sets can easily be collected. These data sets are the basis for greatly improving our understanding of individuals and group dynamics. Second, events such as the election of Donald J. Trump as President of the United States of America and the exit of Great Britain from the European Union have shaped public debates on the influence of psychometric micro-targeting of voters. Generally, public authorities, but also other organizations, have a very high demand for information about individuals.

We combine these two streams, meaning the enormous amounts of data available and the demand for micro-targeting, aiming at answering the following question: How can big data analytics be used for psychometric profiling? We develop a conceptual framework of how Facebook data might be used to derive the psychometric traits of an individual user. Our conceptual framework includes the Facebook Graph API, a nonSQL Mongo Data Base for information storage and R scripts to reduce the dimensionality of large data sets by applying the latent Dirichlet allocation to determine correlations between reduced information with psychologically relevant words.

In this chapter we provide a hands-on introduction to psychometric trait analysis and present a scalable infrastructure solution as a proof of concept for the concepts presented here. We discuss two use cases and show how psychometric information, which could, for example, be used for targeted political messages, can be derived from Facebook data. Finally, potential further developments are outlined that could serve as starting points for future research.

Keywords: Big data, Big Five personality traits, Facebook, Politics, Psychometrics, Latent Dirichlet allocation

1.1Introduction

Technological innovations in the twentieth and twenty-first centuries have had immense impacts on society. The emergence of the Internet and the resulting permanent connectivity of individuals has changed not only the economy but also the way society functions in general [1]. Before the infiltration of measurable online actions, data were scarce. That is why statistical inference, analysing a data set derived from a larger population, was and still is very important. It helps make the best of scarce and expensive data.

Now, in the age of social media, the Internet of Things, e-commerce, online financial services, search engines, navigation systems and cloud computing, data are being collected from every individual to machines and processed by machine-to-machine interactions [2]. These data can be analysed in terms of joint correlations, which generates even more data, called meta-data. A single smartphone alone is already able to provide information about the purchasing habits, transportation preferences and routes, personal preferences and social surroundings of individual users.

It is not far-fetched to imagine how public authorities use their applications to listen to spoken words and translate them to written and, therefore, searchable text, mapping all movements and patterns of an individual and using correlations of in-between behaviour and preferences collected from social networks to identify, for instance, potential threats to the state. All of the international invasions of privacy carried out by the US National Security Agency (NSA) and similar organizations have shaped the public debate and our perception of technology dramatically since the Edward Snowden leaks in 2013. An Orwellian fantasy of mass surveillance seems to have become a reality in the shape of modern government.

Secret services are, however, also arguably relevant institutions within democracies. Based on a judicial system that follows a democratic constitution and aims at protecting a nation’s interests, secret services build their operations on judges and laws and usually use their applications in a political context of checks and balances to categorize and control enemies of the state. The nation’s “interest” and the nation’s “enemies” are, however, terms whose definitions greatly depend on the ideas of the current elected government. In fact, given the technology already in place, today’s governments are potentially able to build “psychometric” profiles of every single voter so as to influence their voting habits [3].

Psychometrics is a scientific approach to assessing the psychological traits of people. It has its roots in sociobiology [4]. The goal is to obtain a distribution within the population of each of the personality traits of the Big Five personality test. The Big Five traits are extraversion, agreeableness, openness to experience, conscientiousness and emotional stability/neuroticism [5, 6]. There are various ways to explore individual traits, for example by analysing the Facebook likes of a user and other forms of written texts such as status messages. This information can be connected to other demographic data such as age or gender. In the context of politics, messages sent to voters can potentially be adapted, based on the results of a psychometric analysis. A governor advocating lax gun laws would address a young mother in a different way than a gun enthusiast in the National Rifle Association (NRA). The young mother might receive a message advocating lax gun laws so that teachers can carry guns in educational settings to protect her children, while an NRA member would receive a message demonstrating the newest features of a military weapon that should be legalized.

Understanding the psychometric traits of each recipient can help define the content of a message. Psychometric analysis has been used for over a century, but in the context of big data and mass surveillance, it has gained new importance [4].

Based on these thoughts, we will investigate the opportunities for deriving the psychometric traits of individual users from Facebook data. More precisely, the research question can be summarized as follows:

How can Facebook data be extracted, stored, analysed, presented and used for micro-targeting within the Big Five model?

To answer this question, several data sources and calculations are used, as shown in Figure 1.1.

Fig. 1.1: Conceptual flowchart of this chapter

This chapter is structured as follows. The second section introduces the theoretical and historical foundations of psychometric analysis. The third section presents the research methodology while placing a specific focus on both the underlying statistical and technical infrastructure, including a presentation of how Facebook can be used as a data source for our study. We test our conceptual framework in Section 1.4, which includes two use cases covering the preparation of data, the extraction of patterns and corresponding final results. This chapter ends with a discussion of our results, the limitations of our approach and some concluding remarks.

1.2Psychometrics

This section contains a theoretical overview of the topic of psychometrics. We present its historical emergence and briefly mention some ethical issues related to psychometrics before two general schools, the functional and the trait schools, are discussed. Then the concept of the Big Five personality traits will be introduced. We will show how the Big Five traits are linked to politics and provide some recent research results on this linkage. This section ends with a presentation of new opportunities for psychometric assessment in an increasingly digital world.

1.2.1Historical emergence and ethical issues

Psychometrics is the science of psychological assessment. In this chapter, we mostly refer to the book by Rust and Golombok (2009), Modern Psychometrics – The Science of Psychological Assessment [4]. This book provides a comprehensive overview of psychometrics as well as a discussion of several practical aspects of the topic. Furthermore, important historical and ethical issues are presented. We summarize them in this subsection.

Generally, psychological assessment has diverse goals. Tests can potentially aim at recruiting the ideal candidates for a job or to create equality in educational settings by identifying learning disorders. Another controversial function is the goal of using psychometric profiling to build micro-targeted advertising to influence voting habits in democratic elections [3].

The roots of psychometrics reach back long before Darwin’s famous publications On the Origin of Species and The Descent of Man. Talent was assumed to be a divine gift that depends on the mere judgment and plan of God [4]. However, Darwin’s discovery of evolution had a great impact on the human sciences and launched a scientific project with the goal of revealing the impact of nature on human beings. Ever since, a key ambition has been the goal of measuring individual intelligence. “Intelligence is not education but educability. It was perceived as being part of a person’s make-up, rather than socially determined, and by implication their genetic make-up. Intelligence when defined in this way is necessarily genetic in origin” [4, p. 8]. Thus, “for socio-biologists, intelligence test scores reflect more than the mere ability to solve problems: they are related to Darwin’s concepts of ‘survival of the fittest’. And fitness tends to be perceived in terms of images of human perfection. [...] Intelligence viewed from this perspective appears to be a general quality reflecting the person’s moral and human worth, and has been unashamedly related to ethnic differences” [4, p. 16].

The attempts to find the common denominator of intelligence in the genetic make-up of an individual led scientists to the field of eugenics. The central hypothesis of eugenics is degeneration. A given population is degenerating if organisms with undesirable characteristics reproduce more quickly than the population with desirable characteristics. Based on this view, eugenicists stated that humans, “by caring for the sick and ‘unfit’, are undergoing dysgenic degeneration, and that scientists should become involved in damage limitation” [4, p. 10]. Eugenicists and their goals of selectively breeding humans were based on the concepts of the theory of evolution. “The intelligence testing movement at the beginning of the 20th century was not simply like Nazism in its racist aspects – it was its ideological progenitor.” These “[...] ideas entered into the evolutionary theory and were used to dress-up dubious political beliefs in an attempt to give them a pseudo-scientific respectability” [4, p. 17]. After the Second World War and the racist crimes of the Nazis, eugenics was shunned by society. In today’s world, the topic of racism is more sensitive, and it is safe to say that the originators of psychometrics did not share this sensitivity.

The sheer endless drive of scientists to find and define measures for intelligence led to the development of statistical methods, including correlation, normalization, standard deviation and factor analysis, which can all be used as methods for psychological assessment. It led further to the development of sets of items that are used for testing and can be compared to each other. These methods have been the foundation of standardized testing, influencing academic and career assessments on a daily basis. Therefore, the function of testing is determining its use, and this function derives from the need in any society to select and assess individuals within it. Given that selection and assessments exist, it is important that they be carried out as properly as possible and that they be studied and understood.

Psychometrics can be defined as the scientific process of selecting and evaluating human beings. But in the modern age we must realize that the ethics, ideology and politics of these selections and assessments are integral parts of psychometrics, as well as statistics and psychology. This concern arises in particular because “any science promoting selection is also by default dealing with rejection, and is therefore intrinsically political” [4, p. 25].

1.2.2Psychometric schools and testing

In general, there are two schools within psychometrics: the trait school and the functional school [4].

The tests of the functional school are built in a linear way, which means that content areas on the x-axis are mapped against different levels of manifestations on the y-axis. Content areas can be political geography, for example, and manifestations are, for example, ratings on a scale from 1, bad, to 4, very good. An item in this case is basically the question that leads to the manifestation/ answer of the content area function. Functional tests are commonly used for the assessment of job applicants or for the selection of candidates for programmes in higher education. The function changes depending on the goal of assessment.

The tests of the trait school, however, try to separate themselves from a purely goal-driven approach by generalizing answers into notions of human intellect and personality. This leads to the belief that personality types are not binary and exclusive, but an individual’s personality is rather a mix of many traits and between each trait’s extremes [4]. The most fascinating difference between the functional school and the trait school is therefore the attempt of the trait school to find the degrees of manifestations in personality types, while the functional school assesses the suitability of an individual for a given task.

Our study examines the trait school’s implications for the assessments of personality traits. We investigate an individual’s behaviour correlating with a larger population with respect to that individual’s personality traits and the influence of personality traits on the decisions made by the individual.

A serious problem of testing in both schools is the theory of true scores. It is assumed that an observed score is a sum of the true score and an added error. This error can be based on biases. Bias can be caused by the construction of the test questions. Item biases are rather simple to identify. A test in the USA might be formulated with dollars and cents and would therefore not be appropriate for use in the UK. Linguistic forms of item bias are the most common ones [4]. Another bias is item offensiveness. Offensive items include racism and sexism. Intrinsic test bias exists when the test itself is constructed for a certain group and does not give adequate chances to a group that the test was not constructed for. A simple example is a native English-speaking group taking a test that was made for native speakers, but non-native speakers have to take the same test. Extrinsic test bias is found when there are actual differences between the social standings of the two mentioned groups in the example of the intrinsic test bias. Certain biases are regulated by law, for example racial bias. For instance, in Germany, it is not allowed to select candidates based on their ethnicity. Intrinsic test biases can be regulated by positive discrimination; it is much harder to regulate extrinsic test biases.

Biases are ubiquitous. Facebook, which we will use as our data source in this study, can be biased too. If argued from a statistical point of view, measured scores could be subject to statistical bias, for example when recommendation of content is built upon things the users liked already within their so-called bubble [7]. Also, item bias based on linguistic differences raises a problem.

Differential item functioning (DIF) tests analyse the deviations of answers within and between groups. DIF tests can therefore provide information about the existing intrinsic bias of a test, meaning that DIF tests point to differences between test takers. DIF tests help to explain differences, for example between cultures and sexes.

Finally, it is important to construct a test based on characteristics that help to make measures and items comparable. To achieve this goal, not only is the true score relevant, so too are the reliability, validity, potential standardization and normalization of a test. Constructing a functional test seems to be fairly straightforward. There is a clear goal to achieve, and item sets can be directed towards that goal. Designing a trait-based test, however, seems to be much more difficult. Not only is it important to define measurable personality traits, but the definitions of the traits and the selection of the items to serve the test’s purpose have a lot of potential for biases.

1.2.3The Big Five traits and politics

One of the biggest challenges of psychometric trait analysis, as described in the previous section, is to define the personality traits to be measured. “The origins of trait theory can be traced back to the development of the IQ testing movement, particularly to the work of Galton and Spearman. From the perspective of trait theory, variation in personality is viewed as continuous, i.e., for a specific personality, characteristics vary along a continuum. The advantage of a trait approach is that a person can be described according to the extent to which he or she shows a particular set of characteristics” [4, p. 150].

The definition of the term personality is still very much debated. An encompassing definition of personality does not exist, nor is one ever likely to emerge. Each definition is based on a different theory that is trying to explain human behaviour in a certain context and therefore contributes to a better understanding of what personality is. In the context of psychological testing, personality can be defined as an individual’s unique constellation of psychological traits and states [8].

In the more specific context of micro-targeting of individuals based on psychometrics within social media, we extend the previously mentioned definition as follows:

Psychometric micro-targeting is the adaptation of content (pictures, videos, sounds and texts) based on an individual’s unique constellation of psychological traits and states, to trigger certain favourable (to the content creator) actions of the content receiver.

A common approach to measuring and defining personality is to use factor analysis [9]. Factor analysis is a vectorial method that is based on correlations of manifestations of individuals’ responses to items, using vectors to describe influencing factors that lead with a certain multiple to the measured output. Factors in psychometrics are basically the hidden influencers of human decisions. It is important to have as few factors as possible and to have common factors throughout psychometrics for the description of human behaviours, to make measured outcomes more comparable.

Progress in the field of psychometrics was made possible by the adoption of the Big Five model as the unifying force of the field of personality.

Donald Winslow Fiske “was the first who noticed that with five factors it was possible to obtain similar factor definitions when different assessment techniques, such as self-ratings, peer-ratings and observer ratings, were used” [4, p. 166]. The Big Five traits are extraversion, agreeableness, conscientiousness, emotional stability/neuroticism and openness to experience [5]. A short overview of these traits is presented in Table 1.1.

Tab. 1.1: The Big Five traits [5, p. 267]

Trait Definition
Extraversion ...energetic approach towards the social and material world
Agreeableness Contrasts a prosocial and communal orientation towards others with antagonism...
Conscientiousness ...socially prescribed impulse control that facilitates task- and goal-directed behaviour...
Emotional stability Contrasts...even-temperedness with negative emotionality...
Openness to experience ...the breadth, depth, originality and complexity of an individual’s mental and experiential life

The conclusive definition of the Big Five model is an argued standard within the psychometric community. Four main reasons support the acceptance of the Big Five model. The first one is that the five traits have high stability. Secondly, the traits are compatible with a wide range of psychological theories. Thirdly, the five traits occur in many different cultures. Finally, the five traits have a biological basis [4, 10].

Obviously, there are various options with which micro-targeting, applying the Big Five traits, could take place. One might think of opportunities for companies in the field of marketing, for example. However, in this chapter, we consider the context of politics, especially because of its current importance (refer to the examples of election campaigns mentioned in the introductory section).

Generally, making use of the Big Five traits when analysing elections is not a new idea. To name just a few studies, Vecchione et al. (2011) elaborated for Italy, Spain, Germany, Greece and Poland that the Big Five were linked to party preference. The traits have substantial effects on voting, while socio-demographic characteristics (gender, age, income and educational level) had less influence. The openness trait has been shown to be the most generalizable predictor of party preference across the examined countries. Conscientiousness was also a valid predictor, but its effect was less robust and replicable [11].

Dennison (2015) draws a similar picture for the 2015 general election in the UK: “Undoubtedly the two most consistently found relationships are the positive effect of conscientiousness on right-wing voting and the positive effect of openness to experience on left-wing voting” [12]. The rationales behind this might be that very conscientious people, for which socially prescribed norms and rules are more important, are rather conservative. In contrast, open-minded people could be characterized as open to unconventional and even unorthodox political approaches, which is generally more associated with left-wing parties. Dennison (2015) also emphasizes that emotional instability tends to have an influence in favour of left-wing parties: “Emotionally unstable individuals are more anxious about their economic future, more desirous of state control, and are less likely to view the status quo in positive terms – all of which theoretically increases the chance of left-wing attitudes” [12]. Figure 1.2 shows the analyses of Dennison (2015) for all Big Five traits. He applies z-scores, which are a numerical measure of a value’s relationship to the mean in a group of values. For example, if a z-score is 0, the value is identical to the mean value. A positive z-score indicates the value is above the mean and a negative score indicates it is below the mean. Considering the case of openness, voters of the right-wing UK Independence Party (UKIP) have the highest negative value, which means that they are the most closed ones. In contrast, Green voters have the highest positive openness value, which indicates that they are the most open-minded people in this sample.

Fig. 1.2: Personality traits and party choice in 2015 in UK [12]

Based on these examples, we conclude that there seems to be a link between individual psychometric traits and voting behaviour. The aforementioned studies mainly refer to self-assessment of individuals, meaning that the trait scores are derived from questions asked in surveys. We believe that in times when social media services are one of the main communication channels, these Big Five traits could be extracted from social media behaviour. We outline some general remarks on psychometrics in the digital age in the following subsection before presenting our conceptual approach in Section 1.3.

1.2.4Psychometrics in the information technology age

The computerization of psychometrics has revolutionary implications. Most mathematical problems of psychometric analysis are based on matrix algebra. Computers are able to do massive amounts of calculations, for example matrix inversions, which is essential to factor analysis, simultaneously, repetitively and iteratively with large data sets [4].

Applying psychometrics to Facebook is a challenging project. The underlying mathematical methods are complicated and need to consider potentially each of the more than 1.3 billion users on the Facebook Graph. The collected data need to be related to lexical psychometrics in real time, self-adjusting and self-learning, while working for a purpose such as micro-targeted content delivery. It is therefore plausible to suppose that substantial resources are needed to perform psychometric micro-targeting and that those resources can quickly become the boundary of what is actually possible.

Classical psychometric tests were sometimes conducted by experts to shape the structure of a questionnaire, depending on previously given answers. This procedure can now be automated. “If the decision of which question to present depends on conditions, e.g., utilize the response to item x only if there is a certain response to item y, then the model is non-linear” [4, pp. 203–204]. Non-linearity in mathematics adds a tremendous amount of complexity to the applied algorithms. A non-linear solution, however, can produce the same questionnaire as a linear solution if the same solution is the optimal path. The underlying mathematics of the non-linear and linear structures are equal in terms of statistics; the complexity derives from the decision-making abilities of the non-linear system.

“A neural network trained to recognize the possibility of diverse pathways to the same standards of excellence could potentially outperform any paradigm from classical psychometrics that was by its nature restricted to linear prediction” [4, p. 205]. Thus, it is possible to use a Bayesian approach that always chooses the next content to be shown in a way that maximizes the probability that the assessed individual will make the preferred decision. If the assessed individual makes the preferred, and therefore predicted, decision, the algorithm adjusts itself with a certain success factor. If the assessed individual denies the decision, the algorithm will understand the assessed individual better and adapt the presented content.

While neural network programs can learn from experience to make excellent behavioural predictions, the internal procedures they follow are often much too complicated for any human to understand. A characteristic of non-neural psychometrics is that the models try to identify latent traits so as to correspondingly adjust the personality traits of the assessed individual. The neural psychometric approach does not rely on latent traits; its algorithm constantly screens patterns and changes the underlying assumptions anytime it succeeds. The predictions are purely actuarial. A good neural network has strong predictive powers; the disadvantage, though, is that there is hardly a human being that can understand how the predictions are made. “Unlike expert systems, neural networks include no explicit rules and have no justification other than their success in prediction” [4, p. 207].

A machine that is able to predict real-world decision outcomes of individuals has tremendous value. However, even an imperfect solution that describes psychometric profiles is interesting to “personnel and credit agencies, the insurance and marketing industry, social security, the police and intelligence services” [4, p. 198]. That is why it is important to take special care during data collection and analysis. Computer systems are also able to generate reports. “Many computerized testing or scoring programs no longer report mere numbers, to be interpreted by experts, but are able to produce narrative reports in a form that is suitable for the respondent or other end users. Where the test is a profile battery, the computer is able to identify extremes, to interpret these in the light of other subscale scores, and to make recommendations” [4, p. 201].

1.3Methodological framework

In this section, the underlying statistical methods and a technical framework to integrate all aspects into one system that is able to calculate predicted probabilities for personal traits are presented.

First the latent Dirichlet allocation (LDA) [13, 14], which is used intopic modelling, is introduced.

Then the implementation of the algorithms in the programming language R is described. The code and a detailed description of the implemented algorithms as well as alternative approaches can be found in Mining Big Data to Extract Patterns and Predict Real-Life Outcomes by Kosinski et al. (2016) [15] and on the related project website http://mypersonality.org [16]. The data available from the myPersonality project are used to calculate the prediction models, which serve as the foundation for later estimations.

Then a set of software components is presented that can be combined to build a working prediction environment. Because there is a multitude of possible webserver software and corresponding plug-ins, the goal is to show one setting in detail and briefly introduce alternatives that can be used to customize the system and adjust it to the needs and prerequisites of different technological landscapes.

Finally, we describe how Facebook data can be integrated in our model while presenting corresponding coding examples.

1.3.1Latent Dirichlet allocation

The LDA is used for topic modelling [13]. A topic is a collection of words that have different probabilities of appearing in passages discussing the topic. If the topic producing the words in a collection is known, it is possible to guess and assign new words that relate to the given topic. The way this is done is by considering the number of times the word occurs in the discussion of the topic and how common the topic is in the rest of the documents. Human memory capacities are limited. While a human being can understand latent structures in a limited amount of texts, the LDA refers to a large scale of texts that a human being would not be able to process in a relatively short period of time [14, 17].

One way to explain the underlying mathematics is to visualize it in a simple and reduced Bayesian statistical formula. Figure 1.3 depicts a linear approach in which for each topic Z, the frequency of a word type W in the topic Z is multiplied by the number of other words in document D that already belong to Z. The result represents the probability that the word W came from the topic Z. Depending on which one has the highest probability, the word will be sorted into one of the given topics. This expression is a basic Bayesian formula that describes the conditional probability that a word W belongs to topic Z [17].

P(Z|W,D)=#ofwordWintopicZ+βWtotaltokeninZ+β(#wordsinDthatbelongtoZ+α)
Fig. 1.3: Simplified (linear) LDA algorithm [17]

The connection between a word and a topic Z influences the total probability a priori. Based on this, the machine learns by applying topics and the topic-constructing words on more documents. The complexity of the LDA therefore grows when the Bayesian model is translated into vectors, resulting in the need to define functions for the probabilistic distributions relating to the prior assumptions, the indicator function and the a-priori probability. Further complexity is added by expanding a usually two-level Bayesian model into a three-level hierarchical Bayesian model, in which each word is a mixture of underlying topics and each topic is a mixture over an underlying set of topic probabilities. The goal is to assure that the essential statistical relationships between each layer are preserved [13]. The preservation of statistical relationships potentially allows a set of applications that can enhance the model with external information.

Fig. 1.4: Graphical model representation of LDA [13, p. 997]

Figure 1.4 describes the variables defining the LDA function in a graphical representation. A word is described as w and defined to be an item from a vocabulary indexed as a vector and represented by a unit vector relating to the vocabulary vector with a single component equal to one and all other components equal to zero. A document is a sequence of N words. A corpus is a collection of M documents. The variables α and β are corpus-level parameters, assumed to be sampled once they are in the process of generating a corpus. The variable θ is a document-level variable, sampled once per document. As indicated in Figure 1.4, there are three levels of the LDA representation: the word, the document and the corpus. The LDA algorithm chooses N words based on a Poisson distribution. Each of the N words is pointed towards a topic that has been chosen by the application of a Multinomial Dirichlet application on θ describing the probability distribution, and a Bayesian multinomial probability conditioned on the topic for each word [13].

Identifying the probability densities based on the multinomial distribution of words relating to topics helps to cluster the relevant words in topics and ideally to generate a “term-by-document matrix [...] that reduces documents of arbitrary length to fixed-length list of numbers” [13, p. 994]. These numbers therefore help to reduce the dimensionality of matrices to relevant “keywords” and “topics”. Concerning the goal of the Big Five personality assessment, it is plausible to expect k = 5 topics. All words in the given Facebook information are clustered iteratively. Words that are not able to be categorized into a topic with a certain amount of probability can be deleted, reducing the complexity of the initial user-like matrix that our study uses based on the myPersonality project to execute the algorithms.

The LDA allows one to build topics. The resulting topic databases contain words that can be correlated against psychological lexica like the Linguistic Inquiry and Word CountLinguistic Inquiry and Word Count (LIWC) [18]. The correlation coefficients can be used to build Big Five personality trait models in an additional database.

1.3.2Statistical programming

As the first step, R, a language for statistical computing, should be installed; this will enable user to follow the presented instructions and implement the code on their own. A current version can be downloaded from the R projectwebsite: https://www.r-project.org/. As R itself only provides a rudimentary interface, the open-source software RStudio can be installed from https://www.rstudio.com/. It provides a graphical user interface that integrates a code editor, debugging and visualization tools, a documentation browser and additional functions that make it easier to use R.

To begin with, the data sets provided by myPersonality project must be downloaded. These data were collected by the myPersonality Facebook application “that allowed users to take real psychometric tests, and allowed [us] to record (with consent!) their psychological and Facebook profiles” [16]. In the data sets, the scores of the psychometric tests, the records of the users’ Facebook profiles and item-level data are available. After saving the .csv files to the R project folder, they can be loaded into the data environment.

Now, users contains the anonymized userid, gender, age, political views and scores for the five personality traits. In likes, the name and ID of all likes made by users are stored. To be able to connect both of these together, ul specifieswhich user made which like by providing pairs of userid and likeid. The next goal is to create a digital user footprint matrix, a matrix where each like is presented as a column, each user is presented as a row, and the entries contain the value 1 or 0 to indicate the existence or nonexistence of the corresponding like. Therefore, in the ul object, the userid and likeid are matched with their respective rows in users and likes. Afterwards, the sparseMatrix() function is used to create the desired footprint matrix M, where the row names are set to the corresponding userids and the column names are set to the corresponding like names.

As the resulting matrix contains a large number of unique likes, the density is comparatively low, and the matrix should be trimmed. There are no correct thresholds. In general, users with few likes or likes made by only a few users do not significantly contribute to the model’s explanatory power but require more computing time. Using relatively high thresholds, all users with less than 50 likes and all likes that were made by less than 150 users are removed. Users that are removed from M must also be removed from users.

After these steps, a user footprint matrix is available and can be used to find patterns or, stated better, extract patterns and build prediction models. In the field of LDA, clusters are called topics, as the method originates from language processing, where it is used to identify topics [13]. Using the created sparse matrix M as input for the LDA() function from the topicmodel library, an MDA model is computed. To ensure that the same results are obtained as presented in this chapter, R’s random number generator must be set to 68. Mlda now contains 50 clusters that were identified based on the user footprint matrix.

To better understand the LDA algorithm and its results, it is helpful to run the preceding computation with k = 50, meaning the algorithm will extract five clusters from the data set, and take a closer look at the results. While this is helpful to understand the approach, having more clusters usually increases the precision of the derived prediction model. A detailed discussion of the influence of varying k can be found in Kosinski et al. (2016) [15].

Several strategies can be applied to interpret clusters. Exploring footprints that are associated with the clusters is the first step. The second step is to identify the relationships between dimensions and clusters with user information like demographic information, psychological traits, collected questionnaires or a lexical database like the LIWC [18].

Figure 1.5 shows a heat map describing the user scores on LDA clusters of the R object gamma’s correlation to psychodemographic traits; it reveals that gender, age, political views and the personality trait of openness (“ope”) are the ones most strongly correlated with the extracted LDA clusters.

Fig. 1.5: Heat map showing correlations between users’ membership in LDA clusters and their traits [15, p. 500]

To estimate the predictors for an extracted topic, the conditional distribution of the proportions given the data and the prior distribution must be calculated, which is done using the posterior() function. The calculated posterior distribution for the topic is then stored as a data frame in predictors.

Having the predictors, it is now possible to create linear regression models that specify the factors with which to multiply the users’ topic memberships when calculating the predicted probabilities for variables. The glm() function is used to calculate generalized linear regression models, which are then stored in fit. For the calculation, var must be replaced by one of the abbreviations of the variables (gender, age, political, ope, con, ext, agr), or it can be wrapped in a loop to calculate the models for all variables. To estimate the predicted probabilities for user traits, the predict() function needs a fitted model, in this case a linear regression model, and the predictors as input.

To acquire reliable information about the quality of an estimation, split the data set into a training set and a testing set. Then the fitted model must be trained independently from the users for which the predicted probabilities are to be calculated. For a detailed description of this process, we refer the reader to the original paper [15].

After these steps, the predicted probabilities for all variables are now stored in preds. To obtain probabilities for a specific variable for a specific user, the variable abbreviation and userid must be specified:

1.3.3Infrastructure

In this subsection, we introduce a possible deployment model that describes the devices and execution environments in each tier. The purpose is to give an example of the configuration of and the possible alternatives for the server infrastructure, including all devices, execution environments, processing elements and connectors of the system’s parts. The prototype system consists of a server infrastructure with a client tier, a web tier, a business tier and a database tier.

Fig. 1.6: Model of conceptual deployment of prototype

Figure 1.6 presents the deployment model of an environment that can collect, evaluate and present information that characterizes an individual’s personality traits in relation to her Facebook likes. As seen in Figure 1.6, the client tier can consist of various devices. It is recommended to use a framework that can provide well-designed interfaces fitting the requesting device (e.g., browser, smartphone or tablet). The information is encoded in JSON to be passed between the different elements within the system. Initially, a device sends a request and the included information is pushed to the web, business and database tiers. Facebook, for example, offers different API calls for user information for browser, smartphone and tablet devices and the functions that are implemented in their respective operating systems.

Once received by the Rest API of the Django Rest Framework, the information is pushed through a Python script that persists and checks the data for accurate formatting. If the format of the incoming data does not match the expectation of the script, an error message is sent back. This step is necessary to be able to react to possible changes in the Facebook API without having to refactor the underlying business logic. To run the Django Rest Framework, an Apache Server is used. The processed user data are stored to the file system and added to the user and user-like .csv files. The R execution environment has access to the .csv files, allowing the system to run the required algorithms, which are implemented as R scripts. The results can be written to the file system, for example using the R function write.csv(). Since the file system should be accessible by other elements of the system, the results can be used to generate output concordant with the user’s personal traits that were predicted by the R environment.

1.3.3.1Technologies

The technologies used and implemented in the prototype system were chosen according to various functional and non-functional requirements. This subsection gives an overview of the implemented technologies, which will be extended by potential future developments and suggestions for future research.

Python was chosen as the preferred script language for data analysis and as the foundation of the web framework that manages the Rest API. As one of the most widely used languages for scientific applications, Python has a reputation of being easy to learn, is minimalistic and is free of charge. There is a growing international community of scientists and professionals who contribute libraries and frameworks on a regular basis that help build efficient scientific applications. Libraries such as SciPy and the topicmodels package can help to reach a higher level of automation with the LDA algorithm [15]. However, the used LDA algorithm is built in R. Python is only applied within the data approval script that lies within the Django framework.

The Django framework is a Python framework that helps developers build and use fast, secure and scalable interface solutions for the external communications of a server. This helps developers focus on the web application, rather than the development of the basic infrastructure.

Apache Web Server is an open-source HTTP server for operating systems including *NIX and Windows. It is one of the most used web servers. Basically, the server is in charge of delivering all kinds of content. Its functionalities can be extended with modules. The source code of the Django framework is interpreted by the Apache Web Server and then delivered to the client.

R is a programming language for statistical computing. Its extensive libraries and core functionalities make R one of the programming languages most used by statisticians. It is open source, and a large community of mathematicians and software developers has formed around it. Therefore, there are many plug-ins, APIs, modules and manuals, making it a good fit for almost every environment without the need for a large financial investment.

JSON is a lightweight data-interchange format and has become a standard for the transportation of data between different architectures and software interfaces. It is built on two structures: firstly, a collection of value pairs, realized as objects, and secondly, an ordered list of values, which can be understood as vectors. From the perspective of algebraic mathematics, relational databases form large-scale matrices, and JSON objects form vectors.

1.3.4Facebook integration

Facebook’s Graph API Explorer (https://developers.facebook.com/tools/explorer/) is an amazing tool for testing and understanding the functionality and possibilities that the Graph API offers. As shown in Figure 1.7, one can easily request likes of a user, who must consent to this action, to be reported as JSON output. In this example, the likes from the account of one of the authors is pulled by GET /v2.9/me?fields=likes, which means that the Graph API Version 2.9 is used, to query the object me for the values from all fields of the type like.

Fig. 1.7: Example of Facebook Graph Explorer tool

Testing commands is an essential step in designing Facebook logins for applications that pull user data in order to drive the analysis presented in this chapter. Online surveys or questionnaires can be integrated directly in Facebook as HTML code, including a Facebook login button that allows researchers to pull and record Facebook user information with the consent of the user [15]. For the purpose of personality trait prediction, the JSON information received from the Graph API can then be further preprocessed and formatted and then passed to the R execution environment. An example of a software development kit (SDK) that uses Facebook social Graph APIs is the React Native SDK (https://developers.facebook.com/docs/react-native). React Native is an SDK that was developed by Facebook to create native applications for iOS and Android mobile operating systems with congruent JavaScript code. React Native saves developers time by publishing one code for two operating systems. The React Native SDK can be used to pull profile information from mobile application sign-ups. This step is especially interesting for businesses, since sign-ups with Facebook could help them to collect a large amount of essential information about their customers.

The first step in accessing the Graph API is to define a function that allows users to enter login information and to authorize the application to grant access permission to the user in case of a successful login.

The final step is to remodel the login button. With respect to granting permission, the access token can be used to request access to the data of the user and then parse it as JSON. If the permission is not granted, the function throws an error.

1.4Testing the framework

By now, the theoretical, methodological and technological foundations have been laid. Two cases that demonstrate the actual application will now be presented. In addition, we will discuss the link with the political context, i.e., what political preferences can be assumed and how parties could make use of this information.

1.4.1Use case 1: I am what I like

Assuming that the likes of a new Facebook userwho has not been included in the original data are available, the goal is to predict the probabilities of this user’s personality traits. For this example, a user called Peter was created. A script was then used to randomly assign between 300 and 400 likes from the original likes data set to Peter, resulting in a total of 347 likes. For example, some of the assigned likes are

http://www.facebook.com/pages/West-Coast-Port-Shutdown/100856866698826

Telling Rush Limbaugh he’s Full of Crap (by Leftake.com)

Super Nintendo Entertainment System

As the first step, each new user’s information must be included as a digital footprint in the sparse matrix M_user. Therefore, the new user’s information is added to users.csv and user-likes. before starting to process the data. To save computation time, it is further recommended to take a look at the newly created sparse matrix and extract a partial sparse matrix that contains only new users. tail(M_user) shows the end of the matrix and M_user[startrow:endrow,] selects only the specified range of rows. dim(M_user) is helpful for determining the required number of rows.

To calculate the predicted probability for a new user, the fitted model that was trained as described in Section 1.3.2 is used. user$var.˜ specifies that the variable var is estimated using all independent variables. Additionally, predictors for the new user(s) are created. Then both are used as input for predict().

After running the calculations, the following values are predicted for Peter:

Based on the randomly assigned likes, the model estimated Peter to be a 38-year-old male Democrat. Taking a closer look at his psychodemographic traits, he can be classified as a relatively open-minded person, leading to the assumption that his political views are left wing. Even though the relationship is not yet as well established in the literature, his relatively low emotional stability (neuroticism) indicates a left-wing orientation as well. For the interpretation of the predicted values, it is helpful to use the previously presented heat map function as well as summary(users) to get an understanding of the possible range of values for each trait and its distributions.

Given the derived political preferences, parties could make use of this information. Considering a scenario in which a left-wing party is able to identify Peter’s political orientation, the party could try to strengthen his political preferences, for example by providing studies that underline the need for social equity. On the other hand, in a scenario in which a right-wing party knows about Peter’s left-wing orientation, the party could try to provide information that emphasizes the need for more conservative actions, thereby trying to weaken Peter’s current commitments. Besides the approach of strengthening those preferences of an individual that seem favourable, a party could be more direct and present information that stresses how the target’s preferences align with those of the party, implying that a vote for this party would be a good choice.

Influencing a person based on individual analysis is not a new phenomenon. Micro-targeting is, for example, very popular and sophisticated when used for marketing purposes. All Internet users have experience with personalized pop-up advertisements that are based on website visits and other previously taken online actions. However, coming back to the example of Peter, there is a very high risk of fake news. In our context, this would mean that a political party uses intentionally misleading or even wrong information to influence Peter.

1.4.2Use case 2: It started with a like

Besides the possibility of using users’ likes to estimate their predicted user traits, one could start with a like that has been shown to be strongly correlated with a relevant trait and target all users with that specific like. Certainly, this approach is a very broadly based approach and will lead to the targeting of many ill-suited individuals. However, taking the constraints of computation time into consideration and, even more importantly, lacking user consent for the retrieval of these personal data, such as users’ likes, this might turn out to be a feasible approach.

Again, this use case is based on extracted LDA clusters. For this description, k = 5 is used to extract five LDA clusters. While more clusters improve the precision when building predictive models, fewer clusters are easier to interpret and understand their commonalities. The LDA analyses used in this example are the same as those used to create the heat map in Section 1.3.2. Therefore, it is recommended to keep the heat map and the explanation of the clusters in mind while looking at the most strongly correlated likes shown here.

The strength of the correlation between the extracted LDA clusters and the likes contained within these clusters is stored in Mlda@beta. The list top is created and used to store the likes with the highest correlations, found in a for loop using order() to order the likes by the strength of their correlation and then tail() to get the end of the list.

The heat map shows that LDA cluster 3 is strongly positively correlated with openness. As openness tends to be a trait of more liberal/left-wing people, one could target Facebook users who like The Beatles, Queen, ... with suitable messages. On the other end, cluster 5 is strongly positively correlated with conscientiousness and, consonant with the theory, negatively with openness. This could imply that the users of Lil Wayne, Eminem, ... are more receptive to conservative/right-wing messages.

The potential use of this information is, however, the same as discussed in use case 1. Parties could present information that strengthens the traits associated with a certain like to all persons with that like or even customize an advertisement that incorporates the topic of the like and shows how the party itself views the topic and embodies the trait in question.

1.5Discussion and conclusion

In this section we discuss the main contributions of our approach. In addition, its limitations, potential future developments and some concluding remarks are presented.

1.5.1Contributions to research and practice

Our study makes three main contributions to research and to practice. First, it represents an easy and comprehensive introduction to Big Five psychometric trait assessment; second, it provides a conceptualization and prototyping of an infrastructure that supports a scalable solution for real-time data collection and evaluation; and third, it provides a proof of concept that psychometric traits can be actually used to predict real-life outcomes with a scalable infrastructure.

Our chapter gave a comprehensive introduction to computational psychometric trait analysis based on the Big Five model. It contains descriptions in terms of the conceptual frameworks of psychometric analysis, the meanings of the words in lexical analysis, and how to use Facebook as a research tool.

Furthermore, an instantiation of a prototype infrastructure was presented, with deployment diagrams and algorithmic methods connecting all the devices, execution environments, files and databases. This prototype describes what a scalable analytical solution for Big Five personality trait assessment could look like and has the potential for further automation and scalability. The prototype could be used by other researchers to set up a server environment for their own future research.

Finally, our chapter laid down a foundation for an objective understanding of computational psychometric trait analysis and can therefore be used as an argument to address issues in the field of privacy infringement and potential issues in the field of election fraud.

1.5.2Limitations and suggestions for future research

One limitation of our study is its focus on Facebook as the primary data source. Several other social media services might be used to derive psychometric traits. However, we believe that our general approach could easily be adapted to other data sources.

The presented prototype is far from being optimized for speed and would require a fast computation environment to perform just-in-time calculations. In particular, the LDA algorithm and the training of the GLM model take a long time. Both algorithmic optimizations and an improvement of the technological implementation (e.g., in-memory storage of computed models) would be required to establish an efficient and feasible production system.

Having a large data set that combines psychodemographic traits with additional information, such as that provided by the myPersonality project, makes it possible to perform a variety of statistical analyses and prediction models. As the data set is static, it is not evolving, and newly created likes are not included. Given the volatility of today’s technology and online trends in social media, important aspects and newly arising influence factors might not be included in the models.

The main goal of our chapter was to build an initial proof of concept to enable further research and development. Having built a prototype, it is important to point out three main suggestions.

Algorithmic suggestions. Scientists should pay attention to the methods used to build correlations between the LDA [14] clusters and the psychodemographic traits of users. While we used used a simple log-likelihood function, it is advisable to extend the method and build machine learning algorithms [19], ideally on neural networks, that are able to include user actions in the interactive re-ranking of trait scores [20]. When adding machine learning concepts to the assessment of personality traits for a more automated assessment of individuals [21], it is advisable to rethink the importance of the priors within the Bayesian statistics [22].

Architectural suggestions. The main architectural suggestions in Figure 1.8 are concerned with the elimination of the .csv file system currently being used in the prototype. A system based on .csv files is difficult and unreasonable to scale due to the limitations of access rights to such files. The next step in the development of the system would be to separate the database tier from the web and business tiers and use connectors from the web and business tiers to the database tier to push and pull data. MongoDB is the most widely used NoSQL database. NoSQL systems were developed to meet the requirements of building modern applications that create and process massive volumes of data. One strength of MongoDB is that it makes it easy to handle rapidly changing data types and therefore operate with structured, semi-structured, unstructured or polymorphic data. Code for MongoDB can be written to be agile, quickly pushed, and released multiple times a day. Relational databases were not built to manage the volume, variety and velocity of modern applications [23]. The MongoDB database tier would allow multiple users to access the contained data. It would also be easier to replicate databases or split the data into many different databases and servers. The separation between the business tier and the database tier helps in the use of tools like Hadoop and Spark to allow parallel computing and methods like mapreduce [24]. This separation would allow scalability in the infrastructure and would therefore solve the problems of data volume, velocity and volatility.

Fig. 1.8: Potential future deployment model of the next prototype

Evaluative suggestions. The proposed architecture is an initial design that is part of an ongoing search process and represents a proof of concept. The next iteration might result in another proof of concept, but a second iteration of the infrastructure could easily include software to measure the computing time, as well as the efficiency and effectiveness of the algorithms. MongoDB, for example, has a guide for query optimization that can lead to a reduction in computing time [23]. The efficiency of an algorithm can be measured during computation. Additionally, the effectiveness of an algorithm, especially one built on inferential statistics, can be measured with a set of standard tools including mean, distribution and deviation. Many other possibilities result from the potential of using already used data sets and trying different algorithmic models on them, to see which ones optimally fit the attainment of a predefined goal. Ineffective algorithms could be compromised by the unsuitability of the underlying methods, but also by bias in the data. No matter what the output might be, it is clearly preferable to measure each modification so that the infrastructure and the algorithms can be optimized.

1.5.3Concluding remarks

Big data are a mirror of personal identity. It might seem mystical, sometimes infeasible, and for most people something inconveniently hidden in technologies they only superficially understand. Constantly assessed, data are unavoidably becoming a core asset of any organization or public authority. The goal of our chapter was to introduce the reader to a technological demonstration that data-driven economies and governments have the ability to collect private information about citizens and to build psychometric analytical models that predict human behaviour. These analyses enable, for example, political parties to send targeted messages to individuals that vary according to each individual’s characteristics. This might result in the manipulation of the wills of citizens. So far, communities have lacked judiciousness, adequate jurisprudence and law enforcement structures to take effective countermeasures against threats that result from modern technological innovations, such as psychometric micro-targeting.

Bibliography

[1]McAfee A and Brynjolfsson E. Machine, platform, crowd: Harnessing our digital future. W. W. Norton & Company, New York, first edition edition, 2017.

[2]Fosso Wamba S, Akter S, Edwards A, Chopin G, and Gnanzou D. How ‘big data’ can make big impact: Findings from a systematic review and a longitudinal case study. International Journal of Production Economics, 165:234–246, 2015.

[3]Kranish M. Trump’s plan for a comeback includes building a ‘psychographic’ profile of every voter; https://www.washingtonpost.com/politics/trumps-plan-for-a-comeback-includes-building-a-psychographic-profile-of-every-voter/2016/10/27/ (accessed 19.12.2017), 2016.

[4]Rust J and Golombok S. Modern psychometrics: The science of psychological assessment. Routledge, Hove, East Sussex and New York, 3rd ed. edition, 2009.

[5]Gerber AS, Huber GA, Doherty D, and Dowling CM. The big five personality traits in the political arena. Annual Review of Political Science, 14(1):265–287, 2011.

[6]Goldberg LR. An alternative “description of personality”: The big-five factor structure. Journal of Personality and Social Psychology, 59(6):1216–1229, 1990.

[7]Wong JC, Levin S, and Solon O. Bursting the facebook bubble: we asked voters on the left and right to swap feeds; https://www.theguardian.com/us-news/2016/nov/16/facebook-bias-bubble-us-election-conservative-liberal-news-feed (accessed 19.12.2017), 2016.

[8]Cohen RJ, Swerdlik ME, and Sturman ED. Psychological testing and assessment: An introduction to tests and measurement. McGraw Hill, New York, 8th ed., international student ed. edition, 2013.

[9]Eysenck HJ. The structure of human personality. Methuen, New York, 1953.

[10]Gerber AS, Huber GA, Doherty D, Dowling CM, and Panagopoulos C. Big five personality traits and responses to persuasive appeals: Results from voter turnout experiments. Political Behavior, 35(4):687–728, 2013.

[11]Vecchione M, Schoen H, Castro JLG, Cieciuch J, Pavlopoulos V, and Caprara GV. Personality correlates of party preference: The big five in five big european countries. Personality and Individual Differences, 51(6):737–742, 2011.

[12]Dennison J. Populist personalities? the big five personality traits and party choice in the 2015 uk general election; http://blogs.lse.ac.uk/politicsandpolicy/populist-personalities-the-big-five-personality-traits-and-party-choice-in-the-2015-uk-general-election/ (accessed 19.12.2017), 2015.

[13]Blei DM, Ng AY, Jordan MI, and Lafferty J. Latent dirichlet allocation. Journal of Machine Learning Research, 3(1):993–1022, 2003.

[14]Blei DM. Probabilistic topic models. Communications of the ACM, 55(4):77, 2012.

[15]Kosinski M, Wang Y, Lakkaraju H, and Leskovec J. Mining big data to extract patterns and predict real-life outcomes. Psychological methods, 21(4):493, 2016.

[16]myPersonality; http://mypersonality.org/wiki/doku.php (accessed 19.12.2017), 2016.

[17]Underwood T. The stone and the shell; https://tedunderwood.com/2012/04/07/topic-modeling-made-just-simple-enough/ (accessed 19.12.2017), 2012.

[18]Pennebaker Conglomerates I. Linguistic inquiry and word count (liwc); https://liwc.wpengine.com/ (accessed 19.12.2017), 2017.

[19]Farnadi G, Zoghbi S, Moens MF, and De Cock M. Recognising personality traits using facebook status updates. In Proceedings of the workshop on computational personality recognition (WCPR13) at the 7th international AAAI conference on weblogs and social media (ICWSM13). AAAI, 2013.

[20]Long B and Chang Y. Relevance ranking for vertical search engines. Newnes, Waltham, Massachusetts, USA, 2014.

[21]Park G, Schwartz HA, Eichstaedt JC, Kern ML, Kosinski M, Stillwell DJ, Ungar LH, and Seligman MEP. Automatic personality assessment through social media language. Journal of personality and social psychology, 108(6):934, 2015.

[22]Wallach HM, Mimno DM, and McCallum A. Rethinking lda: Why priors matter. In Advances in neural information processing systems, pages 1973–1981, 2009.

[23]MongoDB, inc.; https://www.mongodb.com/nosql-explained (accessed 19.12.2017), 2017.

[24]Dean J and Ghemawat S. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset