Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

APPENDIX D
How to Start Understanding What the Annual Reports Say About Digital with the Help of Natural Language Processing (NLP)

To leverage the planned NLP concepts into the empirical approach, I developed a new concept of what I call “quantification levels.” The idea behind this concept is to assign higher “levels” the more specific/concrete the outcome is assumed to be. For this book, I ended up using only both extreme ends (Level 1 and Level 3), but in the original research, a careful second‐stage preprocessing of all reports built the foundation for the analysis along all three levels. (See Figure D.1.)

Level 1 analyzed separately for each 10‐K filing date the amount of occurrences/frequency of digital transformation language dictionary terms per category in the “raw” reports and, for normalization purposes, the relative percentage of occurrences versus total words as a proxy for digital transformation outcomes. In simple words, it counts how often digital terms appear in a report and makes this number comparable across reports by adjusting it for document length.

Level 2 (not used for this book) analyzed separately for each 10‐K filing date the number of occurrences of digital transformation language dictionary terms per category in relationship to explicit statements on monetary or timing impacts and the relative percentage of these occurrences (versus total level 1 occurrences) as a proxy for digital transformation outcomes. In simple words it counts the percentage of Level 1 terms that are further specified by dates or monetary terms in far proximity to the terms (measured in so‐called arcs, that is, dependency steps in the sentence syntax until you find the specification). In Level 2 only syntax dependencies above four arcs in the same sentence are counted.

		Level 3: D_CLOSE_M/D Dictionary occurrence in close dependency to temporal (D) or monetary (M) statements (as a percentage of Level 1 occurrence)
	*Level 2: D_FAR_M/D* Dictionary occurrence in far dependency to temporal (D) or monetary (M) statements (as a percentage of Level 1 occurrence)
Level 1: DIGITALPROXY Dictionary occurrences/ frequency per report (Normalized per total words in each report)

FIGURE D.1 Replicable references on three analysis levels.

Level 3 analyzed separately for each 10‐K filing date the number of occurrences of digital transformation language dictionary terms per category in relationship to explicit statements on monetary or timing impacts and the relative percentage of these occurrences (versus total Level 1 occurrences) as a proxy for digital transformation outcomes. In simple words it counts the percentage of Level 1 terms, which are further specified by dates or monetary terms in close proximity to the terms (measured in so‐called arcs, that is, dependency steps in the sentence syntax until you find the specification). In Level 3 only close syntax dependencies below five arcs in the same sentence are counted.

For the sake of simplicity in this book, the sublevels of framework categories, which would be available in the dataset have been discarded, and only aggregate figures across all framework categories are applied.

After random checks, it became clear that programmatically eliminating obvious company names would be beneficial. This was implemented with an entity search algorithm.

In the wake of the preceding application of NLP methods via custom‐ automated Python code, it turned out to be very efficient to leverage further supplementary analysis to generate additional information on the assessed texts. This builds on a recent trend in literature to look deeper into text sentiments to derive further conclusions and/or controls for subjectivity/objectivity and negative/positive statements in SEC 10‐K reports versus their financial impact, usually with a confirmation of a relationship between financials and sentiment data (Chouliaras 2015; Li 2006). In order to generate a first operationalization of the desired text sentiments, a custom‐developed Python code was applied, strongly building on similar lexical development work (Haritash 2018). The general idea of this pragmatic approach is simply counting words from a “negative” dictionary, as the negative score (for example “annulments,” “annuls,” “anomalies,” “anomalous”), then counting words from a “positive” dictionary, as the positive score (for example, “able,” “abundance, “acclaimed,” “accomplish”) (UND 2020) and then calculating a so called “polarity” out of the relation of these two scores. This score determines if a given text is positive or negative in nature. It is calculated by using the following formula (range is from –1 to +1):

In addition to the custom‐developed approach the underlying research for this book also used “TextBlob,” which ended up as the only tool for all sentiment analysis in this book. TextBlob is a Python library for processing textual data. “It provides API for natural language processing (NLP) tasks such as part‐of‐speech tagging, noun phrase extraction, sentiment analysis, classification, translation …” (Loria 2018, website). This work, however, only leverages the prepacked sentiment functionality to calculate subjectivity and polarity scores for the generated “raw” texts.

TABLE D.1 Overall sample textual analysis summary statistics.

Textual Variables	N	Mean	SD	Min	Max
DIGITALPROXY	30544	2.480327	2.819965	0	30.91132
D_CLOSE_M	30540	0.002627	0.008951	0	0.193548
D_CLOSE_D	30540	0.002943	0.009139	0	0.166667
POLARITY	30544	0.051735	0.013579	‐0.0197173	0.141112
SUBJECTIVITY	30544	0.374910	0.018039	0.2806215	0.459227

In summary, applied textual analysis produced several variables (see Table D.1) sourced via custom‐developed Python code directly from the preprocessed 10‐Ks.

Obviously, depending on the availability of datapoints for each analysis, regressions will only include a limited part of these observations, for example, because of leveraging the lagged ROA (L1ROA) for our models, the number of observations naturally must go down closer to the 20,000 range.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for APPENDIX D: How to Start Understanding What the Annual Reports Say About Digital with the Help of Natural Language Processing (NLP)

Create new playlist

Sign In

Sign Up

Table of Contents for
APPENDIX D: How to Start Understanding What the Annual Reports Say About Digital with the Help of Natural Language Processing (NLP)