Preface

Around 2013, natural language processing and chatbots began dominating our lives. At first Google Search had seemed more like an index, a tool that required a little skill in order to find what you were looking for. But it soon got smarter and would accept more and more natural language searches. Then smart phone autocomplete began to get sophisticated. The middle button was often exactly the word you were looking for.[1]

1

Hit the middle button (https://www.reddit.com/r/ftm/comments/2zkwrs/middle_button_game/:) repeatedly on a smart phone predictive text keyboard to learn what Google thinks you want to say next. It was first introduced on Reddit as the “SwiftKey game” (https://blog.swiftkey.com/swiftkey-game-winning-is/) in 2013.

In late 2014, Thunder Shiviah and I were collaborating on a Hack Oregon project to mine natural language campaign finance data. We were trying to find connections between political donors. It seemed politicians were hiding their donors’ identities behind obfuscating language in their campaign finance filings. The interesting thing wasn’t that we were able to use simple natural language processing techniques to uncover these connections. What surprised me the most was that Thunder would often respond to my rambling emails with a succinct but apt reply seconds after I hit send on my email. He was using Smart Reply, a Gmail Inbox “assistant” that composes replies faster than you can read your email.

So I dug deeper, to learn the tricks behind the magic. The more I learned, the more these impressive natural language processing feats seemed doable, understandable. And nearly every machine learning project I took on seemed to involve natural language processing.

Perhaps this was because of my fondness for words and fascination with their role in human intelligence. I would spend hours debating whether words even have “meaning” with John Kowalski, my information theorist boss at Sharp Labs. As I gained confidence, and learned more and more from my mentors and mentees, it seemed like I might be able to build something new and magical myself.

One of the tricks I learned was to iterate through a collection of documents and count how often words like “War” and “Hunger” are followed by words like “Games” or “III.” If you do that for a large collection of texts, you can get pretty good at guessing the right word in a “chain” of words, a phrase, or sentence. This classical approach to language processing was intuitive to me.

Professors and bosses called this a Markov chain, but to me it was just a table of probabilities. It was just a list of the counts of each word, based on the preceding word. Professors would call this a conditional distribution, probabilities of words conditioned on the preceding word. The spelling corrector that Peter Norvig built for Google showed how this approach scales well and takes very little Python code.[2] All you need is a lot of natural language text. I couldn’t help but get excited as I thought about the possibilities for doing such a thing on massive free collections of text like Wikipedia or the Gutenberg Project.[3].

2

See the web page titled “How to Write a Spelling Corrector” by Peter Norvig (http://www.norvig.com/spell-correct.html).

3

If you appreciate the importance of having freely accessible books of natural language, you may want to keep abreast of the international effort to extend copyrights far beyond their original “use by” date: gutenberg.org (http://www.gutenberg.org) and gutenbergnews.org (http://www.gutenbergnews.org/20150208/copyrightterm-extensions-are-looming:)

Then I heard about latent semantic analysis (LSA). It seemed to be just a fancy way of describing some linear algebra operations I’d learned in college. If you keep track of all the words that occur together, you can use linear algebra to group those words into “topics.” LSA could compress the meaning of an entire sentence or even a long document into a single vector. And, when used in a search engine, LSA seemed to have an uncanny ability to return documents that were exactly what I was looking for. Good search engines would do this even when I couldn’t think of the words that might be in those documents!

Then gensim released a Python implementation of Word2vec word vectors, making it possible to do semantic math with individual words. And it turned out that this fancy neural network math was equivalent to the old LSA technique if you just split up the documents into smaller chunks. This was an eye-opener. It gave me hope that I might be able to contribute to the field. I’d been thinking about hierarchical semantic vectors for years—how books are made of chapters of paragraphs of sentences of phrases of words of characters. Tomas Mikolov, the Word2vec inventor, had the insight that the dominant semantics of text could be found in the connection between two layers of the hierarchy, between words and 10-word phrases. For decades, NLP researchers had been thinking of words as having components, like niceness and emotional intensity. And these sentiment scores, components, could be added and subtracted to combine the meanings of multiple words. But Mikolov had figured out how to create these vectors without hand-crafting them, or even defining what the components should be. This made NLP fun!

About that time, Thunder introduced me to his mentee, Cole. And later others introduced me to Hannes. So the three of us began to “divide and conquer” the field of NLP. I was intrigued by the possibility of building an intelligent-sounding chatbot. Cole and Hannes were inspired by the powerful black boxes of neural nets. Before long they were opening up the black box, looking inside and describing what they found to me. Cole even used it to build chatbots, to help me out in my NLP journey.

Each time we dug into some amazing new NLP approach it seemed like something I could understand and use. And there seemed to be a Python implementation for each new technique almost as soon as it came out. The data and pretrained models we needed were often included with these Python packages. “There’s a package for that” became a common refrain on Sunday afternoons at Floyd’s Coffee Shop where Hannes, Cole, and I would brainstorm with friends or play Go and the “middle button game.” So we made rapid progress and started giving talks and lectures to Hack Oregon classes and teams.

In 2015 and 2016 things got more serious. As Microsoft’s Tay and other bots began to run amok, it became clear that natural language bots were influencing society. In 2016 I was busy testing a bot that vacuumed up tweets in an attempt to forecast elections. At the same time, news stories were beginning to surface about the effect of Twitter bots on the US presidential election. In 2015 I had learned of a system used to predict economic trends and trigger large financial transactions based only on the “judgment” of algorithms about natural language text.[4] These economy-influencing and society-shifting algorithms had created an amplifier feedback loop. “Survival of the fittest” for these algorithms appeared to favor the algorithms that generated the most profits. And those profits often came at the expense of the structural foundations of democracy. Machines were influencing humans, and we humans were training them to use natural language to increase their influence. Obviously these machines were under the control of thinking and introspective humans, but when you realize that those humans are being influenced by the bots, the mind begins to boggle. Could those bots result in a runaway chain reaction of escalating feedback? Perhaps the initial conditions of those bots could have a big effect on whether that chain reaction was favorable or unfavorable to human values and concerns.

4

See the web page titled “Why Banjo Is the Most Important Social Media Company You’ve Never Heard Of” (https://www.inc.com/magazine/201504/will-bourne/banjo-the-gods-eye-view.html).

Then Brian Sawyer at Manning Publishing came calling. I knew immediately what I wanted to write about and who I wanted to help me. The pace of development in NLP algorithms and aggregation of natural language data continued to accelerate as Cole, Hannes, and I raced to keep up.

The firehose of unstructured natural language data about politics and economics helped NLP become a critical tool in any campaign or finance manager’s toolbox. It’s unnerving to realize that some of the articles whose sentiment is driving those predictions are being written by other bots. These bots are often unaware of each other. The bots are literally talking to each other and attempting to manipulate each other, while the health of humans and society as a whole seems to be an afterthought. We’re just along for the ride.

One example of this cycle of bots talking to bots is illustrated by the rise of fintech startup Banjo in 2015.[5] By monitoring Twitter, Banjo’s NLP could predict newsworthy events 30 minutes to an hour before the first Reuters or CNN reporter filed a story. Many of the tweets it was using to detect those events would have almost certainly been favorited and retweeted by several other bots with the intent of catching the “eye” of Banjo’s NLP bot. And the tweets being favorited by bots and monitored by Banjo weren’t just curated, promoted, or metered out according to machine learning algorithms driven by analytics. Many of these tweets were written entirely by NLP engines.[6]

5

6

The 2014 financial report by Twitter revealed that >8% of tweets were composed by bots, and in 2015 DARPA held a competition (https://arxiv.org/ftp/arxiv/papers/1601/1601.05140.pdf) to try to detect them and reduce their influence on society in the US.

More and more entertainment, advertisement, and financial reporting content generation can happen without requiring a human to lift a finger. NLP bots compose entire movie scripts.[7] Video games and virtual worlds contain bots that converse with us, sometimes talking about bots and AI themselves. This “play within a play” will get ever more “meta” as movies about video games and then bots in the real world write reviews to help us decide which movies to watch. Authorship attribution will become harder and harder as natural language processing can dissect natural language style and generate text in that style.[8]

7

8

NLP has been used successfully to help quantify the style of 16th century authors like Shakespeare (https://pdfs.semanticscholar.org/3973/ff27eb173412ce532c8684b950f4cd9b0dc8.pdf).

NLP influences society in other less straightforward ways. NLP enables efficient information retrieval (search), and being a good filter or promoter of some pages affects the information we consume. Search was the first commercially successful application of NLP. Search powered faster and faster development of NLP algorithms, which then improved search technology itself. We help you contribute to this virtuous cycle of increasing collective brain power by showing you some of the natural language indexing and prediction techniques behind web search. We show you how to index this book so that you can free your brain to do higher-level thinking, allowing machines to take care of memorizing the terminology, facts, and Python snippets here. Perhaps then you can influence your own culture for yourself and your friends with your own natural language search tools.

The development of NLP systems has built to a crescendo of information flow and computation through and among human brains. We can now type only a few characters into a search bar, and often retrieve the exact piece of information we need to complete whatever task we’re working on, like writing the software for a textbook on NLP. The top few autocomplete options are often so uncannily appropriate that we feel like we have a human assisting us with our search. Of course we authors used various search engines throughout the writing of this textbook. In some cases these search results included social posts and articles curated or written by bots, which in turn inspired many of the NLP explanations and applications in the following pages.

What is driving NLP advances?

  • A new appreciation for the ever-widening web of unstructured data?
  • Increases in processing power catching up with researchers’ ideas?
  • The efficiency of interacting with a machine in our own language?

It’s all of the above and much more. You can enter the question “Why is natural language processing so important right now?” into any search engine,[9] and find the Wikipedia article full of good reasons.[10]

9

10

See the Wikipedia article “Natural language processing” (https://en.wikipedia.org/wiki/Natural_language_processingWikipedia/NLP).

There are also some deeper reasons. One such reason is the accelerating pursuit of artificial general intelligence (AGI), or Deep AI. Human intelligence may only be possible because we are able to collect thoughts into discrete packets of meaning that we can store (remember) and share efficiently. This allows us to extend our intelligence across time and geography, connecting our brains to form a collective intelligence.

One of the ideas in Steven Pinker’s The Stuff of Thought is that we actually think in natural language.[11] It’s not called an “inner dialog” without reason. Facebook, Google, and Elon Musk are betting on the fact that words will be the default communication protocol for thought. They have all invested in projects that attempt to translate thought, brain waves, and electrical signals into words.[12] In addition, the Sapir-Whorf hypothesis is that words affect the way we think.[13] And natural language certainly is the communication medium of culture and the collective consciousness.

11

12

See the Wired Magazine Article “We are Entering the Era of the Brain Machine Interface” (https://backchannel.com/we-are-entering-the-era-of-the-brain-machine-interface-75a3a1a37fd3).

13

See the web page titled “Linguistic relativity” (https://en.wikipedia.org/wiki/Linguistic_relativity).

So if it’s good enough for human brains, and we’d like to emulate or simulate human thought in a machine, then natural language processing is likely to be critical. Plus there may be important clues to intelligence hidden in the data structures and nested connections between words that you’re going to learn about in this book. After all, you’re going to use these structures, and connection networks make it possible for an inanimate system to digest, store, retrieve, and generate natural language in ways that sometimes appear human.

And there’s another even more important reason why you might want to learn how to program a system that uses natural language well... you might just save the world. Hopefully you’ve been following the discussion among movers and shakers about the AI Control Problem and the challenge of developing “Friendly AI.”[14] Nick Bostrom,[15] Calum Chace,[16] Elon Musk,[17] and many others believe that the future of humanity rests on our ability to develop friendly machines. And natural language is going to be an important connection between humans and machines for the foreseeable future.

14

15

Nick Bostrom, home page, http://nickbostrom.com/

16

17

See the web page titled “Why Elon Musk Spent $10 Million To Keep Artificial Intelligence Friendly” (http://www.forbes.com/sites/ericmack/2015/01/15/elon-musk-puts-down-10-million-to-fight-skynet/#17f7ee7b4bd0).

Even once we are able to “think” directly to/with machines, those thoughts will likely be shaped by natural words and languages within our brains. The line between natural and machine language will be blurred just as the separation between man and machine fades. In fact this line began to blur in 1984. That’s the year of the Cyborg Manifesto,[18] making George Orwell’s dystopian predictions both more likely and easier for us to accept.[19], [20]

18

19

Wikipedia on George Orwell’s 1984, https://en.wikipedia.org/wiki/Nineteen_Eighty-Four

20

Wikipedia, The Year 1984, https://en.wikipedia.org/wiki/1984

Hopefully the phrase “help save the world” didn’t leave you incredulous. As you progress through this book, we show you how to build and connect several lobes of a chatbot “brain.” As you do this, you’ll notice that very small nudges to the social feedback loops between humans and machines can have a profound effect, both on the machines and on humans. Like a butterfly flapping its wings in China, one small decimal place adjustment to your chatbot’s “selfishness” gain can result in a chaotic storm of antagonistic chatbot behavior and conflict.[21] And you’ll also notice how a few kind, altruistic systems will quickly gather a loyal following of supporters that help quell the chaos wreaked by shortsighted bots—bots that pursue “objective functions” targeting the financial gain of their owners. Prosocial, cooperative chatbots can have an outsized impact on the world, because of the network effect of prosocial behavior.[22]

21

A chatbot’s main tool is to mimic the humans it is conversing with. So dialog participants can use that influence to engender both prosocial and antisocial behavior in bots. See the Tech Republic article “Why Microsoft’s Tay AI Bot Went Wrong” (http://www.techrepublic.com/article/why-microsofts-tay-ai-bot-went-wrong).

22

An example of autonomous machines “infecting” humans with their measured behavior can be found in studies of the impact self-driving cars are likely to have on rush-hour traffic (https://www.enotrans.org/wp-content/uploads/AV-paper.pdf). In some studies, as few as 1 in 10 vehicles around you on the freeway will help moderate human behavior, reducing congestion and producing smoother, safer traffic flow.

This is how and why the authors of this book came together. A supportive community emerged through open, honest, prosocial communication over the internet using the language that came naturally to us. And we’re using our collective intelligence to help build and support other semi-intelligent actors (machines).[23] We hope that our words will leave their impression in your mind and propagate like a meme through the world of chatbots, infecting others with passion for building prosocial NLP systems. And we hope that when superintelligence does eventually emerge, it will be nudged, ever so slightly, by this prosocial ethos.

23

Toby Segaran’s Programming Collective Intelligence kicked off my adventure with machine learning in 2010 (https://www.goodreads.com/book/show/1741472.Programming_Collective_Intelligence).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset