Chapter 4

Analyzing Music and Video

IN THIS CHAPTER

check Discovering how to imitate creativity

check Understanding that deep learning can’t create

check Developing art and music based on established styles

check Using GANs to generate art based on existing styles

You can find considerable discussions online about whether computers can be creative by employing deep learning. The dialogue goes to the very essence of what it means to be creative. Philosophers and others have discussed the topic endlessly throughout human history without arriving at a conclusion as to what, precisely, creativity means. Consequently, a single chapter in a book written in just a few months won’t solve the problem for you.

However, to provide a basis for the discussions in this chapter, this book defines creativity as the ability to define new ideas, patterns, relationships, and so on. The emphasis is on new: the originality, progressiveness, and imagination that humans provide. It doesn’t include copying someone else’s style and calling it one’s own. Of course, this definition will almost certainly raise the ire of some while garnering the accepting nods of others, but to make the discussion work at all, you need a definition. Mind you, this definition doesn’t exclude creativity by nonhumans. For example, some people can make a case for creative apes (see http://www.bbc.com/future/story/20140723-are-we-the-only-creative-species for more details).

Creativity and computers can definitely come together in a fascinating collaboration. As you know, computers rely on math to do everything, and their association with art and music is no exception. A computer can transfer existing art or music patterns to a neural network and use the result to generate something that looks new but actually relies on the existing pattern. Generative Adversarial Networks (GANs) are the best available technology for this task of transferring patterns to neural networks today, but you can count on other technologies appearing in the future.

Computers don’t perform the tasks involved in outputting art on their own; they rely on a human to provide the means to accomplish such tasks. For example, a human designs the algorithm that the computer uses to perform the statistical analysis of the patterns. Moreover, a human decides which artistic style to mimic, and a human defines what sort of output might prove aesthetically pleasing. In short, the computer ends up being a tool in the hands of an exceptionally smart human to automate the process of creating what could be deemed as new, but really isn’t.

As part of the process of defining how some can see a computer as creative, the chapter also defines how computers mimic an established style. You can see for yourself that deep learning relies on math to perform a task generally not associated with math at all. An artist or musician doesn’t rely on calculations to create something new, but could rely on calculations to see how others performed their task. When an artist or musician employs math to study another style, the process is called learning, not creating. Of course, this entire minibook (part of a larger discussion on data science programming) is about how deep learning performs learning tasks, and even that process differs greatly from how humans learn.

Learning to Imitate Art and Life

You have likely seen interesting visions of AI art, such as those mentioned in the article at https://news.artnet.com/art-world/ai-art-comes-to-market-is-it-worth-the-hype-1352011. The art undeniably has aesthetic appeal. In fact, the article mentions that Christie’s, one of the most famous auction houses in the world, originally expected to sell the piece of art for $7,000 to $10,000 but actually it sold for $432,000, according to the Guardian (https://www.theguardian.com/artanddesign/shortcuts/2018/oct/26/call-that-art-can-a-computer-be-a-painter) and the New York Times (https://www.nytimes.com/2018/10/25/arts/design/ai-art-sold-christies.html). So not only is this type of art appealing, it can also generate a lot of money. However, in every unbiased story you read, the question remains as to whether the AI art actually is art at all. The following sections help you understand that computer generation doesn’t correlate to creativity; instead, it translates to amazing algorithms employing the latest in statistics.

Transferring an artistic style

One of the differentiators of art is the artistic style. Even when someone takes a photograph and displays it as art (https://www.wallartprints.com.au/blog/artistic-photography/), the method in which the photograph is taken, processed, and optionally touched up all define a particular style. In many cases, depending on the skill of the artist, you can’t even tell that you’re looking at a photograph because of its artistic elements (https://www.pinterest.com/lorimcneeartist/artistic-photography/?lp=true).

Some artists become so famous for their particular style that others take time to study it in depth to improve their own technique. For example, Vincent van Gogh’s unique style is often mimicked (https://www.artble.com/artists/vincent_van_gogh/more_information/style_and_technique). Van Gogh’s style — his use of colors, methods, media, subject matter, and a wealth of other considerations — requires intense study for humans to replicate. Humans improvise, so the adjective suffix esque often appears as a descriptor of a person’s style. A critic might say that a particular artist uses a van Goghesque methodology.

Remember To create art, the computer relies on a particular artistic style to modify the appearance of a source picture. In contrast to a human, a computer can perfectly replicate a particular style given enough consistent examples. Of course, you could create a sort of mixed style by using examples from various periods in the artist’s life. The point is that the computer isn’t creating a new style, nor is it improvising. The source image isn’t new, either. You see a perfectly copied style and a perfectly copied source image when working with a computer, and you transfer the style to the source image to create something that looks a little like both.

The process used to transfer the style to the source picture and produce an output is complex and generates a lot of discussion. For example, considering where source code ends and elements such as training begin is important. The article at https://www.theverge.com/2018/10/23/18013190/ai-art-portrait-auction-christies-belamy-obvious-robbie-barrat-gans discusses one such situation that involves the use of existing code but different training from the original implementation, which has people wondering over issues such as attribution when art is generated by computer. Mind you, all the discussion focuses on the humans who create the code and perform the training of the computer; the computer itself doesn’t figure in to the discussion because the computer is simply crunching numbers.

Reducing the problem to statistics

Computers can’t actually see anything, so analyzing images doesn’t occur in the same manner as humans use; you must solve the problem in another way. Someone takes a digital image of a real-world object or creates a fanciful drawing like the one in Figure 4-1, and each pixel in that image appears as tuples of numbers representing the red, blue, and green values of each pixel, as shown in Figure 4-2. These numbers, in turn, are what the computer interacts with using an algorithm. The computer doesn’t understand that the numbers form a tuple; that’s a human convention. All it knows is that the algorithm defines the operations that must take place on the series of numbers. In short, the art becomes a matter of manipulating numbers using a variety of methods, including statistics.

A fanciful drawing where each pixel in that image appears as tuples of numbers representing the red, blue, and green values of each pixel.

FIGURE 4-1: A human might see a fanciful drawing.

Image described by caption.

FIGURE 4-2: The computer sees a series of numbers.

Remember Deep learning relies on a number of algorithms to manipulate the pixels in a source drawing in a variety of ways to reflect the particular style you want to use. In fact, you can find a dizzying array of such algorithms because everyone appears to have a different idea of how to force a computer to create particular kinds of art. The point is that all these methods rely on algorithms that act on a series of numbers to perform the task; the computer never takes brush in hand to actually create something new. Two methods appear to drive the current strategies, though:

Understanding that deep learning doesn’t create

For art created by deep learning, the images are borrowed, the computer doesn’t understand them at all, and the computer relies on algorithms to perform the task of modifying the images. Deep learning doesn’t even choose the method of learning about the images — a human does that. In short, deep learning is an interesting method of manipulating images created by someone else using a style that another person also created.

Remember Whether deep learning can create something isn’t the real question to ask. The question that matters is whether humans can appreciate the result of the deep learning output. Despite its incapacity to understand or create, deep learning can deliver some amazing results. Consequently, creativity is best left to humans, but deep learning can give everyone an expressive tool — even people who aren’t artistic. For example, you could use deep learning to create a van Gogh version of a loved one to hang on your wall. The fact that you participated in the process and that you have something that looks professionally drawn is the point to consider — not whether the computer is creative.

Deep learning is also about automation. A human may lack the ability to translate a vision into reality. However, by using the automation that deep learning provides, such translation may become possible, even predictable. Humans have always relied on tools to overcome deficiencies, and deep learning is just another in a very long line of tools. In addition, the automation that deep learning provides also makes repetition possible, supplying consistent and predictable output from even less skilled humans.

Mimicking an Artist

Deep learning helps you mimic a particular artist. You can mimic any artist you want because the computer doesn’t understand anything about style or drawing. The deep learning algorithm will faithfully reproduce a style based on the inputs you provide (even if you can’t reproduce the style on your own). Consequently, mimicking is a flexible way to produce a particular output, as described in the following sections.

Defining a new piece based on a single artist

Convolutional Neural Networks (CNNs) appear in a number of uses for deep learning applications. For example, they’re used for self-driving cars and facial recognition systems. Book 4, Chapter 3 provides some additional examples of how CNNs do their job, but the point is that a CNN can perform recognition tasks well given enough training.

Interestingly, CNNs work particularly well in recognizing art style. So you can combine two pieces of art into a single piece. However, those two pieces supply two different kinds of input for the CNN:

  • Content: The image that defines the desired output. For example, if you provide a content image of a cat, the output will look like a cat. It won’t be the same cat you started with, but the content defines the desired output with regard to what a human will see.
  • Style: The image that defines the desired modification. For example, if you provide an example of a van Gogh painting, the output will reflect that style.

Tip In general, you see CNNs that rely on a single content image and a single style image. Using just the two images like this lets you see how content and style work together to produce a particular output. The example at https://medium.com/mlreview/making-ai-art-with-style-transfer-using-keras-8bb5fa44b216 provides a method for combining two images in this manner.

Of course, you need to decide how to combine the images. In fact, this is where the statistics of deep learning come into play. To perform this task, you use a neural style transfer, as outlined in the paper “A Neural Algorithm of Artistic Style,” by Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge (https://arxiv.org/pdf/1508.06576.pdf or https://www.robots.ox.ac.uk/~vgg/rg/papers/1508.06576v2.pdf).

The algorithm works with these kinds of images: a content image, which depicts the object you want to represent; a style image, which provides the art style you want to mimic; and an input image, which is the image to transform. The input image is usually a random image or the same image as the content image. Transferring the style implies preserving the content (that is, if you start with a photo of a dog, the result will still depict a dog). However, the transformed input image is nearer to the style image in presentation. The algorithm you use will define two loss measures:

  • Content loss: Determines the amount of the original image that the CNN uses to provide output. A greater loss here means that the output will better reflect the style you provide. However, you can reach a point at which the loss is so great that you can no longer see the content.
  • Style loss: Determines the manner in which the style is applied to the content. A higher level of loss means that the content retains more of its original style. The style loss must be low enough for you to end up with a new piece of art that reflects the desired style.

Having just two images doesn’t allow for extensive training, so you use a pretrained deep learning network, such as VGG-19 (the 2014 winner of the ImageNet challenge created by the Visual Geometry Group, VGG, at Oxford University). The pretrained deep learning network already knows how to process an image into image features of different complexity. The algorithm for neural style transfer picks the CNN of a VGG-19, excluding the final, fully connected layers. In this way, you have the network that acts as a processing filter for images. When you send in an image, VGG-19 transforms it into a neural network representation, which could be completely different from the original. However, when you use only the top layers of the network as image filters, the network transforms the resulting image but doesn’t completely change it.

Taking advantage of such transformative neural network properties, the neural transfer style doesn’t use all the convolutions in the VGG-19. Instead, it monitors them using the two loss measures to assure that, in spite of the transformations applied to the image, the network maintains the content and applies the style. In this way, when you pass the input image through VGG-19 several times, its weights adjust to accomplish the double task of content preservation and style learning. After a few iterations, which actually require a lot of computations and weight updates, the network transforms your input image into the anticipated image and art style.

Tip You often see the output of a CNN referred to as a pastiche. It’s a fancy word that generally means an artistic piece composed of elements borrowed from motifs or techniques of other artists. Given the nature of deep learning art, the term is appropriate.

Combining styles to create new art

If you really want to get fancy, you can create a pastiche based on multiple style images. For example, you could train the CNN using multiple Monet works so that the pastiche looks more like a Monet piece in general. Of course, you could just as easily combine the styles of multiple impressionist painters to create what appears to be a unique piece of art that reflects the impressionist style in general. The actual method for performing this task varies, but the article at https://ai.googleblog.com/2016/10/supercharging-style-transfer.html offers ideas for accomplishing the task.

Visualizing how neural networks dream

Using a CNN is essentially a manual process with regard to choosing the loss functions. The success or failure of a CNN depends on the human setting the various values. A GAN takes a different approach. It relies on two interactive deep networks to automatically adjust the values to provide better output. You can see these two deep networks having these names:

  • Generator: Creates an image based on the inputs you provide. The image needs to retain the original content, but with the appropriate level of style to produce a pastiche that is hard to distinguish from an original.
  • Discriminator: Determines whether the generator output is real enough to pass as an original. If not, the discriminator provides feedback telling the generator what is wrong with the pastiche.

To make this setup work, you actually train two models: one for the generator and another for the discriminator. The two act in concert, with the generator creating new samples and the discriminator telling the generator what is wrong with each sample. The process goes back and forth between generator and discriminator until the pastiche achieves a specific level of perfection. In the “Moving toward GANs” section, later in this chapter, you can find an even more detailed explanation about how GANs work.

Tip This approach is advantageous because it provides a greater level of automation and a higher probability of good results than using a CNN. The disadvantage is that this approach also requires considerable time to implement, and the processing requirements are much greater. Consequently, using the CNN approach is often better to achieve a result that’s good enough. You can see an example of the GAN approach at https://towardsdatascience.com/gan-by-example-using-keras-on-tensorflow-backend-1a6d515a60d0.

Using a network to compose music

This chapter focuses mainly on visual art because you can easily judge the subtle changes that occur to it. However, the same techniques also work with music. You can use CNNs and GANs to create music based on a specific style. Computers can’t see visual art, nor can they hear music. The musical tones become numbers that the computer manipulates just as it manipulates the numbers associated with pixels. The computer doesn’t see any difference at all.

However, deep learning does detect a difference. Yes, you use the same algorithms for music as for visual art, but the settings you use are different, and the training is unique as well. In addition, some sources say that training for music is a lot harder than for art (see https://motherboard.vice.com/en_us/article/qvq54v/why-is-ai-generated-music-still-so-bad for details). Of course, part of the difficulty stems from the differences among the humans listening to the music. As a group, humans seem to have a hard time defining aesthetically pleasing music, and even people who like a particular style or particular artists rarely like everything those artists produce.

In some respects, the tools used to compose music using AI are more formalized and mature than those used for visual art. This doesn’t mean that the music composition tools always produce great results, but it does mean that you can easily buy a package to perform music composition tasks. Here are the two most popular offerings today:

Remember AI music composition is different from visual art generation because the music tools have been around for a longer time, according to the article at https://www.theverge.com/2018/8/31/17777008/artificial-intelligence-taryn-southern-amper-music. The late songwriter and performer David Bowie used an older application called Verbasizer (https://motherboard.vice.com/en_us/article/xygxpn/the-verbasizer-was-david-bowies-1995-lyric-writing-mac-app) in 1995 to aid in his work. The key idea here is that this tool aided in, rather than produced, work. The human being is the creative talent; the AI serves as a creative tool to produce better music. Consequently, music takes on a collaborative feel, rather than giving the AI center stage.

Other creative avenues

One of the more interesting demonstrations of the fact that computers can’t create is in writing. The article at https://medium.com/deep-writing/how-to-write-with-artificial-intelligence-45747ed073c describes a deep learning network used to generate text based on a particular writing style. Although the technique is interesting, the text that the computer generates is nonsense. The computer can’t generate new text based on a given style because the computer doesn’t actually understand anything.

The article at https://www.grammarly.com/blog/transforming-writing-style-with-ai/ provides a more promising avenue of interaction between human and AI. In this case, a human writes the text and the computer analyzes the style to generate something more appropriate to a given situation. The problem is that the computer still doesn’t understand the text. Consequently, the results will require cleanup by a human to ensure reliable results.

Warning To realize just how severe the problem can become when using an AI in certain creative fields, consider the problems that occurred when the New York Times decided to favor technology over humans (see the article at https://www.chronicle.com/blogs/linguafranca/2018/06/14/new-york-times-gets-rid-of-copy-editors-mistakes-ensue/). Without copy editors to verify the text, the resulting paper contains more errors. Of course, you’ve likely seen this problem when a spell checker or a grammar checker fixes your perfectly acceptable prose in a manner that makes it incorrect. Relying on technology to the exclusion of human aid seems like a less than useful solution to the problem of creating truly inspiring text.

Eventually, most humans will augment their creativity using various AI-driven tools. In fact, we’re probably there now. This book benefits from the use of a spelling and grammar checker, along with various other aids. However, the writer is still human, and the book would never make it into print without an entire staff of humans to check the accuracy and readability of the text. When you think of deep learning and its effect on creativity, think augmentation, not replacement.

Moving toward GANs

In 2014, at the Départment d’informatique et de recherche opérationnelle at Montréal University, Ian Goodfellow and other researchers (among whom is Yoshua Bengio, one of Canada’s most noted scientists working on artificial neural networks and deep learning) published the first paper on GANs. You can read the work at https://arxiv.org/pdf/1406.2661v1.pdf or https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf. In the following months, the paper attracted attention and was deemed innovative for its proposed mix of deep learning and game theory. The idea became widespread because of its accessibility in terms of neural network architecture: You can train a working GAN using a standard computer. (The technique works better if you can invest a lot of computational power.)

Contrary to other deep learning neural networks that classify images or sequences, the specialty of GANs is their capability to generate new data by deriving inspiration from training data. This capability becomes particularly impressive when dealing with image data, because well-trained GANs can generate new pieces of art that people sell at auctions (such as the artwork sold at Christie’s for nearly half a million dollars, mentioned earlier in this chapter). This feat is even more incredible because previous results obtained using other mathematical and statistical techniques were far from credible or usable.

Finding the key in the competition

The GAN name contains the term adversarial because the key idea behind GANs is the competition between two networks, which play as adversaries against each other. Ian Goodfellow, the principal author of the original paper on GANs, used a simple metaphor to describe how everything works. Goodfellow described the process as an endless challenge between a forger and a detective: The forger has to create a fake piece of art by copying some real art masterpiece, so he starts painting something. After the forger completes the fake painting, a detective examines it and decides whether the forger created a real piece of art or simply a fake. If the detective sees a fake, the forger receives notice that something is wrong with the work (but not where the fault lies). When the forger shows that the art is real despite the negative feedback of the detective, the detective receives notice of the mistake and changes the detection technique to avoid failure during the next attempt. As the forger continues attempts to fool the detective, both the forger and the detective grow in expertise in their respective duties. Given time, the art produced by the forger becomes extremely high in quality and is almost undistinguishable from the real thing except by someone with an expert eye.

Figure 4-3 illustrates the story of GANs as a simple schema, in which inputs and neural architectures interact together in closed loop of reciprocal feedbacks. The generator network plays the part of the forger, and a discriminator network plays the detective. GANs use the term discriminator because of the similarity in purpose to electronic circuits that accept or reject signals based on their characteristics. The discriminator in a GAN accepts (wrongly) or refuses (correctly) the work created by the generator. The interesting aspect of this architecture is that the generator never sees a single training example. Only the discriminator accesses such data in its training. The generator receives random inputs (noise) to provide a random starting point each time, which forces it to produce a different result.

The generator may seem take all the glory (after all, it generates the data product). However, the real powerhouse of the architecture is the discriminator. The discriminator computes errors that are backpropagated to its own network to learn how best to distinguish between real and fake data. The errors also propagate to the generator, which optimizes itself to cause the discriminator to fail during the next round.

Illustration of the story of GANs as a simple schema, in which inputs and neural architectures interact together in closed loop of reciprocal feedbacks.

FIGURE 4-3: How a GAN operates.

Remember GANs may seem creative. However, a more correct term would be that they are generative: They learn from examples how data varies, and they can generate new samples as if they were taken from the same data. A GAN learns to mimic a previously existing data distribution; it can’t create something new. As stated earlier in this chapter, deep learning isn’t creative.

Considering a growing field

After starting with a plain-vanilla implementation, similar to the one just completed, researchers have grown the GAN idea into a large number of variants that achieve tasks more complex than simply creating new images. The list of GANs and their applications grows every day, and keeping up is difficult. Avinash Hindupur has built a “GAN Zoo” by tracking all the variants, a task that’s becoming more difficult daily. (You can see the most recent updates at https://github.com/hindupuravinash/the-gan-zoo.) Zheng Liu favors a historical approach instead, and you can see the GAN timeline he maintains at https://github.com/dongb5/GAN-timeline. No matter how you approach GANs, seeing how each new idea sprouts from previous ones is a useful exercise.

Inventing realistic pictures of celebrities

The chief application of GANs is to create images. The first GAN network that evolved from the original paper by Goodfellow and others is the DCGAN, which was based on convolutional layers.

DCGAN greatly improved the generative capabilities of the original GANs, and they soon impressed everyone when they created fake images of faces by taking examples from photos of celebrities. Of course, not all the DCGAN-created faces were realistic, but the effort was just the starting point of a rush to create more realistic images. EBGAN-PT, BEGAN, and Progressive GAN are all improvements that achieve a higher degree of realism. You can read the NVidia paper prepared on Progressive GANs to gain a more precise idea of the quality reached by such state-of-the-art techniques: https://research.nvidia.com/publication/2017-10_Progressive-Growing-of.

Another great enhancement to GANs is the conditional GAN (CGAN). Although having a network produce realistic images of all kinds is interesting, it’s of little use when you can’t control the type of output you receive in some way. CGANs manipulate the input and the network to suggest to the GAN what it should produce. Now, for instance, you have networks that produce images of faces of persons who don’t exist, based on your preferences of how hair, eyes, and other details appear, as shown by this demonstrative video by NVidia: https://www.youtube.com/watch?v=kSLJriaOumA.

Enhancing details and image translation

Producing images of higher quality and possibly controlling the output generated has opened the way to more applications. This chapter doesn’t have room to discuss them all, but the following list offers an overview of what you can find:

  • Cycle GAN: Applied to neural transfer style. For example, you can turn a horse into a zebra or a Monet painting into one that appears to come from van Gogh. By exploring the project at https://github.com/junyanz/CycleGAN, you can see how it works and consider the kind of transformations it can apply to images.
  • Super Resolution GAN (SRGAN): Transforms images by making blurred, low-resolution images into clear, high-resolution ones. The application of this technique to photography and cinema is interesting because it improves low-quality images at nearly no cost. You can find the paper describing the technique and results here: https://arxiv.org/pdf/1609.04802.pdf.
  • Pose Guided Person Image Generation: Controls the pose of the person depicted in the created image. The paper at https://arxiv.org/pdf/1705.09368.pdf describes practical uses in the fashion industry to generate more poses of a model, but you might be surprised to know that the same approach can create videos of one person dancing exactly the same as another one: https://www.youtube.com/watch?v=PCBTZh41Ris.
  • Pix2Pix: Translates sketches and maps into real images and vice versa. You can use this application to transform architectural sketches into a picture of a real building or to convert a satellite photo into a drawn map. The paper at https://arxiv.org/pdf/1611.07004.pdf discusses more of the possibilities offered by the Pix2Pix network.
  • Image repairing: Repairs or modifies an existing image by determining what’s missing, cancelled, or obscured: https://github.com/pathak22/context-encoder.
  • Face Aging: Determines how a face will age. You can read about it at https://arxiv.org/pdf/1702.01983.pdf.
  • Midi Net: Creates music in your favorite style, as described at https://arxiv.org/pdf/1703.10847.pdf.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset