CNN Sprite Generator

Project

Motivation

Developing visual content for games, even simplified 2D games, proves to be a hurdle for indie studios and solo game developers who have neither the time nor capital to develop captivating environments, engaging character designs, and remarkable particle effects. So, these developers turn to procedural algorithms to assist them in this endeavor. Current, general-use algorithms either iterate over the equivalent of building blocks to construct sprites that lack the detail of a hand-crafted image or explore a crafted algorithm using a pseudo-random seed [1]. Alternatively, a developer can construct a deep neural network, such as a generative-adversarial network, to produce novel sprites for their game [2]. However, a neural network is not a panacea; it requires a large dataset of images. Additionally, the developer must rely on the network converging to an optimal weight space while simultaneously not overfitting.

Approach

Our solution takes significant inspiration from Lewis Horsley's and Diego Perez-Liebana's 2017 paper titled Building an Automatic Sprite Generator with Deep Convolutional Generative Adversarial Networks [2]. The rationale to use a GAN over hand-crafting would be to generate sprites without much human intervention after model training. Note that further refinement and animation would still need to be done to tailor these sprites for the specific game. As a future step the sprite pipeline may include animation generation [5]. Our motivation is partially drawn from personal experience where the problem with making games comes down to creating the art and partially inspired from literature [1].

One valid critique of the GAN method is the sprite's art style. That is the sprite will be a combination of features from the dataset the network was trained on and thus may not have consistent artistic qualities. Our solution will handle this in two parts. First, the dataset will be curated with a rough theme in mind, something analogous to a game studio's genre. Second, themes will be taught into an autoencoder as it can be thought of as a pre-trained style domain that learned the visual metaphor for sprite generation in its decoder.

Implementation

Our solution will first train an autoencoder, then flip the encoder and decoder using the hidden central feature structure as both input and output, see Figure 1. After this we will hold the autoencoder's weights constant while training a generator and discriminator. Meaning the GAN only needs to learn what features to select and discriminate on instead of having to learn the entire space. Note a possible limitation of our solution is a potential for loss in novelty with the worst-case being mode collapse [6].

Figure 1: High-level network design with autoencoder.

Drawing on Lewis Horsley's and Diego Perez-Liebana's paper [2], we will use a similar DCGAN [4] model including their use of strided convolution in lieu of maxpooling and fractional strided convolution instead of upsampling and we will implement batch normalization. Additionally, we are making the following changes,

1. Our network will generate images of 32x32x4 pixel images.

2. We will adjust the hyperparameters such as learning rate and network structure to get best possible results from our dataset.

2. We will perform pretraining of the GAN model through autoencoders. Our network will use successive decoders trained through the normal unsupervised process. This should populate the weight with a better chance of generalizing from input. A potential drawback will be lack of novelty in output images if the autoencoder overfits during training.

The following images detail the model architecture as sub-networks that correspond to the high-level image. The generator takes an input vector with 100 dimensions into a fully connected layer that gets reshaped into the feature stack shown. The next two layers are transpose convolutional layers that learn to upsample the features to the form that the decoder takes in. Activation functions are ReLU.

Figure 2: Generator sub-network.

The decoder performs transpose convolution to move from feature space to the final image. This sub-network has its weights frozen while training the GAN as it is pre-trained in the autoencoder step. Activation functions are ReLU.

Figure 3: Decoder sub-network.

Likewise, the encoder has its weights froze while training the GAN as it too comes from the autoencoder step. The goal of this network is to learn a feature stack from an image using convolutional layers. Activation functions are ReLU except the last layer which has a sigmoid activation function.

Figure 4: Encoder sub-network.

Finally, the discriminator sub-network takes the feature vector generated from the encoder and through convolutional layers determines whether the image is real or fake. The final output is a dense layer, though it is only a single element. Activation functions are leaky ReLU with 0.2 leakage except the last layer which has a sigmoid activation function.

Figure 5: Discriminator sub-network.

Comparison Architecture

For comparison, we will develop a standard DCGAN that closely follow Horsley's and Perez-Liebana's architecture (without our autoencoder). This network will still have hyperparameters adjusted to produce best possible output given the constraint of time.

Datasets

Regarding the dataset, we will be collecting sprites from open source resources, specifically OpenGameArt.org. This should provide enough data for an initial system. If necessary, we can perform simple manipulations such as changing the color palette and single-pixel manipulations. Our team recognizes the largest risk to the success of the project is collecting enough quality data for our network to generalize. Our models will be trained on constrained topics as subsets of items, entities, and environment. While this will reduce our training set it will also reduce the variance thereby reducing cases to learn.

All sprite data will be filtered to only include sprites of 32x32 pixels, some sprites may be cropped or padded (note sprites that when manipulated, lose the original meaning such as terrain, will be discarded). Sprites themselves may either originate from a sprite-sheet where they are extracted or may already be provided individually. The sprite is then normalized to a four-channel image (RGBA) before being manually classified in a hierarchy. In the future, a classification neural network could be developed for sorting but that is outside the scope of our project. Finally, we package the sorted image data as serialized Numpy arrays along with labels generated from the hierarchical structure.

Development and Training

We are using Keras to develop both our autoencoder architecture and the comparison architecture. Keras was chosen due to the ease of use in leveraging Tensorflow and that it is a Python library. For each subset of images, the class of interest, we will train a new variant of the network. Regarding training, we chose Google Colab as it provides free GPU access for a Jupyter notebook.

Results

In the following sections, we will cover our development process starting with the GAN created to generate handwritten digits, the comparison GAN, and autoencoder GAN. Each section will detail the design decisions and inspiration behind it along with several result images.

GAN - MNIST Handwritten Digits

As a proof of concept while collecting our sprite dataset we developed a GAN to generate handwritten digits. This allowed us to get familiar with Keras and develop the initial GAN architecture. Figure 6 displays several digits generated from the trained network. These results were generated after training the network for 20 epochs using the full MNIST dataset.

Figure 6: MNIST handwritten digits generated by GAN.

Comparison GAN

Using an exploratory GAN based off the model architecture developed for the MNIST problem, we trained a network that generated terrain tiles, Figure 7. Our network was trained for approximately 2000 epochs to achieve these results. Most interesting sprites are the various block patterns that emerge as a feature. Note that some of the artifacts are exaggerated due to the sprites being scaled, however, there is still much work to be done in generating valid sprites.

Figure 7: Environment sprites generated with the network created. Note that some of the blurring artifacts are generated due to scaling the 32x32 pixels.

Using a similar network structure to the model that generated environmental sprites, we trained a GAN that could generate items, specifically our largest subset was various weapons like swords, staffs, maces, etc. Figure 8 shows what our network learned at approximately 500 epochs. A thousand epochs later, the results are shown in Figure 9. In both cases, the network has difficulty learning the relevant details but has a general understanding of shape and shadow.

Figure 8: Weapon sprites generated with GAN. Note results are poor but understandable.

Figure 9: Results generated after further training of model. Improvement shown but quality in terms of color, but it is still low.

Further modification to the DCGAN along with training the network for 3000 epochs resulted in a model that seemed to produce some valid results. Figure 10, shows a sample of these generated images, note that the network did not generalize well, instead learning only the exemplars in the dataset. For example, the sword generated in the center is almost identical to one training sprite. We attempt to address this issue by increasing our dataset and thus variance.

Figure 10: Potential overfitting of the limited weapons dataset after modifying network structure. Note that this set was randomly selected, the duplicates appeared by chance.

Training

Below are some of the recent training results captured for the comparison GAN. In these, we were training entities and armor. While the color is off after the training, along with partial mode collapse, the results display new variants of the sprites within the dataset.

Figure 11: Human-like entities being learned.

Figure 12: Humanoid being learned.

Figure 13: Torso armor being learned.

Autoencoder GAN

We started with one layer of autoencoder which produced good results as shown below, Figure 14. However, the network could not learn with a second layer of autoencoder with any reasonable accuracy. Thus for our architecture, we pretrain one layer of autoencoder before swapping order of encoder and decoder.

Figure 14: Original vs Autoencoder generated.

Using the pre-trained autoencoder, we trained our architecture GAN in the same manner as the comparison network. The GAN, with the pre-trained autoencoder, suffered from partial mode collapse. The generator outputs lacked in diversity due to the generator jumping from one single optimal point to another.

Figure 15: Mode collapse. The generated sprites exhibit low variance.

The generator outputs showed little variance in structure, however, the variance in color was more pronounced. Comparing the generator and discriminator losses between our autoencoder and comparison GANs brought up another issue with our autoencoder GAN.

Figure 16: Loss plot for comparison GAN.

Figure 17: Loss plot for Autoencoder GAN.

The discriminator and generator losses, in the comparison GAN, do indicate that both the generator and discriminator are able to optimize their loss function without one overpowering the other. However, in out autoencoder GAN, the discriminator loss remains close to zero throughout training. With the discriminator giving values close to 0 or 1, the generator will struggle to read the gradient. One theory as to why this is happening is that the encoder knowing the pattern may make the determination of a 'real' or 'fake' sprite (by the discriminator) easier.

The autoencoder GAN loss plot confirmed the network was suffering from mode collapse. The sharps drop in generator loss and slight increase is discriminator may signify the generator converging to a single optimal point. The discriminator then learns to classify these generated images as fake and the generator will collapse to the next optimal point.

The simplest solution to implement in this case was to increase the batch size while training the GAN. This solution worked to some extent when training on large datasets. The resulting outputs had more variance and novelty.

Figure 18: Autoencoder GAN results.

This, however, did not prove to be a great solution as the autoencoder GAN did not perform well for limited datasets. Also, even on large datasets, once the GAN experience partial mode collapse the training process had to be restarted. Nevertheless, the autoencoder GAN did show promise through the novelty of the outputs and a relatively lower training time.

A final attempt to handle the discriminator loss approaching zero was to use the comparison GAN's discriminator and our autoencoder model's generator and decoder. This produced the following sprites which were an minor improvement over our previous autoencoder models. Furthermore, it proved be stable for a longer amount of time but it still eventually resulted in partial mode collapse.

Figure 19: Sprites generated from decoder only with comparison discriminator.

Discussion

In order to further progress in the topic of sprite generation, we implemented a generative adversarial network that was augmented with a pre-trained autoencoder. Thus, instead of learning to draw images, the GAN was responsible for learning the feature space of an autoencoder that was pre-trained to render sprites. This resulted in a model that could learn novel patterns with the relatively small dataset we collected. Furthermore, we found that using an autoencoder decreased time to get initial results but at a major trade-off of stability during training. This was observed by the GAN's discriminator approaching zero early in the training process. Several possible solutions we have found (though not yet implemeted) include minibatch discrimination, unrolled GANs, and collecting a larger dataset of sprites.

A second challenge encountered during development was generating images with correct coloring and alpha channel. For example, the autoencoder trained on a small dataset can produce the correct shape but would apply a new color palette to the result image. Likewise, both the comparison GAN and the autoencoder GAN would fail to learn color information while training. For the autoencoder GAN, this resulted in sprites with an inconsistent color palette and in the comparison model, the sprites would have color noise. These pixel errors jeopardize the visual metaphor that the sprite is supposed to serve if used in a game. We believe the first step to addressing this problem is to build a larger dataset with control for variance.

The next step of the project is to explore alternate implementations, strategies, etc. to further test the autoencoder concept. Near the end of the term, we started work on a conditional GAN with the hope that this handles the variance in the dataset. However, we repeatedly had the discriminator loss approach zero early in training (without the autoencoder). Thus, we abandoned that aspect of the project scope to focus on what was working. It would be a logical step to attempt this again. Another route that the autoencoder GAN could go is to generate grayscale images with a separate model learning to recolor a sprite. This would leverage the strength of the autoencoder as we observed (that is learning sprite structure) without the observed weakness.

While the setbacks due to stability and color challenge the benefit of the autoencoder, they do not necessarily invalidate the approach. We believe the potential benefit as a method to reduce learning time and increase novelty is still suggested by the results we were able to generate. Further work is needed to triangulate our findings.

Acknowledgements

Website Template

1. Jeromelachaud. Grayscale-theme. License: Apache-2.0. Online: https://github.com/jeromelachaud/grayscale-theme

Sprite Dataset Contributions

1. Henrique Laxarini (7Soul1). 496 Pixel Art Icons for Medieval/Fantasy RPG. License: CC0 - Public Domain. Online: https://opengameart.org/content/496-pixel-art-icons-for-medievalfantasy-rpg

2. MedicineStorm. Dungeon Crawl 32x32 Tiles Supplemental. License: CC0 - Public Domain. Online: https://opengameart.org/content/dungeon-crawl-32x32-tiles-supplemental

3. David E. Gervais. Roguelike Tiles (Large Collection). License: CC-BY 3.0. Online: https://opengameart.org/content/roguelike-tiles-large-collection

4. Lanea Zimmerman (Sharm). Tiny 16: Basic. License: CC-BY 3.0. Online: https://opengameart.org/content/tiny-16-basic

5. ArMM1998. Zelda-like Titlesets and Sprites. License: CC0. Online: https://opengameart.org/content/zelda-like-tilesets-and-sprites

6. Zabin, Hyptosis, Danial Cook. Castle Tiles for RPG's. License: CC-BY 3.0. Online: https://opengameart.org/content/castle-tiles-for-rpgs

7. Zabin, Daneeklu, Jetrel, Hyptosis, Redshrike, Bertram. RPG Tiles: Cobble Stone Paths & Town Objects. License: CC-BY-SA 3.0. Online: https://opengameart.org/content/rpg-tiles-cobble-stone-paths-town-objects

References

1. Hendrickx M., Meijer S., Velden J. V. D., Losup A. (2013) Procedural Content Generation for Games: A Survey. ACM Transactions on Multimedia Computing, Communications, and Applications. v.9, n.1, p.1-22.' Online Access: https://dl.acm.org/citation.cfm?id=2422957

2. Horsley L., Perez-Liebana D. (2017) Building an Automatic Sprite Generator with Deep Convolutional Generative Adversarial Networks. IEEE Conference on Computational Intelligence and Games,' p.134-141. Online Access: https://ieeexplore.ieee.org/document/8080426/?part=1

3. Mirza M., Osindero S. (2014) Conditional Generative Adversarial Nets. CoRR. Online Access: https://arxiv.org/pdf/1411.1784.pdf

4. Radford A., Metz L., Chintala S. (2015) Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. CoRR. Online Access: https://arxiv.org/pdf/1511.06434.pdf

5. Reed S. E., Zhang Y., Zhang Y., Lee H. (2015) Deep Visual Analogy-Making. Advances in Neural Information Processing Systems 28. P.1252-1260. Online Access: http://papers.nips.cc/paper/5845-deep-visual-analogy-making.pdf

6. Thanh-Tung H., Tran T., Venkatesh S. (2018) On catastrophic forgetting and mode collapse in Generative Adversarial Networks. ICML Workshop on Theoretical Foundations and Applications of Deep Generative Models. Online Access: https://arxiv.org/pdf/1807.04015.pdf