Recently there was a paper [1] that claimed the following about Deep Neural Networks (DNNs):

DNNs fail to recognize the negative images and classify them randomly into other classes.

In this post I’ll try to verify those results for the MNIST dataset (handwritten digits), by training a neural network to accurately classify the digits, and then testing it with negated input images.



For my implementation, I will be using Keras with the TensorFlow backend. To make life easier, I’ll base my code off the Keras MNIST CNN example.

The Model

The authors describe their model architecture as the following:

The network has two convolutional layers with 32 and 64 filters and with a kernel size of 3 × 3, followed by a max-pooling layer. It is then followed by two fully-connected layers with 200 rectified linear units. The classification is made by a softmax layer

I think there is some ambiguity so I’m making the following assumptions:

  • The convolutional layers use ReLU activations
  • The max-pooling layer is 2x2
  • There is no dropout
  • I’ll use the Adam optimizer

Translated to Keras code, we have the following model:

model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dense(200, activation='relu'))
model.add(Dense(200, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))


Training & testing

My data is split into 60,000 training instances and 10,000 test instances. I am training for 20 epochs. The paper normalizes their data values to [0, 1], and defines the Negative image as 1 - X (same range, but inverted values), where X was the original value. For fun, I’ll also test the accuracy with -X, which converts the input range to [-1, 0].

For measuring my results, I’ll be taking note of the classification accuracy of the test set X, as well as 1 - X and -X.

Robust Training

The authors of the paper also performed several tests where they trained a “robust” model using Adversarial Examples [2], which I understood to be the same training data, but with some noise added in order to improve it’s ability to generalize. In the paper, they used a parameter ,ε, to control the magnitude of the noise. They test with ε = 0.05, 0.1, 0.2, and 0.5, but I will just test with ε = 0.5.

# add noise to the training data
x_noise = np.random.random(x_train.shape) - 0.5
x_train = x_train + noise



Regular DNN

These results show that with negated (1 - X) inputs, the model does perform worse, having about 47% accuracy, versus ~99% accuracy with the normal inputs. The errors are greater still with the -X input, with only ~13% accuracy.

Below the table are the confusion matrices for each test, which should help visualize how each test set caused the classifier to behave.

For some reason, my negative test results, although not that great, are much better than the paper’s results. This isn’t a comprehensive review of the work, so I might have wrong assumptions or missed some implementation details.

Test Set My Accuracy Hosseini & Poovendran
X 99.10% 99.20%
1 - X 47.64% 6.58%
-X 13.49% N/A


“Robust” DNN

In the paper, using adversarial training examples with ε = 0.5 brought the accuracy of the negative test set up to about 16%, which is slightly better than random.

In this version, my results are still quite different than the paper’s, but adding the adversarial examples had a similar effect to the errors: My normal test set accuracy went down slightly (99.10% to 98.86%), while performance with the negative examples improved a bit (47.64% to 54.97%).

Test Set My Accuracy Hosseini & Poovendran
X 98.86% 98.74%
1 - X 54.97% 16.69%
-X 10.02% N/A



I think that my (very brief) tests somewhat support the conclusion that neural networks seem to fail on negative examples, although not to the degree that the paper claims.

** That’s also probably not the point of the paper,** which the authors touch upon in their conclusion:

“The inability of recognizing the transformed inputs shows the shortcoming of current training methods, which is that learning models fail to semantically generalize.”

The real story here is that currently, neural networks will most likely behave in unexpected ways when asked to process a new input that is completely outside the range of their training examples, even if that new input represents the same concepts.

This is illustrated using inverted images: to humans, an inverted picture still represents the same contents as a regular one, but a neural network only sees weird new numbers and features it’s never had to deal with before. It tries to process them exactly as it would a regular input, and ends up producing a meaningless output.

Both are penguins, but to a neural net, the one on the right might as well be a pineapple. Hopefully this gets solved before it gets used in anything too important.


[1] Hosseini, Hossein, and Radha Poovendran. “Deep Neural Networks Do Not Recognize Negative Images.” arXiv preprint arXiv:1703.06857 (2017).
[2] Goodfellow, Ian J., Jonathon Shlens, and Christian Szegedy. “Explaining and harnessing adversarial examples.” arXiv preprint arXiv:1412.6572 (2014).