Deep visual embedding for image classification

This paper proposes a new visual embedding method for image classification. It goes further in the analogy with textual data and allows us to read visual sentences in a certain order as in the case of text. The proposed method considers the spatial relations between visual words. It uses a very popular text analysis method called ‘word2vec’. In this method, we learn visual dictionaries based on filters of convolution layers of the convolutional neural network (CNN), which is used to capture the visual context of images. We employee visual embedding to convert words to real vectors. We evaluate many designs of dictionary building methods. To assess the performance of the proposed method, we used CIFAR10 and MNIST datasets. The experimental results show that the proposed visual embedding method outperforms the performance of several image classification methods. Experiments also show that our method can improve image classification regardless the structure of the CNN.

way of visual embedding is done by the same manner as in natural language processing (NLP). The proposed method uses a very popular text analysis method called 'word2vec'. In this method, we learn visual dictionaries based on filters of convolution layers of the convolutional neural network (CNN), which is used to capture the visual context of images. We employee visual embedding to convert words to real vectors. In addition, we assess different designs of dictionary building methods. The rest of the paper is organized as follows. Section II presents the related works. Section II discusses the word embedding. Section III explains the proposed method. The experimental results are provided in section IV. The conclusions and the future work are provided in section V.

II. RELATED WORKS
Word embedding is basically real value vectors that capture semantic similarity. Such vectors are used as representations of words for NLP tasks, such as sentiment analysis, text classification and document clustering. These representations have been learned using neural networks [1], [2] and then used as to initialize a multi-layer network model [1], [2]. Like these approaches, we learn word embedding from image online and fine-tune them to predict visual class of image. Xu et al. [5] and Lazaridou et al. [6] use visual appearance to improve the word2vec representation by predicting real image representations from word2vec. While they focus on capturing the appearance, we focus on capturing the visual words inside the image. Another groups of researchers use visual and textual attributes to learn distributional models of the meaning of words [7], [8]. In turn, our set of visual words are learned in the training step of CNN. Other researchers use word embedding inside a larger models for complex tasks, such as image captioning [9], [10] and image retrieval [9]. These work are multi-modal (i.e. textual-visual). In turn, we only do visual embedding. Bags-of-visual words (BOVW) have been proposed by Sivic et al. in [21]. BOVW becomes very popular due to its efficiency and high performance. Other works, such as probabilistic latent semantic analysis (pLSA) [23] and latent Dirichlet allocation (LDA) [24] use unsupervised classification. They compute latent concepts in images using the co-occurrences of visual words. These techniques obtain best results when the number of categories is known. Random forests classifier

I. INTRODUCTION
Several methods have been proposed for classification, clustering and indexing text documents in the literature. Those methods are fully grown and very effective to deal with huge numbers of classes. Textual data contains words and thus it can be processed using information about words. It is always possible to learn a word representation upon the existence of a given word in different documents. This idea can be also applied to images. Indeed, any image can be considered as a document containing a set of patches. However, this analogy would not stand further because text words are discrete values while local image patches are represented using high-dimensional and realvalued descriptors. To obtain discrete visual words, we should find a way to project image patches or patch descriptors to a single integer number per patch or descriptor. Therefore, local patches or descriptor vectors can be represented in terms of the region (i.e. discrete word) that they belong to. In text prepossessing field, vector representations of word (word embedding or word2vec) has been recently proposed to capture the context-based relationships between words. Those representations achieve the state-of-the-art results in different text processing tasks. Indeed, the main obstacle to apply those methods to images is the above mentioned nature of image patches. In this paper, we build a bridge between images and text documents, which is applicable to images in total analogy with textual word embedding. In an end-to-end fashion, the have been used in some works, e.g., [24], however, the stateof-the-art results were obtained with support vector machines (SVM) classifier. The authors of [25] combine local matching of the features and specific kernels based on the χ 2 to get the better results. Most BOVW methods build their visual words in the clustering step, in turn, we learn our dictionary from the data using CNN or we build them using local binary pattern (LBP) or Gabor filters. Vision and NLP: Recently, different problems at the intersection of NLP and vision are considered to be interesting to researchers [19], [20]. Significant breakthroughs have been achieved in several tasks, such as visual question answering [10]- [12], image captioning [13]- [15], aligning text and vision [16], [17] and video description [18]. Unlike these works, our approach is generic, i.e., it can be also used for multiple tasks.

III. WORD EMBEDDING
In NLP, word embedding is a set of language modeling algorithms where words and phrases are mapped to vectors of real numbers. Mathematically, embedding is a word projection from a space with one dimension (sparse vector) to a continuous vector space with much lower dimension. The authors of [29], [33] showed that mapping can be done using neural networks. In literature, other word embedding models were proposed, such as probabilistic models [29], dimensionality reductions based models [31], word context based [31]. Experiments showed that word and phrase embeddings are able to boost the performance in NLP tasks (for example, syntactic parsing and sentiment analysis) when they used as the underlying input representation. Taking into account this intuition, and supposing that patterns from a given (or learned) dictionary are occurring in the images, any image can be converted to an image of visual words instead of image containing colors. In this work, we follow [33] to do the embedding step. Below, we discuss some options to build the dictionaries. Simple pixel dictionary: It is clear that the use of values of image pixels are the simplest dictionary that can be built. It consists of gray-scale values of the pixels. The main disadvantages of this approach are: • It does not capture any information about the context. It only describes a single pixel. • It is sensitive to the noise. Local binary pattern (LBP): The LBP operator was proposed by Ojala et al. [26]. Indeed, LBP descriptor is one of the most successful descriptors due to its outstanding advantages, such robustness to illumination changes, low complexity in terms of computation and implementation. LBP characterizes the structure of a local image patch. The main idea of LBP is to convert differences between a given pixel value of the central point and those of its neighbors. The value of the binary pattern is used to label the given pixel. The responses of LBP is calculated as follows.
where P is number of all neighbor pixels, g c is the center pixel value and g p is neighbor pixels. As shown in Eq.1, for any pixel in the neighbor, if the center pixel's value is greater than the neighbour's value, s = 0 is set to the neighbor pixel; otherwise, s = 1 is set to the neighbor pixel. This gives an 8-digit binary number and it is usually converted to a decimal value.

Fig. 1: Gabor filters
Gabor filters: Fig. 1 sows a set of Gabor filters with various scales and rotations. Jones and Palmer explained in [26] that the real part of the complex Gabor function is a good simulation to the receptive field weight functions that found in simple cells in mammals striate cortex. In other words, image analysis using Gabor filters is similar to the human visual system. The impulse response of Gabor filters is defined analytically as a Gaussian function multiplied by a sinusoidal wave or a plane wave in case of 2D Gabor filters. As shown in Eq. 2, the filters have a real and an imaginary components representing orthogonal directions. Components can be formed into a complex number or used individually. The complex form of Gabor filters can be given as follows: g(x, y, λ, θ, ψ, ω, γ) = exp(− x 2 + γ 2 y 2 2ω 2 )exp(i(2π x λ + ψ)) (2) And the real part can be given as following: where x = xcosθ + ysinθ and y = −xsinθ + ycosθ. Convolving an image I with a bank of Gabor filters with N filters will produce N image. Stack the resulting images horizontally as shown in Fig. 2 and for every x, y in output Fig. 2: Embedding procedure for analytically given dictionary image write the number of the filter with highest response in the same coordinates. It always possible to find the word in a given location using a similar policy (see Fig. 2). In this section, we show how to compute the visual words and find them inside the images. The proposed method is independent on dictionary building step as long as the resulting dictionary has finite numbers of visual words. Each patch in the image will be converted to the number closest visual word in the used dictionary. This step produces 'words image' (see Fig. 3). It will be passed to CNN which captures the spatial context of words. After that, every word in the word image will be embedded to real values vector with embedding size d, resulting an image with d channels. Then, d-channel image will pass through layer of the CNN and the process will continue until we get a stable embedding vector.
As shown in Fig. 4, in the proposed method we use filters learned by CNN to build dictionary. The first layer of the CNN is a convolution layer with N filters. Afters passing throw this layer, we will get N feature maps. Choosing the number of filters with highest responses, and then pass the resulting responses to the next layers. It is possible to learn filters (build a dictionary). The size of the dictionary equals the number of filters of first convolutional layer. In this paper, we adopt learned filters to build dictionary. We also believe that Gabor dictionary is a special case of learned filters.

A. Datasets
To assess the performance of the proposed method, we use two state-of-the-art image classification datasets: MNIST and CIFAR-10. The MNIST dataset contains 70000 of 28x28 gray-scale images. The images contain hand-written digits of 0 to 9. In the dataset, there are 60000 training images and 10000 testing images. The CIFAR-10 dataset contains 60000 RGB images of 32x32 pixels of 10-classes. The dataset is split into 50000 training images and 10000 testing images.

B. Model Setup
In our proposed model, the first layer of CNN is a convolution layer. This layer consists of 128 filters (the size of dictionary in our model is 128 words). The output of the first layer is an image with 128 channels. To convert the multichannel image to word image, we take the index of the channel with highest response. The intuition behind that is that word (filter) with highest response is the closest to the patch in given location. Thus, we have a single-channel image which contains indexes of closest words. The next embedding layer will convert each single pixel of the word image to a vector of real-values. In our experiments, embedding size of 16 is used. Thus, each word image will be converted to an image of 16 channels containing real values. The next layer is a convolutional layer, in which the number of filters is set to the visual dictionary size. Afterwards, the outputs of the first and second convolution layers are concatenated to learn visual words, which are the filters of the first convolutional layer. Table I shows that the performance of CNN can be improved by adding embedding in bottom layers of the model. Our custom CNN model gives an error of 1.0 with MNIST dataset without embedding and 0.5 after embedding which is quite good improvement. Fig. 5 shows the change of accuracy across training and testing epochs of CNN while Fig. 6 presents the confustion matrix of the proposed method with MNIST dataset. Applying the same model to CIFAR10 dataset gives an  Fig. 7 demonstrates the change of accuracy across training and testing epochs of CNN while Fig.  8 presents the confustion matrix of the proposed method with CIFAR10 dataset. We avoid putting embedding on the bottom of state-of-the-art models, such as VGG16 and ResNet in order to show the contribution of the visual embedding itself. Table  II demonstrates that learned filters as dictionary give much better performance compared to LBP dictionary.   Table III shows that the proposed method gives a promising performance with MNIST and CIFAR10 datasets. Our method shows quite good performance on the MNIST (an error of 0.5) and promising behavior on CIFAR10 (an accuracy of 0.81). The degradation of behavior in CIFAR10 is due to two factors: the difficulty of classification problem on this dataset and the simplicity of our model. Table III also shows that the proposed method is on par with several recently published image classification methods.  [35] 0.71 -Adversarial Examples [36] 0.78 -PCANet Examples [37] 0.62 78.67 ACKNOWLEDGEMENT This paper is partially funded by project DPI2016-77415-R.

VI. CONCLUSION AND FUTURE WORK
In this paper, we have presented a CNN model to embed visual words in images. We have treated images as textual document, build visual words and embed them to capture the spacial context surrounding them. The learned embedding performed better than the same CNN model with original images. The main goal of our proposed method is to push a framework that shows how to build a bridge between embedding in text data and how to adapt it for visual data and wide range of generic image and video tasks. In the future work, we will apply image embedding on the bottom of state-of-the-art CNN models, such as VGG16 and ResNet. In addition, we will apply it to recognize actions in videos.