Symbols in engineering drawings (SiED): an imbalanced dataset benchmarked by convolutional neural networks.

. Engineering drawings are common across diﬀerent domains such as Oil & Gas, construction, mechanical and other domains. Automatic processing and analysis of these drawings is a challenging task. This is partly due to the complexity of these documents and also due to the lack of dataset availability in the public domain that can help push the research in this area. In this paper, we present a multiclass imbalanced dataset for the research community made of 2432 instances of engineering symbols. These symbols were extracted from a collection of complex engineering drawings known as Piping and Instrumentation Diagram (P&ID). By providing such dataset to the research community, we anticipate that this will help attract more attention to an important, yet overlooked industrial problem, and will also advance the research in such important and timely topics. We discuss the datasets characteristics in details, and we also show how Convolutional Neural Networks (CNNs) perform on such extremely imbalanced datasets. Finally, conclusions and future directions are discussed.


Introduction
Engineering drawings are known to be one of the most complex types of documents to process and analyse. They are widely used in different industries such as construction and city planning (i.e. floor plan diagrams [2]), Oil & Gas (i.e. P&IDs [9]), Mechanical Engineering [33], AutoCAD Drawing Exchange Format (DXF) [13] and others. Interpreting these drawings requires highly skilled people, and in some cases long hours of work. Processing and analysing these drawings is becoming increasingly important. This is partly due to the urgent need to improve business practices such as inventory, asset management, risk analysis, safety checks and other types of applications, and also due to the recent advancements in the domain of machine vision and image understanding. Deep Learning (DL) [15], in particular, had significantly improved the performance by orders of magnitude in many domains such as the Gaming and AI [17], Natural Language Processing [36], Health [12], Cyber Security [28], and others.
The concept of Convolutional Neural Networks (CNNs) [16] has made significant progress in recent years in many image-related tasks. It has been successfully applied to several fields such as hand-written digit recognition [22], image classification [30,20], face recognition & biometrics [27], amongst others. Before CNNs, improvements in image classification, segmentation, and object detection were marginal and incremental. CNNs revolutionised this field. For example, Deep Face [31], which is a face recognition system that was first proposed by Facebook in 2014, achieved an accuracy of 97.35%, beating the then state-of-the-art, by 27%.
Despite extensive progress in the field of image processing and analysis, very little progress has been made in the area of analysing complex engineering drawings, and extracting information from these diagrams is still considered a challenging problem [5]. Consider for example the case of the Piping and Instrumentation Diagram (P&ID), which is a schematic engineering drawing, commonly used in the Oil and Gas industry [9,24]. This type of diagram, as can be seen in Figure 1, is made of symbols, connectivity information (lines, dashed lines, combinations of lines), text, and other graphical elements. Identification of the symbols within this kind of diagram would appear to be an ideal problem which could be easily solved by convolutional neural networks. However, a recent review on the subject [7] showed that publicly available datasets are not common in this area, with research commonly applied to small, proprietary datasets. To take full advantage of the recent advances in machine vision, and to facilitate reproducible experiments, a sizeable, labelled dataset in the public domain is required.
Several factors make processing and analysing engineering drawings a challenging tasks. First, the quality of the images/scanned documents is sometimes of a standard which requires the application of various image-enhancements methods. Second, the nature of these diagrams, where various types of elements might be overlapping (i.e. a text overlaid on a symbol), in addition to possible data annotations and other graphic elements makes accurate localisation of individual elements more challenging. It is difficult to isolate one particular symbol from its neighbours. Another inherent problem is the imbalanced distribution of various symbols within these drawings. Handling all related challenges is beyond the scope of this paper. The reader is referred to [24] for more detailed description about the inherent characteristics and challenges of these types of drawings.
In this paper, we present a new multiclass dataset of symbols extracted from engineering drawings to the research community. Realistically reflecting the problem, this dataset is subject to some class-imbalance. The remaining parts of this paper are organised as follows: In Section 2 we discuss relevant literature to the digitisation of engineering drawings and class imbalance. In Section 3 we present our methods which includes detailed discussion of the dataset, and our approach for classifying engineering symbols. Benchmarking experiments and results are presented in Section 4, and finally, conclusions and future directions are discussed in Section 5.

Related Work
Attempts to process and analyse symbolic drawings date back to at least the early 90's. These include: analysis of musical notes [6]; processing mechanical drawings [19]; and optical character recognition (OCR) [21,23,26]. In recent years, digitising engineering drawings has become increasingly important as they are widely used in different domains [9,2,33,13], however, literature is still limited. To the best of our knowledge, there is no large, publicly available dataset to facilitate the advantages of modern, data-hungry CNNs. A recent review [7] detailed the whole process of digitisation and contextualisation of the three main shapes contained in engineering drawings (i.e. text, lines and symbols). The authors identified that, typically, symbols are located within the drawing either in a specific or a holistic way. In specific localisation, the system has a predefined symbol description/template, and an algorithm recursively looks for such symbol. In contrast, holistic methods require differentiation of the three shapes to then be able to split the drawing into layers. One of the most widely-used frameworks in this regard is text-graphics separation [32], which is a family of algorithms which distinguish text from lines and symbols based on properties such as height-to-width radio, stroke, amongst others. CNNs could be applied to both of these, given sufficient labelled data.
One type of engineering drawings, namely P&IDs, has attracted more research attention in recent years. Typical examples, presented in [9,18,25], aimed at detecting and recognising symbols within these diagrams. It can be argued however, that most of the existing literature followed a traditional image process-ing approach [14], which requires feature extraction [8], feature representation [37], and classification to determine the class of objects (i.e. symbols, digits, ...) [1].
Most recently in [9], authors presented a first step towards creating a symbol repository for engineering drawings. A total of 1187 symbols split into 37 different classes was compiled. The repository was then processed by means of class decomposition [10,11], resulting in a total of 57 sub-classes. Classification accuracy was calculated using three different classification frameworks: Random Forests (RF), Support Vector Machine (SVM) and a CNN. Class decomposition demonstrated a slight improvement in classification results for SVM and CNN, with a more considerable improvement in RF.
Overall, there is a growing interest in the research community in digitising and analysing engineering drawings. Yet, the lack of public domain datasets is considered as one of the main challenges to push the research boundaries in this area. In addition to this, the class-imbalance problem could also be considered as another challenge, in particular when certain types of symbols either dominate, or rarely appear in the dataset. Class-imbalance is common across different domains, and not only limited to engineering drawings [35,34]. Handling this problem is often done by means of data resampling, where majority classes are undersampled to reduce their dominance, or minority classes are oversampled [35]. In addition to this, Generative Adversarial Neural Networks [15] were successfully applied recently to augment an imbalanced dataset and improve learning algorithm performance [3,4].

Methodology
This section presents our novel dataset of Symbols in Engineering Drawings (SiED). First we give a brief description of how this dataset was constructed. This is followed by a detailed description of dataset and class distribution. Finally, a brief discussion related to the classification method used to benchmark this dataset is presented.

Data Extraction
A collection of P&ID sheets was provided by an industrial partner. Following the work in [25], a thresholding method was first applied to reduce noise. Areas of interest were then identified interactively to discard boundaries, text and annotation outside the border of each drawing. A traditional machine-vision approach was then used to extract a set of symbols. A set of heuristic-based methods were developed and applied sequentially to localise symbols within each P&ID drawing. Figure 2 shows a random selection of typical symbols that appear in P&IDs.

Fig. 2: A random selection of typical symbols that appear in P&IDs
The methods proved to be stable enough to provide a list of extracted and well-defined symbols. However, a key limitation of such heuristic-based methods is that they require extensive feature engineering and require fine-tuning and customisation to generalise to unseen symbols or different types of diagrams [7].

Dataset
Using the method presented above, a series of P&IDs have been processed and analysed. This resulted in a collection of symbols that represent different types of equipment within the drawings. In total, a dataset of 2432 instances representing 39 different type of symbols were compiled. All symbols have been scaled to a standard size of 100 ×100 pixels. The dataset provides rich source of information to evaluate various supervised machine learning algorithms. However, and as can be seen in Figure 3, the dataset is hugely imbalanced. Some symbols, such as sensors, dominate the dataset, while others appear only once or are vastly underrepresented.
The imbalance between symbols is huge in some cases. For example symbols of type sensor appears 392 times in the dataset, while symbols such as Barred Tee and Ultrasonic Flow Meter appear only once. Similarly, Reducer appears in the dataset 285 times, while Control Valve Angle Choke only once.

Classification Method
To provide base-line results on this imbalanced dataset of symbols, we use CNNs. CNNs [16] have made significant progress in recent years in many image-related tasks and in particular in image classification [30,20].
The network architecture used in this paper consists of an input layer of 100 × 100 of the raw pixel values of the symbol and 32 filters (3 × 3). Then a 2 × 2 max pooling layer. Then, two convolutional layers followed by a 2 × 2 max pooling layer. This structure is then repeated twice with two convolutional layers, with 64 filters of size (3 × 3) followed by a max pooling layer. Finally, a fully-connected layer composed of two hidden layers and an output layer of 39 (number of classes) units with softmax activation function. All convolutional layers in the network used ReLU activations. Dropout [29] was used in the in the fully connected layer with rates 0.1.

Experiments and Results
A series of experiments were carried out to establish the validity and stability of the proposed CNN architecture.

Set up
The dataset was split into disjoint training, validation and testing sets. First, the dataset was split into training and testing sets where 80% of the data was used for training and the remaining 20% for testing. The training set was then split into training and validation sets with ratios of 90% and 10% of the remaining training set respectively. The CNN model was trained with a batch size of 64 for 25 epochs. These parameters were set empirically.

Results & Discussion
On the training set, an accuracy of 99.8%, with only 2 symbols incorrectly classified was recorded. On the test set, results were slightly lower, with accuracy of 95.3%. In other words 23 symbols were incorrectly identified. Table 1 provides more details about performance across the different symbols and using three different metrics: Precision, Recall, and F1-Score. A closer look at the results, shows as expected that some of the minority class instances went completely undetected. For example, for the control valve symbols which has only five instances in the whole dataset, the corresponding F1-score is zero. Such score can also be seen in Table 1 for the symbols the 'Flange + Triangle' (17 instances in the whole dataset), the 'Line Blindspacer' (4 instances only), 'Valve Gate Through Conduit' with only 4 instances in the whole dataset. Conversely, well represented symbols in the dataset were correctly classified with relatively high precision and recall. For example, the 'Reducer' F1score is 1. Notice that 285 instances of reducers are present in the dataset. A similar performance can be observed for other majority class instances such as 'Sensor' (392 instances), 'Valve Ball' (173 instances in the dataset), and others.
These results are consistent with the literature and showed clearly that the learning algorithm tend to be biased toward majority class-instances. Despite this, it can be said that CNN performed extremely well on the testing set with an overall accuracy reaching 95.3%, and an average precision, recall, and F1-score of 0.785 ,0.822, and 0.784 respectively across all symbols in the dataset.

Conclusions
In this paper, we presented a new multiclass imbalanced dataset for the research community. The dataset represents a collection of symbols extracted from P&IDs. Despite the importance of processing and analysing engineering drawings, no such dataset exists in the public domain. We anticipate that donating this dataset to the research community will help researchers in the domain of machine learning and in particular imbalanced-class classification, and also research in the machine vision domain who are interested in processing and analysing engineering drawings. Future work will focus on handling this multiclass imbalanced problem, where advanced methods such as GANs and other data augmentation techniques might be utilised to improve the learning performance.