Deep Recurrent Neural Networks with Attention Mechanisms for Respiratory Anomaly Classification

In recent years, a variety of deep learning techniques and methods have been adopted to provide AI solutions to issues within the medical field, with one specific area being audio-based classification of medical datasets. This research aims to create a novel deep learning architecture for this purpose, with a variety of different layer structures implemented for undertaking audio classification. Specifically, bidirectional Long Short-Term Memory (BiLSTM) and Gated Recurrent Units (GRU) networks in conjunction with an attention mechanism, are implemented in this research for chronic and non-chronic lung disease and COVID-19 diagnosis. We employ two audio datasets, i.e. the Respiratory Sound and the Coswara datasets, to evaluate the proposed model architectures pertaining to lung disease classification. The Respiratory Sound Database contains audio data with respect to lung conditions such as Chronic Obstructive Pulmonary Disease (COPD) and asthma, while the Coswara dataset contains coughing audio samples associated with COVID-19. After a comprehensive evaluation and experimentation process, as the most performant architecture, the proposed attention BiLSTM network (A-BiLSTM) achieves accuracy rates of 96.2% and 96.8% for the Respiratory Sound and the Coswara datasets, respectively. Our research indicates that the implementation of the BiLSTM and attention mechanism was effective in improving performance for undertaking audio classification with respect to various lung condition diagnoses.


I. INTRODUCTION
Deep learning methods are now one of the most prominent methods in computing today with respect to tasks such as audio, video, and image classification [1]. In the field of medical imaging, a large number of deep learning methods have been demonstrated in a variety of studies. Some example deployments pertaining to medical image classification include melanoma identification, diabetic retinopathy screening, and blood cancer detection. Such existing deep learning research has resulted in significant performance enhancement with respect to medical diagnosis [2]. Besides the above, there have also been studies with respect to the audio classification of medical datasets. As an example, Convolutional Neural Networks (CNNs) and traditional machine learning methods, such as Support Vector Machine (SVM), have been adopted in [3] for lung condition diagnosis using audio datasets. As one specific type of Recurrent Neural Network (RNN), the Long Short-Term Memory (LSTM) network is widely adopted for time series forecasting [4]. In a recent research study, Kumar et al. [5] employed an LSTM model for heartbeat audio classification, which yielded an accuracy rate of 80%, outperforming all other machine learning methods utilised in their experiments [5]. There are also a variety of other existing studies that indicated the efficiency of deep learning methods pertaining to audio classification tasks [6,7,8,9].
Motivated by aforementioned existing studies, in this research, we explore the use of deep learning models, specifically RNN architectures with attention mechanisms for the classification of medical audio datasets for chronic and non-chronic lung diseases, as well as COVID-19 diagnosis. We evaluate the proposed attention RNN models using both the Respiratory Sound [6] and the Coswara datasets [7]. Specifically, the Respiratory Sound Database contains audio data with respect to a total of six lung conditions such as Chronic Obstructive Pulmonary Disease (COPD) and asthma, while the Coswara dataset contains coughing audio samples associated with COVID-19. The empirical results indicate that the proposed models with attention mechanisms outperform other deep learning architectures for diverse lung condition classification.

A. Related Studies for Audio Classification
There have been several existing studies that explored the combination of CNN and LSTM for audio classification. In particular, the investigation of classifying music genres has been intensively studied with impressive results using deep learning methods. As an example, Choi et al. [8] employed a Convolutional Recurrent Neural Network (CRNN), which is described as a CNN model with the last layers replaced with an RNN network. The Million Song Dataset, consisting of numerous song clips, has been employed for model evaluation for the classification of categories such as genre, mood, era, and instrument [9]. Their CRNN model was composed of six layers which included four conv2d layers and two RNN layers. To utilise the dataset for model evaluation, features were extracted from the audio files using the python package Librosa. The features extracted from the audio files are known as Mel-Frequency Cepstral Coefficients (MFCC), which are the logarithmic measure of the Mel magnitude spectrum and contain sufficient discriminating properties. This in particular makes them efficient assets for classifying audio datasets [10]. The empirical results of their studies indicated that their proposed CRNN model outperformed all three other baseline CNN models consistently for audio classification. In particular, the CRNN model outperformed the CNN model with 5 convolutional layers and 2 fully-connected layers on 44 tags out of the 50 tags in the dataset, based on the AUC-ROC (Area Under Curve-Receiver Operating Characteristic Curve) metric, pertaining to music tagging. However on the other hand, their proposed CRNN model had the highest number of model parameters and was computationally costly.
Zheng et al [11] demonstrated another CRNN model for Gastrointestinal (GI) sound event detection. Their work employed a gastrointestinal sound dataset that includes 6 different types of body sounds, i.e. bowel sound, speech, snore, cough, groan, and rub. As in the existing studies, to utilise the audio files, MFCC features were extracted using the python package Librosa. Their proposed CRNN model was made up of a 5-layer CNN network, followed by a bidirectional Gated Recurrent Unit (BiGRU) layer and the fully connected layers. Their work achieved promising results with the mean F1 score of 81.06% for the detection of the aforementioned 6 categories of sounds, with two of the classes, i.e. speech and snore, yielding F1 scores over 90%.

B. Bidirectional RNN Architectures
In the above-mentioned existing studies, the concept of a BiGRU was utilised as a part of the implemented model. With the goal of this research to implement an RNN model in conjunction with attention mechanisms, the concept of using a bidirectional RNN is worth exploring further. While a GRU is not an LSTM, it is very similar in functionality and design, with the main difference being that the GRU combines the "forget" and "input" gates into an "update" gate, as well as adding a "reset" gate. This results in the GRU model having fewer parameters and generally a simpler architecture than that of an LSTM network [12].
However, one key distinction of our research is the use of bidirectional design for the LSTM and GRU models. A bidirectional RNN model has two RNN layers of the same type, for example, having two LSTM layers. These two layers ensure that the input features can be processed in both forward and backward directions. This enables the model to better obtain the relations among elements in the input sequence by using the information in both forward and backward directions [13].
In addition, Chen and Li [14] demonstrated a CNN-BiLSTM model for the classification of emotions embedded in music. The dataset adopted in their work consisted of 2000 audio song samples from the Last.fm tag subset of the Million Song Dataset [9]. The dataset consisted of 500 song samples for each of the following emotion classes, i.e. anger, sadness, relaxation, and sadness. Like aforementioned studies, MFCC features were extracted from the audio files and adopted for model training. Their studies indicated that their proposed CNN-BiLSTM method achieved an average accuracy rate of 68% across the four classes, while the other baseline models, i.e. CNN-LSTM, CNN, and LSTM, achieved 63%, 59%, and 50% respectively for music emotion classification. This demonstrates again that the use of BiLSTM layer architectures can potentially increase classification performance. As the use of bidirectional RNN methods was shown to be advantageous in [11] and [14], the concept has been further explored in our research.

C. Attention Mechanisms
Another concept that has been explored intensively in various studies as part of the RNN architectures is the attention mechanism, which has shown very positive results in areas such as speech recognition and natural language processing (NLP). The attention mechanism provides an adaptive ability to learn the relationship of each of the input features at several time steps to predict the current time step [15].
In the work of Zhang et al. [16], a convolutional RNN architecture with an attention mechanism, namely ACRNN, was proposed. Attention for both CNN and RNN layers was investigated. In particular, their work focused on being both determining the effectiveness of the attention mechanism as well as the position that the attention mechanism should reside in within the model. Their work was evaluated using the environmental audio datasets, i.e. ESC-50 and ESC-10, which consist of 50 and 10 classes respectively [17]. The empirical results indicated that the attention mechanism provided a significant increase in accuracy, with an over 2% increase for both datasets. It was also found that the attention mechanism was best suited for increasing classification accuracy in layers 2 and 10 in their CRNN network. These discoveries indicated that an attention mechanism implemented within a deep CRNN model would be beneficial for undertaking audio classification and is worth further investigation.

III. THE PROPOSED DEEP NETWORKS WITH ATTENTION MECHANISMS
In this research, we propose two bidirectional LSTM and GRU networks incorporated with attention mechanisms, namely A-BiLSTM and A-BiGRU, for chronic and nonchronic lung conditions and COVID-19 diagnosis. We introduce the key procedures such as the feature extraction from audio inputs and the proposed deep learning models in detail below.

A. Feature Extraction
As indicated in the aforementioned existing studies, the first step is to extract the features from audio inputs. The method of choice includes the extraction of the MFCC features, which again can be achieved through the use of the python package Librosa [18].
There are several aspects that need to be taken into account in the pre-processing stage. The first of which is to decide how to ensure that the model has enough input features, which can be achieved by splitting the audio files into segments. Depending on the sample rate and the length of each of the audio clips, which need to be determined, the audio input can then be split into segments.
Following the splitting stage, all of the MFCC features are extracted from each of the segments and then appended to a dictionary with its class label. To achieve this, certain variables need to be decided upon such as the value for the Fast Fourier Transform (FFT) algorithm and the hop length. The FFT algorithm is typically used to convert a signal from its original domain, which in this case is time, to a representation in the frequency domain. In the context of MFCC, FFT is applied to every frame to calculate the frequency spectrum. This is conducted through the process called the Short-Time Fourier-Transform (STFT), from which the power spectrum is then calculated. Once the power spectrum is calculated, then triangular filters applied on the Mel-scale are applied to the power spectrum to extract frequency bands. Using these frequency bands, the Mel frequency is computed through the use of the formula below [20].
The formula in Equation (1) in particular converts the audio input to the Mel frequency in hertz i.e.
. First, a setting of 1127 is calculated by taking the natural logarithm (ln) and the corner frequency of 700 hertz, which is typically between 600 and 1000 hertz for this type of formula. This is then multiplied by the natural logarithm (ln), where a constant value of 1 pluses the frequency in hertz (f) being divided by the corner frequency of 700.
The values of the hop length combined with the FFT algorithm determine how many frames are taken from each segment. The default numbers for both of these variables are 2048 for FFT and 512 for hop length, which for simplicity, will be used for this research. Following the extraction of the MFCC feature from the audio files, they are each appended onto a JSON file which is used as the input file for training the model architectures.

B. The Proposed Model Architectures
In this research, we propose bidirectional RNN models with an attention mechanism for audio classification, i.e., chronic and non-chronic lung conditions and COVID-19 identification via breathing, coughing, and voice recordings. The two specific types of RNN networks chosen for model construction are LSTM and GRU. Determining the specific structure of the networks involves rigorous testing of the settings of each of the layers' parameters, as well as the hyperparameters during the training process. Table 1 below describes the architectures of the proposed BiLSTM and BiGRU models with the attention mechanisms. As shown in Tables I, the two model architectures of A-BiLSTM and A-BiGRU have the same structures, with the only differing aspect being the type of the RNN network, i.e., LSTM or GRU, implemented in layers 1 and 3. The choice of implementing the attention mechanism in the second layer of each model was influenced by the suggestion in [16], which demonstrated that the attention mechanism was best suited to increase the model accuracy by being on layer 2 or 10 of the network. With the choice of a dense layer being the final connected layer of the two architectures, therefore the only remaining option was to implement the mechanism on layer 2.
In addition, the first layer of the model was decided to be the aforementioned BiLSTM or BiGRU layer, with the hidden neuron units set as 512. Owing to the layer being bidirectional, the number of hidden units is then doubled. The number of hidden units was determined based on rigorous testing which involved experimenting with different numbers of neurons.
Following the attention layer, an unidirectional LSTM or GRU layer is then implemented, with the choice of units being set as 512 as it is half the value of the first layer. Following the third layer, a regular dense layer with the activation function 'relu' is found to be the most effective. The next layer is a dropout layer, which is to reduce the amount of overfitting that may occur during the training of neural networks [21].
The two final layers implemented are the two dense layers, the first being 128 units, and the final being the fully connected layer consisting of 6 or 2 units (i.e., the number of classes), depending on the expected number of classes being outputted in the employed test dataset. In addition, the specific classes for both datasets are explained in the next section. The fully connected layer also has the activation function 'softmax'. To determine the effectiveness of the attention layer, testing also involved training both model architectures without the attention mechanism implemented, as shown in Table II.

C. Model Training
As mentioned previously, the training process was rigorous to optimise the layers of the models and the various training hyperparameters.
The first choice to be made pertaining to the training and test processes is determining the train, validation, and test split of the dataset. Several divisions were tested, but what was conclusively chosen was the splits shown in Table III below. These splits were found to be the most effective for optimising the model performance and accuracy for the respective datasets.
Due to the nature of the two networks, i.e. A-BiLSTM and A-BiGRU, being fundamentally different, with the GRU network being a simpler variation of the LSTM, the hyperparameters identified to achieve optimal performance between the two networks differed a fair amount. Tables IV-V below demonstrate the identified optimal model settings. As shown in Tables IV-V, the training process for the GRU models required far fewer epochs, as well as a comparatively larger learning rate, to achieve optimal performance, which reflects the comparatively simple nature of the networks themselves.
Moreover, Figs. 1-2 are examples of the training and validation losses for the Coswara dataset, with respect to A-BiLSTM and A-BiGRU, respectively. As demonstrated in Figs. 1-2, no overfitting occurred during the training process which ensures that all models are performing to the best of their capability. We discuss the evaluation details in the following section.

IV. EVALUATION
To ensure a comprehensive model evaluation, both the Respiratory Sound Database and the Coswara dataset are used in our experiments to give a good indication of the proficiency and effectiveness of the proposed models.
The Respiratory Sound Database was chosen for investigation in this research. This is mainly owing to the fact that three out of the top ten leading causes of death globally are respiratory diseases [21]. With early detection being crucial for preventing deaths of such diseases, providing a more convenient and accurate method of diagnosis could offer a vital tool to medical professionals working in the respiratory field [22]. Specifically, the dataset contains 920 recordings from 126 subjects whose conditions include Healthy, Upper Respiratory Tract Infections (URTI), Chronic Obstructive Pulmonary Disease (COPD), Bronchiolitis, Pneumonia and Bronchiectasis. In other words, a total of six lung conditions are taken into account for model evaluation. All the 920 audio recordings have been employed in our experiments. With there being numerous classes to classify from and the sounds being recorded from a range of stethoscopes, to achieve high We also employ the Coswara dataset to test model efficiency. Owing to the contents of the dataset being strictly COVID-19 related, the obvious primary reason of selecting the dataset would be to research and provide a possible solution that could help alleviate the current worldwide pandemic.
The current most widely used method of testing for COVID-19 is the PT-PCR test, which while it is the most effective method of testing at this moment in time, it also has several issues such as cost, scalability, and the nature of the test violating social distancing [23]. Providing a more convenient, cost-effective, and scalable method of diagnosis would deliver a crucial service to allow more people to be tested daily and ultimately provide control over the pandemic.
The Coswara dataset itself is open access and consists of a growing number of respiratory audio recording classes that include coughing, breathing, and voice recordings. The subsection focussed on in this research is the cough class. This subsection can be classified into three categories, i.e. healthy subjects, subjects who have COVID-19, and subjects who have a respiratory disease that is not COVID-19.
The two classes that are included in this study are healthy subjects and subjects who have COVID-19, in other words, positive and negative cases for COVID-19 diagnosis. A total of 95 positive and 100 negative cases are employed for model evaluation.
Tables VI-VII below illustrate the performance of the four RNN models on both datasets. The results indicate that the best performing model is the A-BiLSTM network for undertaking audio classification for both datasets.

V. DISCUSSIONS
From the inspection of the empirical results, several observations can be made. The main finding is that for both datasets, as indicated in Tables VI-VII, the A-BiLSTM model was the best performing method in terms of accuracy, i.e., it obtains accuracy rates of 96.2% and 96.8% for the Respiratory Sound and the Coswara datasets, respectively. Similarly, for both datasets, the LSTM models outperform the respective GRU models for each of the experiments.
One possible explanation for both of the above observations is that because of the more complex nature of the LSTM, with it embedding more complex layer topologies, it provides the network with better capability to learn bidirectional temporal dependencies on large MFCC datasets. Moreover, as seen in Figs. 1-2, the number of epochs required to train on the Coswara dataset to achieve optimal performance was far fewer for the A-BiGRU model in comparison with those of the A-BiLSTM model.
As discussed previously, this is more likely owing to the simpler nature of the A-BiGRU model. Therefore, if the efficiency or size of the model is essential in determining the models to use, for example, in the case of using a model on a mobile device, then the GRU would be preferable.
Another key observation that can be made is with respect to the impact on performance of the implemented attention mechanism. Not only does the highest performing model for both datasets consist of the attention mechanism, but all models that have an attention mechanism implemented show higher performance than their counterparts, with the difference in test accuracy ranging from 1.8% and 3% for the Respiratory Sound Database, and 1.4% to 2.2% for the Coswara dataset.
Although the improvements are not transformative, they are notable and significant enough to be worthy of implementation, especially when considering that achieving test accuracy results of 96.8% for the Coswara dataset would not be possible without the attention mechanism being implemented. Despite that in this particular study, no models tested consist of purely unidirectional LSTM/GRU layers, we can observe the results from [4] which demonstrated that an unidirectional LSTM model tested on a heartbeat audio dataset achieved an accuracy rate of 80%.
Despite the datasets being tested not being the same as in this research, they are similar in nature as they are all audio medical datasets. Therefore, the superiority of our experimental results of around 96% on both selected datasets indicates the potential of the use of bidirectional RNN to be more suited to achieve higher performance on audio Another conclusion that can be made by observing the Figs. 3-6 and Tables VIII-XI is that both the confusion matrices and the detailed experimental results illustrate how the A-BiLSTM model outperforms A-BiGRU significantly for the classification of each class with respect to both datasets.
Specifically by inspecting the results of the Respiratory Sound Database in Tables VIII-IX, the A-BiLSTM model outperforms the A-BiGRU network for the classification of nearly all the categories. In particular, the A-BiLSTM model yielded substantially better results for the categories such as Healthy (94.3%), Pneumonia (97.2%) and UTRI (94.8%), when compared with the A-BiGRU model, which yielded 87.3%, 93.3% and 88.8% for Healthy, Pneumonia and UTRI, respectively.
On the other hand, for the Coswara dataset, according to the experimental results shown in Tables X-XI, the A-BiLSTM model again outperforms the A-BiGRU model for the prediction of both positive and negative classes.
Ultimately, our conclusion is that the specific architecture implemented for the A-BiLSTM model has achieved impressive results and is certainly worth exploring and experimenting with further.

VI. CONCLUSION
In this research, we have implemented novel architectures of the RNN networks with attention mechanisms for the classification of the medical audio datasets, i.e., the Respiratory Sound and the Coswara datasets. The experimental results of both datasets reveal the efficiency of the proposed A-BiLSTM network among all the test methods, which consists of a BiLSTM layer, an attentional layer, an unidirectional LSTM layer, and several dense layers. The A-BiLSTM model achieves the test accuracy rates of over 96% for both datasets, which indicates that the implementation of a BiLSTM network and attention mechanism is a concept that is beneficial for improving audio classification performance.