skip to main content
research-article
Open Access

An Expert System for Indian Sign Language Recognition Using Spatial Attention–based Feature and Temporal Feature

Published:09 March 2024Publication History

Skip Abstract Section

Abstract

Sign Language (SL) is the only means of communication for the hearing-impaired people. Normal people have difficulty understanding SL, resulting in a communication barrier between hearing impaired people and hearing community. However, the Sign Language Recognition System (SLRS) has helped to bridge the communication gap. Many SLRs are proposed for recognizing SL; however, a limited number of works are reported for Indian Sign Language (ISL). Most of the existing SLRS focus on global features other than the Region of Interest (ROI). Focusing more on the hand region and extracting local features from the ROI improves system accuracy. The attention mechanism is a widely used technique for emphasizing the ROI. However, only a few SLRS used the attention method. They employed the Convolution Block Attention Module and temporal attention but Spatial Attention (SA) is not utilized in previous SLRS. Therefore, a novel SA based SLRS named Spatial Attention-based Sign Language Recognition Module (SASLRM) is proposed to recognize ISL words for emergency situations. SASLRM recognizes ISL words by combining convolution features from a pretrained VGG-19 model and attention features from a SA module. The proposed model accomplished an average accuracy of 95.627% on the ISL dataset. The proposed SASLRM is further validated on LSA64, WLASL, and Cambridge Hand Gesture Recognition datasets where, the proposed model reached an accuracy of 97.84%, 98.86%, and 98.22%, respectively. The results indicate the effectiveness of the proposed SLRS in comparison with the existing SLRS.

Skip 1INTRODUCTION Section

1 INTRODUCTION

Sign Language (SL) is the only means of communication for hearing-impaired people. According to the World Health Organization, approximately 63 million Indians are hearing impaired [16]. However, very few people are well trained or can understand SL, which creates a communication barrier between the hearing and deaf community. Employing a human interpreter is a traditional approach to resolving this issue. However, well-trained professional human interpreters are not easily available, and they are not cost-effective. Besides these, signer need to compromise confidentiality. Therefore, Sign Language Recognition System (SLRS) can be an effective choice for reducing the communication barrier between the two communities. The objective of an SLRS is to provide a platform to facilitate the communication process. Since the early 2010s, considerable advancements have taken place in the evolution of SLRS. Nevertheless, these systems lack a unified standard due to the existence of numerous variations within sign language recognition tasks. In broad terms, these tasks can be categorized into two main types: static and dynamic sign language recognition, which depend on the representation of gestures. The static sign language recognition task revolves around identifying gestures without any associated motion. However, dynamic sign language recognition involves the identification of gestures that incorporate motion. The dynamic category can be further divided into isolated word recognition and continuous sentence recognition [13]. The state-of-the-art approach for static SL recognition has achieved remarkable performance. In contrast with static SL recognition the dynamic SL recognition still an unexplored area of research specifically for Indian Sign Language (ISL). Dynamic sign language recognition operates with a video containing a series of sequential gestures, aiming to produce an output of either a single word or a sequence of words.

In recent times, numerous SLRS have been introduced to address dynamic SL recognition tasks, with a predominant focus on leveraging deep learning methodologies. Recent contributions in the realm of deep learning encompass works by Cihan et al. [11], Cui et al. [12], Das et al. [15], and Koller [29]. Modern techniques for dynamic SL recognition encompass the utilization of Convolutional Neural Networks (CNN) and transformer-based SLRS. A prevalent approach within SLRS involves the application of Transfer Learning (TL), which entails the utilization of pretrained CNN models to classify individual frames or to serve as feature extractors. This involves adapting a model previously trained on a related task to the current context. The efficacy of this approach depends on various strategies, including determining the number of layers to transfer, finetune, or freeze. Some existing SLRS have effectively incorporated TL principles using pretrained networks. For instance, Adithya et al. [57] proposed an SLRS model that employs a pretrained GoogleNet as a feature extractor in conjunction with a Bi-directional Long Short-Term Memory (BiLSTM) network to classify ISL words in the agricultural domain. Similarly, Aparna et al. [7] utilized a InceptionV3 network for feature extraction and employed a stacked Long Short-Term Memory (LSTM) network for classifying ISL words. In contrast with the previous two approaches, an SLRS model proposed in Reference [58] used the features from the final convolution layer of VGG-16 instead of the dense layer, because the last polling layers extract the most salient features from the sign frames, which consist of local information. All the aforementioned SLRS have demonstrated acceptable performance. Nevertheless, they exhibit notable limitations, notably in handling a substantial volume of frames that often encompass redundant and extraneous data. Sign language videos are characterized by a continuous stream of frames, frequently captured at a high sampling rate by cameras. Consequently, the elevated sampling rate results in a considerable volume of frames within a sign video, necessitating substantial hardware resources for processing. Furthermore, the presence of redundant or irrelevant data between successive frames impairs the accuracy of the system. Another problem with the exiting SLRS [7, 57, 58] is that they focus on global features other than focusing more on the Region of Interest (ROI). In the case of SL frames, most part of the frame consist of background information and a small part covers the hand region. Thus, focusing more on the hand region and extracting local features from the ROI can increase the performance of the system. Therefore, an SLRS named Spatial Attention-based Sign Language Recognition Module (SASLRM) is proposed in this article to address the aforementioned issues. SASLRM incorporates a novel architecture that uses a Histogram Difference– (HD) based keyframe selection module and Convolution with Spatial Attention (CSA) module. In addition, the proposed model uses a BiLSTM network to incorporate temporal features for classifying the sign words. The keyframe selection module resolves the problem of excessive processing by eliminating redundant or useless frames. The CSA module extracts features from the keyframes, in which convolution along with Spatial Attention (SA) extract salient local features. SASLRM uses the last pooling layer of VGG-19 as a convolution base, which has a higher discriminability and is faster in deep learning model training tasks [31, 52]. At the same time, the attention module as deep layer assists in capturing the spatial relationship during the training process for better discrimination. The main contributions of this proposed work are as follows:

A novel SLRS by combining VGG-19 and SA module. Since the proposed model uses the features from the last convolution layer with a SA module, it can capture more salient local features from sign frames.

A Histogram Difference based keyframe selection technique that eliminates redundant and useless frames, which enhance the effectiveness of the proposed model.

The rest of the article is organised as follows. Section 2 presents related work for SL recognition. Section 3 discusses the proposed methodology for ISL recognition task. Section 4 presents the experimental analysis. Finally, Section 5 concludes the article.

Skip 2RELATED WORK Section

2 RELATED WORK

SL is not a uniform language—it varies according to country and region. There are around 300 SL available worldwide, including American Sign Language (ASL), Chinese Sign Language, British Sign Language, ISL, and so on [1]. The SLR system recognises SL and converts it into meaningful words or expressions for the hearing community. The SL word consist of a sequence of gestures. Thus, SLRS are inextricably linked to the gesture recognition task or action recognition problem. Researchers have developed several models for recognising SL and hand gestures over the years and these models are classified into hardware-based and vision-based models. Models based on hardware are less complex than models based on vision. However, they have many disadvantages, including the fact that the signer faces difficulty performing any gesture, and the hardware-based SLR system is not cost-effective, making it unsuitable for real-world sign language recognition tasks. Therefore, vision-based solutions are opted for SL recognition to address these issues. The proposed work is focused on vision-based techniques. This section discusses current advances in SL recognition system.

2.1 SLRS Models without CNN

The traditional approach of SL recognition without CNN consists of several basic steps such as hand detection, hand tracking, segmentation, feature extraction, and classification. Hand motion, shape, orientation, posture, and location are some of the basic features that are used for SL recognition in traditional SLRS models. In SLRS, relevant feature extraction is critical, because irrelevant features lead to misclassification [5]. Some of the widely used feature extractions methods are Scale Invariant Feature Transform, Speeded Up Robust Feature, Histogram Oriented Gradient, Principle Component Analysis (PCA), Linear Discriminant Analysis, Convex Haul, Convexity defects, Local Binary Pattern, and so on [27, 30, 34, 56, 62]. The feature extraction steps help the classifier by reducing computation and overfitting. The most commonly used classifiers in SLRS are Support Vector Machine (SVM), K-Nearest Neighbour, Random Forest (RF), and Hidden Markov Model (HMM) [21, 59]. HMM-based classifiers perform better in the case of dynamic SL recognition tasks. Ong et al. [42] proposed a sequential Pattern Tree-based multi-class classifier for recognising Greek Sign Language and their proposed model outperformed HMM. Yang et al. [61] suggested a CRF- and SVM-based SLRS for recognizing ASL with manual feature and non-manual feature. The BoostMap embeddings are used for hand shape, segmentation is done using hierarchical CRF, and recognition is done using SVM. Lim et al. [33] proposed an SLRS for isolated sign word recognition, and they used an optical flow feature with an absolute difference between train and test for classification. Chai et al. [10] suggested another SLRS where they proposed a novel Grassman Covariance Matrix to encode long sequence and HMM for classification. It has been observed that single features like shape or motion features are insufficient for SL recognition. Thus, a combination of features can be utilized to improve recognition performance. Zaki et al. [63] proposed an SLRS where they incorporated hand shape, hand orientation, articulation place, and hand movement. They used PCA to determine hand shape and orientation, kurtosis position to determine articulation, and Motion Chain Code to determine hand movement. For classification, they have used HMM. Traditional SLRS models have performed satisfactorily in some cases. However, selecting appropriate features is challenging, and the performance of the SLRS dependents on various factors such as background, efficient and accurate hand detection, and segmentation. Therefore, researchers have moved toward CNN based models for dynamic SL recognition task.

2.2 SLRS Models with CNN

Deep learning (DL) models eliminate the extensive image pre-processing steps and overcome obstacles of traditional SLRS models without CNN [48]. Deep Neural Network architectures, such as CNN, learn high-level features automatically from image sequences and removes the problems associated with traditional feature extraction steps. DL models extracts salient features from each object present in an image. Researchers have proposed numerous DL-based SLR models over the years, including three-dimensional (3D) CNN with keyframe [22], a combination of 2D and 3D CNN [64], an attention-based model [19], and DL architecture with skeleton feature [47]. Researchers used various feature extraction techniques with various models, and the encoder-decoder model performed best, with CNN as an encoder and LSTM as a decoder. Likhar et al. [32] proposed a continuous SLR system. This method employs the CNN-LSTM architecture. They evaluated the proposed model with 10 dynamic gestures and reported an F1 measure of 99.20%. The SLRS proposed by Jaydeep et al. [25] capable of recognizing dynamic ISL words in the banking sector. They used a pretrained InceptionV3 as a feature extractor and employed a LSTM network gesture classification. The model received an 85% recognition rate. Many researchers have employed custom LSTM networks to improve the accuracy of the proposed model. Mittal et al. [40] proposed a continuous SL recognition SLR system based on a custom LSTM. The SLR model can recognize ISL sentence with 72.3% accuracy and isolated ISL word with 89.5% accuracy. Aly et al. [6] developed a new SLR system for dynamic SL recognition that uses a BiLSTM network. They showed that the BiLSTM network outperforms the normal LSTM network.

CNN models have performed better compared to the SLRS without CNN. However, they fail in some cases where the background is nonuniform and complex. In such cases, the accuracy is enhanced by narrowing the focus to the region of interest. Trainable attention mechanisms are one way to achieve this. For SL recognition, many researchers have proposed attention-based SLRS. Zhang et al. [65] proposed an SLRS that used global-local attention with a C3D network and achieved an accuracy of 91% and better generalisation performance. Huang et al. [24] proposed another method for improving generalisation ability by combining a Convolution Block Attention Module with a convolution self-encoding network and achieved an accuracy of 89.90%. Pan et al. [43] suggested an attention-based SLR network for signer-independent SL recognition using a keyframe sampling and skeleton features. However, they used temporal attention in the proposed SLRS. Rahim et al. [44] introduced an innovative approach within the domain of SLRS. Their system incorporated ROI identification for individual sign frames. The feature extraction phase involved the extraction of features from segmented and original images, followed by their concatenation into a unified feature vector. The classification aspect leveraged a CNN. Sing [55] proposed a dynamic SL recognition solution centered around a 3D CNN network. This network comprised three 3D convolution layers, three pooling layers, fully connected layers, and dropout layers. Achieving an overall accuracy of 88.24%, this model showcased promising results. Adhikari et al. [2] devised a pose-based SLRS methodology employing media pipe for pose feature extraction. Various classifiers including DT, RF, and Gradient Boosting were employed for classification. Notably, the highest accuracy of 97.4% is attained using the RF classifier. Nonetheless, a drawback is noted as media pipe encountered difficulties in detecting landmarks for overlapping hands. Muneer et al. [4] introduced an optimized C3D network for SLRS. Their model employed two separate C3D networks for distinct tasks. The first network concentrated on learning features from the hand region, while the second network captured features from the entire body. Concatenating these features, a Multi-Layer Perceptron is utilized for classification. An inefficiency in the model and the absence of temporal modeling were highlighted as drawbacks. Aparna et al. [7] adopted a hybrid CNN-LSTM approach to develop an SLRS for ISL. Their methodology involved utilizing InceptionV3 as a pre-trained CNN model to extract spatial features from sequential frames. The sequence learning is achieved using a stacked LSTM network. Areeb et al. [8] developed a DL-based SLRS with a focus on recognizing ISL words relevant to emergency situations. Three distinct models were proposed: two for classification and one for detection. The first classification model employed a 3D CNN, while the second model combined VGG-16 for feature extraction and LSTM for classification. Their detection model integrated YOLO V5. Notably, accuracy rates of 82% and 98% were reported for the first and second classification models, respectively, while the detection model reached 99% accuracy. However, a limitation is observed in their evaluation approach, as they used only a random subset of data for testing and employed random sampling for frame selection, potentially resulting in information loss.

From the literature it can be concluded that SLRS with CNN outperform the SLRS without CNN and CNN with an attention mechanism can improve the performance of the model. Existing SLRS have utilized various attention mechanisms such as channel attention, global-local attention, and temporal attention. However, the spatial attention mechanism is not explored in previous models, which can improve the recognition accuracy of the SLRS.

Skip 3METHODOLOGY OF THE PROPOSED SASLRM Section

3 METHODOLOGY OF THE PROPOSED SASLRM

The proposed SASLRM is based on a pretrained VGG-19 DL model. The VGG-19 is preferred in the proposed model because of its low-level feature extraction capability using a smaller kernel size, which is appropriate for the SL recognition task. The proposed work utilizes the concept of finetuning, which is one of the TL techniques. The pretrained weight of the ImageNet is used in the VGG-19 model for fine-tuning process because of limited data [18]. It helps to reduce the overfitting issue and the random weight initialization problem. The SASLRM model comprises three essential components: (i) keyframe selection, (ii) feature extraction, and (iii) classification. The comprehensive design of the SASLRM is visually represented in Figure 1. The subsequent sections provide an in-depth exploration of each core module.

Fig. 1.

Fig. 1. Architecture of proposed SASLRM.

3.1 Keyframe Selection

The selection of keyframes stands as a fundamental principle within the SASLRM model. SL videos inherently encompass a multitude of frames, with only a limited subset harboring significant information. The majority of frames derived from SL videos contain superfluous and repetitive data. The integration of a keyframe selection serves to eliminate frames bearing such redundant details, thereby improving both the accuracy and efficiency of the system. In this vein, the proposed SLRS incorporates a method for keyframe selection, specifically employing the HD technique. The operational protocol of this algorithm unfolds as follows: Initially, the HD between successive frames is computed, with the threshold value subsequently determined from the resultant distance sequence. Then in next phase, the initial frame is opted as a candidate keyframe from the sequence of frames and the HD is computed between candidate keyframe and the subsequent frame, which is compared against the threshold. Now, if the computed value surpass the threshold, then the current frame is selected as keyframe and the candidate keyframe is updated accordingly with that keyframe. This iterative process spans the entire sequence of frames. The outlined procedure is encapsulated in Algorithm 1.

3.2 Feature Extraction

The feature extraction is an essential part of the proposed SASLRM. At this stage, the CSA module is used for feature extraction. The CSA module consists of the convolution and SA module. Figure 2 represent the CSA module. A pre-trained VGG-19 network is employed as the convolution module, where the VGG-19’s final pooling layer serves as the convolutional base, and the SA module is added afterwards to extract more salient features from the sign images. The convolutional base, together with the SA module and dense layers, are trained with the SL dataset. After that, the trained network is used as a feature extractor in the proposed SASLRM model. The detailed architecture is given in Table 1.

Table 1.
Layer typeOutput shape
VGG-19 Model7\(\times 7\times\)512
Lambda(Average Pooling)_ Layer7\(\times 7\times\)1
Lambda(Max Pooling)_ Layer7\(\times 7\times\)1
Concatenation_ Layer7\(\times 7\times\)2
Convolution_ Layer7\(\times 7\times\)1
Concatenation_ Layer7\(\times 7\times\)513
Flatten_ Layer25,137
Dense_ Layer4,096
Dense_ Layer4,096
Dense_ Layer8
Total number of parameters: 119,578,730
Trainable parameters: 119,578,730
Nontrainable parameters: 0

Table 1. Detailed Architecture of Feature Extraction Network

Fig. 2.

Fig. 2. Architecture of CSA module.

3.2.1 Convolution Module.

According to the literature, VGG-19 has demonstrated outstanding performance on various benchmark SL datasets, including the ImageNet dataset. As a feature extractor, the VGG-19 network has consistently yielded superior results compared to other models across most of tasks [17]. VGG-19 adopts a uniform structure and employs small 3 \(\times\) 3 convolution filters, which are instrumental in extracting features from local regions and enhancing the ability to learn spatial features required for SL recognition tasks. Furthermore, the proposed model utilizes features from the convlayer, facilitating the extraction of features from local regions [52]. Compared to other pretrained CNN models like ResNet50, InceptionV3, and GoogleNet, VGG-19 has a comparatively lower dimension in the convlayer features. This characteristic makes the VGG-19-based model more effective for the SL recognition task. Therefore, the proposed SASLRM uses the pre-trained VGG-19 model as the convolution module, where all the layers of VGG-19 are frozen except for the last convolution layer. The last convolution layer of the VGG-19 extracts important low-level features that are useful for SL recognition tasks. During the training process, individual video frames are resized at 224 \(\times\) 224 and fed to the pretrained VGG-19. The output from the last pooling layer of VGG-19 gives an intermediate output of 7 \(\times\) 7 \(\times\) 512, which is passed to the CSA module. CSA module concatenates the output from the SA module and the intermediate output from the convolution module that gives feature vector size of 7 \(\times\) 7 \(\times\) 513 for individual frames. Then it is passed to a flattened layer that gives a feature vector size of 25,137 for each frame. Finally, the feature vector sequences of respective videos are forwarded to the classification stage for classifying sign words.

3.2.2 Spatial Attention Module.

This module is used in the proposed model to focus on the ROI of sign frames. SA module focuses on the informative part of the sign frames that makes SA different from channel attention and temporal attention. The SA module follows the SA concept proposed in Reference [60]. The SA module performs average-pooling and max-pooling operation on the output from the last pooling layer of VGG-19 model. After this, two different vectors from max-pooling and avg-pooling are concatenated and passed to convolution layer with sigmoid activation function where kernel size is set as 7 \(\times\) 7. The concatenated output (M_s (F)) is defined in Equation (1), (1) \(\begin{equation} M_s(F) = \sigma \left(f^{7\times 7}\left(\left[F_{\text{avg}}^s;F_{\text{max}}^s \right] \right) \right), \end{equation}\) where \(f^{(7 \times 7)}\) defines the convolution operation with a filter size of 7 \(\times\) 7, \(F_{\text{avg}}^s\) defines a 2D feature vector from the average pooling operation, and \(F_{\text{max}}^s\) defines a 2D feature vector from the max pooling operation. \(F_{\text{avg}}^s \in \mathbb {R}^{(1 \times H \times w)}\) and \(F_{\text{max}}^s \in \mathbb {R}^{(1 \times H \times w)}\), where F, H, and W denote the input vector, height, and width of the vector, respectively. The incorporation of SA into the CSA module significantly enhances the performance of the SASLRM. SA enables the model to selectively focus on essential regions or features within the input data while disregarding irrelevant information. It effectively suppresses noise and distractions, making the model more robust, especially in cluttered or distracting input scenarios. Moreover, SA reduces the computational burden by directing resources to where they are most needed that improves efficiency. This attention mechanism also aids in generalization, allowing the model to identify common patterns across various examples and thereby enhancing its accuracy when dealing with new and unseen data. In summary, SA’s integration into the CSA module elevates the SASLRM’s performance by improving focus, robustness, efficiency, and generalization capabilities. Figure 3 illustrate the SA mechanism.

Fig. 3.

Fig. 3. Spatial Attention module.

3.3 Classification

The features from the CSA module are given as input to the classification module, which consists of a BiLSTM layer, a fully connected layer and a SoftMax classifier. Table 2 gives the detailed architecture of the classification module.

Table 2.
Layer typeOutput shape
CSA Model25,137
BiLSTM_layer1,024
Dense_layer256
Dense_layer8
Total number of parameters: 105,126,152
Trainable parameters: 105,126,152
Nontrainable parameters: 0

Table 2. Detailed Architecture of Classification Module

3.3.1 BiLSTM Layer.

Following the extraction of features, the subsequent objective revolves around harnessing these extracted information to capture temporal insights from the sequential sign frames. Traditional deep neural networks fall short in achieving this due to their incapacity to grasp sequential patterns.The challenge at hand is effectively tackled by a Recurrent Neural Network (RNN), as demonstrated by Senthil et al. [53]. RNNs’ feedback mechanism equips them to retain historical data. A specialized form of RNN, known as LSTM, is extensively employed for addressing sequential input predicaments. One of LSTM’s core strengths lies in mitigating two key issues of standard RNNs: vanishing and exploding gradients. It excels in preserving extensive patterns across sequences. The architecture of an LSTM module, depicted in Figure 4, underscores its distinctive ability to incorporate various gates that facilitate operations like information addition or removal.

Fig. 4.

Fig. 4. Block diagram of the LSTM cell.

Equations (2) through (7) provide an elaborate exposition of the operations within the LSTM module, (2) \(\begin{equation} F_k = \sigma (w_F [h_{k-1}, x_k] + b_F), \end{equation}\) (3) \(\begin{equation} I_k = \sigma (w_I [h_{k-1}, x_k] + b_I), \end{equation}\) (4) \(\begin{equation} \acute{c_k} = \tanh (w_C [h_{k-1}, x_k] + b_C), \end{equation}\) (5) \(\begin{equation} c_k = F_k \cdot c_{k-1} + I_k \cdot \acute{c_k}, \end{equation}\) (6) \(\begin{equation} o_k = \sigma (w_O [h_{k-1}, x_k] + b_O), \end{equation}\) (7) \(\begin{equation} h_k = o_k \cdot \tanh {c_k}, \end{equation}\) where \(F_k\) signifies the output value at the forget gate, \(I_k\) represents the output of the internal state, and \(h_t\) denotes the intended final output. \(w_F\), \(w_O\), and \(w_C\) correspond to the weight matrices, while \(b_F\), \(b_O\), and \(b_C\) pertain to the bias terms.

Different types of LSTMs, such as BiLSTM, ConvLSTM, LSTM, and stacked LSTM, are employed for the task of sequence learning. The choice of a particular LSTM adaptation depends on the requirements of the specific application. Drawing from prior research and our own experimentation, the BiLSTM network emerges as the most effective option for SL recognition. This is attributed to BiLSTM’s capability to process input sequences both forwards and backwards, enabling predictions based on future context. Thus, in SASLRM, the BiLSTM layer is utilized to grasp sequential information. The suggested model incorporates a BiLSTM layer featuring 512 neurons.

3.3.2 Fully Connected Layer.

Before the final classification layer, a dense layer is used with 256 neurons as shown in Table 2.

3.3.3 Softmax Layer.

The softmax layer is used to classify the sign words based upon probability score. The number of categories in the dataset determines the unit number for the softmax layer. For classification, the softmax layer generates a multinomial distribution of probability scores. Equation (8) is used to calculate the probability score, (8) \(\begin{equation} p(A=C|B) = \frac{e^{B_k}}{\sum e^{B_j}}. \end{equation}\)

Skip 4EXPERIMENTAL ANALYSIS Section

4 EXPERIMENTAL ANALYSIS

4.1 Dataset

The experimentation is conducted utilizing the ISL dataset, which is openly accessible [3]. This section provides a concise overview of the dataset employed in this study.

Dataset description: The dataset comprises ISL words associated with dynamic gestures pertinent to emergency scenarios. It encompasses a total of eight categories: “Accident,” “Call,” “Doctor,” “Help,” “Heat,” “Lose,” “Pain,” and “Thief.” These terms are frequently utilized to convey urgent messages or seek aid during critical situations. Each class incorporates 52 videos, with the exception of the “Lose” and “Thief” categories. A diverse set of contributors, encompassing 26 adults, including 12 males and 14 females aged between 22 to 26 years, participated in generating the SL video clips. This variation is introduced deliberately to enhance the diversity of the dataset. An exemplification of frame samples from the dataset is presented in Figure 5. Further insights into the dataset’s composition are provided in Table 3.

Table 3.
CategoryParticipant numberTotal instances
accident2652
Call2652
doctor2652
help2652
hot2652
lose2550
pain2652
thief2550
The total number of instances 412
Total Frames 31,264
Longest video duration 4 s
Shortest video duration 1.53 s
Resolution 500\(\times\)600

Table 3. Overvew of the Dataset

Fig. 5.

Fig. 5. Extracted frame samples of ISL dataset for categories (a) Accident, (b) Call, and (c) Lose.

4.2 Experimental Setup and Implementation Details

The experimentation procedure is conducted utilizing Python 3.9, TensorFlow, and Keras on a work station featuring an Intel Xenon processor and 64-GB DDR5 RAM. The fundamental parameters for training the model are outlined in Table 4.

Table 4.
ParametersSettings
Image Size224\(\times\)224
Batch Size32
Epochs100, 20
OptimizerAdam
Learning rate0.001
Loss functionCategorical Cross Entropy
Rescale1/255

Table 4. Parameters Settings

To select hyperparameters, we conducted several experiments. However, due to limited computational resources, we utilized standard values for certain hyperparameters. For instance, the learning rate for the Adam optimizer set to the default value of 0.001. Additionally, the number of neurons in the fully connected layer and the number of epochs were set to 256 and 20, respectively [8, 57]. To determine the optimal batch size and number of neurons in the BiLSTM layer for the SASLRM model, we performed further experiments. Our observations revealed that using a batch size of 32 and 512 number of neurons in the BiLSTM layer resulted in the best outcomes. Table 5 provides the outcome of the experiment.

Table 5.
Batch SizeNumber of neurons in BiLSTM LayerAccuracy (%)
825693.68
51295.14
1,02495.14
1651296.11
3251296.60
6451294.65

Table 5. Accuracy for Different Batch Sizes and Number of Neurons

From the results, it can be observed that with same batch size and increase in number of neurons does not improve the performance; thus, the number of hidden unit is fixed at 512 for subsequent batch sizes.

4.3 Results and Discussion

4.3.1 Results of Keyframe Selection Algorithm.

The actual frames and extracted keyframes for the “Call” sign are shown in Figure 6. The Call sign consists of a total of 75 frames, but only 10 frames were obtained after executing the keyframe selection algorithm. Figure 7 provides a comparison between the actual number of frames and the number of keyframes extracted for each class. The HD values obtained for the “Call” sign video sample to calculate the threshold value are presented in Table 6, where the obtained threshold value is 6082.4980.

Table 6.
ValueFrame differenceValueFrame differenceValueFrame difference
of nvalue \(d = F_n - F_{n+1}\)of nvalue \(d = F_n - F_{n+1}\)of nvalue \(d = F_n - F_{n+1}\)
\(n=1\)1217.764\(n=26\)3086.269\(n=51\)3944.813
\(n=2\)2574.648\(n=27\)2651.318\(n=52\)4319.377
\(n=3\)2538.348\(n=28\)4093.014\(n=53\)4114.301
\(n=4\)4402.682\(n=29\)6486.309\(n=54\)1859.803
\(n=5\)5469.165\(n=30\)6443.643\(n=55\)5143.418
\(n=6\)2804.609\(n=31\)5239.489\(n=56\)5951.384
\(n=7\)1995.546\(n=32\)1381.968\(n=57\)7916.884
\(n=8\)3243.938\(n=33\)15371.96\(n=58\)4767.201
\(n=9\)6265.107\(n=34\)3845.832\(n=59\)5206.903
\(n=10\)6228.337\(n=35\)2949.526\(n=60\)2978.777
\(n=11\)5221.185\(n=36\)4468.579\(n=61\)2952.717
\(n=12\)4803.998\(n=37\)1883.7\(n=62\)3885.778
\(n=13\)1683.613\(n=38\)1807.858\(n=63\)2823.877
\(n=14\)3340.464\(n=39\)7349.98\(n=64\)4358.893
\(n=15\)4035.786\(n=40\)5706.246\(n=65\)5372.001
\(n=16\)3388.099\(n=41\)4770.412\(n=66\)1945.707
\(n=17\)4827.764\(n=42\)2423.231\(n=67\)1256.903
\(n=18\)3347.618\(n=43\)1943.406\(n=68\)1469.536
\(n=19\)2907.632\(n=44\)1532.749\(n=69\)5468.828
\(n=20\)1300.703\(n=45\)4699.361\(n=70\)6333.01
\(n=21\)3035.459\(n=46\)7206.831\(n=71\)3773.547
\(n=22\)2897.902\(n=47\)5837.176\(n=72\)2950.8
\(n=23\)2809.976\(n=48\)3770\(n=73\)1595.268
\(n=24\)4238.255\(n=49\)3883.96\(n=74\)4624.214
\(n=25\)1683.664\(n=50\)2875.263

Table 6. Frame Difference Values to Calculate Threshold for Sample “Call” Sign Video

Fig. 6.

Fig. 6. Actual and keyframes from “Call” sign video.

Fig. 7.

Fig. 7. Comparative analysis between actual number of frame and extracted keyframe.

The dataset contains total 31, 264 frames, extracted from 412 videos. After execution of keyframe selection algorithm it reduces to 4,956.

4.3.2 Results of the Proposed SASLRM.

For model evaluation, a repeated k-fold CV methodology is employed. The ISL dataset forms the foundation for assessing the effectiveness of the proposed SASLRM model. Initially, the dataset is evenly partitioned into two halves, with 50% allocated for training and the remaining half for testing. Following this initial arrangement, the roles of the training and testing sets are swapped, and the average performance is computed based on this exchange. This iterative procedure is repeated five times to derive a final average performance. The proposed model’s performance is gauged through classification outcomes on the ISL dataset. In the process of selecting the appropriate classification network, various options, including LSTM, BiLSTM, Stacked BiLSTM, and ConvLSTM, are assessed alongside the base VGG-19 model. From these alternatives, the BiLSTM network displayed superior performance. This is ascribed to the BiLSTM’s capability to process input sequences in both forward and backward directions, thereby enabling predictions grounded in future context. A comparative analysis of the outcomes is provided in Table 7.

Table 7.
MethodPrecisionRecallF1-scoreAccuracy (%)SD (\(\sigma\)) of Accuracy
Keyframe + VGG-19 + LSTM0.93570.93340.933293.2890.8719
Keyframe + VGG-19 + BiLSTM0.94970.94700.946594.6540.4354
Keyframe + VGG-19 + Stacked BiLSTM0.91430.90590.905091.5941.09
Keyframe + VGG-19 + ConvLSTM0.94890.94550.943494.5080.8894

Table 7. Average Results from Five Repetitions for Various Classification Networks with the VGG-19 Model

The findings clearly indicate that the BiLSTM network combined with VGG-19 exhibited superior performance compared to other models in various metrics such as precision, recall, F1-score, and accuracy. Consequently, the BiLSTM network is selected for the proposed SASLRM. The achieved performance metrics for the proposed model include an average precision of 0.9591, recall of 0.9556, F1-score of 0.9558, and an average accuracy of 95.627%. A comprehensive depiction of the proposed SASLRM’s performance using the ISL dataset is presented in Table 8.

Table 8.
RepetitionPrecisionRecallF1-scoreAccuracyAverage accuracy (%)SD (\(\sigma\)) of Accuracy
1st0.95740.94930.954095.3895.6270.6335
2nd0.96750.96630.966096.60
3rd0.95380.94950.949194.90
4th0.95400.95180.951595.14
5th0.96300.96140.958696.11

Table 8. Performance of the Proposed SASLRM on ISL Dataset

CM are illustrated in Figure 8, pertaining to the first and second folds of the initial repetition. A careful examination of these matrices reveals that the proposed model demonstrates accurate classification for the “Lose,” “Pain,” and “Thief” classes. However, the model does encounter challenges in accurately classifying the “Call” and “Hot” signs, leading to frequent miss classifications in these cases.

Fig. 8.

Fig. 8. Confusion Matrix of the first and second folds of first repetition.

4.3.3 Comparison with Existing Approaches on ISL Dataset.

The outcomes of the proposed SASLRM are contrasted with outcomes from several existing models using the ISL dataset. The comparative results are summarized in Table 9. Upon analysis, it is evident that the model presented in Reference [7] attained the lowest accuracy of 69% as it solely relies on global features derived from a pre-trained Inception V3 network. In contrast, the model introduced in References [3] got an accuracy of 90% by employing DWT, which is adept at extracting local features due to its localization property. Another deep learning model proposed in the same study [3] reached an accuracy of 96.25%. This model employed Google Net for feature extraction and BiLSTM for classification, with 2000 neurons within the BiLSTM network.It is essential to emphasize that the model proposed by Aditya et al. [3] reported an accuracy of 96.25%. However, it is crucial to note the substantial difference in evaluation criteria employed for this model. Aditya et al. conducted their evaluation using a random 50-50 train-test split, while the proposed SASLRM employed a repeated twofold cross-validation technique. When both models are assessed under the same experimental setup, the model proposed by Aditya et al. achieved an average accuracy of 93.88%, which is comparatively lower than the average accuracy of 95.627% attained by the proposed SASLRM. Moreover, the model proposed model by Aditya et al. has used data augmentation on the complete dataset; unlike image data augmentation, video data augmentation limited due to temporal nature of the video; because of that video data augmentation can generate redundant data and it requires significant amount of computational resources. Thus, the proposed model does not incorporates the any data augmentation technique. Besides, the model proposed in Reference [3] has used 2,000 of neurons in BiLSTM layer that makes it inefficient. The proposed SASLRM has achieved the satisfactory performance without any data augmentation and by using 512 number of neurons in the BiLSTM layer that provides upperhand to the suggested approach. The 3D CNN model presented in Reference [8] attained an accuracy of 82%, which is comparatively lower than various CNN-LSTM based models. This can be attributed to the fact that the 3D CNN model is trained from the ground up. On another note, a distinct model introduced in the same [8] achieved an accuracy of 98%; however, the evaluation criteria employed for the model differ significantly. They evaluated using a random 80-20 train-test split where the number of test samples are very less, while the proposed model evaluated using a repeated twofold cross-validation technique. When both models are implemented and evauated under the same experimental setup, the model proposed in Reference [8] achieved an average accuracy of 94.21%, which is comparatively lower than the average accuracy of 95.627% attained by the proposed SASLRM. Furthermore, the model proposed in Reference [8] has used uniform sampling for keyframe selection where at specific time interval frames are extracted from the video. Therefore, there is an always chance of loss of information specifically if the video consists of a large number of key gestures. In contrast the proposed SAASLRM uses Histogram Difference–based key frame selection technique that makes it more advantageous. In Reference [14], an average accuracy of 94.42% has been achieved. This accomplishment is realized through a blend of local handcrafted features and CNN features. The classification aspect employed a stacked BiLSTM. Nevertheless, the inclusion of stacked BiLSTM results in a decrease in accuracy.

Table 9.
AuthorTechniquePrecisionRecallF1-scoreAccuracy (%)
Aditya et al. [3]DWT + Multiclass SVM (50-50 random split)0.91270.89990.902590
Aditya et al. [3]GoogleNet + BiLSTM (50-50 random split)0.96380.96240.962596.25
Aparna et al. [7]Inception V3 + Stacked LSTM (50-50 split)69
Areeb et al. [8]3D CNN (80-20 random split)0.850.83620.826282
Areeb et al. [8]VGG16 + LSTM (80-20 split)0.97500.97750.973798
Das et al. [14]VGG19+handcrafted+Stacked BiLSTM (50-50 split)0.92810.91990.921894.42
ProposedVGG19 with spatial attention + BiLSTM (Repeated twofold CV)0.95910.95560.955895.627

Table 9. Performance Comparison with Previous Approaches on ISL Dataset

The SASLRM model produces an average accuracy of 95.627%, where the average is taken from five repetition of twofold cross-validation. The average results are compared with the existing SLRS where, the proposed model has outperformed most of the existing SLRS. The suggested approach acquired better accuracy compared to the existing models, because the SASLRM model uses the CSA module, which helps to extract more salient features from the local hand region by combining SA and convolution features that emphasize features from the local hand region.

4.3.4 Classwise Analysis.

The classwise analysis of the proposed SASLRM on ISL dataset is discussed in this section. For this analysis, precision, recall and F-score values are used. Equations (9), (10), and (11) calculate the precision, recall, and F-score, respectively, (9) \(\begin{equation} \text{Precision} = \frac{{T_P}}{{T_P + F_P}}, \end{equation}\) (10) \(\begin{equation} \text{Recall} = \frac{{T_P}}{{T_P + F_N}}, \end{equation}\) (11) \(\begin{equation} \text{F-Score} = \frac{{2 \times (\text{Precision} \times \text{Recall})}}{{\text{Precision} + \text{Recall}}}, \end{equation}\) where \(T_P\), \(F_P\), and \(F_N\) are the true-positive value, false-positive value, and false-negative value, respectively. The average values are reported in Table 10. Low precision value specifies high false-positive cases, whereas low recall value specifies high false-negative cases. It can be observed that the class “Call” has the lowest precision value that means it has highest false-positive cases, which indicates many gesture classes are miss classified into “Call” class. However, the “Call” class has the lowest recall value, which indicates that the word “Call” is misclassified into other gesture classes in a large number. However, the proposed method overall classifies each class with a high recognition rate.

Table 10.
Class NamePrecisionRecallF1-score
Accident0.99220.97680.9843
Call0.91410.89310.8983
Doctor0.96300.95020.9517
Help0.95280.96200.9559
Hot0.98170.91970.9422
Lose0.98100.98850.9861
Pain0.94490.97680.9604
Thief0.94660.98920.9717

Table 10. Detailed Classification Reports on ISL Dataset

4.3.5 Error Analysis.

From the average classification report it can be observed that the “Call” sign has the lowest F1-score value. This section discusses about the misclassified results of the call signs.

Most of the “Call” sign video samples suffers from the self-occulision problem, which is one of the major reasons for high misclassification rate for the “Call” signs. Figure 9 shows the keyframes of some of the misclassified video samples of “Call” sign. Furthermore it has been observed that due to inter-class similarity and intra-class dissimilarity the model got confused. From Figure 10, it can be observed that “Lose” and “Pain” sign have similarities with “Call” sign in terms of gesture position. Thus, most of the call sign videos are misclassified as “Lose” and “Pain,” and some of the samples are misclassified as “Thief.” Specifically for video sample 2 and sample 3 it can be observed that keyframe selection algorithm fails to extract all the key gesture positions, which may be one of the probable reasons for misclassification.

Fig. 9.

Fig. 9. Keyframes of miss classified video samples for “Call” sign.

Fig. 10.

Fig. 10. Keyframes of “Lose,” “Pain,” and “Thief” video samples.

4.3.6 Validation of SASLRM with Standard Datasets.

Regrettably, there is lack of standard accessible dataset for the ISL within the public domain. Numerous researchers have crafted their individual dataset for the task; however, these datasets remain either partially accessible, or they are not publicly available. In light of this, the proposed model undergoes validation via testing against three benchmark datasets.

Dataset1: The LSA64 dataset stands out as a publicly accessible extensive dataset utilized for recognition of Argentine SL. This dataset is widely used for evaluating SLRS. The dataset encompasses a of 3,200 videos spanning 64 classes, derived from 10 non-expert subjects. Within these classes, 22 utilize double hands, while the remaining classes employ a single hand for performing SL gestures. To evaluate the proposed model, the dataset is partitioned in an 80-20 ratio, and the average outcome across five repetitions is computed. The LSA64 dataset has been widely adopted by various researchers for the evaluation of their proposed SLRS. The outcome of the proposed model is depicted in Table 11 alongside a comparison with certain existing methods on the LSA64 dataset. It is discernible from the results that the proposed method achieves enhanced results in comparison to existing methods on the LSA64 dataset, with the exception of Reference [20]. Nonetheless, it is crucial to highlight that Reference [20] utilized a subset of the dataset comprising 40 classes.

Table 11.
AuthorTechniqueNumber of InstancesAccuracy (%)
Ramos et al. [41]3D CNN6493.9 (Avg. of 3 repetition)
Ronchetti et al. [50]Random transform with HMM+GMM6495.95 (Avg. of 30 repetition)
Rodriguez et al. [49]Cumulative shape difference + SVM6485
Masood et al. [39]Inception CNN + LSTM6495.20
Shah et al. [54]Inception CNN + BiLSTM6496
Elsayed et al. [20]3D CNN + LSTM4098.50
ProposedKeyframe + VGG19 with spatial attention + BiSLTM6497.84 (Avg. of five repetition)

Table 11. Comparison with Existing Methods on LSA64

Dataset 2: Dynamic Hand Gesture Recognition (HGR) is very similar to this task. Thus, the SASLRM is validated with a benchmark HGR dataset named the Cambridge Hand Gesture dataset is used [28]. The dataset consists of 900 dynamic gesture videos of nine classes, and each class contains 100 samples. There are three shapes in the dataset: V, flat, and spread, as well as three motions: left, right, and contract. The dataset is prepared with the help of two individuals under different illumination conditions. The dataset split into two halves, with 450 video samples used for training and the remaining 450 video samples used for testing. The experiment is repeated twice with different train and test data, and the average results are reported. Table 12 shows a comparison of previous works on the same dataset. The proposed model uses the pretrained VGG-19 model with the SA module, which extracts salient features from the frame sequences. In contrast with existing methods, the proposed model uses keyframe selection methods that enhance the accuracy and efficiency of the SLRS by discarding the unwanted frames. Thus, the proposed SASLRM reached a mean accuracy of 98.22%, which is significantly better than in Reference [36] and Reference [35]. However, the method proposed in Reference [57] reached an accuracy of 97.22%, which is closest to the proposed model, because both models follow the CNN-BiLSTM architecture, which incorporates spatial and temporal features needed for dynamic hand gesture recognition.

Table 12.
AuthorMethodAccuracy (%)
Lui et al. [37]Tangent bundle on Grassmann manifold91
Lui et al. [36]Least square regression method88
Harandi et al. [51]Weighted Riemannian manifolds and 3D covariance descriptors93
Liu et al. [35]Primitive 3D operators with genetic algorithm85
Baraldi et al. [9]Dense trajectories hand segmentation94
Jhon et al. [26]Keyframe with CNN model91
Aditya et al. [57]GoogleNet+BiLSTM97.22
ProposedSASLRM98.22

Table 12. Comparative Analysis on Cambridge HGR Dataset

Dataset 3: The proposed framework is also evaluated with WLASL-2000 dataset. which is a benchmark dataset for isolated words of ASL. For evaluation top-5 percentage accuracy is considered. The performance of the proposed method is compared with some of the prominent work in this domain, such as Fusion-3 [23], multi- stream [38], HNN-ISLR [45], and hDNN-SLR [46]. Table 13 shows the comparative results. From the results it can be observed that the proposed SASLRM has performed better in comparison with most of the existing works.

Table 13.
AuthorTechniqueAccuracy (%)
A. Hosain et al. [23]Fusion-377.51
M. Maruyama et al. [38]Multi-stream87.47
E. Rajalakshmi et al. [46]HNN-ISLR99.85
E. Rajalakshmi et al. [45]hDNN-SLR97.54
ProposedSASLRM98.86

Table 13. Comparative Analysis on WALSL Dataset

4.3.7 Significance of the Components of SASLRM.

To check the importance of each module of SASLRM, the ability analysis is done. This study investigates the significance of the convolution module and the SA module, as well as their combination in the proposed SASLRM. The average classification accuracy on the ISL dataset is used to determine the contribution of each component of the proposed method. The average accuracy comparison is listed in Table 14. The results clearly show that the combination of SA and convolution module has performed better. SA module alone is not sufficient for better performance; however, the SA module helps to increase the accuracy of the proposed model. Further, to investigate the contribution of the keyframe extractio module the proposed model is executed without keyframe selection module. The detailed results are given in Table 15. From the results it can be observed that the keyframe selection reduces the execution time significantly.

Table 14.
ModulesAccuracy (%)
VGG-19 with SA module76.69
VGG-1994.654
VGG-19 with combination of SA and Convolution module95.627

Table 14. Ability Analysis of Each Module of the Proposed SASLRM on ISL Dataset

Table 15.
MethodKeyframe selection Time (sec)/videoFeature Extraction Time (sec)/videoTraining Time (Classification Network) (sec)Training Time (Feature Extraction Network) (sec)Prediction Time (sec)/videoTotal Execution Time (sec)
Without Keyframe4.67182021500.093974.76
With Keyframe2.730.7456928500.071545.54

Table 15. Comparison of Average Execution Times per Video on ISL Dataset

Skip 5CONCLUSION AND FUTURE SCOPE Section

5 CONCLUSION AND FUTURE SCOPE

This article proposes a novel vision-based SLRS named SASLRM for recognizing ISL words that are frequently used in an emergency situation. The proposed SASLRM uses a keyframe selection module to eliminate redundant and useless frames, which helps to enhance the accuracy. The proposed model extract features using the SA module on top of the VGG-19 and classifies sign words using the BiLSTM model. The evaluation results indicates that the proposed method outperforms the existing SLRS. Besides this, the importance of each module is investigated, and, analyzing the outcomes, it is noticeable that the fusion of SA and convolution modules outperforms individual ones. The results show that the proposed SASLRM is better suited to the SL recognition task, and the combination of the convolution and SA modules improves the SLRS performance. However, during error analysis it is observed that the proposed model encounters challenges when attempting to classify sign words in scenarios involving self-occlusion. The issue of self-occlusion poses a significant hurdle within the domain of sign language recognition. In this context, self-occlusion occurs when a specific body part or object involved in making a gesture is hidden or obstructed from view by another part of the same body or object. Consequently, this phenomenon leads to misclassifications. Thus, as a part of future work SLRS can be developed to address the self-occlusion problem to improve the performance of the SLRS.

ACKNOWLEDGMENTS

The authors thank Computer Science Department of NIT Silchar for providing the research facility.

REFERENCES

  1. [1] 2018. World Health Organization. Retrieved from https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-lossGoogle ScholarGoogle Scholar
  2. [2] Adhikary Subhangi, Talukdar Anjan Kumar, and Sarma Kandarpa Kumar. 2021. A vision-based system for recognition of words used in indian sign language using mediapipe. In Proceedings of the 6th International Conference on Image Information Processing (ICIIP’21), Vol. 6. IEEE, 390394.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Adithya V. and Rajesh R.. 2020. Hand gestures for emergency situations: A video dataset based on words from indian sign language. Data Brief 31 (2020), 106016.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Al-Hammadi Muneer, Muhammad Ghulam, Abdul Wadood, Alsulaiman Mansour, Bencherif Mohammed A., Alrayes Tareq S., Mathkour Hassan, and Mekhtiche Mohamed Amine. 2020. Deep learning-based approach for sign language gesture recognition with efficient hand gesture representation. IEEE Access 8 (2020), 192527192542.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Al-Hammadi Muneer, Muhammad Ghulam, Abdul Wadood, Alsulaiman Mansour, Bencherif Mohamed A., and Mekhtiche Mohamed Amine. 2020. Hand gesture recognition for sign language using 3DCNN. IEEE Access 8 (2020), 7949179509.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Aly Saleh and Aly Walaa. 2020. DeepArSLR: A novel signer-independent deep learning framework for isolated arabic sign language gestures recognition. IEEE Access 8 (2020), 8319983212.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Aparna C. and Geetha M.. 2020. CNN and stacked LSTM model for Indian sign language recognition. In Machine Learning and Metaheuristics Algorithms, and Applications: First Symposium (SoMMA’19), Revised Selected Papers 1. Springer, 126134.Google ScholarGoogle Scholar
  8. [8] Areeb Qazi Mohammad, Nadeem Mohammad, Alroobaea Roobaea, Anwer Faisal, et al. 2022. Helping hearing-impaired in emergency situations: A deep learning-based approach. IEEE Access 10 (2022), 85028517.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Baraldi Lorenzo, Paci Francesco, Serra Giuseppe, Benini Luca, and Cucchiara Rita. 2014. Gesture recognition in ego-centric videos using dense trajectories and hand segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 688693.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Chai Xiujuan, Wang Hanjie, Yin Fang, and Chen Xilin. 2015. Communication tool for the hard of hearings. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction (ACII’15).13.Google ScholarGoogle Scholar
  11. [11] Camgoz Necati Cihan, Hadfield Simon, Koller Oscar, and Bowden Richard. 2017. Subunets: End-to-end hand shape and continuous sign language recognition. In Proceedings of the IEEE International Conference on Computer Vision. 30563065.Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Cui Runpeng, Liu Hu, and Zhang Changshui. 2017. Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 73617369.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Das Soumen, Biswas Saroj Kr, Chakraborty Manomita, and Purkayastha Biswajit. 2022. Intelligent Indian sign language recognition systems: A critical review. In ICT Systems and Sustainability: Proceedings of ICT4SD 2021, Volume 1 (2022), 703713.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Das Soumen, Biswas Saroj Kr, and Purkayastha Biswajit. 2023. Automated Indian sign language recognition system by fusing deep and handcrafted feature. Multimedia Tools Appl. 82, 11 (2023), 1690516927.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Das Soumen, Biswas Saroj Kr, and Purkayastha Biswajit. 2023. A deep sign language recognition system for Indian sign language. Neural Comput. Appl. 35, 2 (2023), 14691481.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Das Soumen, Chakraborty Manomita, Purkayastha Biswajit, et al. 2021. A review on sign language recognition (SLR) system: ML and DL for SLR. In Proceedings of the IEEE International Conference on Intelligent Systems, Smart and Green Technologies (ICISSGT’21). IEEE, 177182.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Das Sunanda, Imtiaz Md Samir, Neom Nieb Hasan, Siddique Nazmul, and Wang Hui. 2023. A hybrid approach for bangla sign language recognition using deep transfer learning model with random forest classifier. Expert Syst. Appl. 213 (2023), 118914.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Deng Jia, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248255.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Dhingra Naina and Kunz Andreas. 2019. Res3atn-deep 3d residual attention network for hand gesture recognition in videos. In Proceedings of the International Conference on 3D Vision (3DV’19). IEEE, 491501.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Elsayed Eman K. and Fathy Doaa R.. 2021. Semantic deep learning to translate dynamic sign language. Int. J. Intell. Eng. Syst. 14 (2021).Google ScholarGoogle Scholar
  21. [21] Gurbuz Sevgi Z., Gurbuz Ali Cafer, Malaia Evie A., Griffin Darrin J., Crawford Chris S., Rahman Mohammad Mahbubur, Kurtoglu Emre, Aksu Ridvan, Macks Trevor, and Mdrafi Robiulhossain. 2020. American sign language recognition using rf sensing. IEEE Sens. J. 21, 3 (2020), 37633775.Google ScholarGoogle Scholar
  22. [22] Hoang Nguyen Ngoc, Lee Guee-Sang, Kim Soo-Hyung, and Yang Hyung-Jeong. 2018. A real-time multimodal hand gesture recognition via 3D convolutional neural network and key frame extraction. In Proceedings of the International Conference on Machine Learning and Machine Intelligence. 3237.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Hosain Al Amin, Santhalingam Panneer Selvam, Pathak Parth, Rangwala Huzefa, and Kosecka Jana. 2021. Hand pose guided 3d pooling for word-level sign language recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 34293439.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Huang Yanglai, Huang Jing, Wu Xiaoyue, and Jia Yu. 2022. Dynamic sign language recognition based on CBAM with autoencoder time series neural network. Mobile Inf. Syst. 2022 (2022).Google ScholarGoogle Scholar
  25. [25] Jayadeep Gautham, Vishnupriya N. V., Venugopal Vyshnavi, Vishnu S., and Geetha M.. 2020. Mudra: Convolutional neural network based Indian sign language translator for banks. In Proceedings of the 4th International Conference on Intelligent Computing and Control Systems (ICICCS’20). IEEE, 12281232.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] John Vijay, Boyali Ali, Mita Seiichi, Imanishi Masayuki, and Sanma Norio. 2016. Deep learning-based fast hand gesture recognition using representative frames. In Proceedings of the International Conference on Digital Image Computing: Techniques and Applications (DICTA’16). IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Khalid Samina, Khalil Tehmina, and Nasreen Shamila. 2014. A survey of feature selection and feature extraction techniques in machine learning. In Proceedings of the Science and Information Conference. IEEE, 372378.Google ScholarGoogle ScholarCross RefCross Ref
  28. [28] Kim Tae-Kyun, Wong Shu-Fai, and Cipolla Roberto. 2007. Tensor canonical correlation analysis for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Koller Oscar, Zargaran Sepehr, and Ney Hermann. 2017. Re-sign: Re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 42974305.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Kumar Naresh. 2017. Sign language recognition for hearing impaired people based on hands symbols classification. In Proceedings of the International Conference on Computing, Communication and Automation (ICCCA’17). IEEE, 244249.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Kumar Naresh and Sukavanam Nagarajan. 2017. Deep network architecture for large scale visua detection and recognition issues. J. Inf. Assur. Secur. 12, 6 (2017).Google ScholarGoogle Scholar
  32. [32] Likhar Pratik, Bhagat Neel Kamal, and Rathna G. N.. 2020. Deep learning methods for indian sign language recognition. In Proceedings of the IEEE 10th International Conference on Consumer Electronics (ICCE-Berlin’20). IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Lim Kian Ming, Tan Alan W. C., and Tan Shing Chiang. 2016. Block-based histogram of optical flow for isolated sign language recognition. J. Vis. Commun. Image Represent. 40 (2016), 538545.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Lin Wei-Syun, Wu Yi-Leh, Hung Wei-Chih, and Tang Cheng-Yuan. 2013. A study of real-time hand gesture recognition using SIFT on binary images. In Advances in Intelligent Systems and Applications-Volume 2: Proceedings of the International Computer Symposium (ICS’12). Springer, 235246.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Liu Li and Shao Ling. 2013. Synthesis of spatio-temporal descriptors for dynamic hand gesture recognition using genetic programming. In Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG’13). IEEE, 17.Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Lui Yui Man. 2012. Human gesture recognition on product manifolds. J. Mach. Learn. Res. 13, 1 (2012), 32973321.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Lui Yui Man and Beveridge J. Ross. 2011. Tangent bundle for human action recognition. In Proceedings of the IEEE International Conference on Automatic Face & Gesture Recognition (FG’11). IEEE, 97102.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Maruyama Mizuki, Ghose Shuvozit, Inoue Katsufumi, Roy Partha Pratim, Iwamura Masakazu, and Yoshioka Michifumi. 2021. Word-level sign language recognition with multi-stream neural networks focusing on local regions. arXiv preprint arXiv:2106.15989 (2021).Google ScholarGoogle Scholar
  39. [39] Masood Sarfaraz, Srivastava Adhyan, Thuwal Harish Chandra, and Ahmad Musheer. 2018. Real-time sign language gesture (word) recognition from video sequences using CNN and RNN. In Intelligent Engineering Informatics: Proceedings of the 6th International Conference on FICTA. Springer, 623632.Google ScholarGoogle ScholarCross RefCross Ref
  40. [40] Mittal Anshul, Kumar Pradeep, Roy Partha Pratim, Balasubramanian Raman, and Chaudhuri Bidyut B.. 2019. A modified LSTM model for continuous sign language recognition using leap motion. IEEE Sens. J. 19, 16 (2019), 70567063.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Neto Geovane M. Ramos, Junior Geraldo Braz, Almeida João Dallyson Sousa de, and Paiva Anselmo Cardoso de. 2018. Sign language recognition based on 3d convolutional neural networks. In Image Analysis and Recognition: 15th International Conference (ICIAR’18). Springer, 399407.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Ong Eng-Jon, Cooper Helen, Pugeault Nicolas, and Bowden Richard. 2012. Sign language recognition using sequential pattern trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 22002207.Google ScholarGoogle Scholar
  43. [43] Pan Wei, Zhang Xiongquan, and Ye Zhongfu. 2020. Attention-based sign language recognition network utilizing keyframe sampling and skeletal features. IEEE Access 8 (2020), 215592215602.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Rahim Md Abdur, Shin Jungpil, and Islam Md Rashedul. 2019. Dynamic hand gesture based sign word recognition using convolutional neural network with feature fusion. In Proceedings of the IEEE 2nd International Conference on Knowledge Innovation and Invention (ICKII’19). IEEE, 221224.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Rajalakshmi E., Elakkiya R., Prikhodko Alexey L., Grif M. G., Bakaev Maxim A., Saini Jatinderkumar R., Kotecha Ketan, and Subramaniyaswamy V.. 2022. Static and dynamic isolated Indian and Russian sign language recognition with spatial and temporal feature detection using hybrid neural network. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 22, 1 (2022), 123.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Rajalakshmi E., Elakkiya R., Subramaniyaswamy V., Alexey L. Prikhodko, Mikhail Grif, Bakaev Maxim, Kotecha Ketan, Gabralla Lubna Abdelkareim, and Abraham Ajith. 2023. Multi-semantic discriminative feature learning for sign gesture recognition using hybrid deep neural architecture. IEEE Access 11 (2023), 22262238.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] Rastgoo Razieh, Kiani Kourosh, and Escalera Sergio. 2020. Hand sign language recognition using multi-view hand skeleton. Expert Syst. Appl. 150 (2020), 113336.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Rastgoo Razieh, Kiani Kourosh, and Escalera Sergio. 2021. Sign language recognition: A deep survey. Expert Syst. Appl. 164 (2021), 113794.Google ScholarGoogle ScholarCross RefCross Ref
  49. [49] Rodríguez Jefferson and Martínez Fabio. 2018. Towards on-line sign language recognition using cumulative SD-VLAD descriptors. In Advances in Computing: 13th Colombian Conference (CCC’18). Springer, 371385.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Ronchetti Franco, Quiroga Facundo, Estrebou César Armando, Lanzarini Laura Cristina, and Rosete Alejandro. 2016. LSA64: An argentinian sign language dataset. In XXII Congreso Argentino de Ciencias de la Computación (CACIC’16).Google ScholarGoogle Scholar
  51. [51] Sanin Andres, Sanderson Conrad, Harandi Mehrtash T., and Lovell Brian C. 2013. Spatio-temporal covariance descriptors for action and gesture recognition. In Proceedings of the IEEE Workshop on Applications of Computer Vision (WACV’13). IEEE, 103110.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. [52] Saraee Elham, Jalal Mona, and Betke Margrit. 2020. Visual complexity analysis using deep intermediate-layer features. Comput. Vis. Image Understand. 195 (2020), 102949.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Kumar N. K. Senthil and Malarvizhi N.. 2020. Bi-directional LSTM–CNN combined method for sentiment analysis in part of speech tagging (PoS). Int. J. Speech Technol. 23 (2020), 373380.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Shah Jai Amrish et al. 2018. Deepsign: A Deep-learning Architecture for Sign Language. Ph. D. Dissertation.Google ScholarGoogle Scholar
  55. [55] Singh Dushyant Kumar. 2021. 3d-cnn based dynamic gesture recognition for indian sign language modeling. Proc. Comput. Sci. 189 (2021), 7683.Google ScholarGoogle ScholarCross RefCross Ref
  56. [56] Tariq Memona, Iqbal Ayesha, Zahid Aysha, Iqbal Zainab, and Akhtar Junaid. 2012. Sign language localization: Learning to eliminate language dialects. In Proceedings of the 15th International Multitopic Conference (INMIC’12). IEEE, 1722.Google ScholarGoogle ScholarCross RefCross Ref
  57. [57] Venugopalan Adithya and Reghunadhan Rajesh. 2021. Applying deep neural networks for the automatic recognition of sign language words: A communication aid to deaf agriculturists. Expert Syst. Appl. 185 (2021), 115601.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Venugopalan Adithya and Reghunadhan Rajesh. 2023. Applying hybrid deep neural network for the recognition of sign language words used by the deaf COVID-19 patients. Arab. J. Sci. Eng. 48, 2 (2023), 13491362.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Agris Ulrich Von, Zieren Jörg, Canzler Ulrich, Bauer Britta, and Kraiss Karl-Friedrich. 2008. Recent developments in visual sign language recognition. Univ. Access Inf. Soc. 6 (2008), 323362.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. [60] Woo Sanghyun, Park Jongchan, Lee Joon-Young, and Kweon In So. 2018. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV’18). 319.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Yang Hee-Deok and Lee Seong-Whan. 2013. Robust sign language recognition by combining manual and non-manual features based on conditional random field and support vector machine. Pattern Recogn. Lett. 34, 16 (2013), 20512056.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. [62] Yao Yi and Li Chang-Tsun. 2012. Hand posture recognition using SURF with adaptive boosting. In British Machine Vision Conference.Google ScholarGoogle Scholar
  63. [63] Zaki Mahmoud M. and Shaheen Samir I.. 2011. Sign language recognition using a combination of new vision based features. Pattern Recogn. Lett. 32, 4 (2011), 572577.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. [64] Zhang Erhu, Xue Botao, Cao Fangzhou, Duan Jinghong, Lin Guangfeng, and Lei Yifei. 2019. Fusion of 2D CNN and 3D DenseNet for dynamic gesture recognition. Electronics 8, 12 (2019), 1511.Google ScholarGoogle ScholarCross RefCross Ref
  65. [65] Zhang Shujun and Zhang Qun. 2021. Sign language recognition based on global-local attention. J. Vis. Commun. Image Represent. 80 (2021), 103280.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An Expert System for Indian Sign Language Recognition Using Spatial Attention–based Feature and Temporal Feature
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Asian and Low-Resource Language Information Processing
            ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 3
            March 2024
            277 pages
            ISSN:2375-4699
            EISSN:2375-4702
            DOI:10.1145/3613569
            Issue’s Table of Contents

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 March 2024
            • Online AM: 3 February 2024
            • Accepted: 23 January 2024
            • Revised: 20 November 2023
            • Received: 4 March 2023
            Published in tallip Volume 23, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
          • Article Metrics

            • Downloads (Last 12 months)219
            • Downloads (Last 6 weeks)168

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader