research-article

Open Access

An Expert System for Indian Sign Language Recognition Using Spatial Attention–based Feature and Temporal Feature

Authors:
Soumen Das

National Institute of Technology Silchar, Silchar, India

National Institute of Technology Silchar, Silchar, India

0000-0002-3756-186X
View Profile

,
Saroj Kr. Biswas

National Institute of Technology Silchar, Silchar, India

National Institute of Technology Silchar, Silchar, India

0009-0004-4819-8623
View Profile

,
Biswajit Purkayastha

National Institute of Technology Silchar, Silchar, India

National Institute of Technology Silchar, Silchar, India

0009-0003-7302-2585
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23 Issue 3Article No.: 37pp 1–23https://doi.org/10.1145/3643824

Published:09 March 2024Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Sign Language (SL) is the only means of communication for the hearing-impaired people. Normal people have difficulty understanding SL, resulting in a communication barrier between hearing impaired people and hearing community. However, the Sign Language Recognition System (SLRS) has helped to bridge the communication gap. Many SLRs are proposed for recognizing SL; however, a limited number of works are reported for Indian Sign Language (ISL). Most of the existing SLRS focus on global features other than the Region of Interest (ROI). Focusing more on the hand region and extracting local features from the ROI improves system accuracy. The attention mechanism is a widely used technique for emphasizing the ROI. However, only a few SLRS used the attention method. They employed the Convolution Block Attention Module and temporal attention but Spatial Attention (SA) is not utilized in previous SLRS. Therefore, a novel SA based SLRS named Spatial Attention-based Sign Language Recognition Module (SASLRM) is proposed to recognize ISL words for emergency situations. SASLRM recognizes ISL words by combining convolution features from a pretrained VGG-19 model and attention features from a SA module. The proposed model accomplished an average accuracy of 95.627% on the ISL dataset. The proposed SASLRM is further validated on LSA64, WLASL, and Cambridge Hand Gesture Recognition datasets where, the proposed model reached an accuracy of 97.84%, 98.86%, and 98.22%, respectively. The results indicate the effectiveness of the proposed SLRS in comparison with the existing SLRS.

1 INTRODUCTION

Sign Language (SL) is the only means of communication for hearing-impaired people. According to the World Health Organization, approximately 63 million Indians are hearing impaired [16]. However, very few people are well trained or can understand SL, which creates a communication barrier between the hearing and deaf community. Employing a human interpreter is a traditional approach to resolving this issue. However, well-trained professional human interpreters are not easily available, and they are not cost-effective. Besides these, signer need to compromise confidentiality. Therefore, Sign Language Recognition System (SLRS) can be an effective choice for reducing the communication barrier between the two communities. The objective of an SLRS is to provide a platform to facilitate the communication process. Since the early 2010s, considerable advancements have taken place in the evolution of SLRS. Nevertheless, these systems lack a unified standard due to the existence of numerous variations within sign language recognition tasks. In broad terms, these tasks can be categorized into two main types: static and dynamic sign language recognition, which depend on the representation of gestures. The static sign language recognition task revolves around identifying gestures without any associated motion. However, dynamic sign language recognition involves the identification of gestures that incorporate motion. The dynamic category can be further divided into isolated word recognition and continuous sentence recognition [13]. The state-of-the-art approach for static SL recognition has achieved remarkable performance. In contrast with static SL recognition the dynamic SL recognition still an unexplored area of research specifically for Indian Sign Language (ISL). Dynamic sign language recognition operates with a video containing a series of sequential gestures, aiming to produce an output of either a single word or a sequence of words.

In recent times, numerous SLRS have been introduced to address dynamic SL recognition tasks, with a predominant focus on leveraging deep learning methodologies. Recent contributions in the realm of deep learning encompass works by Cihan et al. [11], Cui et al. [12], Das et al. [15], and Koller [29]. Modern techniques for dynamic SL recognition encompass the utilization of Convolutional Neural Networks (CNN) and transformer-based SLRS. A prevalent approach within SLRS involves the application of Transfer Learning (TL), which entails the utilization of pretrained CNN models to classify individual frames or to serve as feature extractors. This involves adapting a model previously trained on a related task to the current context. The efficacy of this approach depends on various strategies, including determining the number of layers to transfer, finetune, or freeze. Some existing SLRS have effectively incorporated TL principles using pretrained networks. For instance, Adithya et al. [57] proposed an SLRS model that employs a pretrained GoogleNet as a feature extractor in conjunction with a Bi-directional Long Short-Term Memory (BiLSTM) network to classify ISL words in the agricultural domain. Similarly, Aparna et al. [7] utilized a InceptionV3 network for feature extraction and employed a stacked Long Short-Term Memory (LSTM) network for classifying ISL words. In contrast with the previous two approaches, an SLRS model proposed in Reference [58] used the features from the final convolution layer of VGG-16 instead of the dense layer, because the last polling layers extract the most salient features from the sign frames, which consist of local information. All the aforementioned SLRS have demonstrated acceptable performance. Nevertheless, they exhibit notable limitations, notably in handling a substantial volume of frames that often encompass redundant and extraneous data. Sign language videos are characterized by a continuous stream of frames, frequently captured at a high sampling rate by cameras. Consequently, the elevated sampling rate results in a considerable volume of frames within a sign video, necessitating substantial hardware resources for processing. Furthermore, the presence of redundant or irrelevant data between successive frames impairs the accuracy of the system. Another problem with the exiting SLRS [7, 57, 58] is that they focus on global features other than focusing more on the Region of Interest (ROI). In the case of SL frames, most part of the frame consist of background information and a small part covers the hand region. Thus, focusing more on the hand region and extracting local features from the ROI can increase the performance of the system. Therefore, an SLRS named Spatial Attention-based Sign Language Recognition Module (SASLRM) is proposed in this article to address the aforementioned issues. SASLRM incorporates a novel architecture that uses a Histogram Difference– (HD) based keyframe selection module and Convolution with Spatial Attention (CSA) module. In addition, the proposed model uses a BiLSTM network to incorporate temporal features for classifying the sign words. The keyframe selection module resolves the problem of excessive processing by eliminating redundant or useless frames. The CSA module extracts features from the keyframes, in which convolution along with Spatial Attention (SA) extract salient local features. SASLRM uses the last pooling layer of VGG-19 as a convolution base, which has a higher discriminability and is faster in deep learning model training tasks [31, 52]. At the same time, the attention module as deep layer assists in capturing the spatial relationship during the training process for better discrimination. The main contributions of this proposed work are as follows:

—	A novel SLRS by combining VGG-19 and SA module. Since the proposed model uses the features from the last convolution layer with a SA module, it can capture more salient local features from sign frames.
—	A Histogram Difference based keyframe selection technique that eliminates redundant and useless frames, which enhance the effectiveness of the proposed model.

The rest of the article is organised as follows. Section 2 presents related work for SL recognition. Section 3 discusses the proposed methodology for ISL recognition task. Section 4 presents the experimental analysis. Finally, Section 5 concludes the article.

2 RELATED WORK

SL is not a uniform language—it varies according to country and region. There are around 300 SL available worldwide, including American Sign Language (ASL), Chinese Sign Language, British Sign Language, ISL, and so on [1]. The SLR system recognises SL and converts it into meaningful words or expressions for the hearing community. The SL word consist of a sequence of gestures. Thus, SLRS are inextricably linked to the gesture recognition task or action recognition problem. Researchers have developed several models for recognising SL and hand gestures over the years and these models are classified into hardware-based and vision-based models. Models based on hardware are less complex than models based on vision. However, they have many disadvantages, including the fact that the signer faces difficulty performing any gesture, and the hardware-based SLR system is not cost-effective, making it unsuitable for real-world sign language recognition tasks. Therefore, vision-based solutions are opted for SL recognition to address these issues. The proposed work is focused on vision-based techniques. This section discusses current advances in SL recognition system.

2.1 SLRS Models without CNN

The traditional approach of SL recognition without CNN consists of several basic steps such as hand detection, hand tracking, segmentation, feature extraction, and classification. Hand motion, shape, orientation, posture, and location are some of the basic features that are used for SL recognition in traditional SLRS models. In SLRS, relevant feature extraction is critical, because irrelevant features lead to misclassification [5]. Some of the widely used feature extractions methods are Scale Invariant Feature Transform, Speeded Up Robust Feature, Histogram Oriented Gradient, Principle Component Analysis (PCA), Linear Discriminant Analysis, Convex Haul, Convexity defects, Local Binary Pattern, and so on [27, 30, 34, 56, 62]. The feature extraction steps help the classifier by reducing computation and overfitting. The most commonly used classifiers in SLRS are Support Vector Machine (SVM), K-Nearest Neighbour, Random Forest (RF), and Hidden Markov Model (HMM) [21, 59]. HMM-based classifiers perform better in the case of dynamic SL recognition tasks. Ong et al. [42] proposed a sequential Pattern Tree-based multi-class classifier for recognising Greek Sign Language and their proposed model outperformed HMM. Yang et al. [61] suggested a CRF- and SVM-based SLRS for recognizing ASL with manual feature and non-manual feature. The BoostMap embeddings are used for hand shape, segmentation is done using hierarchical CRF, and recognition is done using SVM. Lim et al. [33] proposed an SLRS for isolated sign word recognition, and they used an optical flow feature with an absolute difference between train and test for classification. Chai et al. [10] suggested another SLRS where they proposed a novel Grassman Covariance Matrix to encode long sequence and HMM for classification. It has been observed that single features like shape or motion features are insufficient for SL recognition. Thus, a combination of features can be utilized to improve recognition performance. Zaki et al. [63] proposed an SLRS where they incorporated hand shape, hand orientation, articulation place, and hand movement. They used PCA to determine hand shape and orientation, kurtosis position to determine articulation, and Motion Chain Code to determine hand movement. For classification, they have used HMM. Traditional SLRS models have performed satisfactorily in some cases. However, selecting appropriate features is challenging, and the performance of the SLRS dependents on various factors such as background, efficient and accurate hand detection, and segmentation. Therefore, researchers have moved toward CNN based models for dynamic SL recognition task.

2.2 SLRS Models with CNN

Deep learning (DL) models eliminate the extensive image pre-processing steps and overcome obstacles of traditional SLRS models without CNN [48]. Deep Neural Network architectures, such as CNN, learn high-level features automatically from image sequences and removes the problems associated with traditional feature extraction steps. DL models extracts salient features from each object present in an image. Researchers have proposed numerous DL-based SLR models over the years, including three-dimensional (3D) CNN with keyframe [22], a combination of 2D and 3D CNN [64], an attention-based model [19], and DL architecture with skeleton feature [47]. Researchers used various feature extraction techniques with various models, and the encoder-decoder model performed best, with CNN as an encoder and LSTM as a decoder. Likhar et al. [32] proposed a continuous SLR system. This method employs the CNN-LSTM architecture. They evaluated the proposed model with 10 dynamic gestures and reported an F1 measure of 99.20%. The SLRS proposed by Jaydeep et al. [25] capable of recognizing dynamic ISL words in the banking sector. They used a pretrained InceptionV3 as a feature extractor and employed a LSTM network gesture classification. The model received an 85% recognition rate. Many researchers have employed custom LSTM networks to improve the accuracy of the proposed model. Mittal et al. [40] proposed a continuous SL recognition SLR system based on a custom LSTM. The SLR model can recognize ISL sentence with 72.3% accuracy and isolated ISL word with 89.5% accuracy. Aly et al. [6] developed a new SLR system for dynamic SL recognition that uses a BiLSTM network. They showed that the BiLSTM network outperforms the normal LSTM network.

CNN models have performed better compared to the SLRS without CNN. However, they fail in some cases where the background is nonuniform and complex. In such cases, the accuracy is enhanced by narrowing the focus to the region of interest. Trainable attention mechanisms are one way to achieve this. For SL recognition, many researchers have proposed attention-based SLRS. Zhang et al. [65] proposed an SLRS that used global-local attention with a C3D network and achieved an accuracy of 91% and better generalisation performance. Huang et al. [24] proposed another method for improving generalisation ability by combining a Convolution Block Attention Module with a convolution self-encoding network and achieved an accuracy of 89.90%. Pan et al. [43] suggested an attention-based SLR network for signer-independent SL recognition using a keyframe sampling and skeleton features. However, they used temporal attention in the proposed SLRS. Rahim et al. [44] introduced an innovative approach within the domain of SLRS. Their system incorporated ROI identification for individual sign frames. The feature extraction phase involved the extraction of features from segmented and original images, followed by their concatenation into a unified feature vector. The classification aspect leveraged a CNN. Sing [55] proposed a dynamic SL recognition solution centered around a 3D CNN network. This network comprised three 3D convolution layers, three pooling layers, fully connected layers, and dropout layers. Achieving an overall accuracy of 88.24%, this model showcased promising results. Adhikari et al. [2] devised a pose-based SLRS methodology employing media pipe for pose feature extraction. Various classifiers including DT, RF, and Gradient Boosting were employed for classification. Notably, the highest accuracy of 97.4% is attained using the RF classifier. Nonetheless, a drawback is noted as media pipe encountered difficulties in detecting landmarks for overlapping hands. Muneer et al. [4] introduced an optimized C3D network for SLRS. Their model employed two separate C3D networks for distinct tasks. The first network concentrated on learning features from the hand region, while the second network captured features from the entire body. Concatenating these features, a Multi-Layer Perceptron is utilized for classification. An inefficiency in the model and the absence of temporal modeling were highlighted as drawbacks. Aparna et al. [7] adopted a hybrid CNN-LSTM approach to develop an SLRS for ISL. Their methodology involved utilizing InceptionV3 as a pre-trained CNN model to extract spatial features from sequential frames. The sequence learning is achieved using a stacked LSTM network. Areeb et al. [8] developed a DL-based SLRS with a focus on recognizing ISL words relevant to emergency situations. Three distinct models were proposed: two for classification and one for detection. The first classification model employed a 3D CNN, while the second model combined VGG-16 for feature extraction and LSTM for classification. Their detection model integrated YOLO V5. Notably, accuracy rates of 82% and 98% were reported for the first and second classification models, respectively, while the detection model reached 99% accuracy. However, a limitation is observed in their evaluation approach, as they used only a random subset of data for testing and employed random sampling for frame selection, potentially resulting in information loss.

From the literature it can be concluded that SLRS with CNN outperform the SLRS without CNN and CNN with an attention mechanism can improve the performance of the model. Existing SLRS have utilized various attention mechanisms such as channel attention, global-local attention, and temporal attention. However, the spatial attention mechanism is not explored in previous models, which can improve the recognition accuracy of the SLRS.

3 METHODOLOGY OF THE PROPOSED SASLRM

The proposed SASLRM is based on a pretrained VGG-19 DL model. The VGG-19 is preferred in the proposed model because of its low-level feature extraction capability using a smaller kernel size, which is appropriate for the SL recognition task. The proposed work utilizes the concept of finetuning, which is one of the TL techniques. The pretrained weight of the ImageNet is used in the VGG-19 model for fine-tuning process because of limited data [18]. It helps to reduce the overfitting issue and the random weight initialization problem. The SASLRM model comprises three essential components: (i) keyframe selection, (ii) feature extraction, and (iii) classification. The comprehensive design of the SASLRM is visually represented in Figure 1. The subsequent sections provide an in-depth exploration of each core module.

Fig. 1. Architecture of proposed SASLRM.

3.1 Keyframe Selection

The selection of keyframes stands as a fundamental principle within the SASLRM model. SL videos inherently encompass a multitude of frames, with only a limited subset harboring significant information. The majority of frames derived from SL videos contain superfluous and repetitive data. The integration of a keyframe selection serves to eliminate frames bearing such redundant details, thereby improving both the accuracy and efficiency of the system. In this vein, the proposed SLRS incorporates a method for keyframe selection, specifically employing the HD technique. The operational protocol of this algorithm unfolds as follows: Initially, the HD between successive frames is computed, with the threshold value subsequently determined from the resultant distance sequence. Then in next phase, the initial frame is opted as a candidate keyframe from the sequence of frames and the HD is computed between candidate keyframe and the subsequent frame, which is compared against the threshold. Now, if the computed value surpass the threshold, then the current frame is selected as keyframe and the candidate keyframe is updated accordingly with that keyframe. This iterative process spans the entire sequence of frames. The outlined procedure is encapsulated in Algorithm 1.

3.2 Feature Extraction

The feature extraction is an essential part of the proposed SASLRM. At this stage, the CSA module is used for feature extraction. The CSA module consists of the convolution and SA module. Figure 2 represent the CSA module. A pre-trained VGG-19 network is employed as the convolution module, where the VGG-19’s final pooling layer serves as the convolutional base, and the SA module is added afterwards to extract more salient features from the sign images. The convolutional base, together with the SA module and dense layers, are trained with the SL dataset. After that, the trained network is used as a feature extractor in the proposed SASLRM model. The detailed architecture is given in Table 1.

Table 1.

Layer type	Output shape
VGG-19 Model	7\(\times 7\times\)512
Lambda(Average Pooling)_ Layer	7\(\times 7\times\)1
Lambda(Max Pooling)_ Layer	7\(\times 7\times\)1
Concatenation_ Layer	7\(\times 7\times\)2
Convolution_ Layer	7\(\times 7\times\)1
Concatenation_ Layer	7\(\times 7\times\)513
Flatten_ Layer	25,137
Dense_ Layer	4,096
Dense_ Layer	4,096
Dense_ Layer	8
Total number of parameters: 119,578,730
Trainable parameters: 119,578,730
Nontrainable parameters: 0

Layer type	Output shape
CSA Model	25,137
BiLSTM_layer	1,024
Dense_layer	256
Dense_layer	8
Total number of parameters: 105,126,152
Trainable parameters: 105,126,152
Nontrainable parameters: 0

Category	Participant number	Total instances
accident	26	52
Call	26	52
doctor	26	52
help	26	52
hot	26	52
lose	25	50
pain	26	52
thief	25	50
The total number of instances 412
Total Frames 31,264
Longest video duration 4 s
Shortest video duration 1.53 s
Resolution 500\(\times\)600

Parameters	Settings
Image Size	224\(\times\)224
Batch Size	32
Epochs	100, 20
Optimizer	Adam
Learning rate	0.001
Loss function	Categorical Cross Entropy
Rescale	1/255

Value	Frame difference	Value	Frame difference	Value	Frame difference
of n	value \(d = F_n - F_{n+1}\)	of n	value \(d = F_n - F_{n+1}\)	of n	value \(d = F_n - F_{n+1}\)
\(n=1\)	1217.764	\(n=26\)	3086.269	\(n=51\)	3944.813
\(n=2\)	2574.648	\(n=27\)	2651.318	\(n=52\)	4319.377
\(n=3\)	2538.348	\(n=28\)	4093.014	\(n=53\)	4114.301
\(n=4\)	4402.682	\(n=29\)	6486.309	\(n=54\)	1859.803
\(n=5\)	5469.165	\(n=30\)	6443.643	\(n=55\)	5143.418
\(n=6\)	2804.609	\(n=31\)	5239.489	\(n=56\)	5951.384
\(n=7\)	1995.546	\(n=32\)	1381.968	\(n=57\)	7916.884
\(n=8\)	3243.938	\(n=33\)	15371.96	\(n=58\)	4767.201
\(n=9\)	6265.107	\(n=34\)	3845.832	\(n=59\)	5206.903
\(n=10\)	6228.337	\(n=35\)	2949.526	\(n=60\)	2978.777
\(n=11\)	5221.185	\(n=36\)	4468.579	\(n=61\)	2952.717
\(n=12\)	4803.998	\(n=37\)	1883.7	\(n=62\)	3885.778
\(n=13\)	1683.613	\(n=38\)	1807.858	\(n=63\)	2823.877
\(n=14\)	3340.464	\(n=39\)	7349.98	\(n=64\)	4358.893
\(n=15\)	4035.786	\(n=40\)	5706.246	\(n=65\)	5372.001
\(n=16\)	3388.099	\(n=41\)	4770.412	\(n=66\)	1945.707
\(n=17\)	4827.764	\(n=42\)	2423.231	\(n=67\)	1256.903
\(n=18\)	3347.618	\(n=43\)	1943.406	\(n=68\)	1469.536
\(n=19\)	2907.632	\(n=44\)	1532.749	\(n=69\)	5468.828
\(n=20\)	1300.703	\(n=45\)	4699.361	\(n=70\)	6333.01
\(n=21\)	3035.459	\(n=46\)	7206.831	\(n=71\)	3773.547
\(n=22\)	2897.902	\(n=47\)	5837.176	\(n=72\)	2950.8
\(n=23\)	2809.976	\(n=48\)	3770	\(n=73\)	1595.268
\(n=24\)	4238.255	\(n=49\)	3883.96	\(n=74\)	4624.214
\(n=25\)	1683.664	\(n=50\)	2875.263	—	—

Method	Precision	Recall	F1-score	Accuracy (%)	SD (\(\sigma\)) of Accuracy
Keyframe + VGG-19 + LSTM	0.9357	0.9334	0.9332	93.289	0.8719
Keyframe + VGG-19 + BiLSTM	0.9497	0.9470	0.9465	94.654	0.4354
Keyframe + VGG-19 + Stacked BiLSTM	0.9143	0.9059	0.9050	91.594	1.09
Keyframe + VGG-19 + ConvLSTM	0.9489	0.9455	0.9434	94.508	0.8894

Repetition	Precision	Recall	F1-score	Accuracy	Average accuracy (%)	SD (\(\sigma\)) of Accuracy
1st	0.9574	0.9493	0.9540	95.38	95.627	0.6335
2nd	0.9675	0.9663	0.9660	96.60
3rd	0.9538	0.9495	0.9491	94.90
4th	0.9540	0.9518	0.9515	95.14
5th	0.9630	0.9614	0.9586	96.11

Author	Technique	Precision	Recall	F1-score	Accuracy (%)
Aditya et al. [3]	DWT + Multiclass SVM (50-50 random split)	0.9127	0.8999	0.9025	90
Aditya et al. [3]	GoogleNet + BiLSTM (50-50 random split)	0.9638	0.9624	0.9625	96.25
Aparna et al. [7]	Inception V3 + Stacked LSTM (50-50 split)	—	—	—	69
Areeb et al. [8]	3D CNN (80-20 random split)	0.85	0.8362	0.8262	82
Areeb et al. [8]	VGG16 + LSTM (80-20 split)	0.9750	0.9775	0.9737	98
Das et al. [14]	VGG19+handcrafted+Stacked BiLSTM (50-50 split)	0.9281	0.9199	0.9218	94.42
Proposed	VGG19 with spatial attention + BiLSTM (Repeated twofold CV)	0.9591	0.9556	0.9558	95.627

Class Name	Precision	Recall	F1-score
Accident	0.9922	0.9768	0.9843
Call	0.9141	0.8931	0.8983
Doctor	0.9630	0.9502	0.9517
Help	0.9528	0.9620	0.9559
Hot	0.9817	0.9197	0.9422
Lose	0.9810	0.9885	0.9861
Pain	0.9449	0.9768	0.9604
Thief	0.9466	0.9892	0.9717

Author	Technique	Number of Instances	Accuracy (%)
Ramos et al. [41]	3D CNN	64	93.9 (Avg. of 3 repetition)
Ronchetti et al. [50]	Random transform with HMM+GMM	64	95.95 (Avg. of 30 repetition)
Rodriguez et al. [49]	Cumulative shape difference + SVM	64	85
Masood et al. [39]	Inception CNN + LSTM	64	95.20
Shah et al. [54]	Inception CNN + BiLSTM	64	96
Elsayed et al. [20]	3D CNN + LSTM	40	98.50
Proposed	Keyframe + VGG19 with spatial attention + BiSLTM	64	97.84 (Avg. of five repetition)

Author	Method	Accuracy (%)
Lui et al. [37]	Tangent bundle on Grassmann manifold	91
Lui et al. [36]	Least square regression method	88
Harandi et al. [51]	Weighted Riemannian manifolds and 3D covariance descriptors	93
Liu et al. [35]	Primitive 3D operators with genetic algorithm	85
Baraldi et al. [9]	Dense trajectories hand segmentation	94
Jhon et al. [26]	Keyframe with CNN model	91
Aditya et al. [57]	GoogleNet+BiLSTM	97.22
Proposed	SASLRM	98.22

Batch Size	Number of neurons in BiLSTM Layer	Accuracy (%)
8	256	93.68
	512	95.14
	1,024	95.14
16	512	96.11
32	512	96.60
64	512	94.65

Author	Technique	Accuracy (%)
A. Hosain et al. [23]	Fusion-3	77.51
M. Maruyama et al. [38]	Multi-stream	87.47
E. Rajalakshmi et al. [46]	HNN-ISLR	99.85
E. Rajalakshmi et al. [45]	hDNN-SLR	97.54
Proposed	SASLRM	98.86

Method	Keyframe selection Time (sec)/video	Feature Extraction Time (sec)/video	Training Time (Classification Network) (sec)	Training Time (Feature Extraction Network) (sec)	Prediction Time (sec)/video	Total Execution Time (sec)
Without Keyframe	—	4.67	1820	2150	0.09	3974.76
With Keyframe	2.73	0.745	692	850	0.07	1545.54

Modules	Accuracy (%)
VGG-19 with SA module	76.69
VGG-19	94.654
VGG-19 with combination of SA and Convolution module	95.627

An Expert System for Indian Sign Language Recognition Using Spatial Attention–based Feature and Temporal Feature

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

1 INTRODUCTION

2 RELATED WORK

2.1 SLRS Models without CNN

2.2 SLRS Models with CNN

3 METHODOLOGY OF THE PROPOSED SASLRM

3.1 Keyframe Selection

3.2 Feature Extraction

3.2.1 Convolution Module.

3.2.2 Spatial Attention Module.

3.3 Classification

3.3.1 BiLSTM Layer.

3.3.2 Fully Connected Layer.

3.3.3 Softmax Layer.

4 EXPERIMENTAL ANALYSIS

4.1 Dataset

4.2 Experimental Setup and Implementation Details

4.3 Results and Discussion

4.3.1 Results of Keyframe Selection Algorithm.

4.3.2 Results of the Proposed SASLRM.

4.3.3 Comparison with Existing Approaches on ISL Dataset.

4.3.4 Classwise Analysis.

4.3.5 Error Analysis.

4.3.6 Validation of SASLRM with Standard Datasets.

4.3.7 Significance of the Components of SASLRM.

5 CONCLUSION AND FUTURE SCOPE

ACKNOWLEDGMENTS

REFERENCES

Cited By

Index Terms

Recommendations

Static and Dynamic Isolated Indian and Russian Sign Language Recognition with Spatial and Temporal Feature Detection Using Hybrid Neural Network

A Spatio-Temporal Framework for Dynamic Indian Sign Language Recognition

Sign, Attend and Tell: Spatial Attention for Sign Language Recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media