3992 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020 Smartphone Sensor-Based Human Activity Recognition Using Feature Fusion and Maximum Full a Posteriori Zhenghua Chen , Chaoyang Jiang , Shili Xiang , Jie Ding , Min Wu , and Xiaoli Li Abstract— Human activity recognition (HAR) using smart- phone sensors has attracted great attention due to its wide range of applications. A standard solution for HAR is to first generate some features defined based on domain knowledge (handcrafted features) and then to train an activity classification model based on these features. Very recently, deep learning with automatic feature learning from raw sensory data has also achieved great performance for HAR task. We believe that both the handcrafted features and the learned features may convey some unique information that can complement each other for HAR. In this article, we first propose a feature fusion framework to combine handcrafted features with automatically learned features by a deep algorithm for HAR. Then, taking the regular dynamics of human behavior into consideration, we develop a maximum full a posteriori algorithm to further enhance the performance of HAR. Our extensive experimental results show the proposed approach can achieve superior performance comparing with the state-of-the-art methodologies across both a public data set and a self-collected data set. Index Terms— Deep learning, feature fusion, Human activity recognition (HAR), maximum full a posteriori (MFAP), smart- phone sensors. I. INTRODUCTION HUMAN activity recognition (HAR) is of great impor-tance for many applications in health-care services, smart homes, and pervasive and mobile computing [1], [2]. With the development of computer vision techniques, camera-based HAR has been well-developed [3]. However, it can only monitor a specific space with adequate illumination conditions. In addition, it suffers from privacy concerns. Wear- able sensors, such as accelerator and gyroscope, are also popu- lar for HAR [4], [5]. However, they require special hardware to be worn by users, which is obviously inconvenient. In the past Manuscript received July 10, 2019; revised September 3, 2019; accepted September 22, 2019. Date of publication October 3, 2019; date of current version June 9, 2020. This work was supported in part by the Ministry of National Development, Singapore, through the Sustainable Urban Living Program under Grant SUL2013-5 and in part by the Beijing Institute of Technology Research Fund Program for Young Scholars. The Associate Editor coordinating the review process was Dr. Alessio De Angelis. (Corresponding authors: Chaoyang Jiang; Min Wu.) Z. Chen, S. Xiang, J. Ding, M. Wu, and X. Li are with the Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore 138632 (e-mail:
[email protected];
[email protected];
[email protected];
[email protected];
[email protected]). C. Jiang is with the Science and Technology on Vehicle Transmission Lab- oratory, School of Mechanical Engineering, Beijing Institute of Technology, Beijing 100081, China (e-mail:
[email protected]). Color versions of one or more of the figures in this article are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIM.2019.2945467 decade, smartphones become more and more powerful with many sensors embedded, including accelerator, gyroscope, barometer, temperature sensor, and so on. Since most people carry smartphones in their daily life, smartphones-based HAR will, thus, be a practical option [6], [7]. Recently, smartphone sensor-based HAR has been devel- oped, which can be generally divided into two categories, namely, shallow and deep algorithms. Specifically, shallow algorithms consist of two steps—feature extraction and activity inference [2], [8]. Since the raw smartphone sensor data are not well representative for distinct activities, a standard proce- dure is, thus, to extract some informative features, also known as feature extraction/engineering. For instance, the magnitude of acceleration should be helpful in separating different activ- ities, such as walking and running. As such, some defined statistical features known as handcrafted features will be first extracted from the raw smartphone sensor data. Note that, these handcrafted features are also automatically generated by programs which are written based on their definitions. Some machine learning algorithms, such as neural networks, support vector machines (SVMs), and random forest (RF), can be then applied with the handcrafted features to identify different human activities. Deep algorithm-based HAR, on the other hand, is one-step approach that can automatically learn representative features from the raw sensory data for HAR without human intervention, and perform activity inference simultaneously [9]–[11]. We observe that both shallow learning algorithms with handcrafted features and deep learning algorithms with auto- matically learned features have achieved great successes for the task of HAR [12], [13]. We believe that both handcrafted features and automatically learned features by deep algorithms may convey unique information which can complement each other to boost the performance of smartphone sensor-based HAR. In this article, at the first stage, we propose a feature fusion framework to integrate handcrafted features with a deep algorithm, i.e., deep long short-term memory (LSTM), to boost the performance of HAR. In the second stage, considering the dynamics (frequent activity changes) of human behavior, we propose a maximum full a posteriori (MFAP) algorithm, which exploits all the past information and the current a pos- teriori probability obtained from the feature fusion framework to give an optimal estimation of human activities. The main contributions of this article are summarized as follows. 0018-9456 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 3993 1) We propose a novel feature fusion framework that can effectively combine handcrafted features with a deep learning algorithm to boost the performance of smart- phone sensor-based HAR. 2) Taking the dynamics of human behaviors into consider- ation, we formulate an MFAP algorithm that exploits all the past information and the current a posteriori information obtained from the feature fusion framework to give an optimal estimation of human activities. 3) We use a public data set and a self-collected data set to evaluate the effectiveness of the proposed approach. Our comprehensive experimental results demonstrate that the proposed approach significantly outperforms existing advanced learning algorithms and the state-of-the-art. The remaining of this article is organized as follows. Section II reviews some related works with handcrafted fea- tures and automatically feature learning by deep algorithms for HAR. Section III briefly introduces the handcrafted features and a deep algorithm for automatic feature learning, followed by the proposed feature fusion framework. Section IV presents the proposed MFAP algorithm. Section V first demonstrates the data for evaluation, followed by the experimental setup. Then, the experimental results are presented and discussed. Section VI concludes this article and presents some potential future works. II. RELATED WORKS In this section, some related works for HAR using different learning algorithms are reviewed. We divide this section into two parts—shallow and deep algorithms. A. Shallow Algorithms For shallow algorithms, they normally consist of feature engineering and activity inference. Since raw smartphone sensor data are noisy and not representative of different human activities, some more informative features can be extracted with domain knowledge. Then, shallow learning algorithms can be performed for HAR with these handcrafted features. For example, Wang et al. [14] investigated the effectiveness of smartphone accelerator and gyroscope for HAR. First, they extracted a large number of statistical features from both time and frequency domains of 3-D acceleration and gyroscope. Then, they proposed a hybrid of filter and wrapper method known as FW to select best features from all handcrafted fea- tures. Finally, machine learning algorithms, namely, k-nearest- neighbors (KNN) and naive Bayes (NB) were employed to classify different activities. Eastwood and Jayne [15] evalu- ated different extensions of hyperbox neural network (HNN) which is built upon different modes of learning for HAR. In addition, Anguita et al. [12] proposed a hardware-friendly SVM (HF-SVM) algorithm based on fix-point arithmetic for HAR using smartphone sensors. The experimental results showed that HF-SVM has comparable performance to the conventional SVM, but with much less computational com- plexity. Ronao and Cho [16] presented two-stage continuous hidden Markov models (CHMMs) for HAR. The first-stage CHMM was utilized to separate static and dynamic activ- ities. The second-stage CHMM was then applied to iden- tify the exact activity from the two types of activities. Rana et al. [17] enhanced the sparse random classifier with singular value decomposition (SRC-SVD) for HAR. The SVD was leveraged to construct the random projection matrix for SRC. Seera et al. [18] proposed a hybrid of fuzzy min–max (FMM) neural network and the classification and regression tree (CART) to recognize human activities. In their proposed system, the FMM was mainly used for data incremental learning, and the CART was utilized to provide interpretations for the classification. B. Deep Algorithms Owing to the powerful feature learning ability of deep algorithms, they have achieved remarkable performance for HAR using smartphone sensors. Li et al. [19] presented a sparse autoencoder (SAE) to automatically learn representative features from raw smartphone accelerator and gyroscope data for the task of HAR. The 3-D acceleration, gyroscope, and the magnitudes of them are treated as different channels on which the SAE is implemented for feature learning. Ronao and Cho [20] presented a convolutional neural net- work (convnet) that is able to learn representative features from raw smartphone sensor data for HAR. They also explored the use of temporal fast Fourier transform (tFFT) on the raw sensory data with convnet for HAR. In their another work, they attempted to apply handcrafted features as the inputs of convnet instead of the raw smartphone sensor data for HAR [21]. Tao et al. [22] presented an ensemble bidirectional long short-term memory (BLSTM) approach for HAR. They applied the raw sensory data, the magnitude of the raw sensory data, and two-directional features as inputs for differ- ent BLSTM. Experiments indicate the effectiveness of their proposed approach. Chen et al. [13] proposed a knowledge distilling strategy which attempts to use well-designed hand- crafted features to guide deep algorithms for generalization for smartphone sensor-based HAR. A comprehensive survey on deep learning-based HAR can be found in [23]. In real applications, both handcrafted features with domain knowledge and automatically learned features by deep algo- rithms may convey unique information for HAR. In this article, we attempt to build a feature fusion framework to combine these two types of comprehensive features to make good use of all the useful information, which should boost the performance of HAR. Taking the dynamics of human behavior into consideration, we further improve the performance of HAR by formulating an MFAP algorithm that exploits all the past information with the current a posteriori information obtained from the feature fusion framework to give an optimal estimation of human activities. III. PROPOSED FEATURE FUSION FRAMEWORK In this section, we will first briefly introduce handcrafted features and automatic feature learning, and subsequently, elaborate two key innovations in our proposed methods. Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. 3994 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020 TABLE I HANDCRAFTED FEATURES A. Handcrafted Features Feature engineering is a widely used technique for data preprocessing, leading to the success of shallow machine learning algorithms [24]. For HAR using smartphone sen- sors, the raw sensory data are not representative of different human activities. To achieve better performance for HAR, some more representative features can be extracted based on domain knowledge. For example, the activities of walking and running will yield different magnitudes of acceleration. Thus, the feature of the magnitude of acceleration can be extracted for the separation of these two activities. In addition, the variance of smartphone sensors can be used to distinguish static activities from dynamic ones. As such, some advanced statistical features from time and frequency domains have been shown to be effective for smartphone sensor-based HAR [12], which are presented in Table I. All these handcrafted features will be extracted for both 3-D acceleration and gyroscope of smartphones. B. Automatic Feature Learning Deep learning has achieved great success in many challeng- ing research areas, such as image recognition [25] and natural language processing [26]. The biggest merit of deep learning is the ability of automatic feature learning from raw sensory data without human intervention. For HAR using smartphone sen- sors, the raw sensory data are typical time series with temporal dependence [27]. While a recurrent neural network (RNN) is naturally suitable for time series data, the conventional RNN suffers from the problem of gradient vanishing and exploding, which degrades its performance on the modeling of long-term dependencies in sequential data [28]. To solve this problem, Hochreiter and Schmidhuber proposed a new RNN named long short-term memory (LSTM), which attempts to use some memory cells to preserve information for long-term dependencies [29]. A typical structure of LSTM can be found in Fig. 1, where xt is the input at time step t , ht is the hidden state, Ct−1 is the memory cell state, w f , wi , wC , and wo are the weights, Fig. 1. Structure of the LSTM network. b f , bi , bC , and bo are the biases, and σ(·) and tanh are the sigmoid and tanh functions, respectively. In the LSTM network, the first step is to determine which information should be thrown from the previous memory cell state Ct−1 by using a forget gate, which can be formulated as f t = σ(w f [ht−1, xt ] + b f ) (1) Here, f t = 1 means to keep all the information from the previ- ous step, and f t = 0 means to totally remove the information from the previous step. The next step is to determine which new information should be stored based on the current input. It consists of two components. The first component is an input gate to decide what shall be updated. It can be expressed as i t = σ(wi [ht−1, xt ] + bi) (2) The second component produces a candidate state value C˜t by using a tanh function, shown as C˜t = tan h(wC [ht−1, xt ] + bC). (3) After that, the next step is to decide the current state Ct by using the following equation: Ct = f t ∗ Ct−1 + i t ∗ C˜t . (4) Finally, the hidden output ht is a filtered version of the compressed cell state tanh(Ct ). The output of the sigmoid layer ot will determine which part of the information will be preserved. It is shown as ot = σ(wo[ht−1, xt ] + bo). (5) The final hidden output ht ∈ Rd , where d is the dimension of the feature, can be expressed as ht = ot ∗ tan h(Ct ). (6) Deep architecture has been shown to be effective for rep- resentation learning [30]. Therefore, in this article, we stack multiple LSTM layers, known as deep LSTM, for deep rep- resentation learning in the task of smartphone sensor-based HAR. Specifically, the output of i th LSTM layer will be the input of (i + 1)th LSTM layer. As a special case, the input of the first LSTM layer is the raw sequential smartphone sensor data. Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 3995 Fig. 2. Proposed feature fusion framework. C. Proposed Feature Fusion Both the handcrafted features with domain knowledge and the features learned by deep algorithms may contain unique information for HAR. To make good use of these two types of features, we propose a feature fusion framework to combine them together for better recognition of human activities using smartphone sensors. The proposed feature fusion framework is shown in Fig. 2. Here, we choose the deep LSTM for feature learning, which is naturally suitable for our sequential data analysis problem. The raw sequential smartphone sensor data is fed into two stacked LSTM layers for feature learning. The learned features at the last time instance are fed into a fully connected layer (FCL) to get more abstract features. At the same time, the handcrafted features in Table I, extracted from the raw smartphone sensor data, are fed into another FCL to obtain more abstract features. After that, we combine the two types of features using a concatenate layer. Finally, the combined features are fed into a softmax layer for activity classification. More specifically, given the smartphone sensor input ot which is a window of sensory data, the automatically learned features and the handcrafted features can be expressed as vt = (ot ) and ht = (ot ), respectively, where (·) is the LSTM-based feature learning and (·) is the handcrafted feature extraction based on domain knowledge. Note that, the LSTM is able to encode temporal dependencies within the sample (window) during feature learning. These two types of features can be treated as the processing of the raw sensory data in two distinct perspectives, both of which have been shown to be effective for HAR. The complete feature set is the concatenation of the two types of features, which can be expressed as lt = vt ∪ ht . This concatenation is able to make full use of these two types of features, which may also lead to a more comprehensive understanding of the raw sensory data. Hence, better performance can be expected. The final outputs of the proposed feature fusion framework are the probabilities of all activities by using the softmax layer on these features, which can be expressed as so f tmax(lt ). The training of the proposed feature fusion framework is to optimize the parameters of the network by using a backpropagation algorithm on the training data. Specifically, given training data and targets, the network outputs with the training data are calculated. The errors between the network outputs and the given targets can be obtained, where the gra- dient of the errors can be used to update network parameters based on gradient-based optimization methods. In this article, we utilize an optimization method of RMSprop, which is able to use the magnitude of recent gradients to normalize the gradients [31] for parameter optimization. To prevent over- fitting, some dropout layers and a batch normalization (BN) layer are employed, which are shown in Fig. 2. The dropout rates for the two dropout layers are set to be 0.5. After the network has been learned with the training data, the outputs of the proposed feature fusion framework are the probabilities of all activities given the current sensor measurements ot , which can be expressed as p(zt |ot ). It is also known as a posteriori. In general, the current activity will be determined based on the maximal probability of a posteriori, known as maximum a posteriori (MAP) estimation. However, the current human activity should be related to the activity sequence in the past and previous sensor observations, which is not considered by the MAP during estimation. In other words, the LSTM network in the proposed feature fusion framework is only able to encode temporal dependencies within the sample. But it is not able to model the temporal dynamics among samples (activity sequence). To further improve the performance of HAR, we propose an MFAP approach which combines the past information with the current a posteriori to give an optimal estimation of human activities. IV. MAXIMUM FULL a Posteriori ESTIMATION In real life, when performing activities, human normally carries on one activity for a while and then transfer to another activity. This important property should be considered when designing HAR systems. However, to the best of our knowl- edge, no previous works have exploited this important property of human behavior. The conventional data-driven approaches attempt to estimate human activities only based on current sensor observations. In this article, to take the dynamics of human behavior into consideration, we propose an MFAP algorithm that is able to consider the past information and the current a posteriori information obtained from the proposed feature fusion framework. The MFAP can be formulated as zˆt = arg max zt p(zt |o1:t) (7) where zt is the human activity at time instance t and o1:t are observations from time instance 1 to t . Here, we make Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. 3996 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020 two basic assumptions for HAR using the MFAP algorithm, which are as follows. 1) The state (activity) follows a first-order Markov prop- erty, i.e., p(zt |zt−1) = p(zt |z1:t−1). 2) The current observation of state is conditionally indepen- dent of the previous observations, i.e., p(ot |o1:t−1, zt ) = p(ot |zt ). A human normally performs activity sequentially. The cur- rent activity usually has high correlation with the activities performed recently and a low correlation with the activities performed long ago. This process has been well modeled by a first-order Markov chain [32]. Therefore, we can assume that human activities follow a first-order Markov property, and the first assumption is considered valid. The observation relies on real human activity. Once the current activity is known, the current observation is independent of the previous observations. Hence, the second assumption, which states that the current observation of activity is conditional independence of the previous observations is also considered valid. According to Bayes rules, the full a posteriori probability for HAR, p(zt |o1:t ), can be expressed as p(zt |o1:t ) = p(o1:t |zt )p(zt )p(o1:t ) = p(ot , o1:t−1|zt )p(zt ) p(ot , o1:t−1) = p(ot |o1:t−1, zt )p(o1:t−1|zt )p(zt ) p(ot |o1:t−1)p(o1:t−1) = p(ot |o1:t−1, zt )p(zt |o1:t−1)p(o1:t−1)p(zt ) p(ot |o1:t−1)p(o1:t−1)p(zt ) = p(ot |zt )p(zt |o1:t−1) p(ot |o1:t−1) = p(zt |ot )p(ot)p(zt |o1:t−1) p(zt )p(ot |o1:t−1) . (8) Given observations o1:t from time step 1 to t , the probability of (p(ot)/p(ot |o1:t−1)) is deterministic, which can be treated as a normalization factor. Therefore, the full a posteriori probability can be further expressed as p(zt |o1:t) ∝ p(zt |ot)p(zt |o1:t−1)p(zt ) (9) In (9), p(zt |ot ) is the a posteriori probability of the human activity. Compared with p(zt |ot ), full observation information is involved in p(zt |o1:t ). Hence, we call the estimation in (7) MFAP estimation. We can find from (9) that the full a posteriori probability, i.e., p(zt |o1:t), is determined by the following three components. 1) p(zt |o1:t−1) = ∑ i p(zt |zt−1 = li )p(zt−1 = li |o1:t−1) (10) where li is the i th activity and p(zt |zt−1) is the transition probability for the first-order Markov chain model. 2) p(zt |ot ): the current a posteriori which can be obtained from the proposed feature fusion framework. 3) p(zt ): the prior distribution for different activities. To get p(zt |o1:t−1) from (10), we need to obtain the transition probability p(zt |zt−1) for the first-order Markov chain model. Here, we model human activity sequence as a Markov chain, which describes the transition from one activity to another. Given the n activities {l1, l2, . . . , ln}, the i th row and j th column entry of the transition probability matrix, A ∈ Rn×n , can be expressed as ai j = p(zt = li |zt−1 = l j ), i, j = 1, 2, . . . , n. (11) We intend to calculate the transition probability matrix based on the training data. Given m steps human activity sequence, the transition probability from state l j to state li , denoted as ai j can be calculated as ai j = ∑m t=2 δ(zt − li )δ(zt−1 − l j )∑m t=2 δ(zt−1 − l j ) (12) where δ(α) = { 1, α = 0 0, otherwise. Next, the probability p(zt |ot) can be obtained from the proposed feature fusion framework. Since the last layer of the proposed feature fusion framework is a softmax layer, it will produce the probability for each activity based on inputs, i.e., current smartphone sensor measurements. Specifically, the current a posteriori probability can be expressed as p(zt |ot ) = softmax(lt ). (13) Finally, the probability p(zt ) can be easily counted based on the training data as p(zt = li ) = ∑m t=1 δ(zt − li ) m (14) The implementation of the proposed MFAP for HAR is shown in Algorithm 1. Algorithm 1 Proposed MFAP for HAR Input: A = {ai j }, bt = {bit } = {p(zt = li |ot )}, c = {ci } = {p(zt = li )}, i, j = 1, 2, . . . , n, t = 1, 2, . . . , T . Output: Full a posterior: rt = p(zt |o1:t ), predicted activity: O Initialization: t = 1 1: r1 = {r i1} = b1 2: O1 = arg maxli r1 Recursion 3: for t = 2 to T do 4: for i = 1 to n do 5: r it = bit ∑ j ai j r j t−1 ci based on Equation (9). 6: end for 7: Ot = arg maxli rt 8: end for 9: return O Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 3997 V. EXPERIMENTS A. Data Description To evaluate the performance of the proposed approaches for HAR using smartphone sensors, we first use a public data set from UCI [12]. A Samsung Galaxy SII smartphone, which is attached to the waist of subjects with fixed orientation was used for data collection. Both 3-D acceleration and gyroscope data were collected. This data set contains six activities, i.e., walking, walking upstairs, walking downstairs, standing, sitting, and laying. The sampling frequency of the data is 50 Hz. A sliding window of 2.56 s (or a sample) with a 50% overlap is used for data segmentation. In total, 10 299 samples are collected from 30 participants. We also collected our own data set using a recently released Huawei P20 Pro smartphone. For this data set, instead of attaching the smartphone to a fix position that may not be realistic, we freely put the smartphone in three common posi- tions, i.e., pants’ pocket, shirt’s pocket, and backpack, without any restrictions for data collection. Here, we consider some different activities, including walking, fast walking, running, walking upstairs, walking downstairs, and static. Similarly, we collected both 3-D acceleration and gyroscope with a sampling rate of 50 Hz. We also use a sliding window of 2.56 s with a 50% overlap for data segmentation. Totally, 4752 samples are collected from 12 volunteers. For the public data set and our own data set, there are some differences: 1) the smartphones for experiments are different; 2) the placements of smartphones are different; And 3) due to the different smartphone placements, the explored activities in the two data sets are different. Since the smartphone is attached to the waist of subjects with fixed orientation in the public data set, it is possible to detect the activities of “Standing” and “Sitting” based on the slight variances of smartphone orientations. Meanwhile, the orientation of “Laying” is totally different from the other two static activities of “Standing” and “Sitting,” and thus, various algorithms achieve very high recognition accuracy for “Laying” as shown in Table II. For these three activities, i.e., “Standing,” “Sitting,” and “Laying,” in the public data set, we can distinguish them based on the orientation information. However, for our own data set, the smartphone is freely put in three common positions, without any restrictions on its orientation. Therefore, we are not able to distinguish the above-mentioned three activities of “Standing,” “Sitting,” and “Laying” based on the orientation information. For this reason, we explore some other common activities, such as “Fast walking,” “Running,” and “Static” in our own data set. For both the public data and our own data, we randomly select 70% of the data to train different algorithms and the remaining for testing. B. Experimental Setup To verify the performance of the proposed approaches, we compare with some advanced learning algorithms for HAR, including shallow learning algorithms with handcrafted fea- tures, such as an artificial neural network (ANN), SVM [33], extreme learning machine (ELM) [34], and RF, and the deep learning algorithm of deep LSTM [35]. The parameters of all the benchmark approaches and the proposed approach are carefully tuned using a validation set. For ANN and ELM, the number of hidden nodes is determined by using grid search with the validation set. The popular radial basis function (RBF) kernel is chosen for SVM. The parameters of RBF kernel are determined using grid search. For RF, the number of decision trees is set as 500 for ensemble learning. The deep LSTM consists of two LSTM layers with sizes of 32 and 64, an FCL with a size of 100, and a softmax layer for classification. For the proposed fusion framework, two LSTM layers with sizes of 32 and 64 are used. The FCLs in Fig. 2 both have 100 hidden nodes. C. Experimental Results 1) Results on the Public Data Set: The experimental results on the public data set are shown in Table II. With expert knowledge, conventional machine learning approaches of ANN, ELM, and SVM with the handcrafted features slightly outperform the deep LSTM with automatic feature learning on the public data set. This means that the handcrafted features are more representative of these activities. The proposed fea- ture fusion framework, which combines handcrafted features and automatically learned features by the deep algorithm has a superior performance over these benchmark shallow and deep algorithms. This indicates that handcrafted features and automatically learned features by the deep algorithm contain unique information for HAR and can complement each other, leading to better performance. By taking the dynamics of human behavior into consideration, the proposed MFAP achieves the best performance. The overall accuracy is as high as 98.85%. We now zoom into the performance of specific activity classification. Among all the activities, the activity of “Laying” has the highest recognition accuracy, due to the distinct smartphone orientation for this activity against these of the other five activities. The activities of “Sitting” and “Standing” have very similar patterns on smartphone sensor readings. Therefore, the recognition accuracies of these two activities are relatively low. Similarly, the recognition performances of the activities of “Walking Upstairs” and “Walking Downstairs” are also limited, because of the similar sensory patterns. Owing to the proposed feature fusion framework and the consideration of the dynamics of human behavior, the proposed MFAP has the highest recognition accuracy for all the six activities. We have shown the activity recognition results of the proposed feature fusion framework and the proposed MFAP for testing in Fig. 3. It can be observed that the activities of “Standing” and “Sitting” are difficult to separate, due to the similar sensory patterns. The activities of “Walking,” “Walking Upstairs,” and “Walking Downstairs” suffer from the same issue. By taking the dynamics of human behavior, the pro- posed MFAP algorithm dramatically improves the results. This clearly indicates the effectiveness of the proposed MFAP algorithm for HAR. Fig. 4 shows the confusion matrices of the proposed feature fusion framework and the proposed MFAP on the public data set. The general conclusion is Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. 3998 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020 Fig. 3. The recognition results of the proposed feature fusion framework and the proposed MFAP on the public data set. Fig. 4. The confusion matrices of the proposed feature fusion framework and the proposed MFAP on the public data set. (a) The proposed feature fusion framework (b) The proposed MFAP. the same. By considering human dynamics, the proposed MFAP improves the recognition accuracies for all the six activities. Fig. 5. Recognition results of the proposed feature fusion framework and the proposed MFAP on our own data set. 2) Results on Our Own Data Set: The experimental results on our own data set are shown in Table III. In general, all the approaches perform better on our own data set when compared with the public data set. One possible reason for the distinct results is that the explored activities are different for the two data sets. Based on Table II, we can find that the activities of “Standing” and “Sitting” are difficult to be separated, due to the similar sensory patterns (no movement and similar smartphone orientation). While the activities in Table III are relatively easier to be separated. Moreover, the different devices for data collection and the way how the data was collected for the two data sets may also contribute. Different from the results on the public data set, the Deep LSTM with automatically learned features outperforms the conventional machine learning approaches with handcrafted features. This means that the automatically learned features by the deep algorithm are more representative for HAR on this data set. Similarly, the proposed fusion framework, which combines the handcrafted features and the automati- cally learned features by the deep algorithm outperforms the deep algorithm of deep LSTM and the conventional machine learning approaches, i.e., ANN, ELM, SVM, and RF, with handcrafted features. We can conclude that the handcrafted features and the features learned by the deep algorithm have unique merits, resulting distinct performances on different data sets. With the proposed feature fusion framework, we can make good use of the merits of these two types of features Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 3999 TABLE II RECOGNITION ACCURACIES OF ALL THE APPROACHES ON THE PUBLIC DATA SET TABLE III RECOGNITION ACCURACIES OF ALL THE APPROACHES ON OUR OWN DATA SET to boost the performance for HAR using smartphone sensors. In addition, the proposed MFAP is able to take the dynamics of human behavior into consideration, further improving the performance of the proposed feature fusion algorithm. The overall accuracy is as high as 99.58% on our own data set. For our own data set, we consider some different activities due to the different placement of smartphones in the two data sets. It can be found that the activities of “Fast Walking,” “Running,” and “Static” which contain distinct movement patterns that can be easily identified with high recogni- tion accuracies. However, activities of “Walking,” “Walk- ing Upstairs,” and “Walking Downstairs” have very similar movement patterns, and thus, confuse most of the algorithms. Owing to the proposed feature fusion framework and the consideration of the dynamics of human behavior, the final recognition accuracies of the proposed MFAP are higher than 99% for all the activities. Fig. 5 shows the recognition results of the proposed feature fusion framework and the proposed MFAP for testing on our own data set. Even though the proposed feature fusion framework has already achieved a very high recognition accu- racy, i.e., 98.67%, it still contains some wrong estimations, shown as many spikes (see green line in Fig. 5) which are harmful to real applications, such as home automation. With the proposed MFAP which takes the dynamics of human behavior into consideration, most of the wrong estimations can be corrected. We also show the confusion matrices of the proposed feature fusion framework and the proposed MFAP on our own data set in Fig. 6. It can be found that the proposed MFAP corrects most of the wrong predictions of the proposed feature fusion framework, owing to the consideration of human dynamics. 3) Compared With State-of-the-Arts: We have also com- pared with some state-of-the-art approaches in the litera- ture, including HNN [15], FW KNN [14], FW NB [14], HF-SVM [33], two-stage CHMM [16], SRC-SVD [17], FMM-CART [18], SAEs-c [19], Convnet [20], HCF Con- vnet [21], tFFT Convnet [20], and Knowledge Distilling [13], using the public data set. The detailed reviews of all these Fig. 6. Confusion matrices of the proposed feature fusion framework and the proposed MFAP on our own data set. (a) Proposed feature fusion framework. (b) Proposed MFAP. approaches can be found in Section II. Table IV demonstrates the experimental results of these state-of-the-arts and the proposed approach. It can be found that our proposed approach Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. 4000 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020 TABLE IV COMPARISON WITH STATE-OF-THE-ARTS is able to achieve a superior performance over these state-of- the-art methods. VI. CONCLUSION In this article, we first propose a feature fusion framework, which combines handcrafted features with domain knowledge and automatically learned features by a deep algorithm, for human activity recognition (HAR). By taking the dynamics of human behavior into consideration, we then formulate an MFAP with the past information and the current a poste- riori information obtained from the proposed feature fusion framework to give an optimal estimation of human activities. We employ a public data set and a self-collected data set to evaluate the performance of the proposed approaches. Exten- sive experiments show the proposed feature fusion frameworks outperform five benchmark approaches. And the proposed MFAP can further improve the performance of HAR. We also compared with some state-of-the-art methodologies on the public data set. The proposed MFAP achieves the best per- formance, indicating our proposed method is practical to be applied for real-world applications. In our future works, we intend to focus on the recognition of some more complex activities [36]. Moreover, consider- ing the variation of smartphone orientation, the recognition performance may degrade. How to enhance the performance of smartphone-based HAR with varying device orientations is one of our future works. REFERENCES [1] Y. Zhang, G. Tian, S. Zhang, and C. Li, “A knowledge-based approach for multiagent collaboration in smart home: From activity recognition to guidance service,” IEEE Trans. Instrum. Meas., to be published. [2] O. D. Lara and M. A. Labrador, “A survey on human activity recognition using wearable sensors,” IEEE Commun. Surveys Tuts., vol. 15, no. 3, pp. 1192–1209, 3rd Quart., 2013. [3] B. Ni, G. Wang, and P. Moulin, “RGBD-HuDaAct: A color-depth video database for human daily activity recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops (ICCV Workshops), Nov. 2011, pp. 1147–1153. [4] G. Panahandeh, N. Mohammadiha, A. Leijon, and P. Händel, “Continu- ous hidden Markov model for pedestrian activity classification and gait analysis,” IEEE Trans. Instrum. Meas., vol. 62, no. 5, pp. 1073–1083, May 2013. [5] S. C. Mukhopadhyay, “Wearable sensors for human activity monitoring: A review,” IEEE Sensors J., vol. 15, no. 3, pp. 1321–1330, Mar. 2015. [6] Z. Chen, Q. Zhu, S. Y. Chai, and L. Zhang, “Robust human activity recognition using smartphone sensors via CT-PCA and online SVM,” IEEE Trans. Ind. Informat., vol. 13, no. 6, pp. 3070–3080, Dec. 2017. [7] Q. Zhu, Z. Chen, and Y. C. Soh, “A novel semisupervised deep learning method for human activity recognition,” IEEE Trans. Ind. Informat., vol. 15, no. 7, pp. 3821–3830, Jul. 2019. [8] Z. Chen, C. Jiang, and L. Xie, “A novel ensemble ELM for human activ- ity recognition using smartphone sensors,” IEEE Trans. Ind. Informat., vol. 15, no. 5, pp. 2691–2699, May 2019. [9] J. Yang, M. N. Nguyen, P. P. San, X. Li, and S. Krishnaswamy, “Deep convolutional neural networks on multichannel time series for human activity recognition,” in Proc. IJCAI, vol. 15, 2015, pp. 3995–4001. [10] M. A. Alsheikh, A. Selim, D. Niyato, L. Doyle, S. Lin, and H.-P. Tan, “Deep activity recognition models with triaxial accelerometers,” in Proc. AAAI Workshop, Artif. Intell. Appl. Assistive Technol. Smart Environ., 2016, pp. 8–13. [11] N. Y. Hammerla, S. Halloran, and T. Ploetz, “Deep, convolutional, and recurrent models for human activity recognition using wearables,” 2016, arXiv:1604.08880. [Online]. Available: https://arxiv.org/abs/1604.08880 [12] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A public domain dataset for human activity recognition using smartphones,” in Proc. ESANN, 2013, pp. 437–442. [13] Z. Chen, L. Zhang, Z. Cao, and J. Guo, “Distilling the knowledge from handcrafted features for human activity recognition,” IEEE Trans. Ind. Informat., vol. 14, no. 10, pp. 4334–4342, Oct. 2018. [14] A. Wang, G. Chen, J. Yang, S. Zhao, and C.-Y. Chang, “A comparative study on human activity recognition using inertial sensors in a smart- phone,” IEEE Sensors J., vol. 16, no. 11, pp. 4566–4578, Jun. 2016. [15] M. Eastwood and C. Jayne, “Evaluation of hyperbox neural network learning for classification,” Neurocomputing, vol. 133, pp. 249–257, Jun. 2014. [16] C. A. Ronao and S.-B. Cho, “Human activity recognition using smart- phone sensors with two-stage continuous hidden Markov models,” in Proc. 10th Int. Conf. Natural Comput. (ICNC), Aug. 2014, pp. 681–686. [17] R. Rana, B. Kusy, J. Wall, and W. Hu, “Novel activity classification and occupancy estimation methods for intelligent HVAC (heating, ventilation and air conditioning) systems,” Energy, vol. 93, pp. 245–255, Dec. 2015. [18] M. Seera, C. K. Loo, and C. P. Lim, “A hybrid FMM-CART model for human activity recognition,” in Proc. IEEE Int. Conf. Syst., Man, Cybern. (SMC), Oct. 2014, pp. 182–187. [19] Y. Li, D. Shi, B. Ding, and D. Liu, “Unsupervised feature learning for human activity recognition using smartphone sensors,” in Mining Intelligence and Knowledge Exploration. Cham, Switzerland: Springer, 2014, pp. 99–107. [20] C. A. Ronao and S.-B. Cho, “Human activity recognition with smart- phone sensors using deep learning neural networks,” Expert Syst. Appl., vol. 59, pp. 235–244, Oct. 2016. [21] C. A. Ronao and S.-B. Cho, “Deep convolutional neural networks for human activity recognition with smartphone sensors,” in Proc. Int. Conf. Neural Inf. Process. Cham, Switzerland: Springer, 2015, pp. 46–53. [22] D. Tao, Y. Wen, and R. Hong, “Multicolumn bidirectional long short- term memory for mobile devices-based human activity recognition,” IEEE Internet Things J., vol. 3, no. 6, pp. 1124–1134, Dec. 2016. [23] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, “Deep learning for sensor- based activity recognition: A survey,” Pattern Recognit. Lett., vol. 119, pp. 3–11, Mar. 2019. [24] H. Qian, S. J. Pan, and C. Miao, “Sensor-based activity recognition via learning from distributions,” in Proc. AAAI, 2018, pp. 6262–6269. [25] T.-H. Chan, K. Jia, S. Gao, J. Lu, and Z. Zeng, Y. Ma, “PCANet: A simple deep learning baseline for image classification?” IEEE Trans. Image Process, vol. 24, no. 12, pp. 5017–5032, Dec. 2015. [26] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in deep learning based natural language processing,” IEEE Comput. Intell. Mag., vol. 13, no. 3, pp. 55–75, Aug. 2018. [27] Y. Liu, L. Nie, L. Liu, and D. S. Rosenblum, “From action to activity: Sensor-based activity recognition,” Neurocomputing, vol. 181, pp. 108–115, Mar. 2016. [28] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5, no. 2, pp. 157–166, Mar. 1994. [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997. [30] G. E. Hinton, “Learning multiple layers of representation,” Trends Cognit. Sci., vol. 11, no. 10, pp. 428–434, Oct. 2007. [31] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA, Neural Netw. Mach. Learn., vol. 4, no. 2, pp. 26–31, 2012. [32] T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh, “Activity recognition and abnormality detection with the switching hidden semi- Markov model,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. (CVPR), vol. 1, Jun. 2005, pp. 838–845. Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 4001 [33] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “Human activity recognition on smartphones using a multiclass hardware-friendly support vector machine,” in Proc. Int. Workshop Ambient Assist. Living. Berlin, Germany: Springer, 2012, pp. 216–223. [34] Y. Chen, Z. Zhao, S. Wang, and Z. Chen, “Extreme learning machine- based device displacement free activity recognition model,” Soft Com- put., vol. 16, no. 9, pp. 1617–1625, 2012. [35] W. Zhu et al., “Co-occurrence feature learning for skeleton based action recognition using regularized deep LSTM networks,” in Proc. AAAI, vol. 2, Mar. 2016, p. 8. [36] L. Liu, L. Cheng, Y. Liu, Y. Jia, and D. S. Rosenblum, “Recognizing complex activities by a probabilistic interval-based model,” in Proc. AAAI, vol. 30, 2016, pp. 1266–1272. Zhenghua Chen received the B.Eng. degree in mechatronics engineering from the University of Electronic Science and Technology of China, Chengdu, China, in 2011, and the Ph.D. degree in electrical and electronic engineering from Nanyang Technological University, Singapore, in 2017. He is currently a Scientist with the Institute for Infocomm Research, Agency for Science, Technol- ogy and Research, Singapore. His research interests include data analytics in smart buildings, ubiquitous computing, the Internet of Things, machine learning, and deep learning. Chaoyang Jiang received the B.E. degree in electri- cal engineering and automation from the China Uni- versity of Mining and Technology, Xuzhou, China, in 2009, the M.E. degree in control science and engineering from the Harbin Institute of Technology, Harbin, China, in 2011, and the Ph.D. degree in electrical and electronic engineering from Nanyang Technological University, Singapore, in 2017. He is currently an Associate Professor with the School of Mechanical Engineering, Beijing Institute of Technology. His current research interests include statistical signal processing, sparse sensing, machine learning, and information fusion. Shili Xiang received the B.S. degree in computer science from the University of Science and Tech- nology of China, Hefei, China, in 2003, and the Ph.D. degree in computer science from the National University of Singapore, Singapore, in 2011. She is currently a Scientist and a Principle Inves- tigator with the Data Analytics Department, Insti- tute for Infocomm Research, Agency for Science, Technology and Research, Singapore. Her current research interests include smart mobility, ubiquitous computing, data mining, and machine learning. Jie Ding received the B.Eng. degree in automation from the Harbin Engineering University, Harbin, China, in 2012, and the Ph.D. degree in electrical and electronic engineering from Nanyang Techno- logical University, Singapore, in 2018. She is currently a Scientist with the Insti- tute for Infocomm Research, Agency for Science, Technology and Research, Singapore. Her current research interests include machine learning, pattern recognition, control and optimization, and complex networks. Min Wu received the B.S. degree in computer science from the University of Science and Tech- nology of China, Hefei, China, in 2006, and the Ph.D. degree in computer science from Nanyang Technological University, Singapore, in 2011. He is currently a Senior Scientist with the Data Analytics Department, Institute for Infocomm Research, Agency for Science, Technology and Research, Singapore. His current research inter- ests include machine learning, data mining, and bioinformatics. Dr. Wu was a recipient of the Best Paper Award at the InCoB 2016 and DASFAA 2015. He was also a recipient of the IJCAI Competition on repeated buyers’ prediction in 2015. Xiaoli Li is currently a Principal Scientist with the Institute for Infocomm Research, Agency for Sci- ence, Technology and Research, Singapore. He also holds adjunct professor position at Nanyang Tech- nological University. He has authored or coauthored more than 180 high-quality articles. His current research interests include data mining, machine learning, AI, and bioinformatics. Dr. Li was a recipient of numerous best paper/benchmark competition awards. He has been serving as a (senior) PC member/workshop chair/session chair in leading data mining and AI related conferences, includ- ing KDD, ICDM, SDM, PKDD/ECML, WWW, IJCAI, AAAI, ACL, and CIKM. Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
欢迎咨询51作业君