程序辅导案例 > Program >

程序代写案例-JULY 2020

欢迎使用51辅导，51作业君孵化低价透明的学长辅导平台，服务保持优质，平均费用压低50%以上！ 51fudao.top

3992 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020
Smartphone Sensor-Based Human Activity
Recognition Using Feature Fusion and
Maximum Full a Posteriori
Zhenghua Chen , Chaoyang Jiang , Shili Xiang , Jie Ding , Min Wu , and Xiaoli Li
Abstract— Human activity recognition (HAR) using smart-
phone sensors has attracted great attention due to its wide range
of applications. A standard solution for HAR is to first generate
some features defined based on domain knowledge (handcrafted
features) and then to train an activity classification model based
on these features. Very recently, deep learning with automatic
feature learning from raw sensory data has also achieved great
performance for HAR task. We believe that both the handcrafted
features and the learned features may convey some unique
information that can complement each other for HAR. In this
article, we first propose a feature fusion framework to combine
handcrafted features with automatically learned features by a
deep algorithm for HAR. Then, taking the regular dynamics
of human behavior into consideration, we develop a maximum
full a posteriori algorithm to further enhance the performance
of HAR. Our extensive experimental results show the proposed
approach can achieve superior performance comparing with the
state-of-the-art methodologies across both a public data set and
a self-collected data set.
Index Terms— Deep learning, feature fusion, Human activity
recognition (HAR), maximum full a posteriori (MFAP), smart-
phone sensors.
I. INTRODUCTION
HUMAN activity recognition (HAR) is of great impor-tance for many applications in health-care services,
smart homes, and pervasive and mobile computing [1],
[2]. With the development of computer vision techniques,
camera-based HAR has been well-developed [3]. However,
it can only monitor a specific space with adequate illumination
conditions. In addition, it suffers from privacy concerns. Wear-
able sensors, such as accelerator and gyroscope, are also popu-
lar for HAR [4], [5]. However, they require special hardware to
be worn by users, which is obviously inconvenient. In the past
Manuscript received July 10, 2019; revised September 3, 2019; accepted
September 22, 2019. Date of publication October 3, 2019; date of current
version June 9, 2020. This work was supported in part by the Ministry
of National Development, Singapore, through the Sustainable Urban Living
Program under Grant SUL2013-5 and in part by the Beijing Institute of
Technology Research Fund Program for Young Scholars. The Associate Editor
coordinating the review process was Dr. Alessio De Angelis. (Corresponding
authors: Chaoyang Jiang; Min Wu.)
Z. Chen, S. Xiang, J. Ding, M. Wu, and X. Li are with the Institute
for Infocomm Research, Agency for Science, Technology and Research,
Singapore 138632 (e-mail: [email protected]; [email protected];
[email protected]; [email protected]; [email protected]).
C. Jiang is with the Science and Technology on Vehicle Transmission Lab-
oratory, School of Mechanical Engineering, Beijing Institute of Technology,
Beijing 100081, China (e-mail: [email protected]).
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TIM.2019.2945467
decade, smartphones become more and more powerful with
many sensors embedded, including accelerator, gyroscope,
barometer, temperature sensor, and so on. Since most people
carry smartphones in their daily life, smartphones-based HAR
will, thus, be a practical option [6], [7].
Recently, smartphone sensor-based HAR has been devel-
oped, which can be generally divided into two categories,
namely, shallow and deep algorithms. Specifically, shallow
algorithms consist of two steps—feature extraction and activity
inference [2], [8]. Since the raw smartphone sensor data are
not well representative for distinct activities, a standard proce-
dure is, thus, to extract some informative features, also known
as feature extraction/engineering. For instance, the magnitude
of acceleration should be helpful in separating different activ-
ities, such as walking and running. As such, some defined
statistical features known as handcrafted features will be first
extracted from the raw smartphone sensor data. Note that,
these handcrafted features are also automatically generated
by programs which are written based on their definitions.
Some machine learning algorithms, such as neural networks,
support vector machines (SVMs), and random forest (RF),
can be then applied with the handcrafted features to identify
different human activities. Deep algorithm-based HAR, on the
other hand, is one-step approach that can automatically learn
representative features from the raw sensory data for HAR
without human intervention, and perform activity inference
simultaneously [9]–[11].
We observe that both shallow learning algorithms with
handcrafted features and deep learning algorithms with auto-
matically learned features have achieved great successes for
the task of HAR [12], [13]. We believe that both handcrafted
features and automatically learned features by deep algorithms
may convey unique information which can complement each
other to boost the performance of smartphone sensor-based
HAR. In this article, at the first stage, we propose a feature
fusion framework to integrate handcrafted features with a deep
algorithm, i.e., deep long short-term memory (LSTM), to boost
the performance of HAR. In the second stage, considering
the dynamics (frequent activity changes) of human behavior,
we propose a maximum full a posteriori (MFAP) algorithm,
which exploits all the past information and the current a pos-
teriori probability obtained from the feature fusion framework
to give an optimal estimation of human activities.
The main contributions of this article are summarized as
follows.
0018-9456 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 3993
1) We propose a novel feature fusion framework that can
effectively combine handcrafted features with a deep
learning algorithm to boost the performance of smart-
phone sensor-based HAR.
2) Taking the dynamics of human behaviors into consider-
ation, we formulate an MFAP algorithm that exploits
all the past information and the current a posteriori
information obtained from the feature fusion framework
to give an optimal estimation of human activities.
3) We use a public data set and a self-collected data set to
evaluate the effectiveness of the proposed approach. Our
comprehensive experimental results demonstrate that the
proposed approach significantly outperforms existing
advanced learning algorithms and the state-of-the-art.
The remaining of this article is organized as follows.
Section II reviews some related works with handcrafted fea-
tures and automatically feature learning by deep algorithms for
HAR. Section III briefly introduces the handcrafted features
and a deep algorithm for automatic feature learning, followed
by the proposed feature fusion framework. Section IV presents
the proposed MFAP algorithm. Section V first demonstrates
the data for evaluation, followed by the experimental setup.
Then, the experimental results are presented and discussed.
Section VI concludes this article and presents some potential
future works.
II. RELATED WORKS
In this section, some related works for HAR using different
learning algorithms are reviewed. We divide this section into
two parts—shallow and deep algorithms.
A. Shallow Algorithms
For shallow algorithms, they normally consist of feature
engineering and activity inference. Since raw smartphone
sensor data are noisy and not representative of different human
activities, some more informative features can be extracted
with domain knowledge. Then, shallow learning algorithms
can be performed for HAR with these handcrafted features.
For example, Wang et al. [14] investigated the effectiveness
of smartphone accelerator and gyroscope for HAR. First, they
extracted a large number of statistical features from both time
and frequency domains of 3-D acceleration and gyroscope.
Then, they proposed a hybrid of filter and wrapper method
known as FW to select best features from all handcrafted fea-
tures. Finally, machine learning algorithms, namely, k-nearest-
neighbors (KNN) and naive Bayes (NB) were employed to
classify different activities. Eastwood and Jayne [15] evalu-
ated different extensions of hyperbox neural network (HNN)
which is built upon different modes of learning for HAR.
In addition, Anguita et al. [12] proposed a hardware-friendly
SVM (HF-SVM) algorithm based on fix-point arithmetic for
HAR using smartphone sensors. The experimental results
showed that HF-SVM has comparable performance to the
conventional SVM, but with much less computational com-
plexity. Ronao and Cho [16] presented two-stage continuous
hidden Markov models (CHMMs) for HAR. The first-stage
CHMM was utilized to separate static and dynamic activ-
ities. The second-stage CHMM was then applied to iden-
tify the exact activity from the two types of activities.
Rana et al. [17] enhanced the sparse random classifier with
singular value decomposition (SRC-SVD) for HAR. The SVD
was leveraged to construct the random projection matrix for
SRC. Seera et al. [18] proposed a hybrid of fuzzy min–max
(FMM) neural network and the classification and regression
tree (CART) to recognize human activities. In their proposed
system, the FMM was mainly used for data incremental
learning, and the CART was utilized to provide interpretations
for the classification.
B. Deep Algorithms
Owing to the powerful feature learning ability of deep
algorithms, they have achieved remarkable performance for
HAR using smartphone sensors. Li et al. [19] presented a
sparse autoencoder (SAE) to automatically learn representative
features from raw smartphone accelerator and gyroscope data
for the task of HAR. The 3-D acceleration, gyroscope, and
the magnitudes of them are treated as different channels
on which the SAE is implemented for feature learning.
Ronao and Cho [20] presented a convolutional neural net-
work (convnet) that is able to learn representative features
from raw smartphone sensor data for HAR. They also explored
the use of temporal fast Fourier transform (tFFT) on the raw
sensory data with convnet for HAR. In their another work,
they attempted to apply handcrafted features as the inputs
of convnet instead of the raw smartphone sensor data for
HAR [21]. Tao et al. [22] presented an ensemble bidirectional
long short-term memory (BLSTM) approach for HAR. They
applied the raw sensory data, the magnitude of the raw
sensory data, and two-directional features as inputs for differ-
ent BLSTM. Experiments indicate the effectiveness of their
proposed approach. Chen et al. [13] proposed a knowledge
distilling strategy which attempts to use well-designed hand-
crafted features to guide deep algorithms for generalization
for smartphone sensor-based HAR. A comprehensive survey
on deep learning-based HAR can be found in [23].
In real applications, both handcrafted features with domain
knowledge and automatically learned features by deep algo-
rithms may convey unique information for HAR. In this article,
we attempt to build a feature fusion framework to combine
these two types of comprehensive features to make good
use of all the useful information, which should boost the
performance of HAR. Taking the dynamics of human behavior
into consideration, we further improve the performance of
HAR by formulating an MFAP algorithm that exploits all
the past information with the current a posteriori information
obtained from the feature fusion framework to give an optimal
estimation of human activities.
III. PROPOSED FEATURE FUSION FRAMEWORK
In this section, we will first briefly introduce handcrafted
features and automatic feature learning, and subsequently,
elaborate two key innovations in our proposed methods.
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
3994 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020
TABLE I
HANDCRAFTED FEATURES
A. Handcrafted Features
Feature engineering is a widely used technique for data
preprocessing, leading to the success of shallow machine
learning algorithms [24]. For HAR using smartphone sen-
sors, the raw sensory data are not representative of different
human activities. To achieve better performance for HAR,
some more representative features can be extracted based on
domain knowledge. For example, the activities of walking
and running will yield different magnitudes of acceleration.
Thus, the feature of the magnitude of acceleration can be
extracted for the separation of these two activities. In addition,
the variance of smartphone sensors can be used to distinguish
static activities from dynamic ones. As such, some advanced
statistical features from time and frequency domains have been
shown to be effective for smartphone sensor-based HAR [12],
which are presented in Table I. All these handcrafted features
will be extracted for both 3-D acceleration and gyroscope of
smartphones.
B. Automatic Feature Learning
Deep learning has achieved great success in many challeng-
ing research areas, such as image recognition [25] and natural
language processing [26]. The biggest merit of deep learning is
the ability of automatic feature learning from raw sensory data
without human intervention. For HAR using smartphone sen-
sors, the raw sensory data are typical time series with temporal
dependence [27]. While a recurrent neural network (RNN)
is naturally suitable for time series data, the conventional
RNN suffers from the problem of gradient vanishing and
exploding, which degrades its performance on the modeling of
long-term dependencies in sequential data [28]. To solve this
problem, Hochreiter and Schmidhuber proposed a new RNN
named long short-term memory (LSTM), which attempts to
use some memory cells to preserve information for long-term
dependencies [29].
A typical structure of LSTM can be found in Fig. 1, where
xt is the input at time step t , ht is the hidden state, Ct−1 is
the memory cell state, w f , wi , wC , and wo are the weights,
Fig. 1. Structure of the LSTM network.
b f , bi , bC , and bo are the biases, and σ(·) and tanh are the
sigmoid and tanh functions, respectively.
In the LSTM network, the first step is to determine which
information should be thrown from the previous memory cell
state Ct−1 by using a forget gate, which can be formulated as
f t = σ(w f [ht−1, xt ] + b f ) (1)
Here, f t = 1 means to keep all the information from the previ-
ous step, and f t = 0 means to totally remove the information
from the previous step. The next step is to determine which
new information should be stored based on the current input.
It consists of two components. The first component is an input
gate to decide what shall be updated. It can be expressed as
i t = σ(wi [ht−1, xt ] + bi) (2)
The second component produces a candidate state value C˜t
by using a tanh function, shown as
C˜t = tan h(wC [ht−1, xt ] + bC). (3)
After that, the next step is to decide the current state Ct by
using the following equation:
Ct = f t ∗ Ct−1 + i t ∗ C˜t . (4)
Finally, the hidden output ht is a filtered version of the
compressed cell state tanh(Ct ). The output of the sigmoid
layer ot will determine which part of the information will be
preserved. It is shown as
ot = σ(wo[ht−1, xt ] + bo). (5)
The final hidden output ht ∈ Rd , where d is the dimension of
the feature, can be expressed as
ht = ot ∗ tan h(Ct ). (6)
Deep architecture has been shown to be effective for rep-
resentation learning [30]. Therefore, in this article, we stack
multiple LSTM layers, known as deep LSTM, for deep rep-
resentation learning in the task of smartphone sensor-based
HAR. Specifically, the output of i th LSTM layer will be the
input of (i + 1)th LSTM layer. As a special case, the input of
the first LSTM layer is the raw sequential smartphone sensor
data.
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 3995
Fig. 2. Proposed feature fusion framework.
C. Proposed Feature Fusion
Both the handcrafted features with domain knowledge and
the features learned by deep algorithms may contain unique
information for HAR. To make good use of these two types of
features, we propose a feature fusion framework to combine
them together for better recognition of human activities using
smartphone sensors. The proposed feature fusion framework is
shown in Fig. 2. Here, we choose the deep LSTM for feature
learning, which is naturally suitable for our sequential data
analysis problem. The raw sequential smartphone sensor data
is fed into two stacked LSTM layers for feature learning.
The learned features at the last time instance are fed into
a fully connected layer (FCL) to get more abstract features.
At the same time, the handcrafted features in Table I, extracted
from the raw smartphone sensor data, are fed into another
FCL to obtain more abstract features. After that, we combine
the two types of features using a concatenate layer. Finally,
the combined features are fed into a softmax layer for activity
classification.
More specifically, given the smartphone sensor input ot
which is a window of sensory data, the automatically learned
features and the handcrafted features can be expressed as
vt = (ot ) and ht = (ot ), respectively, where (·) is
the LSTM-based feature learning and (·) is the handcrafted
feature extraction based on domain knowledge. Note that,
the LSTM is able to encode temporal dependencies within
the sample (window) during feature learning. These two types
of features can be treated as the processing of the raw sensory
data in two distinct perspectives, both of which have been
shown to be effective for HAR. The complete feature set is
the concatenation of the two types of features, which can be
expressed as lt = vt ∪ ht . This concatenation is able to make
full use of these two types of features, which may also lead to
a more comprehensive understanding of the raw sensory data.
Hence, better performance can be expected. The final outputs
of the proposed feature fusion framework are the probabilities
of all activities by using the softmax layer on these features,
which can be expressed as so f tmax(lt ).
The training of the proposed feature fusion framework
is to optimize the parameters of the network by using a
backpropagation algorithm on the training data. Specifically,
given training data and targets, the network outputs with the
training data are calculated. The errors between the network
outputs and the given targets can be obtained, where the gra-
dient of the errors can be used to update network parameters
based on gradient-based optimization methods. In this article,
we utilize an optimization method of RMSprop, which is able
to use the magnitude of recent gradients to normalize the
gradients [31] for parameter optimization. To prevent over-
fitting, some dropout layers and a batch normalization (BN)
layer are employed, which are shown in Fig. 2. The dropout
rates for the two dropout layers are set to be 0.5.
After the network has been learned with the training data,
the outputs of the proposed feature fusion framework are
the probabilities of all activities given the current sensor
measurements ot , which can be expressed as p(zt |ot ). It is also
known as a posteriori. In general, the current activity will be
determined based on the maximal probability of a posteriori,
known as maximum a posteriori (MAP) estimation. However,
the current human activity should be related to the activity
sequence in the past and previous sensor observations, which is
not considered by the MAP during estimation. In other words,
the LSTM network in the proposed feature fusion framework
is only able to encode temporal dependencies within the
sample. But it is not able to model the temporal dynamics
among samples (activity sequence). To further improve the
performance of HAR, we propose an MFAP approach which
combines the past information with the current a posteriori to
give an optimal estimation of human activities.
IV. MAXIMUM FULL a Posteriori ESTIMATION
In real life, when performing activities, human normally
carries on one activity for a while and then transfer to another
activity. This important property should be considered when
designing HAR systems. However, to the best of our knowl-
edge, no previous works have exploited this important property
of human behavior. The conventional data-driven approaches
attempt to estimate human activities only based on current
sensor observations. In this article, to take the dynamics of
human behavior into consideration, we propose an MFAP
algorithm that is able to consider the past information and the
current a posteriori information obtained from the proposed
feature fusion framework. The MFAP can be formulated as
zˆt = arg max
zt
p(zt |o1:t) (7)
where zt is the human activity at time instance t and o1:t
are observations from time instance 1 to t . Here, we make
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
3996 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020
two basic assumptions for HAR using the MFAP algorithm,
which are as follows.
1) The state (activity) follows a first-order Markov prop-
erty, i.e., p(zt |zt−1) = p(zt |z1:t−1).
2) The current observation of state is conditionally indepen-
dent of the previous observations, i.e., p(ot |o1:t−1, zt ) =
p(ot |zt ).
A human normally performs activity sequentially. The cur-
rent activity usually has high correlation with the activities
performed recently and a low correlation with the activities
performed long ago. This process has been well modeled by
a first-order Markov chain [32]. Therefore, we can assume
that human activities follow a first-order Markov property,
and the first assumption is considered valid. The observation
relies on real human activity. Once the current activity is
known, the current observation is independent of the previous
observations. Hence, the second assumption, which states that
the current observation of activity is conditional independence
of the previous observations is also considered valid.
According to Bayes rules, the full a posteriori probability
for HAR, p(zt |o1:t ), can be expressed as
p(zt |o1:t ) = p(o1:t |zt )p(zt )p(o1:t )
= p(ot , o1:t−1|zt )p(zt )
p(ot , o1:t−1)
= p(ot |o1:t−1, zt )p(o1:t−1|zt )p(zt )
p(ot |o1:t−1)p(o1:t−1)
= p(ot |o1:t−1, zt )p(zt |o1:t−1)p(o1:t−1)p(zt )
p(ot |o1:t−1)p(o1:t−1)p(zt )
= p(ot |zt )p(zt |o1:t−1)
p(ot |o1:t−1)
= p(zt |ot )p(ot)p(zt |o1:t−1)
p(zt )p(ot |o1:t−1) . (8)
Given observations o1:t from time step 1 to t , the probability
of (p(ot)/p(ot |o1:t−1)) is deterministic, which can be treated
as a normalization factor. Therefore, the full a posteriori
probability can be further expressed as
p(zt |o1:t) ∝ p(zt |ot)p(zt |o1:t−1)p(zt ) (9)
In (9), p(zt |ot ) is the a posteriori probability of the human
activity. Compared with p(zt |ot ), full observation information
is involved in p(zt |o1:t ). Hence, we call the estimation in (7)
MFAP estimation. We can find from (9) that the full a
posteriori probability, i.e., p(zt |o1:t), is determined by the
following three components.
1)
p(zt |o1:t−1) =
∑
i
p(zt |zt−1 = li )p(zt−1 = li |o1:t−1)
(10)
where li is the i th activity and p(zt |zt−1) is the transition
probability for the first-order Markov chain model.
2) p(zt |ot ): the current a posteriori which can be obtained
from the proposed feature fusion framework.
3) p(zt ): the prior distribution for different activities.
To get p(zt |o1:t−1) from (10), we need to obtain the
transition probability p(zt |zt−1) for the first-order Markov
chain model. Here, we model human activity sequence as a
Markov chain, which describes the transition from one activity
to another. Given the n activities {l1, l2, . . . , ln}, the i th row
and j th column entry of the transition probability matrix,
A ∈ Rn×n , can be expressed as
ai j = p(zt = li |zt−1 = l j ), i, j = 1, 2, . . . , n. (11)
We intend to calculate the transition probability matrix
based on the training data. Given m steps human activity
sequence, the transition probability from state l j to state li ,
denoted as ai j can be calculated as
ai j =
∑m
t=2 δ(zt − li )δ(zt−1 − l j )∑m
t=2 δ(zt−1 − l j )
(12)
where
δ(α) =
{
1, α = 0
0, otherwise.
Next, the probability p(zt |ot) can be obtained from the
proposed feature fusion framework. Since the last layer of
the proposed feature fusion framework is a softmax layer,
it will produce the probability for each activity based on inputs,
i.e., current smartphone sensor measurements. Specifically,
the current a posteriori probability can be expressed as
p(zt |ot ) = softmax(lt ). (13)
Finally, the probability p(zt ) can be easily counted based
on the training data as
p(zt = li ) =
∑m
t=1 δ(zt − li )
m
(14)
The implementation of the proposed MFAP for HAR is
shown in Algorithm 1.
Algorithm 1 Proposed MFAP for HAR
Input: A = {ai j }, bt = {bit } = {p(zt = li |ot )}, c = {ci } =
{p(zt = li )}, i, j = 1, 2, . . . , n, t = 1, 2, . . . , T .
Output: Full a posterior: rt = p(zt |o1:t ), predicted activity:
O
Initialization: t = 1
1: r1 = {r i1} = b1
2: O1 = arg maxli r1
Recursion
3: for t = 2 to T do
4: for i = 1 to n do
5: r it =
bit
∑
j ai j r
j
t−1
ci
based on Equation (9).
6: end for
7: Ot = arg maxli rt
8: end for
9: return O
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 3997
V. EXPERIMENTS
A. Data Description
To evaluate the performance of the proposed approaches for
HAR using smartphone sensors, we first use a public data set
from UCI [12]. A Samsung Galaxy SII smartphone, which is
attached to the waist of subjects with fixed orientation was
used for data collection. Both 3-D acceleration and gyroscope
data were collected. This data set contains six activities,
i.e., walking, walking upstairs, walking downstairs, standing,
sitting, and laying. The sampling frequency of the data is
50 Hz. A sliding window of 2.56 s (or a sample) with a 50%
overlap is used for data segmentation. In total, 10 299 samples
are collected from 30 participants.
We also collected our own data set using a recently released
Huawei P20 Pro smartphone. For this data set, instead of
attaching the smartphone to a fix position that may not be
realistic, we freely put the smartphone in three common posi-
tions, i.e., pants’ pocket, shirt’s pocket, and backpack, without
any restrictions for data collection. Here, we consider some
different activities, including walking, fast walking, running,
walking upstairs, walking downstairs, and static. Similarly,
we collected both 3-D acceleration and gyroscope with a
sampling rate of 50 Hz. We also use a sliding window of
2.56 s with a 50% overlap for data segmentation. Totally,
4752 samples are collected from 12 volunteers.
For the public data set and our own data set, there are some
differences: 1) the smartphones for experiments are different;
2) the placements of smartphones are different; And 3) due to
the different smartphone placements, the explored activities in
the two data sets are different. Since the smartphone is attached
to the waist of subjects with fixed orientation in the public
data set, it is possible to detect the activities of “Standing”
and “Sitting” based on the slight variances of smartphone
orientations. Meanwhile, the orientation of “Laying” is totally
different from the other two static activities of “Standing”
and “Sitting,” and thus, various algorithms achieve very high
recognition accuracy for “Laying” as shown in Table II. For
these three activities, i.e., “Standing,” “Sitting,” and “Laying,”
in the public data set, we can distinguish them based on
the orientation information. However, for our own data set,
the smartphone is freely put in three common positions,
without any restrictions on its orientation. Therefore, we are
not able to distinguish the above-mentioned three activities of
“Standing,” “Sitting,” and “Laying” based on the orientation
information. For this reason, we explore some other common
activities, such as “Fast walking,” “Running,” and “Static” in
our own data set.
For both the public data and our own data, we randomly
select 70% of the data to train different algorithms and the
remaining for testing.
B. Experimental Setup
To verify the performance of the proposed approaches,
we compare with some advanced learning algorithms for HAR,
including shallow learning algorithms with handcrafted fea-
tures, such as an artificial neural network (ANN), SVM [33],
extreme learning machine (ELM) [34], and RF, and the deep
learning algorithm of deep LSTM [35]. The parameters of
all the benchmark approaches and the proposed approach are
carefully tuned using a validation set. For ANN and ELM,
the number of hidden nodes is determined by using grid search
with the validation set. The popular radial basis function (RBF)
kernel is chosen for SVM. The parameters of RBF kernel are
determined using grid search. For RF, the number of decision
trees is set as 500 for ensemble learning. The deep LSTM
consists of two LSTM layers with sizes of 32 and 64, an FCL
with a size of 100, and a softmax layer for classification. For
the proposed fusion framework, two LSTM layers with sizes of
32 and 64 are used. The FCLs in Fig. 2 both have 100 hidden
nodes.
C. Experimental Results
1) Results on the Public Data Set: The experimental
results on the public data set are shown in Table II. With
expert knowledge, conventional machine learning approaches
of ANN, ELM, and SVM with the handcrafted features slightly
outperform the deep LSTM with automatic feature learning on
the public data set. This means that the handcrafted features
are more representative of these activities. The proposed fea-
ture fusion framework, which combines handcrafted features
and automatically learned features by the deep algorithm
has a superior performance over these benchmark shallow
and deep algorithms. This indicates that handcrafted features
and automatically learned features by the deep algorithm
contain unique information for HAR and can complement each
other, leading to better performance. By taking the dynamics
of human behavior into consideration, the proposed MFAP
achieves the best performance. The overall accuracy is as high
as 98.85%.
We now zoom into the performance of specific activity
classification. Among all the activities, the activity of “Laying”
has the highest recognition accuracy, due to the distinct
smartphone orientation for this activity against these of the
other five activities. The activities of “Sitting” and “Standing”
have very similar patterns on smartphone sensor readings.
Therefore, the recognition accuracies of these two activities are
relatively low. Similarly, the recognition performances of the
activities of “Walking Upstairs” and “Walking Downstairs” are
also limited, because of the similar sensory patterns. Owing to
the proposed feature fusion framework and the consideration
of the dynamics of human behavior, the proposed MFAP has
the highest recognition accuracy for all the six activities.
We have shown the activity recognition results of the
proposed feature fusion framework and the proposed MFAP
for testing in Fig. 3. It can be observed that the activities of
“Standing” and “Sitting” are difficult to separate, due to the
similar sensory patterns. The activities of “Walking,” “Walking
Upstairs,” and “Walking Downstairs” suffer from the same
issue. By taking the dynamics of human behavior, the pro-
posed MFAP algorithm dramatically improves the results.
This clearly indicates the effectiveness of the proposed MFAP
algorithm for HAR. Fig. 4 shows the confusion matrices
of the proposed feature fusion framework and the proposed
MFAP on the public data set. The general conclusion is
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
3998 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020
Fig. 3. The recognition results of the proposed feature fusion framework and the proposed MFAP on the public data set.
Fig. 4. The confusion matrices of the proposed feature fusion framework and
the proposed MFAP on the public data set. (a) The proposed feature fusion
framework (b) The proposed MFAP.
the same. By considering human dynamics, the proposed
MFAP improves the recognition accuracies for all the six
activities.
Fig. 5. Recognition results of the proposed feature fusion framework and
the proposed MFAP on our own data set.
2) Results on Our Own Data Set: The experimental results
on our own data set are shown in Table III. In general, all the
approaches perform better on our own data set when compared
with the public data set. One possible reason for the distinct
results is that the explored activities are different for the two
data sets. Based on Table II, we can find that the activities
of “Standing” and “Sitting” are difficult to be separated, due
to the similar sensory patterns (no movement and similar
smartphone orientation). While the activities in Table III
are relatively easier to be separated. Moreover, the different
devices for data collection and the way how the data was
collected for the two data sets may also contribute.
Different from the results on the public data set, the Deep
LSTM with automatically learned features outperforms the
conventional machine learning approaches with handcrafted
features. This means that the automatically learned features
by the deep algorithm are more representative for HAR
on this data set. Similarly, the proposed fusion framework,
which combines the handcrafted features and the automati-
cally learned features by the deep algorithm outperforms the
deep algorithm of deep LSTM and the conventional machine
learning approaches, i.e., ANN, ELM, SVM, and RF, with
handcrafted features. We can conclude that the handcrafted
features and the features learned by the deep algorithm have
unique merits, resulting distinct performances on different data
sets. With the proposed feature fusion framework, we can
make good use of the merits of these two types of features
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 3999
TABLE II
RECOGNITION ACCURACIES OF ALL THE APPROACHES ON THE PUBLIC DATA SET
TABLE III
RECOGNITION ACCURACIES OF ALL THE APPROACHES ON OUR OWN DATA SET
to boost the performance for HAR using smartphone sensors.
In addition, the proposed MFAP is able to take the dynamics
of human behavior into consideration, further improving the
performance of the proposed feature fusion algorithm. The
overall accuracy is as high as 99.58% on our own data set.
For our own data set, we consider some different activities
due to the different placement of smartphones in the two
data sets. It can be found that the activities of “Fast Walking,”
“Running,” and “Static” which contain distinct movement
patterns that can be easily identified with high recogni-
tion accuracies. However, activities of “Walking,” “Walk-
ing Upstairs,” and “Walking Downstairs” have very similar
movement patterns, and thus, confuse most of the algorithms.
Owing to the proposed feature fusion framework and the
consideration of the dynamics of human behavior, the final
recognition accuracies of the proposed MFAP are higher than
99% for all the activities.
Fig. 5 shows the recognition results of the proposed feature
fusion framework and the proposed MFAP for testing on
our own data set. Even though the proposed feature fusion
framework has already achieved a very high recognition accu-
racy, i.e., 98.67%, it still contains some wrong estimations,
shown as many spikes (see green line in Fig. 5) which are
harmful to real applications, such as home automation. With
the proposed MFAP which takes the dynamics of human
behavior into consideration, most of the wrong estimations
can be corrected. We also show the confusion matrices of the
proposed feature fusion framework and the proposed MFAP
on our own data set in Fig. 6. It can be found that the proposed
MFAP corrects most of the wrong predictions of the proposed
feature fusion framework, owing to the consideration of human
dynamics.
3) Compared With State-of-the-Arts: We have also com-
pared with some state-of-the-art approaches in the litera-
ture, including HNN [15], FW KNN [14], FW NB [14],
HF-SVM [33], two-stage CHMM [16], SRC-SVD [17],
FMM-CART [18], SAEs-c [19], Convnet [20], HCF Con-
vnet [21], tFFT Convnet [20], and Knowledge Distilling [13],
using the public data set. The detailed reviews of all these
Fig. 6. Confusion matrices of the proposed feature fusion framework and the
proposed MFAP on our own data set. (a) Proposed feature fusion framework.
(b) Proposed MFAP.
approaches can be found in Section II. Table IV demonstrates
the experimental results of these state-of-the-arts and the
proposed approach. It can be found that our proposed approach
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
4000 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 69, NO. 7, JULY 2020
TABLE IV
COMPARISON WITH STATE-OF-THE-ARTS
is able to achieve a superior performance over these state-of-
the-art methods.
VI. CONCLUSION
In this article, we first propose a feature fusion framework,
which combines handcrafted features with domain knowledge
and automatically learned features by a deep algorithm, for
human activity recognition (HAR). By taking the dynamics
of human behavior into consideration, we then formulate an
MFAP with the past information and the current a poste-
riori information obtained from the proposed feature fusion
framework to give an optimal estimation of human activities.
We employ a public data set and a self-collected data set to
evaluate the performance of the proposed approaches. Exten-
sive experiments show the proposed feature fusion frameworks
outperform five benchmark approaches. And the proposed
MFAP can further improve the performance of HAR. We also
compared with some state-of-the-art methodologies on the
public data set. The proposed MFAP achieves the best per-
formance, indicating our proposed method is practical to be
applied for real-world applications.
In our future works, we intend to focus on the recognition
of some more complex activities [36]. Moreover, consider-
ing the variation of smartphone orientation, the recognition
performance may degrade. How to enhance the performance
of smartphone-based HAR with varying device orientations is
one of our future works.
REFERENCES
[1] Y. Zhang, G. Tian, S. Zhang, and C. Li, “A knowledge-based approach
for multiagent collaboration in smart home: From activity recognition
to guidance service,” IEEE Trans. Instrum. Meas., to be published.
[2] O. D. Lara and M. A. Labrador, “A survey on human activity recognition
using wearable sensors,” IEEE Commun. Surveys Tuts., vol. 15, no. 3,
pp. 1192–1209, 3rd Quart., 2013.
[3] B. Ni, G. Wang, and P. Moulin, “RGBD-HuDaAct: A color-depth video
database for human daily activity recognition,” in Proc. IEEE Int. Conf.
Comput. Vis. Workshops (ICCV Workshops), Nov. 2011, pp. 1147–1153.
[4] G. Panahandeh, N. Mohammadiha, A. Leijon, and P. Händel, “Continu-
ous hidden Markov model for pedestrian activity classification and gait
analysis,” IEEE Trans. Instrum. Meas., vol. 62, no. 5, pp. 1073–1083,
May 2013.
[5] S. C. Mukhopadhyay, “Wearable sensors for human activity monitoring:
A review,” IEEE Sensors J., vol. 15, no. 3, pp. 1321–1330, Mar. 2015.
[6] Z. Chen, Q. Zhu, S. Y. Chai, and L. Zhang, “Robust human activity
recognition using smartphone sensors via CT-PCA and online SVM,”
IEEE Trans. Ind. Informat., vol. 13, no. 6, pp. 3070–3080, Dec. 2017.
[7] Q. Zhu, Z. Chen, and Y. C. Soh, “A novel semisupervised deep learning
method for human activity recognition,” IEEE Trans. Ind. Informat.,
vol. 15, no. 7, pp. 3821–3830, Jul. 2019.
[8] Z. Chen, C. Jiang, and L. Xie, “A novel ensemble ELM for human activ-
ity recognition using smartphone sensors,” IEEE Trans. Ind. Informat.,
vol. 15, no. 5, pp. 2691–2699, May 2019.
[9] J. Yang, M. N. Nguyen, P. P. San, X. Li, and S. Krishnaswamy, “Deep
convolutional neural networks on multichannel time series for human
activity recognition,” in Proc. IJCAI, vol. 15, 2015, pp. 3995–4001.
[10] M. A. Alsheikh, A. Selim, D. Niyato, L. Doyle, S. Lin, and H.-P. Tan,
“Deep activity recognition models with triaxial accelerometers,” in Proc.
AAAI Workshop, Artif. Intell. Appl. Assistive Technol. Smart Environ.,
2016, pp. 8–13.
[11] N. Y. Hammerla, S. Halloran, and T. Ploetz, “Deep, convolutional, and
recurrent models for human activity recognition using wearables,” 2016,
arXiv:1604.08880. [Online]. Available: https://arxiv.org/abs/1604.08880
[12] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “A public
domain dataset for human activity recognition using smartphones,” in
Proc. ESANN, 2013, pp. 437–442.
[13] Z. Chen, L. Zhang, Z. Cao, and J. Guo, “Distilling the knowledge from
handcrafted features for human activity recognition,” IEEE Trans. Ind.
Informat., vol. 14, no. 10, pp. 4334–4342, Oct. 2018.
[14] A. Wang, G. Chen, J. Yang, S. Zhao, and C.-Y. Chang, “A comparative
study on human activity recognition using inertial sensors in a smart-
phone,” IEEE Sensors J., vol. 16, no. 11, pp. 4566–4578, Jun. 2016.
[15] M. Eastwood and C. Jayne, “Evaluation of hyperbox neural network
learning for classification,” Neurocomputing, vol. 133, pp. 249–257,
Jun. 2014.
[16] C. A. Ronao and S.-B. Cho, “Human activity recognition using smart-
phone sensors with two-stage continuous hidden Markov models,” in
Proc. 10th Int. Conf. Natural Comput. (ICNC), Aug. 2014, pp. 681–686.
[17] R. Rana, B. Kusy, J. Wall, and W. Hu, “Novel activity classification and
occupancy estimation methods for intelligent HVAC (heating, ventilation
and air conditioning) systems,” Energy, vol. 93, pp. 245–255, Dec. 2015.
[18] M. Seera, C. K. Loo, and C. P. Lim, “A hybrid FMM-CART model
for human activity recognition,” in Proc. IEEE Int. Conf. Syst., Man,
Cybern. (SMC), Oct. 2014, pp. 182–187.
[19] Y. Li, D. Shi, B. Ding, and D. Liu, “Unsupervised feature learning
for human activity recognition using smartphone sensors,” in Mining
Intelligence and Knowledge Exploration. Cham, Switzerland: Springer,
2014, pp. 99–107.
[20] C. A. Ronao and S.-B. Cho, “Human activity recognition with smart-
phone sensors using deep learning neural networks,” Expert Syst. Appl.,
vol. 59, pp. 235–244, Oct. 2016.
[21] C. A. Ronao and S.-B. Cho, “Deep convolutional neural networks for
human activity recognition with smartphone sensors,” in Proc. Int. Conf.
Neural Inf. Process. Cham, Switzerland: Springer, 2015, pp. 46–53.
[22] D. Tao, Y. Wen, and R. Hong, “Multicolumn bidirectional long short-
term memory for mobile devices-based human activity recognition,”
IEEE Internet Things J., vol. 3, no. 6, pp. 1124–1134, Dec. 2016.
[23] J. Wang, Y. Chen, S. Hao, X. Peng, and L. Hu, “Deep learning for sensor-
based activity recognition: A survey,” Pattern Recognit. Lett., vol. 119,
pp. 3–11, Mar. 2019.
[24] H. Qian, S. J. Pan, and C. Miao, “Sensor-based activity recognition via
learning from distributions,” in Proc. AAAI, 2018, pp. 6262–6269.
[25] T.-H. Chan, K. Jia, S. Gao, J. Lu, and Z. Zeng, Y. Ma, “PCANet:
A simple deep learning baseline for image classification?” IEEE Trans.
Image Process, vol. 24, no. 12, pp. 5017–5032, Dec. 2015.
[26] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in
deep learning based natural language processing,” IEEE Comput. Intell.
Mag., vol. 13, no. 3, pp. 55–75, Aug. 2018.
[27] Y. Liu, L. Nie, L. Liu, and D. S. Rosenblum, “From action to
activity: Sensor-based activity recognition,” Neurocomputing, vol. 181,
pp. 108–115, Mar. 2016.
[28] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies
with gradient descent is difficult,” IEEE Trans. Neural Netw., vol. 5,
no. 2, pp. 157–166, Mar. 1994.
[29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
[30] G. E. Hinton, “Learning multiple layers of representation,” Trends
Cognit. Sci., vol. 11, no. 10, pp. 428–434, Oct. 2007.
[31] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient
by a running average of its recent magnitude,” COURSERA, Neural
Netw. Mach. Learn., vol. 4, no. 2, pp. 26–31, 2012.
[32] T. V. Duong, H. H. Bui, D. Q. Phung, and S. Venkatesh, “Activity
recognition and abnormality detection with the switching hidden semi-
Markov model,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern
Recognit. (CVPR), vol. 1, Jun. 2005, pp. 838–845.
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: SMARTPHONE SENSOR-BASED HAR USING FEATURE FUSION AND MFAP 4001
[33] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz, “Human
activity recognition on smartphones using a multiclass hardware-friendly
support vector machine,” in Proc. Int. Workshop Ambient Assist. Living.
Berlin, Germany: Springer, 2012, pp. 216–223.
[34] Y. Chen, Z. Zhao, S. Wang, and Z. Chen, “Extreme learning machine-
based device displacement free activity recognition model,” Soft Com-
put., vol. 16, no. 9, pp. 1617–1625, 2012.
[35] W. Zhu et al., “Co-occurrence feature learning for skeleton based action
recognition using regularized deep LSTM networks,” in Proc. AAAI,
vol. 2, Mar. 2016, p. 8.
[36] L. Liu, L. Cheng, Y. Liu, Y. Jia, and D. S. Rosenblum, “Recognizing
complex activities by a probabilistic interval-based model,” in Proc.
AAAI, vol. 30, 2016, pp. 1266–1272.
Zhenghua Chen received the B.Eng. degree
in mechatronics engineering from the University
of Electronic Science and Technology of China,
Chengdu, China, in 2011, and the Ph.D. degree in
electrical and electronic engineering from Nanyang
Technological University, Singapore, in 2017.
He is currently a Scientist with the Institute for
Infocomm Research, Agency for Science, Technol-
ogy and Research, Singapore. His research interests
include data analytics in smart buildings, ubiquitous
computing, the Internet of Things, machine learning,
and deep learning.
Chaoyang Jiang received the B.E. degree in electri-
cal engineering and automation from the China Uni-
versity of Mining and Technology, Xuzhou, China,
in 2009, the M.E. degree in control science and
engineering from the Harbin Institute of Technology,
Harbin, China, in 2011, and the Ph.D. degree in
electrical and electronic engineering from Nanyang
Technological University, Singapore, in 2017.
He is currently an Associate Professor with the
School of Mechanical Engineering, Beijing Institute
of Technology. His current research interests include
statistical signal processing, sparse sensing, machine learning, and information
fusion.
Shili Xiang received the B.S. degree in computer
science from the University of Science and Tech-
nology of China, Hefei, China, in 2003, and the
Ph.D. degree in computer science from the National
University of Singapore, Singapore, in 2011.
She is currently a Scientist and a Principle Inves-
tigator with the Data Analytics Department, Insti-
tute for Infocomm Research, Agency for Science,
Technology and Research, Singapore. Her current
research interests include smart mobility, ubiquitous
computing, data mining, and machine learning.
Jie Ding received the B.Eng. degree in automation
from the Harbin Engineering University, Harbin,
China, in 2012, and the Ph.D. degree in electrical
and electronic engineering from Nanyang Techno-
logical University, Singapore, in 2018.
She is currently a Scientist with the Insti-
tute for Infocomm Research, Agency for Science,
Technology and Research, Singapore. Her current
research interests include machine learning, pattern
recognition, control and optimization, and complex
networks.
Min Wu received the B.S. degree in computer
science from the University of Science and Tech-
nology of China, Hefei, China, in 2006, and the
Ph.D. degree in computer science from Nanyang
Technological University, Singapore, in 2011.
He is currently a Senior Scientist with the
Data Analytics Department, Institute for Infocomm
Research, Agency for Science, Technology and
Research, Singapore. His current research inter-
ests include machine learning, data mining, and
bioinformatics.
Dr. Wu was a recipient of the Best Paper Award at the InCoB 2016 and
DASFAA 2015. He was also a recipient of the IJCAI Competition on repeated
buyers’ prediction in 2015.
Xiaoli Li is currently a Principal Scientist with the
Institute for Infocomm Research, Agency for Sci-
ence, Technology and Research, Singapore. He also
holds adjunct professor position at Nanyang Tech-
nological University. He has authored or coauthored
more than 180 high-quality articles. His current
research interests include data mining, machine
learning, AI, and bioinformatics.
Dr. Li was a recipient of numerous best
paper/benchmark competition awards. He has
been serving as a (senior) PC member/workshop
chair/session chair in leading data mining and AI related conferences, includ-
ing KDD, ICDM, SDM, PKDD/ECML, WWW, IJCAI, AAAI, ACL, and
CIKM.
Authorized licensed use limited to: The Chinese University of Hong Kong CUHK(Shenzhen). Downloaded on November 10,2022 at 01:31:38 UTC from IEEE Xplore. Restrictions apply.

欢迎咨询51作业君