Background & Summary

Sensor-based Human Activity Recognition (HAR) is a rapidly growing research field that focuses on identifying and interpreting human movements and behaviors from inertial measurements, such as acceleration and angular velocity. This technology has numerous applications in healthcare, security, and sports analysis1. One of the key challenges in HAR is the development of effective algorithms that can accurately and reliably recognize human movements and behaviors in real-world scenarios. To address this challenge, researchers have developed several datasets that capture various types of human activities in different contexts.

Some datasets in the field of sensor-based HAR with Inertial Measurement Units (IMU) are listed in Table 1. Many existing datasets consist of activities of daily living (ADL), such as walking, running, and standing. Collecting data for these activities is relatively easy as they can be performed by many individuals and do not require specialized equipment or settings. As a result, many HAR datasets are focused on these specific activities. It is worth noting that there is also a growing amount of research and datasets focused on more complex activities, such as nursing. Specialized data on nursing activities can provide valuable insights into the nurses’ tasks and help identify areas needing improvement. Analyzing nursing activities can reveal patterns indicating a need for additional training or support. This information can also help nurses tailor their care to specific patient conditions. Moreover, nursing documentation is one factor contributing to an increase in perceived nursing workload2. Nurses are estimated to spend between 26.2% to 41% of their time on documentation, which can significantly burden their workload3. Classification of the performed activities can facilitate documentation by generating accurate and complete records, thus, reducing errors and omissions. This can then save time and allow nurses to focus on patient care. Overall, using specialized data on nursing activities can enhance the efficiency and effectiveness of nursing care.

Table 1 Overview of selected datasets for sensor-based HAR, including sensor type and number, recording time, number of subjects, type of activities, and number of unique activities.

Inoue et al. conducted a comprehensive study of nursing activities where they collected data under controlled laboratory conditions and real-world settings. It should be noted that this study stands out as a singular and robust dataset in the field of nursing research. When evaluating Inoue et al.‘s data collection methods, several factors should be taken into account:

  1. 1.

    The data was primarily collected in a hospital environment, potentially limiting its representativeness for other healthcare settings4. Furthermore, the unavailability of the data to the public could restrict accessibility and hinder further research or analysis.

  2. 2.

    The data was collected under controlled laboratory conditions, which might not accurately reflect the actual working conditions of nurses and may fail to capture the full spectrum of factors influencing their performance5.

  3. 3.

    The data derived from a larger study6 was published in segments to facilitate multiple competitions aimed at determining daily nurse care activities7,8,9. This approach may limit its utility for broader research purposes and pose challenges for comprehensive data analysis.

SONAR (Sensor-Oriented Nursing Activity Recognition) differs from other datasets in several ways.

  • The dataset was collected in a real-world setting at an elderly care facility, where experienced professional nurses (spanning between the ages 24–59 and representing a mix of genders) provided care to residents who relied on assistance due to physical or mental limitations. These residents, no longer capable of living independently, resided in single rooms.

  • It encompasses a wide array of nursing activities that reflect the multifaceted responsibilities of healthcare professionals. These activities span from critical tasks such as administering medication, taking vital signs, and assisting patients with mobility to important routines such as hygiene care and documentation. This comprehensive coverage enables the development of algorithms that cater to the intricate and diverse demands of the nursing profession. A detailed description of the recorded nursing activities, corresponding actions, and time allocation is provided in Table 2.

    Table 2 Overview of nursing activities captured in the dataset: Each row corresponds to a distinct activity, with accompanying details including activity description, total recording time in minutes, number of recordings, and statistical measures of activity duration in seconds.
  • The dataset was collected using five sensors placed on each participant’s body as standalone IMUs. This allows for more comprehensive and detailed views of human movements and behaviors. This methodological choice also facilitates the exploration of diverse sensor combinations and optimal placements, thus, yielding outcomes that are both specialized and incite new research questions. Furthermore, integrating multiple sensors allows for analyzing subtle interplays and trade-offs between singular and multiple sensors.

  • The data was initially labeled by an external human observer who walked alongside the nurses. To ensure accurate subsequent labeling and protect privacy, we used synchronized pose estimations of the nurse’s body. The pose estimation data was obtained from parallel video conversion.

Overall, SONAR’s combination of real-world, comprehensive sensor data, and detailed labeling make it a valuable resource for researchers in the field of HAR.

Methods

In this section, we begin by addressing the ethical considerations that guided our research process. Following that, we provide a comprehensive account of our study’s methodology. This includes detailing the technical specifications of the sensors we employed and describing their placement on study participants. We will also discuss the methods used for data collection and annotation. Additionally, we will explore the study’s design and protocol, followed by an overview of the properties of the recorded data.

Ethics approval

The study was carried out with the utmost regard for the well-being and rights of the participants. Approval for the study was obtained from the University of Potsdam Ethics Committee, under the reference number 51/2021. All participants willingly contributed to the study after providing informed consent, including consent to publish the data. The participants were thanked for their time and effort, and their contributions were greatly appreciated. The data collected from the sensors were treated with the utmost confidentiality, and appropriate measures were taken to ensure the participants’ privacy was protected throughout the study.

Equipment

We employed Xsens DOT v2 Bluetooth wearable sensors with accompanying straps throughout this study. These sensors are compact and lightweight: dimensions of 36 × 30 × 11 mm and a weight of 10.8 grams. The design enables the attachment of multiple sensors, thereby, enhancing the capability to capture subtle movement patterns. Upon connecting to the mobile phone, the sensors start to measure and record data. The sensors have a recording frequency and a real-time streaming frequency of up to 120 Hz and 60 Hz, respectively. Xsens DOT features up to 6 hours of continuous data measurement. They use a right-handed Cartesian coordinate system, as illustrated in Fig. 2, with x, y, and z axes. The sensors output data in various formats, including Euler angles, quaternions, delta quantities (acceleration and angular velocity), rate quantities (acceleration and angular velocity), high fidelity data, and custom data combinations. The data format can be selected in the Xsens DOT app or in the SDK. For our study, the sensors were connected to a smartphone via Bluetooth 5.0 devices. Five sensors were paired with a Google Pixel 6 phone for data collection. Output data from the sensors included orientation (in quaternions or Euler angles), free acceleration, angular velocity, magnetic field, timestamp, and status.

Labeling

Our dataset, shown in Table 2, consists of 23 different nursing activities that involve frequent changes in labeling. To address the lack of simultaneous real-time recording and labeling in the Xsens DOT App, we developed our own application10, as depicted in Fig. 1. The application not only allows synchronized sensor recordings, but also facilitates real-time pose estimation using a parallel video stream of the nursing activities via the mobile phone’s camera. The outcome is a skeleton model with 17 keypoints of the human body, as shown in Fig. 1c. By converting the video stream into pose estimations, we ensure an accurate relabeling process and secure anonymization where privacy is pivotal, such as in nursing facilities where patients are being washed, dressed, or fed.

Fig. 1
figure 1

Different functionality screens of the application for data recording.

Connection screen for connecting and synchronizing the Xsens DOT sensors. Recording interface for saving inertial and pose estimation data along labeling activities in each timestamp. Pose estimation view of recorded data. (Re-)labeling screen.

Study design

Participants were recruited from a retirement and assisted living facility in Potsdam, Germany. The facility’s primary goal is to provide care and support for elderly people, more specifically, focus on helping them maintain or regain their abilities and skills required for daily life. This is achieved through individualized support which considers each person’s unique needs and circumstances. Prior to data collection, participants received an explanation of the project via mail. Eligibility criteria required individuals to work as nurses. Upon obtaining written informed consent, the nurses were equipped with sensors for data collection. Data was collected using five Xsens DOT sensors with a 60 Hz output rate on the following body locations: left wrist (LW), right wrist (RW), pelvis (ST), left ankle (LF), and right ankle (RF). The sensors LW, RW, LF, and RF were attached to the body by using straps of different lengths with a velcro fastener. The ST sensor was attached on the waistband using a rubber clip. Figure 2 illustrates the placement of the sensors on the nurse. The placement of IMUs was selected based on prior studies demonstrating the effectiveness of these locations in providing robust and accurate information for HAR and pose estimation tasks11. The sensors were positioned face down when the character was in an A-pose. In other words, the negative x-axis of the character’s pose was aligned with the direction of gravity. The A-pose is a standard reference pose used in animation and computer graphics. In this pose, the character stands upright with its arms extended out to the sides and its palms facing forward, creating a shape that resembles the letter A:

  • Outer wrist: The positive z-axis is positioned towards the inner wrist, while the negative x-axis is positioned towards the hand.

  • Pelvis - central on the back waistband: The positive z-axis is positioned towards the body, while the positive x-axis is positioned towards the head.

  • Inner ankle: The positive z-axis is positioned towards the outer wrist, while the negative x-axis is positioned towards the foot.

Fig. 2
figure 2

Sensor placement on nurses.

After synchronizing the sensors and verifying that the data was being transmitted correctly and completely, nurses were asked to carry out their usual duties while accompanied by an external observer who performed data labeling. The observer recorded the type of nursing activity performed from start to end, and the data was transferred to a mobile device via Bluetooth in real-time. Throughout the study, the observer obtained information from the nurse through verbal communication to ensure that the correct labels were selected. To ensure reliability, each nurse was recorded multiple times performing the same activity on different residents. At the end of each day, the data was transferred to a hard drive and deleted from the phone. Each recording consists of 14 measurements, including, four-dimensional quaternion values, four-dimensional angular velocity calculated from the derivative of the quaternion values, three-dimensional acceleration values, and three-dimensional magnetic field values.

Dataset properties

A total of 14 nurses, comprising of nine females and five males, aged between 24 and 59 years, participated in the study and provided written informed consent to be recorded while performing their daily work duties. The recording process began with the observer pressing the start button after verifying that the data stream was functioning correctly. One or multiple activities were recorded during each session, and the process ended when the observer pressed the stop button. Throughout the study, 254 recording files with an overall of 5673 recordings were collected; this included 23 different activities that are commonly performed by nurses in the assisted living facility.

The recorded activities were typical nursing duties, such as changing a patient’s clothes or washing their hair and involved similar procedures or movement patterns. These activities were grouped together according to nursing rules and classification criteria aimed to capture the essential elements of each activity for documentary purposes. This approach ensured that the data collected was representative of the different types of activities performed by the nurses and provided a comprehensive overview of the movements involved in each activity.

In total, 13319475 data points per sensor stream were recorded, representing approximately 3700 minutes (~61.7 hours) of recording time. Figures 3, 4 and Table 2 display the data distribution among subjects and activities, providing an overview that can serve as a reference for further analysis and comparison with other studies. We employed a systematic approach to ensure accurate labeling and prevent overlapping activities in the recordings. Each recording was either deliberately stopped and categorized as a null activity before commencing the next activity, or alternatively, recordings were split based on the subsequently recorded pose estimation (label refinement). This methodology guarantees distinct and precise labeling for each activity.

Fig. 3
figure 3

The heatmap visually represents the distribution of data on the number of minutes spent by each subject on various activities. The y-axis displays subjects sorted by their total duration spent across all activities.

Fig. 4
figure 4

Heatmap Visualization of Average Activity Durations by Subject, Age, and Gender (in seconds): This heatmap illustrates the average duration of various activities recorded by different subjects, with each subject’s age and gender also indicated. The x-axis represents different activities, while the y-axis shows the subjects’ identities along with their corresponding ages and genders, sorted in ascending order by age. The color intensity in each cell reflects the average duration of each activity and subject, providing a visual representation of activity patterns and differences across subjects.

Data Records

The dataset is accessible via Zenodo, allowing easy collaboration and data sharing among researchers12. Additionally, there is a preprocessed version of the dataset available, organized into a single folder containing all data records, specifically designed for machine learning purposes13. It comprises data gathered from 14 voluntary participants, each providing informed consent to contribute to this study. As shown in Fig. 5, the dataset is organized in 14 folders, one for each participant. Each folder contains the recordings stored as CSV files. The dataset contains recordings from five synchronized Xsens DOT sensors, with each IMU containing 14 measurements. The recordings are stored as CSV files, with each file containing 72 columns. The size of the CSV files varies based on the length of the recording. The first 70 columns correspond to the 14 measurements from the five IMUs, including orientation, acceleration, and calibrated local magnetic field. The two remaining columns in each file contain the timestamp in microseconds and the activity label at each timestamp. The 14 measurements captured by the IMUs are as follows:

  • The output orientation is presented as quaternion values, with the real part represented by Quat_W and the imaginary parts represented by Quat_X, Quat_Y, and Quat_Z respectively. The name of the sensor is also included in the column names. The Delta_q values represent the orientation change over a specified interval, which is 16.67 ms (60 Hz) for Xsens DOT sensors. The column name is composed of three parts, separated by an underscore: dq, indicating the angle increment; W, X, Y or Z, indicating the real/imaginary part/direction; and the name of the sensor.

  • The Delta_v values represent the change in velocity or acceleration over the same interval as the Delta_q values. The column name is divided into two parts, separated by an underscore: dv, indicating the velocity increment; and a number between 1 and 3 in square brackets, denoting the axis (1 corresponds to X, 2 corresponds to Y and 3 corresponds to Z). The name of the sensor is also included in the column name.

  • The next three columns contain the calibrated local magnetic field in the X, Y, and Z directions, respectively, and are denoted as Mag_X, Mag_Y, and Mag_Z, followed by the name of the sensor

  • The sequence of column names is repeated for all five sensors. The penultimate column, SampleTimeFine, displays the timestamp in microseconds.

  • The final column contains the label for the activity performed at each timestamp.

Fig. 5
figure 5

Organization of the SONAR dataset.

Technical Validation

In this section, we focus on validating the quality and soundness of the dataset deposited at Zenodo by using three different deep learning models as tools to demonstrate the fitness of the dataset for activity recognition tasks. The models were trained and evaluated using the dataset, and their performance serves as evidence of the dataset’s reliability and suitability for further research.

Data preprocessing

Preprocessing raw data is an important initial step in the deep learning workflow as it prepares the data in a format that can be easily understood and processed by the network. In this study, the following preprocessing steps were applied to the raw data in order to eliminate any unwanted distortions and improve specific qualities.

  1. 1.

    Imputation: Missing data values can be a major issue in real-world datasets, making it difficult to effectively train the network. In this study, the missing data values were handled through imputation using linear interpolation to fill NaN values.

  2. 2.

    Standardization: The dataset consists of 14 different features recorded in different units. To avoid varying scales and distributions issues, the data was standardized prior to being input into the deep learning algorithm. Rescaling of the values involved standardizing the data to have a mean of 0 and a standard deviation of 1.

  3. 3.

    Windowing: In order to better understand the relationships between the features, the sensor data was divided into non-overlapping windows of 600 data points (equivalent to 10-second windows at a recording frequency of 60 Hz). This process provides a broader understanding of the underlying activity measured.

Deep learning architectures

We trained three different deep learning models incorporating a combination of convolutional neural network (CNN)14 and long short-term memory (LSTM)15 network components or only a CNN. These models were trained using the Adam optimizer with default settings in Tensorflow16. Hyperparameter optimization resulted in a learning rate of 1 × 10−4 and an input size of 600 × 70. The input size corresponds to a window size of 600 (equal to 10 s with 60 Hz) and 14 features from each sensor (14·5 = 70). All models include a preprocessing step for filling missing values and a batch-normalization layer to standardize the inputs in each feature row. The output of each network is a dense softmax layer for activity classification. The models were trained using the categorical cross-entropy loss function, which is defined as:

$$L=-log\left(\frac{{e}^{{s}_{p}}}{{\sum }_{j}^{C}{e}^{{s}_{j}}}\right)$$

where C denotes the set of classes, s is the vector of predictions, and sp is the prediction for the target class. The architecture of the three models is as follows:

  1. 1.

    The CNN-LSTM model is composed of six layers. The input layer is followed by two convolutional layers, two LSTM17 layers, and the output layer.

  2. 2.

    The ResNet model is composed of 11 layers. Next to the input and output layer, it has a repeated sequence of three convolution layer followed by a batch normalization layer and an activation layer18.

  3. 3.

    The DeepConvLSTM model is composed of eight layers. After the input layer, it is followed by four consecutive convolution layers and two LSTM layers before the softmax classifier15.

Dataset validation

This section describes the evaluation process, which encompasses model validation methods and performance metrics, collectively contributing to an understanding of the dataset’s reliability. The selection of evaluation metrics depends on the specific machine learning task and is crucial for quantifying the performance of the model. Figure 6 illustrates the evaluation strategy used for the benchmark results. The subsequent subsections delve into the validation techniques and performance metrics in greater detail.

Fig. 6
figure 6

Evaluation strategy.

Model validation

To state the performance of our models on unseen data, we used the following three different cross-validation techniques:

  1. 1.

    k-fold cross-validation: This method divides the entire dataset into k equal-length segments, also known as folds. One fold is designated as the test set and is used for final evaluation. In this study, we used five folds, with each fold representing a different time window. This method provides a good balance between having enough data for training and enough data for testing.

  2. 2.

    leave-recordings-out cross-validation: This method evaluates the performance of the models on individual recordings. A recording refers to the time frame between pressing the start and stop buttons during activity labeling. In this study, there were 254 recordings, and we used a 80:20 train-test ratio, with 203 recordings for training and 51 recordings for testing. This method evaluates the models’ ability to generalize to new recordings, which is important in this specific task, as recordings can contain a single activity or multiple activities performed multiple times.

  3. 3.

    leave-one-subject-out cross-validation: This method involves training the model on all subjects except for one and then evaluating the model on the held-out subject. This process is repeated until each subject has been held out for evaluation once. This method evaluates the models’ ability to generalize to new subjects and maximizes the use of available data. However, it can be time-consuming to train and evaluate the models multiple times.

Performance metrics

There are various evaluation metrics that are derived from different ratios of the values true positive (TP), true negative (TN), false positive (FP), and false negative (FN). The following commonly used performance metrics were used for comparison:

$${F}_{1}=2\cdot \frac{Precision\cdot Recall}{Precision+Recall}=\frac{TP}{TP+\frac{1}{2}\left(FP+FN\right)}$$

As shown in Table 2, the dataset is imbalanced with respect to the different classes. To provide greater weight to classes with more examples in the dataset, we chose to use the weighted-averaged F1 score. This means that the proportion of each label in the dataset is weighted by its size. Additionally, we calculated the micro-averaged F1 score, which represents the proportion of correctly classified observations out of all observations, also known as accuracy Acc, by considering the total number of TPs, FNs and FPs observations.

Validation results

Tables 37 show the weighted-F1 scores and accuracy alongside the standard deviation in brackets (i) from using different sensor placement combinations, (ii) under different validation setups, and using (iii) three different deep learning models. The best performing model for each sensor and validation setup is highlighted in bold.

Table 3 Comparison of different cross-validation methods and different performance metrics (column) for different models on a single sensor (row).
Table 4 Comparison of different cross-validation methods and different performance metrics (column) for different models on two sensor combinations (row).
Table 5 Comparison of different cross-validation methods and different performance metrics (column) for different models on three sensor combinations (row).
Table 6 Comparison of different cross-validation methods and different performance metrics (column) for different models on four sensor combinations (row).
Table 7 Comparison of different cross-validation methods and different performance metrics (column) for different models on all five sensors combined (row).

Across all scenarios (two, three, four, and five sensor combinations), the CNN-LSTM model consistently outperforms the ResNet and DeepConvLSTM models regarding the accuracy and F1 score for all cross-validation methods. The wrist sensors LW and RW appear to be crucial in achieving better performance, while the addition of the ST (pelvis) sensor further improves the results. Including ankle sensors (LF and RF) shows additional performance gains, however, the improvement diminishes when moving from four to five sensors. In conclusion, the CNN-LSTM model is the most suitable choice for the given problem among the compared models. The combination of wrist, pelvis, and ankle sensors proves beneficial in improving the model’s performance, with wrist sensors being the most critical in providing valuable information. The performance of the deep learning models on the dataset demonstrates its reliability and suitability for activity recognition tasks. The dataset’s fitness for purpose is further evidenced by the consistent improvement in model performance when using combinations of different sensors, which suggests that the dataset effectively captures the relevant information needed for activity recognition. The preprocessed dataset with an adapted folder structure for machine learning applications can be easily accessed through Zenodo, enabling effortless collaboration and sharing of data amongst researchers13.

Usage Notes

The presence of multiple activities in a single recording and the proper ordering of the recording files may prompt researchers to examine the sequential behavior of nurses. Hidden Markov models (HMMs), which include transition probabilities, are a structured probabilistic model that can be used to analyze sequential behavior by forming a probability distribution of sequences.