BK-SAD: A Large Scale Dataset for Student Activity Recognition
Main Article Content
Abstract
Skeleton-based human action recognition has emerged as a prominent research topic in the field of artificial intelligence due to its broad applicability in a wide range of domains, including but not limited to healthcare, security and surveillance, entertainment, and intelligent environments. In this paper, we propose a novel data collection methodology and present BK-Student Activity Dataset (BK-SAD), a new 2D dataset for student activity recognition in smart classrooms that outperforms the existing NTU RGB+D 120 dataset, SBU Kinect Interaction dataset. Our dataset contains three classes: hand raising, doze off, and normal activities. The dataset was collected using cameras placed in the real classroom environments and consisted of video data from multiple viewpoints. The dataset contains over 2700 videos of students raising their hands, over 1700 videos of students dozing off during class, and over 8500 videos of normal activities. In addition, to evaluate the effectiveness of the proposed dataset, we give some baseline performance figures for neural network architectures trained and tested for student activity recognition on BK-SAD dataset. These ConvNet architectures demonstrate significant performance improvement on the proposed dataset. The effectiveness of the proposed novel data collection methodology and BK-SAD dataset in this paper will enable further research and development of activity recognition models for classroom environments, with potential applications in the smart education and intelligent classroom management systems. BK-SAD is available at https://visedu.vn/en/bk-sad-dataset
Keywords
Dataset, Action recognition, Skeleton pose, Student activity recognition, Smart classroom
Article Details

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
References
[1] Dhiman, C., Saxena, M., Vishwakarma, D. K., Skeleton-based view invariant deep features for human activity recognition, In Proceedings of the Fifth IEEE International Conference on Multimedia Big Data, Singapore, 11-13 September 2019, pp. 225-230. https://doi.org/10.1109/BigMM.2019.00-21
[2] Jiang, X., Xu, K., Sun, T., Action recognition scheme based on skeleton representation with DS-LSTM network, IEEE Trans. Circuits Syst. Video Technol. 2020,30, 2129-2140. https://doi.org/10.1109/TCSVT.2019.2914137
[3] Jesna, J., Narayanan, A. S., Bijlani, K., Automatic hand raise detection by analyzing the edge structures. In Proceedings of the 4th International Conference on Emerging Research in Computing, Information, Communication and Applications, Bangalore, India, 29-30 July 2016, pp. 171-180. https://doi.org/10.1007/978-981-10-4741-1_16
[4] Li, W., Jiang, F., Shen, R, Sleep gesture detection in classroom monitor system. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, 12-17 May 2019, pp. 7640-7644. https://doi.org/10.1109/ICASSP.2019.8683116
[5] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, 2005, pp. 886-893 vol. 1. https://doi.org/10.1109/CVPR.2005.177
[6] Althloothi, S., Mahoor, M.H., Zhang, X., Voyles, R.M, Human activity recognition using multi-features and multiple kernel learning, Pattern Recognit. 2014, 47, 1800-1812. https://doi.org/10.1016/j.patcog.2013.11.032
[7] Cippitelli, E., Gasparrini, S., Gambi, E., Spinsante, S., A human activity recognition system using skeleton data from RGBD sensors, Comput. Intell. Neurosci. 2016, 4351435. https://doi.org/10.1155/2016/4351435
[8] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, Openpose: Realtime multi-person 2d pose estimation using part affinity fields, IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 43, no. 01, pp. 172-186, Jan 2021. https://doi.org/10.1109/TPAMI.2019.2929257
[9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625-2634, 2015. https://doi.org/10.1109/CVPR.2015.7298878
[10] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694-4702, 2015. https://doi.org/10.1109/CVPR.2015.7299101
[11] C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR, 2016. https://doi.org/10.1109/CVPR.2016.213
[12] K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568-576, 2014.
[2] Jiang, X., Xu, K., Sun, T., Action recognition scheme based on skeleton representation with DS-LSTM network, IEEE Trans. Circuits Syst. Video Technol. 2020,30, 2129-2140. https://doi.org/10.1109/TCSVT.2019.2914137
[3] Jesna, J., Narayanan, A. S., Bijlani, K., Automatic hand raise detection by analyzing the edge structures. In Proceedings of the 4th International Conference on Emerging Research in Computing, Information, Communication and Applications, Bangalore, India, 29-30 July 2016, pp. 171-180. https://doi.org/10.1007/978-981-10-4741-1_16
[4] Li, W., Jiang, F., Shen, R, Sleep gesture detection in classroom monitor system. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019), Brighton, UK, 12-17 May 2019, pp. 7640-7644. https://doi.org/10.1109/ICASSP.2019.8683116
[5] N. Dalal and B. Triggs, Histograms of oriented gradients for human detection, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05), San Diego, CA, USA, 2005, pp. 886-893 vol. 1. https://doi.org/10.1109/CVPR.2005.177
[6] Althloothi, S., Mahoor, M.H., Zhang, X., Voyles, R.M, Human activity recognition using multi-features and multiple kernel learning, Pattern Recognit. 2014, 47, 1800-1812. https://doi.org/10.1016/j.patcog.2013.11.032
[7] Cippitelli, E., Gasparrini, S., Gambi, E., Spinsante, S., A human activity recognition system using skeleton data from RGBD sensors, Comput. Intell. Neurosci. 2016, 4351435. https://doi.org/10.1155/2016/4351435
[8] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, Openpose: Realtime multi-person 2d pose estimation using part affinity fields, IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 43, no. 01, pp. 172-186, Jan 2021. https://doi.org/10.1109/TPAMI.2019.2929257
[9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625-2634, 2015. https://doi.org/10.1109/CVPR.2015.7298878
[10] J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4694-4702, 2015. https://doi.org/10.1109/CVPR.2015.7299101
[11] C. Feichtenhofer, A. Pinz, and A. Zisserman, Convolutional two-stream network fusion for video action recognition. In IEEE International Conference on Computer Vision and Pattern Recognition CVPR, 2016. https://doi.org/10.1109/CVPR.2016.213
[12] K. Simonyan and A. Zisserman, Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, pages 568-576, 2014.