Online Engagement using Computer Vision

Detecting Student Engagement through live Video Stream

Detecting student engagement from video is a difficult task. This paper describes in detail the methods and processes to train deep learning models to predict engagement with CNNs and RNNs using cloud infrastructure, and the practical deployment of these models to an NVIDIA Jetson device. Although the models tend to overfit to training data, the overall system is designed in such a way that as alternative models are developed they can be integrated. Some attempts are made for code reuse, usability and the centralized storage of data and results.

Download the Report Download PDF

Introduction

More and more students are engaged in online learning, either as a necessity due to the recent COVID-19 pandemic or voluntarily through remote learning courses. For some students, the shift from in classroom attendance to online attendance can be difficult, and communicating over video less immersive. Teachers also have to contend with ensuring students are engaged in the learning process, which can be difficult when the primary access to student response is through a small video representation in an online meeting window.

This work uses video capture to identify when students become bored and provide real time and offline feedback to the teacher. This is a proof of concept, examining the possibility of creating models to classify engagement from video and then applying these models to low powered edge devices.

High Level Solution

The solution combines cloud and edge technology to train deep learning neural net models from previously labeled videos. Individual frames are captured from the videos and utilised in various models, trained to predict boredom levels from 0 (not bored) to 3 (most bored). Models are trained against full images and also images of just the students faces, to reduce the influence of background data. Images are captured from webcam video streams.

Technical Architecture

The project is split into two sections, the data preparation and model training shown on the left of the dotted line, and then the live video capture and inference on the Jetson device on the right.. Docker is used extensively throughout.

Training

Purpose	Type	GPU	vCPUs	Mem(GB)
Image Extraction	c4.8xlarge	None	36	60
Face Detection	p3.2xlarge	1 * NVIDIA Tesla 8 V100 (16GB)	8	61
Storage	EFS

AWS is used for cloud infrastructure to support training. EFS is used as the back end storage to provide persistence between environments for data and saved models. Docker is used to containerize the code, using the tensorflow:latest-gpu-jupyter base image. Tensorflow v2 is used for the deep learning platform, along with Keras and openCV for image processing.

Data Preparation and image extraction is performed on a compute optimized c4.2xlarge. FFMPEG is used to extract images from training video streams. Two extractions were used, 1FPS for CNN based models and 2FPS for RNN based models.

Face detection is performed using a pre-trained DNN model, with weights trained using caffe. DNN is available with OpenCV. The model training scripts are designed to be flexible and modular, and can run with a variety of model inputs:

Model: e.g. XceptionV3, MobileNetv2, EfficientNet, Inception
Image Size, e.g. 224, 299
Data Type: e.g whole images, faces only, augmented data
Weights, e.g. imagenet
epochs
batch_size
Learning rate
Learning rate decay
Inference

Inference

Purpose	Type	GPU	vCPUs	Mem(GB)
Inference	c NVIDIA Jetson Xavier	512-Core Volta	8-Core ARM v8.2 64-Bit	32

Inference is performed on an NVIDIA Jetson with an attached webcam. Model files are copied to the Jetson with SCP.. The Jetson uses ARM computing architecture and so some compilation of code is required for inference to work, specifically newer versions of OpenCV (>3.2). For inference the docker base image nvcr.io/nvidia/l4t-ml:r32.4.3-py3 is used. Tenroflow, Keras and OpenCV are installed. Inference is done in real time against a video stream, with individual and running totals of engagement presented on screen. At the end of the meeting a report is saved and optionally sent to AWS S3 using MQTT for offline analysis. The client demo scripts provide some options when run:

-r   --record, whether to record the video
-m   --messaging, whether to send report to AWS using MQTT
-p   --path, directory to record the video, default is home
-f   --filename, name of the video recording file, default is infer<datetime>.avi
-c   --codec, video codec, default is MJPG
-fps --fps, frames per second, default is 2
-hg  --height, height of video to record default is 480
-w   --width, width of video to record default is 640

Data Acquisition and Processing

DAiSEE Data for Affective States in e-Environments is used to provide ground truth for the classification models, publicly available on request. It contains 8925 multi label 10 second video snippets captured from 112 users. Each snippet has been scored from 0 to 3 for boredom, confusion, engagement and frustration. The DAiSEE data has already been organized into train, validation and test datasets. Each video is structured into a deep directory structure, i.e. Purpose -> Person -> Snippet -> Video file. A separate label CSV file contains the file path and label for each 10 second video snippet. All videos associated with one person are contained within one purpose, i.e. there cannot be videos from person A in Train and also in Validation (or Test).

|___Test
  |___500044
  |___500067
  |___500095
    |___5000441001
    |___5000441002
    |___5000441003
      |___5000441003.avi
|___Train ...
|___Validation ...

The distribution of data between the different classes (0, 1, 2, 3) is shown in the following table, for the different engagement types (Boredom, Engagement, Confusion, Frustration).

	Boredom	Engagement	Confusion	Frustration
0	3822	61	5951	6887
1	2850	455	2133	1613
2	1923	4422	741	338
3	330	3987	100	87
Tota	8925	8925	8925	8925
Average	0.86	2.38	0.44	0.29
% labeled 0	43%	1%	67%	77%

The distribution for Boredom is slightly more even than other engagement types, and stands as a good substitute for engagement. When a student is Engagement = 0 they are more likely to be Boredom = 3

Distribution

Individual images are extracted from the video files and stored in a structured format suitable for boredom classification, depending on the FPS used. Additional storage areas are created for additional image manipulation, i.e. face detection and augmentation:

Directory Structure

data / DAISEE /	Contains all the data based on DAiSEE provided data
… DataSet	The original videos stored as described above
… Labels	CSVs containing labels for videos
… 1FPS … 2FPS	Separate directories for images captured at 1 or 2 Frames per Second
… …data … … … Test, Train, Validation .. … … … b0, b1, b2, b3	Files containing whole images extracted at parent FPS, organized by Test/TrainValidation and Boredom Class (b0 b1 b2 b3)
… …data … … … Test, Train, Validation .. … … … b0, b1, b2, b3	Files containing just faces, extracted at parent FPS, organized by Test/TrainValidation and Boredom Class (b0 b1 b2 b3)
… …dataFaces … … … Test, Train, Validation .. … … … b0, b1, b2, b3	Files containing augmented images, extracted at parent FPS, organized by Test/TrainValidation and Boredom Class (b0 b1 b2 b3)
… …dataImags … … … Test, Train, Validation	Files containing whole images extracted at parent FPS, organized by Test/TrainValidation but containing all images for all engagement types (for multi-task modelling)

Augmented Images are created using keras ImageDataGenerator

Face Capture Haarscascade and DNN were tested, with DNN capturing faces from more image snippets than haarscascade. Examining images extracted at 1FPS:

	Original	haarcascade	dnn	dnn difference
Test	17844	17443	17830	387
Validation	14294	14062	14289	227
Train	53584	53352	53566	214
Total	85722	84857	85685	828

Modeling

Over 30 experiments were performed, divided into the following 3 general areas:

Modelling

The best results from the three methods are given below:

Common parameters:

Learning Rate: 0.0001
Learning Rate Decay: 1.00E-06
Weights: Imagenet
Optimizer: Adam (apart from Multi Task which uses SparseCategoricalCrossentropy)

Model	data type	FPS	Base Model	frozen layers	epochs	batch size	accuracy	val acc	test acc
CNN	Augmented Faces	1	MobileNetV2	[:126]	100	32	0.88	0.34	0.36
CONVLSTM	Whole Images	2	ConvLSTM2D	All	100	16	0.96	0.32	0.34
CNN -> LSTM	Whole Images	2	MobileNetV2 -> LSTM	All	100	32	0.88	0.31	0.38
Multi Task (CNN)	Whole Images	1	Xception	All	10	16	0.59, 0.65, 0.70, 0.59	0.39, 0.46, 0.64, 0.39	42.48, 56.01, 66.92, 42.77

Multi Task Results are presented in order for Boredom, Engagement, Confusion, Frustration.

All of the models had a tendency to overfit, this is likely due to the nuanced nature of the problem set. The facial image differences between four levels of boredom is slight and difficult for a mode to distinguish.

CNN Models

For the models CNN, the classification layer is removed and 2 Fully Connected Dense Layers added with a dropout of 0.2 before the output layer.

x = layers.Flatten()(base_model.output)
x = layers.Dense(1024, activation='relu')(x)
x = layers.Dropout(0.2)(x)
x = layers.Dense(512, activation='relu')(x)
x = layers.Dense(4, activation='softmax')(x)

The best CNN results all used a frame rate of 1 FPS: the videos do not contain a lot of action and using a higher frame rate does not result in performance improvement. Faces rather than whole images are used for classification input, and the addition of augmented images for the under represented boredom 3 class helped the model to find more instances of boredom 3.

The model is overfitting and over predicting for class B0. This still gives the best results during inference (better than LSTM and multi-task) but more work needs to be done.

Loss

Muti task Models

For the multi task model there are four classification heads, one for each class:

x = GlobalAveragePooling2D()(base_model.output)
x = Dense(128, activation="relu", name="fc1")(x)
x = Dense(64, activation="relu", name="fc2")(x)
boredom = Dense(4, activation='softmax', name="y1")(x)
engagement = Dense(4, activation='softmax', name="y2")(x)
confusion = Dense(4, activation='softmax', name="y3")(x)
frustration = Dense(4, activation='softmax', name="y4")(x)

CNN -> LSTM Models

First features are extracted from the base model (e.g MobileNetV2) at the global_average_pooling2d layer. The features are reshaped for input into the LSTM model. Whole images are used, captured at 2 FPS to provide a sequence of 20 frames.

model = Sequential()
model.add(LSTM(units=2048, input_shape=(20,1280),
               return_sequences=False,
               dropout=0.2))
model.add(Dense(512, activation='relu'))
model.add(Dropout(0.2)) model.add(Dense(4, activation='softmax'))

CONVLSTM Models

Whole images are used, captured at 2 FPS to provide a sequence of 20 frames.

model = Sequential()
model.add(ConvLSTM2D(filters = 16,
                     kernel_size = (3, 3),
                     return_sequences = False,
                     data_format = "channels_last",
                     input_shape = (seq_len, img_height, img_width, 3)))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(256, activation="relu"))
model.add(Dropout(0.3))
model.add(Dense(4, activation = "softmax"))

Deployment

Models are deployed to the Jetson using SCP. Different python programs are written for each of the model types

infer_cnn_dnn.py
Infer_muti_task.py
Infer_cnn_lstm.py
infer_CONVlstm.py

The infer_cnn_dnn is the best demo script and incorporates the following features:

Options for setting frame rates and codecs for image capture
Options for recording video
Options for using MQTT messaging for final report
Display’s running count of each class
Display’s “recording” message if video capture is being recorded
Saves summary text report of video

Example Results from CNN

CNN

Example Results from Multi-Class

MultiClass

Challenges

Challenges exist in ensuring that library versions are consistent between AWS X86 architecture and Jetson ARM architecture.

Issue	Mitigation
Overfitting	Add additional meta inputs for modelling, such as gaze direction, head angle. Add more fundamental features, eyes shut, student at desk Consider other operational approaches to problem
No docker image support for DNN	Compile and install OpenCV from source
Model Sizes too big for Jetson	Use MobileNetV2
General Memory Errors	Configure Tensorflow to grab more memory as needed
Unbalanced training set	Augment Images for class three Reduce # of images for class zero
Base models not designed for engagement	Unfreeze later layers for training

Future Improvements

Package up the code, create modules for reuse
Feedback Loop for improved models. A system whereby teachers or students express their engagement during classes through a simple interface.
Incorporate Gaze detection into the training process, if students are looking away from the screen for too long, identify as losing attention.
Identify the difference between working (looking down) and disengagement (looking away, unfocussed gaze
Report to contain timing of boredom to align with duration of class for correlation
Consider ethics of application, do not allow recording of video, summarize classroom
Capture video from multiple students in online recording (e.g. from Zoom or MS Teams)

References

[1] Automatic Recognition of Student Engagement using Deep Learning and Facial Expression, Omid Mohamad Nezami,, Mark Dras, Len Hamey, Deborah Richards, Stephen Wan, and Ce ́cile Paris, 2018, https://arxiv.org/abs/1808.02324
[2] Prediction and Localization of Student Engagement in the Wild, Amanjot kaur, Aamir Mustafa, Love Mehta, Abhinav Dhall, 2018, https://arxiv.org/abs/1804.00858
[3] DAiSEE: Towards User Engagement Recognition in the Wild, Abhay Gupta, Arjun D’Cunha, Kamal Awasthi, Vineeth Balasubramanian, 2016, https://arxiv.org/abs/1609.01885
[4] Gaze360: Physically Unconstrained Gaze Estimation in the Wild, Petr Kellnhofer, Adria` Recasens, Simon Stent2, Wojciech Matusik, and Antonio Torralb, http://gaze360.csail.mit.edu/iccv2019_gaze360.pdf
[5] Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings, Seonwook Park, Xucong Zhang, Andreas Bulling, Otmar Hilliges, 2018, https://ait.ethz.ch/projects/2018/landmarks-gaze/

Code available on GitHub

Share on

Twitter Facebook LinkedIn