Thumb Up or Down: take survey with your hand gesture

Aug 28, 2015python hack

In the latest SurveyMonkey hackathon held on August 13 – 14, 2015, I stayed up the night to build the Thumb Up or Down, a computer vision prototype to allow you to take survey with your hand gesture. It is really fun and I’d like to share my experience with you.

A video is worthy millions of words:

Conceptually, the hack consists four components:

A motion detection module, which starts/terminates the state machine.
A image recognition module, which identify the thumb up or thumb down gesture.
A socketio server, which renders the web page and facilitate the bi-directional communication between the client and server.
HTML5 Front end, which provides the feedback for the survey taker.

Of course, I have simplified the design and made lots of trade-off to meet the 24-hour deadline.

Motion Detection

Computer vision is hard; but standing on the shoulder of the accumulated community efforts, OpenCV, some problems are solvable, even in the hachathon timeline.

The idea behind the motion detection is to shoot a background image as the baseline, then diff each frame with the baseline, accumulate the contour areas, and trigger the event if it exceeds the preset threshold.

import cv2
import imutils

# Grab the picture from the camera
camera = cv2.VideoCapture(0)
grabbed, frame = camera.read()

# Resize, normalize
frame = imutils.resize(frame, width=400, height=300)
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
gray = cv2.GaussianBlur(gray, (21, 21), 0)

frame_delta = cv2.absdiff(self.first_frame, gray)
thresh = cv2.threshold(frame_delta, 25, 255, cv2.THRESH_BINARY)[1]

# dilate the thresholded image to fill in holes, then find contours
# on thresholded image
thresh = cv2.dilate(thresh, None, iterations=2)
cnts, _ = cv2.findContours(thresh.copy(),
        cv2.RETR_EXTERNAL,
        cv2.CHAIN_APPROX_SIMPLE)

# We only pay attention to the largest contour
contour = max(cv2.contourArea(c) for c in cnts) if cnts else 0
if contour > 5000:
    return True

Clearly, I assumed that the backend and the browser runs in the same physical machine, thus I can avoid the implementation of the video stream upload from the frontend. This assumption is somehow legitimate if you consider the app might be packaged as a solution, and running in Raspberry Pi.

Image Recognition

The image recognition is a much harder problem though. Theoretically, the solution is related to the hand detection, which might be solved by skin detection or convex hull detection with some trial-n-error. In the hackathon domain, the problem is simplified as whether the input is more likely the thumb up or the thumb down pattern.

First, I tried the matchTemplate method. It performed poorly, as the method basically slides the template image, — the thumb up or thumb down image —, align the input image with 2D convolution. The method does not take scale or rotation into account.

Then I tried the feature detection. It extracts the key points of the template and input, and describe the distance of two corresponding key points. I did the feature detections of the input image against both template images, and determined the gesture based which template yielded more good matches. This method performed reasonably well, at least good enough to pull off the hackathon demo¹.

Socketio Server

The current architecture demands a bi-directional communication channel between the client and server, which is exactly socketio designed for. gevent-socketio implements the socketio protocol with gevent, and the pyramid integration example gave me a quickstart².

In the __init__.py, the socketio upgrade path is bound to the route_name, socketio:

config.add_route('socketio', 'socket.io/*remaining')

Then in the views.py, we initialize the socketio namespace, /thumbup:

@view_config(route_name='socketio')
def socket_io(request):
    socketio_manage(request.environ,
            {'/thumbup': CameraNamespace},
            request=request)
    return Response('')

In the client side, the socketio client MUST connect to the /thumbup namespace to establish the two-way communication:

$(document).ready(function() {
    var socket = io.connect('/thumbup');
    socket.on('action', function() {...});

then the CameraNamespace will get event notification from the client, and be able to send packages and emit events to the client with this socket. In my example, the CameraNamespace spawns the motion_detect method with the socketio handle, self, in the closure, then motion_detect can trigger the state machine in the client side.

from socketio.namespace import BaseNamespace

class CameraNamespace(BaseNamespace):
    def initialize(self):
        camera = Camera()
        gevent.spawn(camera.motion_detect, self)

HTML5 Front end

The front end MUST provide some feedback once the survey taker is detected. Luckily with WebRTC, this is pretty straightforward:

navigator.getUserMedia({ video: true, audio: false }, function (stream) {
  var vendorURL = window.URL || window.webkitURL;
  var video = document.querySelector("video");
  video.src = vendorURL.createObjectURL(stream);
  video.play();
});

See Mozilla’s example for more details about the vendor extension detection.

You may also take a picture, and send it back to the server side:

var canvas = document.querySelector("canvas");
canvas.width = $(video).width();
canvas.height = $(video).height();
var context = canvas.getContext("2d");
context.drawImage(video, 0, 0, canvas.width, canvas.height);
var data = canvas.toDataURL("image/png");

The data is the Base64 encoded PNG data with data:image/png;base64, header, we can easily decode it to OpenCV image:

import numpy as np

@view_config(route_name='detect')
def detect(request):
    img_str = base64.b64decode(request.body[22:])
    nparr = np.fromstring(img_str, np.uint8)
    img = cv2.imdecode(nparr, cv2.CV_LOAD_IMAGE_GRAYSCALE)

Close thoughts

OpenCV is a versatile and powerful swiss knife for image processing and computer vision. Especially with the cv2 python binding, it empower us to hack some meaningful prototype in a sprint.

WebRTC and other HTML5 technologies make it possible to build web app instead of native app for more sophisticated application like this.

One more thing, SurveyMonkey is hiring! We are looking for talented python developers to help the world make better decisions.

Further inspection shows that the methodology is flawed. As the thumb up and thumb down are relatively similar, we also need to evaluate the perspective transform matrix to determine the gesture: more concretely, if the gesture is identified as a thumb up with 180 degree rotation, it is indeed a thumb down sign; vice versa for the rotated thumb down sign.↩
If you use python3, you may want to try aiopyramid.↩