Home
Home
As a custom-rigged vehicle was driving around, music and visuals were being generated in real time based on people and objects detected in the streets of New York. The entire experience was captured in real-time with multiple cameras to create the final commercial launched at the GRAMMYs. Besides design and development of custom tracking software to create unique songs and generate visuals, we also handled on-site support during the video shoot. Quite a challenging installation with a very short turnaround.
Hardware set-up
To build the installation, we used two computers (with a decent GPU), a GoPro camera and some extra hardware to glue everything together. This illustration shows the full setup and how it is configured to run the applications.
The first machine runs the video server and the computer vision application. The second computer is used to generate the augmented reality layer and music. In total, we developed four applications tying this all together. Let‘s have a look at the different parts!
A video server captures the realtime video feed from a GoPro camera, using a Blackmagic capture card, and streams it to a computer vision application running on the same machine. Simultaneously, it also streams the video to the augmented reality application on the second machine. This stream was used as a base layer, which we augmented with illustrations.
Next, computer vision. The incoming video stream was analysed using the YOLO detector, which allowed us to track people & objects. Each time an object of interest was being detected, an OSC message was send to the augmented reality application and Midi controller application.
Then comes the augmented reality part. Captures the incoming video stream and augments it with custom artwork by Musketon, based on the received OSC messages. The final visual output was then pushed to a display mounted outside the car.
The Midi controller application transforms incoming data to Midi notes and Midi CC messages. The Midi notes were then send to Ableton Live to trigger different (pre-recorded) instruments & loops and control the volume for each individual track.
All applications were built using openFrameworks, an open source C++ toolkit for creative coding. We also used several add-ons developed by the community. Without these open source projects, we would not have been able to build this installation in such a short time.
The video server takes the live stream from the camera and distributes it to the other applications. Communication between different applications was handled with Spout (for apps on the same machine) and NDI (distributing the video to the other machine).
First tests were promising, but one of the bottlenecks seemed to be the standard openFrameworks video capture class. We started out by using a Logitech HD Pro Webcam C920, which sends out two video streams:
1. Raw video feed
2. H264 feed
Turns out the openFrameworks ‘ofVideoPlayer’ class does not support taking in H264 encoded streams from the webcam. This forced us to use the raw feed which resulted in a serious frame drop. There are workarounds to deal with this but we didn’t have time to explore this.
So on to plan B… using a GoPro Hero 5 and a Blackmagic Design DeckLink Mini Recorder 4K PCIe Capture Card. To get the captured video feed into openFrameworks, we used the ofxBlackMagix2 library by Elliot Woods.
During development, we also used the Blackmagic Design Intensity Shuttle because we didn’t have full-time access to the main setup.
The most important challenge of this project was detection of people & objects in a live video stream. To do this we used Darknet by Joseph Redmon, which has a state-of-the-art detecor named YOLO: Real-Time Object Detection.
To get the detector up & running in openFrameworks we used ofxDarknet, developed by Marcel Schwittlick, using the Coco dataset implementation.
On top of the object detection we implemented a ‘playhead’ system, which gave us more fine grained control over when & how many times signals were send through the system. We used the OSC to send out messages.
These ‘playheads’ could be placed anywhere inside the feed, each controlling different animations, instruments or global parameters (volume). The ‘region of interest’ of a certain playhead was defined by its width and position.
The Augmented Reality app was creating visual output for the experience, taking in the video stream from the Video Server and augmenting it with custom illustrations based on OSC messages received from the Computer Vision application.
Each OSC message contained the type of object that was detected, its position on screen and the size of its bounding box. Each detected object was then paired with an animation, which was playing back a sequence of PNG images.
Based on visual references from the agency, we brought in Musketon to create the artwork and combined it with generated design elements.
An exception was made for cars, trucks and buses, these were not paired with illustrations but controlled a ribbon that started dancing on the musical notation bars each time they were detected.
In the end, the incoming video and augmented reality layer were composited and displayed on a display mounted on the outside of the car.
All music was composed by Eddie Alonso and consisted out of different instruments and loops which were composited in layers in Ableton Live.
The initial idea was to use different instruments for detected objects, which could play different notes at different scales. However this became pretty complex and wasn’t sounding like the musical soundtrack we envisioned for this experience. For the final experience we ended up with only one instrument which was used whenever people were detected, paired with audio loops for other objects (eg fire hydrants, bicycles, traffic lights, stop signs, ...). Again a special exception was made for cars, trucks and busses. Just like the central ribbon in the visual output, these were controlling the volume of one of the three base tracks for the composition.
For the instrument attached to people, we laid out some ground rules:
1. The vertical position of a person defines the pitch of a note
2. People positioned at the top of the screen play higher notes
3. All notes should be in the scale of C-Minor
The musical notes being played are based on where an object is detected inside the video. We mapped the height of the video to two octaves, meaning notes could be anything between C2 and B3. Playing all these notes randomly in a composition will sound horrible. To make it more musically, we used the Midi scale effect in Ableton Live. This effect alters the pitch of incoming notes based on a predefined scale map. To facilitate all of this we created a Midi controller app which received OSC messages, handled the processing and sending the results to Ableton Live again through Midi and/or OSC. For this we used ofxMidi by Dan Wilcox and also added a quantizer for incoming notes and controlling playback of audio loops.
You can read more here.
- tbwachiatdayla.com, Agency — TBWA/Chiat/Day LA
- toolofna.com, Production — Tool Of North America
- musketon.com, Illustrations — Musketon
This would not have been possible without the help of the open-source community behind the following projects. Big thank you!
- openFrameworks, Creative Coding Toolkit & Community
- github.com/danomatika/ofxMidi
- github.com/mrzl/ofxDarknet
- github.com/Kj1/ofxSpout2
- github.com/elliotwoods/ofxBlackmagic2
- github.com/kylemcdonald/ofxCv
- pjreddie.com/darknet/yolo
- spout.zeal.co
- MaxForLive
- Ableton