Coca-Cola - KWAVE

August 2, 2024

Client

Coca Cola

Agency

Unbound

Date

March 1, 2024

Scalable serverless backend for a global campaign & music video, leveraging AI and 3D rendering to generate custom videos in less than 60 seconds.

Coca Cola Creations were looking for a way for K-pop fans to insert themselves into a bespoke music video as part of a new KWAVE limited edition flavour, and release of a single by a K-pop supergroup. My role was to be handling the backend (API, AI model inference, compositing, 3d rendering and assembling the output video), helping out with the frontend, and ensuring the music video contained shots we could dynamically composite elements into later.

Originally I pitched a photorealistic approach based on cutting-edge 2D generative AI models, with controlnets to handle temporal consistency (this was pre-generative-video, or character cloning models. Things move fast in AI!). Some early prototypes were built, with strong results.

Eventually the creative direction changed to a more anime/3D character-based approach, where the user would be less 'deepfaked' and more caricatured, and inserted alongside their voice and name. This required a change of tack to using AI models to describe/detect the user's basic characteristics, and connect that to a huge library of 3D assets. I started with multimodal LLMs but quickly realised the limitation on training data means its unlikely to find a pretrained one that can describe people to a sufficient level of detail, especially when it comes to important but loosely defined things like hairstyle, or 'big' vs 'small' noses.

In the end I settled on using a collection of many lightweight custom-trained recognition models (using slightly dated architectures). Basic items like glasses/no glasses/sunglasses were easy to train with existing datasets, but for hairstyles and other elements we ended up using a hybrid mix of finding examples for each type (and in some cases generating them via diffusion models), training a LoRA, then training the recognition model on output from that LoRA. This resulted in an effective way to differentiate the kinds of hairstyles and identifiable characteristics we needed, and allowed us to emphasise accuracy when it came to culturally-significant or ethnically-specific styles - a vital component of a global campaign.

In amongst this I liased with the music video production and VFX teams to ensure the storyboards were adjusted and scenes shot in a way that allowed for us to composite the user in where needed.

To get the user's voice into the music track, I originally implemented an AI 'autotune' model, but it turned out we needed greater control of the output to get approval from all stakeholders. This led me into a deep dive into audio on Linux, and the end result was a combination of an open-source pitch correction library and a stack of LADSPA plugins, alongside FFMPEG. I also used multilingual speech-to-text models for censoring inappropriate content across multiple languages.

For 3D rendering, I looked into a range of options, but in a large organisation like Coke its hard to use anything with a license attached (or get it through the approval process). I looked at Blender and a few other 3D rendering projects, but in the end settled on WebGL in a headless browser, so we could share 3D code & shaders with the front-end site. Getting Chrome rendering hardware-accelerated WebGL, and using hardware-accelerated video encoding on servers intended for AI-model training was a long enough process to warrant its own post, but we got there eventually.

The compositing also ended up being handled in headless Chrome- I wrote a script to export the 2D/3D layers and camera motion from After Effects, then wrote a simple compositor in typescript using three.js & the postprocessing lib. Assembly of the final video file was handled via careful encoding and splicing of clips & audio together, to minimise the time spent on each video. We also rendered some extra images & stickers to create a 'fan-kit' the user could download at the end.

I had a target of 60-90 seconds per user (the time it took for them to capture their face, sing, and play a brief game) which necessitated parallelising the recognition, 3D rendering, audio, compositing and assembly jobs as far as possible. I used a combination of AWS lambda (running a basic Laravel app for the API and job management) and Beam's serverless GPU API with some aggressive warming to ensure we always raced ahead of demand without overspending on expensive GPUs. We ended up coming in well under time and cost budgets, despite hitting 3x the expected 'max' traffic, which was a big success for the architecture.

Towards the end of the project the front-end had gotten quite heavy on assets and I was brought in to help get some out-of-memory issues under control with the team. We managed to profile the webapp and identify all the bottlenecks, and via some re-organising and GLTF/texture compression magic, got the OOM errors tamed.

In all many terabytes of custom videos were generated by k-pop fans without any downtime or hiccups, despite some massive spikes as each K-pop band published the video and link to their socials. Its always difficult to write complex systems that work at scale for an event (hard deadline, unknown traffic numbers, and often no way to simulate the full stack under load!) but I was particularly proud of this one. It took some wrangling to deal with an organisation the size of coke, with their (understandably!) process-heavy organisation distributed across regions and functions like legal, IT, privacy, and the added novel element of AI policy development. Some elements of the architecture took some pushing on my end, and debates with whole teams as to the necessity of things like scalable 3rd-party GPU servers vs standing up our own hardware, but the focus was always on the best outcome, which I think we achieved.

Buy this Template