AI Video Production: No cameras, studios or actors required. (Part I of III)

10 min readDec 17, 2020

Synthetic media — text, image, audio or video content that has been fully or partially generated by AI — is evolving out of the lab and into commercial products.

In this three-part series I’ll outline how the media production pipeline is being reinvented, driven by the ability of neural networks to imitate the real world.

I’ll be focussing on practical use cases today and in the short term. I’ll leave out the more philosophical questions for now, as I’ve covered that in-depth in the past.

Part II and III coming soon!

Part 1: No cameras, studios or actors required.

Part 2: Programmable Video — Automated Generation at Scale (coming soon)

Part 3: GAN’s & GPT3 — will computers be able to make their own content? (coming soon)

—

When we founded Synthesia almost three years ago we had a vision to make it easier to create video content for everyone.

But not just a little easier; our goal was to introduce a paradigm shift in how we think about media production from something we record with cameras to something we program with computers.

Why?

We’re in the early days of a major paradigm shift that will radically change how we communicate.

Imagine the world went back to using pen and paper. We’d all generate 100x less text content than we do today via email, sms and IM.

How will we communicate when it’s as easy to create video as it is to write an email or a tweet?

Humans have always been drawn towards more visual and interactive modes of communication. And we’ve come far since the first text-only newspaper was published in 1665.

I spent my early childhood on internet forums. Back then the internet was mostly text-based, with the occasional GIF, and could barely handle audio streaming, let alone video.

Today, the majority of our online experiences are video driven. We do video calls on Zoom, learn new skills on Udemy, entertain ourselves on YouTube, Twitch or Instagram and the list goes on.

Today’s fastest growing social network, TikTok, is arguably the first major social platform that isn’t just video-first, but more or less video-only — there’s barely any text in the interface.

From research we know that many people respond better to video than they do to text-based information. More senses are activated and our understanding, engagement and retention of the content increases. Conveying ideas and emotions is often easier to do visually than in text.

Yet, only a fraction of the world’s information and ideas are available as video today. Companies are struggling to keep up with the demand for video content. The speed of consumption is massively outpacing the speed and cost of production.

We have made tremendous progress on making it easier to capture video — we literally all walk around with high-definition cameras in our pockets. But our technology for synthesising video via digital processes hasn’t advanced much outside of Hollywood.

Video production is still costly, complex and tied to the physical process of recording.

Our mission is to reduce the entire video production process of cameras, actors and studios to code — enabling humans and algorithms to easily express themselves in video.

Once media production has been abstracted away to an application layer, used via simple markdown language, human creativity will re-wire the online (and offline) experience. Likely in ways we haven’t even thought of yet.

CSS enabled beautiful layouts and graphics. Javascript gave us interactive websites. Smartphones made everything accessible at all times and equipped us with sensors to capture the world around us. ML analyzes and personalizes our digital experiences at scale.

Synthetic, programmable media will be weaved into the fabric of every digital experience in the future — from the personalized news we’ll be watching, to our interactive AI teachers and eventually to the Hollywood blockbuster made by your favourite YouTuber from his or her bedroom.

There’s obviously still a long way to go before we’re making Hollywood films on laptops. Enabling anyone, everywhere to create compelling visual content is an enormous technical challenge. But we’ll get there, and probably faster than we think.

Just how far are we? Here’s a demo of Synthesia, the world’s first text-to-video platform that works in more than 50 languages.

Simply select one of our built-in avatars (or upload yourself), type in your script, change the graphics and click generate. Your video is ready in minutes.

This new, AI-driven workflow exponentially speeds up the video creation process and decreases costs by a factor of ~1000x compared to camera-based production.

Synthetic video in the next three years

I think of synthetically generated content in three layers of automation, each with their own use cases, scalability and degree of human/machine interaction. Today most tools operate in the first layer; but as companies are releasing public APIs (we just did!) we’ll see use cases proliferate rapidly.

Linear Video [accelerating video production as we know it]
Video content as we know it today, but produced using AI. The price, complexity and time required to make content will drop massively. Making simple video content for training, corporate communications and marketing will become a desk job. Videos are composed by humans using AI tools to synthesize video and audio in any language.
Programmable Video [scripted, scalable video production]
Algorithms enable the production of videos at scale. The templates will be designed by humans, but algorithms will compose videos automatically in the millions each day. Personalized videos, data-driven news bulletins and scalable video chatbots will emerge as videos can be automatically generated via an API.
Fully generative video [generative content production]
As text-generation, like GPT3, and other ‘creative’ AI tools mature and become useful, algorithms will be able to somewhat autonomously create content. People love to speculate on the sci-fi end of the spectrum here — and I’m sure we’ll see some crazy things — but I think the actual business impact will be less exciting. For example, imagine an AI bot that automatically converts written articles into a video by summarizing the text, transforming it to spoken-language and pulls in matching images to create a short video. And does so in 40 languages.

This blog will focus on the first part, exponentially scaling up the production process for the type of video content we are familiar with today.

Why traditional video production will never scale

Video production is an extremely fragmented and multidisciplinary process. It can be divided into roughly four steps: creative (script) + camera capture (recording session) + post production (video editing) + distribution (video hosting).

Even a simple corporate video production can easily be $5000 or more. While the cost is easy to understand the complex workflow around video creation is less understood by non-practitioners.

1 — Cost, time and complexity

Writing scripts, renting a studio, hiring actors, renting cameras/light/sound and post-production makes the video recording process expensive, time consuming and complex. It involves many stakeholders and it’s a weeks, if not months, long process to make even very simple video content.

Even bare bones video, say of an executive giving a monthly update, requires you to get the right people in the room, finalize the scripts, record several takes, upload the footage from the camera, do some post-production, render and then you’ll have your video. It’s a long, drawn out process that requires lots of compute, storage and working hours.

2 — Video is linear and can’t be edited

Once you’ve made a video the traditional way it’s set in stone: if you misspoke or want to add/remove something you’re out of luck. So you need to ensure that you have your script down before you start recording. You can’t creatively experiment or update messaging once it’s been shot, unlike text and images which can be dynamically altered.

3 — Camera anxiety

Most people don’t like being on camera. They want to look good, sound confident and not come off as stiff. Performing even a simple message to the camera can be hard without doing many takes and can take up a lot of mental space for whoever is due to be recorded.

If you do have someone internally who likes to be on camera you need to schedule recordings, creating time lag and additional complexity. And what do you do if that person stops working for the company?

4 — Difficult to translate

If you are a global company creating multilingual content you also have the added complexity of translation and subtitles/dubbing/reshoot. You’re faced with the tradeoff between using a presenter, which is much more engaging, or sticking to text or voiceover which is easier to translate but not as engaging.

—

The communication chasm

Companies create much less video than they’d like to. Due to the complexity of production, video is still seen as a project-based activity rather than a continuous process like writing email, documents or working with slide decks.

This creates a communication chasm. In 2020 text-based communication just isn’t the ideal way to communicate with employees or customers.

In the retail sector for example, you are dealing with a high employee turnover, many part-time workers and a workforce who might be less literate than average. In these cases sending a 4 page PDF document over email is not an effective way to communicate.

A client reported that store workers either didn’t read the documents or skimmed them quickly without understanding them. Now, using videos that are shown at the team meeting and sent via email, there’s a much higher degree of information retention amongst the workforce.

It’s deeply interesting to me just how much the medium shapes the message. I predict we’ll see a fundamental shift in how we communicate over the next decade, and while it might seem strange today, I truly believe we will see an explosion of video/audio based communication when it’s as easy to create as writing an email.

Enter synthetic video: digital and programmable

Synthetic video offers an entirely different and fully digital workflow. Reducing the entire production process to a few clicks or an API call exponentially improves the workflow and allows for entirely new opportunities.

In August we released Synthesia CREATE, the world’s first text-to-video platform that works in 39 languages.

Simply select one of our built-in avatars (or upload yourself through a simple recording process), type in your script, change the graphics and click generate.

Your video is ready in minutes and it doesn’t require anyone to move from their computer or to be dependent on any physical processes or other people’s calendar. With pricing that starts at $30 / month synthetic video is a truly accessible tool.

Make your own video for free here!

2020: Linear video production at scale.

This new, AI-driven workflow exponentially speeds up the video creation process and decreases costs by a factor of ~1000x compared to camera-based production.

Some of our clients use the Synthesia platform to create videos end-to-end and others use Synthesia as a ‘virtual camera’ to generate live-action sequences that are then further edited in Adobe Premiere or other traditional editing suites.

For linear video production — the kind of video we are familiar with today — the implications of synthetic video is exponential. Particularly in marketing, corporate communications and training:

Create synthetic videos on-demand as events happen.
Multilingual corporations can make videos in 39 different languages.
Not happy with your script? Simply edit and re-render your video in a few minutes.
Want to communicate in foreign languages? Click a button and we’ll translate your video.
Create videos at the speed with which you create slide decks.
The whole production process can be handled by one person.
New policy or regulation? Load up your original video, edit the script and re-render the video to reflect changes.
Make senior executives into AI presenters and alleviate them of having to take time out of the calendar to do tedious video shoots.
Batch-generate videos via our API.

Turning text into video

While synthetic video can certainly replace some of the content you would otherwise have recorded with a camera, we don’t view synthetic video as a replacement to normal video.

It’s a fundamentally new medium that will expand, rather than replace, the video market. As the price to create a video drops to a few dollars it will allow companies to create much more content than they otherwise would have done.

What would otherwise have been text or slides are now being made into video with very little additional overhead. And this actually where synthetic video shines — not as a replacement for normal video production, but as a replacement to text.

Some of our customers are accelerating their existing video production and are now creating 10x as much content as they did before, with synthetic content sitting alongside their traditional video content.

Others have never produced video before they started using Synthesia and are now publishing entirely synthetic content.

For example we’ve helped a local youth educator in Brazil create a video course for software development and helped another customer train restaurant workers who are currently restricted from doing in-person training due to the pandemic.

Next up: Programmable, personalized video content.

Next part in this series I will cover the power of creating videos through API — from making it easier to create videos to making it entirely hands-off. Once scripts are in control of making videos the scale is unprecedented. With a little bit of programming everything can be turned into a personalized video.

https://www.youtube.com/watch?v=1ogbUbzuYUY

Follow me here on Twitter or our newsletter to get notified when the next part is released.