šŸ§ 

7 Generative AI Startups Who are Building Jasper For Audio/Video šŸ“¹

Text is cool, but audio & video unlock even more value

In the past weeks, the new category of generative AI has been justifiably put in the spotlight. The ability of unsupervised AI/ML models to generate new types of content is unprecedented and marks a significant jump in the world. Most articles have been focused on the initial use cases of text and image generation (GPT-3, DALL-E 2, Imagen, Stable Diffusion). Less has been discussed of the effects of ML on audio and video.

image

Is the hype of Generative AI justified?

The popularity of generative AI in the startup world is caused by both new AI models announced in recent weeks and the success of the first wave of companies built on-top of AI models:

  • Jasper announced $100M in revenue and $1.5B valuation for their AI marketing tool - all done in 18 months. They built a text-to-text application to generate marketing for text based channels (sales emails, marketing copy) by building on-top of the GPT-3 engine. This shows that a unicorn company can be built without owning the underlying ā€œblackboxā€
  • Stability AI raised $101M for their AI platform for image-generation. This platform will allow the next wave of companies to leverage their model to produce text-to-image. More companies such as Jasper, who build-out a vertical use-case application will emerge.

VCā€™s are declaring these technologies as the next big technological breakthrough which will unlock new opportunities - which I completely are warranted. See the footnotes to read manifestos published by Bessemer, Sequoia, NFX and Elad Gil with their predictions. Weā€™ll see many more successes pop-up in the next 5 years around this.

My top 7 bets on companies who are well positioned to become the ā€œJasper for audio/videoā€ and realize similar success:

1. WellSaid (text-to-audio)

Founded in 2018 | Status: Series A | Raised $10M

WellSaid Labs is working on converting on text-to-audio conversion for B2B voiceover usage. They offer a catalog of unique voices companies can use to create new audio pieces. Customers can also upload their own custom voices and use an API to programmatically access these capabilities. This tech the potential to disrupt the traditional voiceover and dubbing markets.

Example of some of WellSaid voices that are generated automatically

2. Resemble.AI (text-to-audio)

Founded in 2018 | Status: Seed | Raised $4M

Resemble are focusing their text-to-audio capabilities on capturing the full range of human emotion (their whispering voice is pretty good!). Another product focus is localization - record a single voice and hear it playback in any language through their neural text to speech engine. One of their interesting projects is the voiceover in Netflix's documentary, The Andy Warhol Diaries.

Example of some of ResembleAI voices that are generated automatically

3. Runway (text-to-video)

Founded in 2018 | Status: Series B | Raised $45M

Runway is an interesting ML company working on various content creation challenges in the text, photo and video world. In the video world, they are working on a text-to-video model (waitlist) and a tool that allows to mask objects in videos. This is a great example of a day-to-day task that video creators do which can be simply automated.

Example of masking an object in a video - extracting and pasting on a new background

4. HourOne (text-to-video)

Founded in 2019 | Status: Series A | Raised $25M

HourOne is building a text-to-video model using virtual presenters, like a newscaster in a studio reading out the news. This replaces the need for humans to record video and can generate a similar effect of presenter-led videos. This can create a lot value within training courses, news publishers, training and courses.

Example of HourOne dynamically created videos

5. Synthesia (text-to-video)

Founded in 2017 | Status: Series B | Raised $66M

Likely one of the largest players in the text-to-video space, Synthesia are challenging traditional video production with their AI content generation platform. Their main use cases are training videos, how-toā€™s and marketing videos. They offer multiple avatars and voices and promise results within minutes, not weeks.

Behind the scenes on how Synthesia works

6. AudioLabs (audio-to-video)

Founded in 2021 | Status: Pre-Seed | Undisclosed funding

First one in this category - AudioLabs is working on audio-to-video models. Whatā€™s different is you donā€™t start with a blank canvas. Their application connects to existing audio content such as podcasts, audiobooks which are used as input. Video storyboards are dynamically created with AI and underlying models such as DALL-E-2 and Metaā€™s Make-A-Video to generate unique assets for each scene. There is strong demand for videos that are optimized for the ā€œrecommendation-mediaā€ world as more enter short-form content marketing (TikTok, YouTube Shorts, Reels)

Disclaimer: I work in this company - yet I stand behind my recommendations šŸ˜‰

7. Descript (text-to-audio)

Founded in 2017 | Status: Series B | Raised $50M

Descript is one of the most interesting companies in this space. Their product was among the first that allowed editing of audio as easy as deleting a keyword. Their product has evolved into both video and audio editing with dabbles of AI magic. One of their key features includes text-to-audio ā€œoverdubā€ for filling in words or mistakes made in podcasts. Their products are mostly used for post-production of audio and video content in e-learning and podcasting.

Thereā€™s a lot to be optimistic about

Generative AI is one of those macros tech shifts really is 10x better than the alternatives. From zero to one solutions as weā€™ve seen today in tech, this wave of ML can unlock zero-to-ten solutions that donā€™t make the process a bit better, but orders of magnitude different. Over the next few years weā€™ll see these companies mature and disrupt existing solutions who donā€™t have these AI models embedded. Furthermore, these models are built to improve over time, giving built-in network effects to the companies that create them. To be continued.

References: