Dylan Fox is CEO and Founder assemblya platform that automatically converts audio, video, and live audio streams to text using AssemblyAI’s speech-to-text APIs.
What attracted you in the beginning? machine learning?
I started by learning how to code and attended Python Meetups in Washington DC, where I went to college. Through college courses, I found myself leaning more on algorithm-type programming problems, which naturally led me to machine learning and natural language processing.
Before founding AssemblyAI, you were a Senior Software Engineer at Cisco, what have you been working on?
At Cisco, I was a senior software engineer focusing on machine learning for their collaborative products.
How did your work at Cisco and the problem of getting speech recognition technology to launch AssemblyAI inspire you?
In some previous jobs, I had the opportunity to work on a lot of AI projects, including several that required speech recognition. But all the companies offering speech recognition as a service were insanely outdated, hard to buy anything from, and were powered by outdated AI technology.
As I became more and more interested in AI research, I noticed that there was a lot of work being done in the field of speech recognition and how quickly research was improving. So it was a combination of factors that inspired me to think, “What if you could build an API company similar to Twilio using the latest AI research that would have been much easier for developers to access the latest AI models for speech recognition, with a much better developer experience.”
From there the idea came assembly I slept.
What is the biggest challenge behind building accurate and reliable speech recognition technology?
Cost and talent are the biggest challenges any company must face when creating accurate and reliable speech recognition technology.
Obtaining data is expensive, and you typically need hundreds of thousands of hours to build a robust speech recognition system. Not only that, the account requirements are enormous for training. Serving these models in production is also expensive, and requires specialized talents to improve and make them economical.
Building these technologies also requires a hard-to-find specialized skill set. This is a big reason why customers come to us for powerful AI models that we research, train and deploy in-house. They have access to years of research into the latest ASR and NLP AI models, all with a simple API.
Outside of pure copying of audio and video content, AI offers additional models, can you discuss what those models are?
Our suite of AI models extends beyond instantaneous and asynchronous transcription. We refer to these additional models as voice intelligence models because they help customers better analyze and understand voice data.
Our summary template provides a comprehensive summary, as well as time-coded summaries that break down and automatically produce a summary for each “chapter” as topics change in a conversation (similar to YouTube classes).
Our sentiment analysis model detects emotions for each sentence of spoken speech in audio files. Each sentence in the text can be marked as positive, negative or neutral.
Our entity disclosure form identifies a wide range of entities that are spoken in audio files, such as the names of people or companies, email addresses, dates and locations.
Our Topic Discovery template names the topics that are spoken in the audio and video files. Expected subject labels follow the standard IAB classification, making them suitable for content targeting.
Our content management model detects sensitive content in audio and video files – such as hate speech, violence, sensitive social issues, alcohol, drugs, and more.
What are some of the biggest use cases for companies using AssemblyAI?
The largest use cases that companies have for AssemblyAI span four categories: telephony, video, virtual meetings, and media.
CallRail is a great example of a customer in telephony space, which leverages AssemblyAI’s AI models – Basic Transcription, Automated Transcription Highlighting, and PII Redaction – to deliver an intelligent conversational solution to its customers.
Essentially, CallRail can now automatically appear and identify key content in their phone calls to their customers at scale – key content such as specific customer requests, frequently asked questions, keywords and phrases. Our PII Redaction model helps detect and remove sensitive data in text (such as Social Security numbers, credit card numbers, personal addresses, and more).
video Use cases range from video streaming platforms to video editors like Veed, who use AssemblyAI’s Core Transcription models to simplify the video editing process for users. Veed allows its users to directly copy and edit their videos with captions.
in virtual meetingsMeeting transcription software companies such as Fathom use AssemblyAI to build smart features that help their users transcribe and highlight key moments from their Zoom calls, promote better meeting engagement and eliminate tedious tasks during and after meetings (such as taking notes).
in The mediaWe see podcast hosting platforms for example Use our content moderation and topic discovery models so they can offer better advertising tools for brand safety use cases and monetize user-generated content with dynamic ads.
AssemblyAI recently Raised a $30 million round of series B. How will this speed up AssemblyAI’s mission?
The progress made in the field of artificial intelligence is very exciting. Our goal is to expose this progress to every online developer and product team – via a simple set of APIs. As we continue to research and train the latest AI models for ASR and NLP tasks (such as speech recognition, summarization, language identification, and many other tasks), we will continue to offer these AI models to developers and product teams via simple APIs – freely available.
AssemblyAI is a place where developers and product teams can easily access the advanced AI models they need to build exciting new products, services, and businesses.
Over the past six months, we’ve launched ASR support for 15 new languages—Including Spanish, German, French, Italian, Hindi, and Japanese, significant improvements have been released to the summary form, real-time ASR forms, content moderation forms, and Countless other product updates.
We barely dived into our Series A cash, but this new funding will give us the ability to aggressively ramp up our efforts – without compromising our runway.
With this new funding, we will be able to accelerate our product roadmap, build better AI infrastructure to accelerate AI research and inference engines, and grow our AI research team – which today includes researchers from DeepMind, Google Brain, Meta AI, BMW and Cisco .
Is there anything else you’d like to share about AssemblyAI?
Our mission is to make modern AI models available to developers and product teams at a very large scale through a simple API.
Thank you for the great interview, readers who want to know more should visit assembly.