How to use GPT-4 to chat with YouTube videos?

In an era defined by information overload, the ability to extract knowledge effectively from huge data sources is becoming increasingly important. With the breakthroughs in artificial intelligence, OpenAI's GPT-4 has emerged as a game-changer, promising to revolutionize how we process information. In this blog post, we look into the world of YouTube, a goldmine of content, and explore how GPT-4 can be harnessed to ask any question based on the source data provided by YouTube videos.

YouTube has become a hub for sharing knowledge, with an abundance of content ranging from tutorials and lectures to interviews and panel discussions. Its massive user base and extensive video library make it an ideal platform for exploring various topics. However, skimming a video is historically difficult - meaning it's even more difficult to find the right information from the huge pool of thousands and thousands of videos. Integrating GPT-4's advanced language processing capabilities with this treasure trove of YouTube data would provide tremendous productivity gains.

Exactly this is the main objective of this blog post: to demonstrate how GPT-4 can leverage the vast YouTube data to ask intricate questions, ultimately leading to enriched knowledge acquisition and discovery.

The potential of using GPT-4 to chat with YouTube data

YouTube has transformed into an expansive library of a wide variety of content. With billions of videos hosted on the platform, it offers an unparalleled wealth of information waiting to be tapped. Whether you're seeking tutorials on coding, in-depth lectures on quantum physics, or expert interviews on entrepreneurship, chances are you'll find it all on YouTube.

One of YouTube's distinguishing features is its incredible diversity of content types. From step-by-step tutorials that guide you through knitting patterns to university lectures unraveling complex philosophical concepts, YouTube caters to a broad range of topics and learning styles. Additionally, interviews with industry experts, panel discussions, and educational documentaries provide valuable insights and perspectives on various subjects.

Furthermore, YouTube's content creators span individuals, educational institutions, researchers, artists, and more. Each contributor brings their unique expertise, experiences, and perspectives, enabling users to access a vibrant tapestry of knowledge.

Moreover, YouTube fosters a global community, transcending geographical and cultural boundaries. Its accessibility promotes the sharing of knowledge from individuals worldwide, resulting in a rich blend of diverse perspectives. This collective knowledge, deeply rooted in personal experiences, adds depth and authenticity to the information available on YouTube.

And here lies the problem with using YouTube as a knowledge source: It's hard to extract specific information from a video compared to text. With text sources, we can skim the whole text and rather quickly extract the information we need. With videos, that's not possible. To answer a simple question we are often required to watch minutes of unrelated parts of the video.

If we use an LLM like GPT-4 on top of YouTube videos, we could provide a simple chat interface allowing us to query the video for specific questions. This question would be sent to our LLM which then uses the YouTube video information and provides a summarized answer.

Connect GPT to YouTube videos

Using LangChain to connect YouTube videos to our Large Language Model

As already demonstrated in previous posts about how to connect LLMs to all sorts of real-world tools and information, we are going to use LangChain for this task.

The process is as follows:

Preparation:

Create a transcript of our YouTube video
Split the transcript into chunks of text (as the full video transcript will not fit the LLMs context window)
Create text embeddings from the text chunks (what embeddings are and why they are useful is introduced in this introductory post)
Store these embeddings in a vector store - for convenience we are using Chroma DB as our local vector store.

Query:

User asks specific question which could be answered by the YouTube video.
Create embedding from this question and use similarity search to find the relevant text-chunks of the transcript
Send these chunks as well as the original prompt to GPT-3.5 or GPT-4 (or any LLM)
The LLM then uses these chunks to find a potential answer for the query

Memory:

And finally, we also want to add chat-memory to our application - so we not only want to ask "one-off" questions like "What is xyz?" - but we want to engage in a conversation, asking follow-up questions and providing feedback to the LLM.

While we could develop all that ourselves, we'll make our lives easier by using LangChain. LangChain is a Python library that aims to assist in the development of applications using large language models. It allows to connect LLMs with other sources of computation or knowledge. LangChain helps with various tasks such as question answering, chatbots, agents, and data augmented generation. It provides a standard interface for working with LLMs and chains, integration with other tools, memory management, and evaluation support. The library offers comprehensive documentation and examples to guide users through the process of building applications with LLMs. Summarized, it makes integrating LLMs in real-world tools, data and applications quite easy.

LangChain Architecture to connect LLMs to video sources

Transcribing the video

As the above sketch outlines, transcribing the video is at the heart of how this system is going to work. We have multiple options here - we could use the youtube transcript api or basically any transcription service out there.

However, as we'll use OpenAIs GPT-3.5 as our LLM model, we're also using their Whisper AI for transcribing. I found it to be one of the best transcription services currently available.

Step-by-step guide to connect YouTube video sources to LLMs

Install the dependencies

1sudo apt-get install youtube-dl ffmpeg

1pip install --upgrade langchain openai yt_dlp pydub chroma

Get the YouTube video URL of one or many videos you want to ask questions for. For the sake of this guide we are using two videos explaining LangChain (because GPT itself knows nothing about LangChain - so we can verify whether our data are actually used as source for GPT's answers):
- https://youtu.be/_v_fgW2SkkQ
- https://youtu.be/2xxziIWmaSA

Load required Python modules and set up the video loading infrastructure

1from langchain.document_loaders.generic import GenericLoader
2from langchain.document_loaders.parsers import OpenAIWhisperParser
3from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
4
5from langchain.vectorstores import Chroma
6from langchain.chat_models import ChatOpenAI
7from langchain.embeddings.openai import OpenAIEmbeddings
8from langchain.text_splitter import RecursiveCharacterTextSplitter
9from langchain.chains import ConversationalRetrievalChain
10import os
11
12os.environ['OPENAI_API_KEY'] = "sk-secret"
13
14# Urls of the videos to use as data source
15video_urls = ["https://youtu.be/_v_fgW2SkkQ", "https://youtu.be/2xxziIWmaSA"]
16
17# Directory of where to save the video audio files
18download_folder = "/home/andreas/Downloads/temp/YT/"

Transcribe the videos to chunks of texts. This step is hilariously easy. This really shows where LangChain shines - very smart abstraction of complexity. Under the hood, these two lines of code download the videos, transform them to mp4, splits the audio into chunks to fit the WhisperAI context window and use the OpenAI Whisper AI to transcribe the video audio.

1# Transcribe the videos to text
2loader = GenericLoader(YoutubeAudioLoader(video_urls, download_folder), OpenAIWhisperParser())
3docs = loader.load()

docs contains the transcribed video documents.

Combine the split text chunks again use the RecursiveCharacterTextSplitter to split them again into chunk which we can use for our GPT inference. Why combine them first and split again? The WhisperAI needs different context length than our LLM, so we combine the transcribed text chunks and then split them again - to control the chunk sizes we send to the LLM.

1# Combine docs as they were split during transcription process to fit the whisper ai context window
2combined_docs = [doc.page_content for doc in docs]
3full_transcript = " ".join(combined_docs)
4
5
6# Splitting up the text into smaller chunks for indexing
7text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
8pages = text_splitter.split_text(full_transcript)

Create a Chroma db vector store with local persistence and create a LangChain retriever object. The search_kwargs parameter defines how many document chunks we want to add to the prompt.

1# Create an index with local persistence
2directory = 'index_store'
3vector_index = Chroma.from_texts(pages, OpenAIEmbeddings(), persist_directory=directory)
4vector_index.persist()
5
6# set up ChromaDB as a in-memory vector store
7retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k":6})

Initialize the chat history array (which we'll use later to store our chat history or memory) an the conversation chain - which is the LangChain interface we'll use to ask GPT-3.5-turbo to answer questions about our videos.

1# Initialize chat history
2chat_history = []
3conv_interface = ConversationalRetrievalChain.from_llm(ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0), retriever=retriever)

That's all the preparation work we need. From now, it's query time. The ConversationalRetrievalChain abstracts the complexities of:

creating embeddings from the query/prompt
execute similarity search on our vector index to find the chunks of texts which are most relevant to answer our question
send a refined prompt to GPT-3.5 and return the answer

1query = "What is langchain?"
2result = conv_interface({"question": query, "chat_history": chat_history})
3result["answer"]

Note: Why do we need these embeddings, vector stores and retrievers? Why not send the full text to GPT? Well, GPT-3.5 currently has a context window of 16.000 tokens - meaning we can only send texts up to 16.000 tokens to the model (keeping in mind, that the token limit counts to both the question and the answer, so we can send even less). Splitting the source-text in reasonable chunks and using embeddings + similarity search allows us to send only the most relevant chunks of text to the model and therefore kind of circumventing the context window.

The output:

1'Langchain is a framework for developing applications powered by language models. It helps make the complicated parts of working and building with AI models easier by providing integration with external data and allowing language models to interact with their environment through decision-making. It offers components and tools to work with language models and customize chains, and it has a fast speed and a supportive community.'

As we can see, GPT-3.5 suddenly knows a lot about LangChain - despite LangChain not being part of it's training data.

Let's add this conversation to the chat history and engage in a conversation:

1chat_history = [(query, result["answer"])]
2query = "Can you provide a longer answer to the previous question?"
3result = conv_interface({"question": query, "chat_history": chat_history},include_run_info=True)
4result["answer"]

The output (shortened):

1"Langchain is a framework for developing applications powered by language models. It aims to make the complicated parts of working and building with AI models easier. It does this through integration, allowing you to bring in external data such as files, other applications, and API data to your language models. [...] in their applications."

And again, a perfect answer. This time we make use of the chat history to not type out our question again but simply ask GPT-3.5 to refine it's answer.

Last but not least, let's query another, little more complex topic:

1chat_history.append((query, result["answer"]))
2query = "What are chains in LangChain?"
3result = conv_interface({"question": query, "chat_history": chat_history})
4result["answer"]

The output:

1'Chains in LangChain refer to a sequence of steps or actions that are executed in a specific order. These chains can be predefined, where the steps are predetermined, or they can be dynamic, where the steps depend on user input or other factors. Chains allow for the customization and flexibility of language models in LangChain, enabling developers to create complex and interactive applications.'

Perfect answer for a quite complex topic!

What happened under the hood?

The ConversationalRetrievalChain provides a very powerful interface, as it allows to seamlessly add historical context or memory to our chain. Now, under the hood, LangChain executes two prompts and a vector store retrieval:

The first prompt is used to summarize the whole memory as well as the new query into a single question.
Then, this new question is used to find the most relevant piece of information in the vector store
Finally, the relevant chunks of information as well as the refined questions are sent to the LLM for the final answer

This not only allows to answer questions on documents - but even more refine these questions and keep a full conversation based on them - as well as previous answers.

Addendum: Loading the vector store from disk

In the section before we stored our embeddings in a Chroma db vector store and persisted it on disk. To load the vector store for a later session, we can execute the following code snippet:

1vector_index = Chroma(persist_directory=directory, embedding_function=OpenAIEmbeddings())
2retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k":6})

Summary

This blog post explores the potential of using OpenAI's GPT to extract knowledge from YouTube videos effectively. With YouTube being a vast source of diverse content, the ability to extract specific information from videos is challenging. By integrating GPT with YouTube data, users can ask questions and receive summarized answers, leading to optimized knowledge acquisition and discovery.

The post highlights YouTube's extensive video library, catering to various topics and learning styles, ranging from tutorials and lectures to interviews and panel discussions. YouTube's global community fosters knowledge sharing, providing a wide range of perspectives and authentic information.

To connect YouTube videos with GPT, the blog post introduces the use of LangChain, a Python library that simplifies the integration of large language models with real-world tools. LangChain allows the creation of a chat interface to query specific questions from YouTube videos. The process includes transcribing the videos, splitting the transcript into text chunks, creating embeddings from the chunks, and storing them in a vector store.

Using LangChain, the blog post presents a step-by-step guide to connect YouTube videos to GPT. It covers the installation of dependencies, obtaining video URLs, setting up the video loading infrastructure, transcribing the videos using the Whisper AI for speech-to-text, creating text chunks from the transcripts, and configuring the system to interact with GPT. The guide demonstrates querying the videos with questions, refining prompts, and engaging in a conversation with GPT.

The full code example is found below:

1from langchain.document_loaders.generic import GenericLoader
2from langchain.document_loaders.parsers import OpenAIWhisperParser
3from langchain.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
4
5from langchain.vectorstores import Chroma
6from langchain.chat_models import ChatOpenAI
7from langchain.embeddings.openai import OpenAIEmbeddings
8from langchain.text_splitter import RecursiveCharacterTextSplitter
9from langchain.chains import ConversationalRetrievalChain
10import os
11
12os.environ['OPENAI_API_KEY'] = "sk-secret"
13
14# Urls of the videos to use as data source
15video_urls = ["https://youtu.be/_v_fgW2SkkQ", "https://youtu.be/2xxziIWmaSA"]
16
17# Directory of where to save the video audio files
18download_folder = "/home/andreas/Downloads/temp/YT/"
19
20# Transcribe the videos to text
21loader = GenericLoader(YoutubeAudioLoader(video_urls, download_folder), OpenAIWhisperParser())
22docs = loader.load()
23
24# Combine docs as they were split during transcription process to fit the whisper ai context window
25combined_docs = [doc.page_content for doc in docs]
26full_transcript = " ".join(combined_docs)
27
28# Splitting up the text into smaller chunks for indexing
29text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
30pages = text_splitter.split_text(full_transcript)
31
32# Create an index with local persistence
33directory = 'index_store'
34vector_index = Chroma.from_texts(pages, OpenAIEmbeddings(), persist_directory=directory)
35vector_index.persist()
36
37# set up ChromaDB as a in-memory vector store
38retriever = vector_index.as_retriever(search_type="similarity", search_kwargs={"k":6})
39
40# Initialize chat history
41chat_history = []
42conv_interface = ConversationalRetrievalChain.from_llm(ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0), retriever=retriever)
43
44# Query time
45query = "What is langchain?"
46result = conv_interface({"question": query, "chat_history": chat_history})
47print(result["answer"])
48
49chat_history = [(query, result["answer"])]
50query = "Can you provide a longer answer to the previous question?"
51result = conv_interface({"question": query, "chat_history": chat_history},include_run_info=True)
52print(result["answer"])

What's next?

If you want to use Language Models to chat with your BigQuery data, have a look at this next post.

If you are more interested in how to connect your very own pdf files to GPT and ask questions about them, have a look at my LangChain PDF tutorial

And if you are interested in how to utilize GPT-3.5/4 to automate your data analytics, have a look at this csv analytics guide

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

Cost control
Data privacy
Excellent performance - adjusted specifically for your intended use

Get your free LLM training guide