Using LangChain and GPT to chat with your BigQuery data

In this post we introduced the concept of how to use LangChain to use GPT to analyze csv data. This helped us to simply ask ordinary human-like questions and the LLM would create code to analyze the CSV file.

LangChain was used to implement an agent-concept, which basically took our questions, reformulated them, sent them to GPT and asked GPT to write code to analyze the file.

Now, what if we don't want to export csv files anymore, but want to ask questions directly based on our database? Eg. in our modern data warehouse.

Why, do you ask? Well, the main use case is quite big: If you have business data in your central warehouse - or in any database - your business users most likely are very interested in these data. Nowadays one needs to know SQL to being able to interact with these data - making it necessary for the business users to contact the data team even for very rudimentary questions. Having a tool which would allow business-users to interact with databases by simple, human-like language would drastically reduce data-team workload and increase business productivity.

Well, LangChain conveniently provides an agent for this kind of task - allowing us to implement such a function with quite little python code.

Using LangChain and GPT to query BigQuery data

As previously discussed, LangChain is a set of tools to connect Large Language Models (LLMs) like GPT to real-world tools. This might be python code, calculators, documents or the internet.

Additionally, they provide a powerful Agent concept: Agents take our prompts and questions and asks the LLM for what tools the agent might use. So we don't need to create our own, fixed chains of questions to the LLM, but the agent itself - sparring with the LLM - decides which tools to use and which prompts to send to the LLM.

Simplified LangChain agent concept

The agent therefore executes all the back and forth between itself and the LLM - until our question is answered. (Note that there is a certain iteration limit included to prevent endless loops. If after 10 iterations the answer is not found, the execution will stop).

LangChain offers basic, general agents but also several different optimized agents for various use-cases. Conveniently for us, they provide an SQL Database Agent, which is optimized for creating database access chats. This database agent is equipped with the following tools out of the box:

QuerySQLDataBaseTool: Tool for querying the database
InfoSQLDatabaseTool: Tool for getting metadata about a SQL database
QueryCheckerTool: Tool to check if a query is correct. This actually is quite a nice tool - it will ask the LLM to check the query for validity and potentially correct itself
ListSQLDatabaseTool: Tool for getting tables names

It is worth noting, that these tools use SQLAlchemy as database abstraction layer - so any database supported by SQLAlchemy is theoretically able to be queried with the SQL Database Agent.

One note before we start: The context window of LLMs and especially GPT-3.5 are quite limited. While the agents seldomly send data themselves to the LLM, they quite often need to inspect the dataset and table schemas - meaning if you have a lot of tables in the dataset or you have a very wide table which you want to query, it might bee too much for the context window. I therefore suggest to create a dedicated dataset as well as a dedicated, very well-groomed and narrow table which you use for interaction with LLMs.

(By the way this limitation also shows that Data Engineers have quite a bright future ahead. While analyzing data using LLMs and agents will get stronger and stronger - you still need very well-prepared data in first place. Good for our Data Engineers.)

Step by step guide for using LangChain and GPT to chat with BigQuery

As we established the basic architecture of what we want to build, this section contains detailed instructions on how to build this Chat with BigQuery solution.

Prerequisites

This article and guide is mainly about how to create a ChatInterface for Google BigQuery rather Google BigQuery and Google Cloud Management itself. Therefore, we assumes you have set up a google cloud project and activated billing for this project.

Furthermore, you should have enabled the following Google Cloud Project APIs:

For authentication:

Create a service account, with the following Bigquery Roles:
- BigQuery User
- BigQuery Data Viewer
- BigQuery Job User
Download a service account key json

If you want to follow this guide step-by-step:

Download the dataset from my website. The dataset contains information about churned users of a telecom provider.
Create a new BigQuery dataset and import the csv file as BigQuery table.

Note: These last two steps are highly optional, as you may simply use your own data to follow this guide.

Implementation

Install the required dependencies:

1pip install langchain openai chromadb tiktoken tabulate sqlalchemy sqlalchemy-bigquery google-cloud-bigquery

Let's import the modules, define the table you want to read and create an sqlalchemy connection string

1from google.cloud import bigquery
2from sqlalchemy import *
3from sqlalchemy.engine import create_engine
4from sqlalchemy.schema import *
5import os
6
7from langchain.agents import create_sql_agent
8from langchain.agents.agent_toolkits import SQLDatabaseToolkit
9from langchain.sql_database import SQLDatabase
10from langchain.llms.openai import OpenAI
11from langchain.agents import AgentExecutor
12
13service_account_file = "/path/to/your/service-account-key.json" # Change to where your service account key file is located
14project = "devopsandmore"
15dataset = "langchain_test"
16table = "churn_table"
17sqlalchemy_url = f'bigquery://{project}/{dataset}?credentials_path={service_account_file}'

Get an OpenAI key from the OpenAI platform

1os.environ["OPENAI_API_KEY"] = "sk-secretxxxxx"

Set up the langchain agent

1db = SQLDatabase.from_uri(sqlalchemy_url)
2llm = OpenAI(temperature=0, model="text-davinci-003")
3
4toolkit = SQLDatabaseToolkit(db=db, llm=llm)
5
6agent_executor = create_sql_agent(
7    llm=llm,
8    toolkit=toolkit,
9    verbose=True,
10    top_k=1000,
11)

Setting verbose=True will provide us with the chain of thoughts of the agent - so we see all the intermediate steps taken by the agent. This is quite interesting to learn what's going on. If you only want to have the result, set this parameter to False.

Setting top_k to 1000 will limit the SQL results to 1000 rows.

Run your query

1agent_executor.run("How many male users churned? ")

Result:

1> Entering new AgentExecutor chain...
2Action: list_tables_sql_db
3Action Input: ""
4Observation: churn_table
5Thought: I should check the schema of the churn_table to see what columns I can query.
6Action: schema_sql_db
7Action Input: "churn_table"
8Observation: 
9CREATE TABLE `churn_table` (
10	`customerID` STRING, 
11	`gender` STRING, 
12	`SeniorCitizen` INT64, 
13	`Partner` BOOL, 
14	`Dependents` BOOL, 
15	`tenure` INT64, 
16	`PhoneService` BOOL, 
17	`MultipleLines` STRING, 
18	`InternetService` STRING, 
19	`OnlineSecurity` STRING, 
20	`OnlineBackup` STRING, 
21	`DeviceProtection` STRING, 
22	`TechSupport` STRING, 
23	`StreamingTV` STRING, 
24	`StreamingMovies` STRING, 
25	`Contract` STRING, 
26	`PaperlessBilling` BOOL, 
27	`PaymentMethod` STRING, 
28	`MonthlyCharges` FLOAT64, 
29	`TotalCharges` STRING, 
30	`Churn` BOOL
31)
32
33/*
343 rows from churn_table table:
35customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
369732-OUYRN	Female	0	True	False	49	True	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	False	Credit card (automatic)	19.0	918.7	False
370661-KQHNK	Female	0	True	True	6	True	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	False	Credit card (automatic)	19.0	105.5	False
384709-LKHYG	Female	0	True	True	29	True	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	False	Electronic check	20.0	540.05	False
39*/
40Thought: I should query the churn_table for the number of male users who churned.
41Action: query_sql_db
42Action Input: SELECT COUNT(*) FROM churn_table WHERE gender = 'Male' AND Churn = True
43Observation: [(930,)]
44Thought: I now know the final answer
45Final Answer: 930 male users churned.
46
47> Finished chain.

Isn't that something? We can see that the agent did some quite logical things. Keep in mind - we only asked the agent how many male users churned. No table-name, no columns, etc. It was able to figure it out on it's own. The output also shows us, what it did:

It checked the available tables
Then it checked the columns as well as the first 3 rows of the table to use (as there is only one, not much to figure - but I did some tests with multiple tables and for as long as the table names are quite descriptive, the agent will figure out which table to use!)
Finally, it figured which query to use and provided the correct answer!

Let's try a little more advanced query, trying to query for more information at once:

1agent_executor.run("""How many users churned which had internet service? 
2How many churned which had no internet service? 
3And for each of the groups, how many did not churn?""")

Result:

1
2> Entering new AgentExecutor chain...
3Action: list_tables_sql_db
4Action Input: ""
5Observation: churn_table
6Thought: I should check the schema of the churn_table to see what columns I can use.
7Action: schema_sql_db
8Action Input: "churn_table"
9Observation: 
10CREATE TABLE `churn_table` (
11	`customerID` STRING, 
12	`gender` STRING, 
13	`SeniorCitizen` INT64, 
14	`Partner` BOOL, 
15	`Dependents` BOOL, 
16	`tenure` INT64, 
17	`PhoneService` BOOL, 
18	`MultipleLines` STRING, 
19	`InternetService` STRING, 
20	`OnlineSecurity` STRING, 
21	`OnlineBackup` STRING, 
22	`DeviceProtection` STRING, 
23	`TechSupport` STRING, 
24	`StreamingTV` STRING, 
25	`StreamingMovies` STRING, 
26	`Contract` STRING, 
27	`PaperlessBilling` BOOL, 
28	`PaymentMethod` STRING, 
29	`MonthlyCharges` FLOAT64, 
30	`TotalCharges` STRING, 
31	`Churn` BOOL
32)
33
34/*
353 rows from churn_table table:
36customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
379732-OUYRN	Female	0	True	False	49	True	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	False	Credit card (automatic)	19.0	918.7	False
380661-KQHNK	Female	0	True	True	6	True	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	False	Credit card (automatic)	19.0	105.5	False
394709-LKHYG	Female	0	True	True	29	True	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	False	Electronic check	20.0	540.05	False
40*/
41Thought: I can use the Churn and InternetService columns to answer the question.
42Action: query_checker_sql_db
43Action Input: SELECT COUNT(*) AS num_users, InternetService, Churn FROM churn_table GROUP BY InternetService, Churn LIMIT 1000
44Observation: 
45
46SELECT COUNT(*) AS num_users, InternetService, Churn 
47FROM churn_table 
48WHERE InternetService IS NOT NULL 
49AND Churn IS NOT NULL 
50GROUP BY InternetService, Churn 
51LIMIT 1000
52Thought: This query looks correct, I can now execute it.
53Action: query_sql_db
54Action Input: SELECT COUNT(*) AS num_users, InternetService, Churn FROM churn_table WHERE InternetService IS NOT NULL AND Churn IS NOT NULL GROUP BY InternetService, Churn LIMIT 1000
55Observation: [(1413, 'No', False), (1962, 'DSL', False), (1799, 'Fiber optic', False), (459, 'DSL', True), (1297, 'Fiber optic', True), (113, 'No', True)]
56Thought: I now know the final answer
57Final Answer: There were 1413 users with no internet service who did not churn, 1962 users with DSL who did not churn, 1799 users with fiber optic who did not churn, 459 users with DSL who churned, 1297 users with fiber optic who churned, and 113 users with no internet service who churned.
58
59> Finished chain.

I mean holy cow, look at that! First try we get this quite elaborate result. And we can also see the query which was used for the result:

1SELECT COUNT(*) AS num_users, InternetService, Churn 
2FROM churn_table 
3WHERE InternetService IS NOT NULL 
4AND Churn IS NOT NULL 
5GROUP BY InternetService, Churn 
6LIMIT 1000

Considering, that we simply used "words" to describe the output we want, I'm very much impressed by how far LangChain and GPT were able to get!

From here, I'd suggest you to experiment with different prompts and find what works and does not.

Summary

This guide demonstrates how to use LangChain and GPT to interact with your BigQuery data. LangChain agents can create queries by analyzing questions in natural language and executing back and forth interactions with GPT until the answer is found. The article includes a step-by-step guide to set up the required modules, import data, and execute queries through the LangChain agent. By following this guide, users can use GPT to query data and find insights without having to resort to SQL queries. This is especially useful to enable business users to interact with (well-prepared) databases.

It is very exciting to experience how far the GPT and LLM ecosystem has come and how easily we can use tools like LangChain to connect LLMs to real-world applications.

For reference, this is all the code we needed to create our chat with BigQuery application:

1from google.cloud import bigquery
2from sqlalchemy import *
3from sqlalchemy.engine import create_engine
4from sqlalchemy.schema import *
5import os
6
7from langchain.agents import create_sql_agent
8from langchain.agents.agent_toolkits import SQLDatabaseToolkit
9from langchain.sql_database import SQLDatabase
10from langchain.llms.openai import OpenAI
11from langchain.agents import AgentExecutor
12
13service_account_file = "/path/to/your/service-account-key.json" # Change to where your service account key file is located
14project = "devopsandmore"
15dataset = "langchain_test"
16table = "churn_table"
17sqlalchemy_url = f'bigquery://{project}/{dataset}?credentials_path={service_account_file}'
18
19# OpenAI API key
20os.environ["OPENAI_API_KEY"] = "sk-secretxxxxx"
21
22# Set up langchain
23db = SQLDatabase.from_uri(sqlalchemy_url)
24llm = OpenAI(temperature=0, model="text-davinci-003")
25toolkit = SQLDatabaseToolkit(db=db, llm=llm)
26agent_executor = create_sql_agent(
27    llm=llm,
28    toolkit=toolkit,
29    verbose=True,
30    top_k=1000,
31)
32
33# First query
34agent_executor.run("How many male users churned? ")
35
36# Second query
37agent_executor.run("""How many users churned which had internet service? 
38How many churned which had no internet service? 
39And for each of the groups, how many did not churn?""")

What's next?

If you don't want to only "chat" with your BigQuery data, but also analyze csv files and even create plots from your csvs, have a look at this next post in this LangChain blogging series.

If you are more interested in how to connect your very own pdf files to GPT and ask questions about them, have a look at my LangChain PDF tutorial

And if you want to know, how to rent a beefy GPU for your own LLM experiments, check out this guide

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

Cost control
Data privacy
Excellent performance - adjusted specifically for your intended use

Get your free LLM training guide