How to analyze csv data using GPT-4

blog preview

Large Language Models (LLMs) and especially GPT-4 conquered the world by storm. This technology is the first of it's kind which allows humans to interact with an advanced artificial intelligence system by simply using human language. Breathtaking results were achieved across many industries.

Also the tooling around these LLMs got quite intriguing. LLMs are by definition very good in interpreting language and creating answers which are most probable to fit to the prompt or question - so they are, as the name suggests, language models. This is especially true for the english language, but also many other "human" languages and - interestingly - also for programming languages. OpenAI and many others have done a tremendous job in training these models and code and therefore made them quite well versed for answering questions around coding. However, in it's core they are just language model - this makes them not very good for mathematical problems and data analytics.

You know what's good in math and data analytics? Python. So... we have a technology which is great in understanding human input and language. And it can write code. And we have a programming language which is good for data analytics. What if... we could ask our LLM to use code to do data analytics?

How to enable GPT-4 to use python?

As mentioned above, we'd like to use LLMs - GPT-4 in this example - to simply ask questions in human language (like "How many users did churn last month"?) based on data in a csv file. ChatGPT nowadays supports uploading CSV files but - privacy concerns aside - it just simply analyzes the text in the file and provides an answer based on it's training as a language model. Which is not what we need in terms of data analytics (LLMs are bad at data analytics!). However, we know that GPT-4 is actually quite good in using python.

So what we want to achieve is a system where we interact with GPT-4 and GPT-4 creates and executes some python code which analyzes the csv file. GPT-4 in the end returns the result to us. So we never actually need to write or interact with any Python. Furthermore, we also don't want to send the csv file to GPT because - first - it would definitely hit the context limit of GPT and - second - for privacy reasons.

Asking GPT-4 to analyze a csv fileAsking GPT-4 to analyze a csv file

While we could theoretically develop this system on our own, there is an amazing tool out there called LangChain which does almost all the required "glueing". In short, LangChain is a library which assists in connecting LLMs with data sources and tools. It basically enables GPT-4 to use tools like web search, calculators or - yes - Python.

As you'll see in the next chapter, LangChain is one hell of a tool - abstracting away almost any complexity involved in connecting LLMs with real-world tools.

Step-by-step guide to using LangChain to analyze CSV data with GPT-4

Now that the basics are done, let's see how we can "teach" GPT-4 to analyze a real-world dataset. We are going to use a well-known dataset about churned users of a telecom provider. It has quite a good amount of data and contains interesting content.

  1. Let's install the required dependencies

    1pip install langchain openai chromadb tiktoken tabulate matplotlib

    Execute the following steps in a jupyter notebook or python script.

  2. Import the required modules

    1import os
    2import pandas as pd
    3
    4from langchain.agents import create_csv_agent
    5from langchain.llms import OpenAI
  3. Get an OpenAI key from the OpenAI platform

  4. Download the dataset from my website. The dataset contains information about churned users of a telecom provider.

  5. Create a LangChain agent. Agents with LangChain are responsible to decide which tool (in our case python) to use and how to use these tools. Agents simply reformulate the user input (our prompt) and send a prompt to an LLM asking the LLM what tools would best be used for the user input. The agent then uses the selected tool - python in our case - and yet again asks an LLM how to use the tool. The LLM then will provide the agent with instructions - python code - on how to use the tool. This is all highly fascinating and requires almost no code. All the business logic is actually provided by the LLM.

    1agent = create_csv_agent(OpenAI(temperature=0), 'churn-dataset.csv', verbose=True)

    Note: Setting verbose to True outputs the "thought chain" of the agent. So we can read what the agent was deciding.

  6. That was actually all the heavy lifting required. The agent knows which dataset it should use and it knows which LLM to use (the OpenAI llm which by default is the text-davinci-003 model which is the most capable language model provided by OpenAI as of this writing). We can start asking questions. Let's start with an easy one:

    1agent.run("How many rows and columns are there in the dataframe?")

    Result/Output:

    1> Entering new AgentExecutor chain...
    2Thought: I need to find the shape of the dataframe
    3Action: python_repl_ast
    4Action Input: df.shape
    5Observation: (7043, 21)
    6Thought: I now know the final answer
    7Final Answer: The dataframe has 7043 rows and 21 columns.
    8
    9> Finished chain.
    10'The dataframe has 7043 rows and 21 columns.'

Isn't that something? By simply calling agent.run with our question we instructed our agent to ask the LLM for how to find the rows and columns of our dataframe.

What's amazing here is, that it's not the LLM itself reading the file - this would exceed the context limit of the LLM as we have way more than 8k tokens. But the LLM just provided the python code which is then run on our machine which determines the correct shape of the dataframe.

The output of the agent also shows the steps it needed to execute. First, it asked the LLM to interpret our input - the LLM told the agent, it needs to find the shape and use the python - tool. Then the agent sent the result to the LLM and the LLM concluded, that that's the answer. So the final answer was The dataframe has 7043 rows and 21 columns.

Let's try something more advanced.

1agent.run("How many users in the dataset did churn? Provide the answer as percentage.")

Result/Output:

1> Entering new AgentExecutor chain...
2Thought: I need to count the number of users who churned and divide it by the total number of users.
3Action: python_repl_ast
4Action Input: df['Churn'].value_counts(normalize=True)
5Observation: Churn
6No 0.73463
7Yes 0.26537
8Name: proportion, dtype: float64
9Thought: I now know the final answer
10Final Answer: 26.54%
11
12> Finished chain.
13'26.54%'

And again, a perfect answer - factually correct and beautifully explained.

Ok - now it would be great to know, how many of our users had an Internet Service and which of these churned. Why not create a bar chart from that?

1agent.run("Please use matplotlib to create a bar chart showing how many of the customers had Internet Service and which portion of these churned. The bar chart should be stacked bar chart, where the top portion of the bar shows the churned users, the bottom portion shows the non-churned ones.")

As you can see above, I was quite specific about what I want to have. And guess what - the result was perfect.

Users with or without Internet ServiceUsers with or without Internet Service

With just 3 lines of text we were able to start an exploratory data analyses which would be quite some lines of python code. From here you could simply continue with further queries.

Summary

This small guide demonstrated the tremendous capabilities of LangChain in helping LLMs to connect to real-world tools like python. With the following few lines of code we were able to analyze data in a csv file and even generate plots - by simply asking our agent for it.

It is exciting to see how far the AI LLM ecosystem has come and how easily they can be extended by providing them the power of tools.

While I must admit that knowing python helps to "guide" the agents to the desired outcome it's of incredible help to not needing to know every in and out of the python programming language and still being able to create basic analytics. And let's be honest: With the rate of development in this field it's to be expected that this whole process will only get more and more powerful.

For reference, this is all the code we needed to create our analyses:

1import os
2import pandas as pd
3
4from langchain.agents import create_csv_agent
5from langchain.llms import OpenAI
6
7# set your openai key here
8os.environ["OPENAI_API_KEY"] = "sk-Wcxxxxxxx"
9
10agent = create_csv_agent(OpenAI(temperature=0), 'churn-dataset.csv', verbose=True)
11
12# Start our "interrogation"
13agent.run("How many rows and columns are there in the dataframe?")
14
15agent.run("How many users in the dataset did churn? Provide the answer as percentage.")
16
17agent.run("Please use matplotlib to create a bar chart showing how many of the customers had Internet Service and which portion of these churned. The bar chart should be stacked bar chart, where the top portion of the bar shows the churned users, the bottom portion shows the non-churned ones")

What's next?

If you don't want to export your data as csv to analyze them, but want to "chat" with your BigQuery data, have a look at this next post in this LangChain blogging series.

If you are more interested in how to connect your very own pdf files to GPT and ask questions about them, have a look at my LangChain PDF tutorial

And if you want to know, how to rent a beefy GPU for your own LLM experiments, check out this guide

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use

Need assistance?

Do you have any questions about the topic presented here? Or do you need someone to assist in implementing these areas? Do not hesitate to contact me.