can-chatgpt-replace-data-analyst
can-chatgpt-replace-data-analyst
can-chatgpt-replace-data-analyst

Mar 27, 2022

6 min read

Will ChatGPT Replace Data Analysts?

Intro

This blogpost was inspired by the Medium post by Marie Truong, where she tried to answer the question “Can ChatGPT Write Better SQL than a Data Analyst?".  To answer the question Marie completed 3 SQL challenges and then used ChatGPT to do the same. As a result, SQL queries prepared by a human were more accurate and Marie won 3-0.

I’ll be using the Metatable.ai SQL module to solve the same SQL challenges and see if my attempts at using AI are any better.

What's The Difference In Using ChatGPT vs Metatable.ai

We’ll be using Metatable.ai - an SQL powered data platform which can translate natural language to sql. Metatable does use OpenAI's API for the GPT model. So the natural question is why do I need Metatable.ai if I could use ChatGPT and ask for SQL?

There are two major differences:

  1. Metatable uses an AI pipeline to validate an SQL query created by AI on the SQL database engine. If the SQL query has errors, the pipeline reports them back to the AI and requests GPT to fix the error. This process happens automatically without the user's knowledge, and GPT tries multiple times to come up with the correct solution. From the user's perspective, it appears as if they typed a prompt and received a valid SQL query.

  2. Better UX, you type only your request, you don't have to worry about prompt engineering in order to prepare enough context to ChatGPT. Metatable does provide full context of the database schema to the GPT model.

Check out the full article on the current version of Metatable.ai

Text to SQL with Metatable.ai

Methodology of Transforming Natural Language Text to SQL

To solve the SQL challenges using Metatable.ai, I will follow these steps:

  1. Inside the Metatable.ai I will create an SQL database with a table that is the same as described in the Challenge.

  2. I will use the Challenge description as a prompt to generate an SQL query using AI.

  3. Submit the AI-generated SQL query to LeetCode and adjust the prompt as necessary if any issues arise.

LeetCode provides a benchmark on efficiency of the code, so the SQL query by GPT will be measured for performance. Let's get started!  

Challenge 1 - Easy Task: Customer Placing the Largest Number of Orders

SQL_challenge_1_for_AI


Step 1. Let's create a table orders, to do so we copy the SQL statement from the puzzle and paste it into Metatable’s prompt. This is almost SQL query so it should be easy to GPT to format it:

CREATE TABLE IF NOT EXISTS orders (
order_number INT,
customer_number INT
);

Step 2. Now let’s copy and paste the task text into Metatable’s prompt field:

Write an SQL query to find the customer_number for the customer who has placed the largest number of orders.
And in few seconds AI generated an SQL query:

SELECT customer_number
FROM orders
GROUP BY customer_number
ORDER BY COUNT(*) DESC
LIMIT 1;

This SQL is correct, readable and beats 81.6%. Not sure if the efficiency number represents true efficiency, since the same SQL query showed quite a different number the other day.

challenge 1: Text to SQL with AI


Challenge 2 - Medium Task: Tree Node

Tree Node SQL Challenge


Step 1: We create a Table in Metatable.ai by typing: 'Create table tree with columns id and p_id'

GPT-3 and Metatable did provide the SQL:
CREATE TABLE tree(
id SERIAL PRIMARY KEY,
p_id INTEGER
);

Which created us the SQL table tree with columns id and p_id

Step 2: The Challenge provides extensive explanation on what is Leaf, Root and Inner for a Node. Let's see if GPT can figure it out just by their names so our prompt is:

'Write an SQL query to find the type of each node in tree table. Types are: Leaf, Inner and Root'

AI did figure out the logic behind types for Root, Inner and Leaf and did create the following SQL.

Metatable.ai User interface

After applying this SQL to the Challenge console, we need to do minor tweaks:

  • change node_type to node

  • add column id to the SELECT statement

The final SQL for the Challenge 2 is:

SELECT t1.id,
CASE
WHEN t1.p_id IS NULL THEN 'Root'
WHEN EXISTS (SELECT 1 FROM tree t2 WHERE t2.p_id = t1.id) THEN 'Inner
ELSE 'Leaf'
END AS type
FROM tree t1;

I'm very impressed how GPT picks up the logic between Root, Inner and Leaf simply from a single word. I guess the concept for calling Nodes that way is quite common, so GPT definitely saw that while training.

challenge 2: Text to SQL with AI


Challenge 3 - Hard Task: Capital Gain/Loss

Capital Gain/Loss SQL challenge

Step1: Creating table stocks.

Step2: Writing prompt to solve the Challenge: 'Write an SQL query to report the Capital gain/loss for each stock. The Capital gain/loss of a stock is the total gain or loss after buying and selling the stock one or many times. Return the result table in any order'

We've got a text to SQL in action by GPT and Metatable:

Metatable.ai-user interface

When tried to validate the SQL it appears to be incorrect.

SQL challenge result

AI was wrong with the sign here 'Buy' THEN - price ELSE price END , I changed it so the final SQL is:

SELECT
stock_name,
SUM (CASE WHEN operation = 'Buy' THEN - price ELSE price END) AS capital_gain_loss
FROM stocks
GROUP BY stock_name;

It's correct, quite readable and beats 96.31%, but again the other day the same SQL showed about 7%.

challenge 3: Text to SQL with AI


Conclusions

AI currently cannot fully replace a data analyst (at least not yet 😉). In my opinion, the optimal strategy would be to implement an approach that incorporates humans in the loop when using AI. With the right approach and adequate human supervision, AI can be a potent tool for anyone working with SQL, including data analysts.

Challenge performance:

  • Success: Metatable.AI was able to complete all 3 challenges.

  • To tackle each Challenge, I created a table in Metatable and then used the Challenge description as a prompt to create SQL with AI. This process was essentially a 'natural language to SQL exercise' and the output was nearly a complete solution. However, for 2 out of 3 challenges, the initial SQL code generated was slightly off.

  • In Challenge 2, renaming columns and adding additional fields were necessary. However, if the task description had included the required field names and naming conventions, the initial prompt would have been sufficient. As for Challenge 3, GPT made a minor mistake by interpreting 'Buy' as a plus and 'Sell' as a minus, which I quickly identified.

Regarding Metatable's performance, I think the AI pipeline does the job,  it validates the generated SQL with SQL database engine and if errors send it back to AI as a combined prompt. There is a room for improvement for natural language to SQL but I think we are on the right track.

If you're interested in exploring the use of AI in your own work, Metatable.ai has a beta release available for free. Give it a try and see how it can improve your efficiency and accuracy in SQL query writing. With the right approach and proper human oversight, AI can be a powerful tool for data analysts and anyone else working with SQL. Sign up for free beta version at beta.metatable.ai

P.S. When I wrote this post, I wanted to use Chat GPT to assist me with copy, but it was ‘At full capacity’ so there’s another point to using Metatable.ai, since it is connected to OpenAI via paid API.

Other articles