How to Extract Data From ChatGPT Response Using Python Langchain Framework?

You can use the create_extraction_chain() function in LangChain to extract structured JSON data from ChatGPT response. To do so, you need to make sure you meet the following pre-requisites first:

  1. Create an OpenAI API key;
  2. Install the necessary Python packages.

Once you've done those, you can extract data from an ChatGPT response using LangChain in the following steps:

  1. Define the model/schema to extract data based on;
  2. Define the LLM and chain to use;
  3. Execute the chain on some input.

You can apply the steps above to extract data:

  1. As Single Entity Type;
  2. As Multiple Entity Types;
  3. With Extra Information.

Extracting Data Into Single Entity

You can simply define the attributes of the entity you would like to extract in a model/schema. Under the hood LangChain will make use of OpenAI function calling to do the job for you.

For example, to extract attributes of a "person" from the input query, you can do something like the following:

import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_extraction_chain

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"

# 1. define the schema to extract data based on
schema = {
  "properties": {
    "name": {"type": "string"},
    "age": {"type": "integer"},
    "occupation": {"type": "string"},
  },
  "required": ["name", "age"],
}

# 2. define the LLM and chain to use
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
chain = create_extraction_chain(schema, llm)

# execute the chain on some input
query = """Bruce is 24 years old and works at Wayne Enterprises. Celina is three years younger than Bruce and does bar tending."""
result = chain.run(query)

print(result)

When executed, this will return a list of JSON objects like the following:

# [
#   {'name': 'Bruce', 'age': 24, 'occupation': 'works at Wayne Enterprises'},
#   {'name': 'Celina', 'age': 21, 'occupation': 'does bar tending'}
# ]

Extracting Data Into Multiple Entities

If you want the ability to extract multiple entities, you can do so by modifying your model/schema to include those entities. These entities are distinguished by using the same prefixes for each distinct entity.

For example, let's suppose you define two distinct entities that describe the attributes of a "person" and their "pet", it would look something like the following:

import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_extraction_chain

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"

# 1. define the schema to extract data based on
schema = {
  "properties": {
    "person_name": {"type": "string"},
    "person_age": {"type": "integer"},
    "pet_name": {"type": "string"},
    "pet_type": {"type": "string"},
  },
  "required": ["person_name", "person_age"],
}

# 2. define the LLM and chain to use
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
chain = create_extraction_chain(schema, llm)

# execute the chain on some input
query = """Bruce is 24 years old, and Celina is three years younger than Bruce. Bruce owns a German Shepherd called Ace, and Celina owns three cats called Otto, Slinky and Kitty."""
result = chain.run(query)

print(result)

When executed, this will produce a result like the following:

# [
#   {'person_name': 'Bruce', 'person_age': 24, 'pet_name': 'Ace', 'pet_type': 'German Shepherd'},
#   {'person_name': 'Celina', 'person_age': 21, 'pet_name': 'Otto', 'pet_type': 'cat'},
#   {'person_name': 'Celina', 'person_age': 21, 'pet_name': 'Slinky', 'pet_type': 'cat'},
#   {'person_name': 'Celina', 'person_age': 21, 'pet_name': 'Kitty', 'pet_type': 'cat'}
# ]

If instead of "pets", you had "animals" that had no connection or relation to the people, then they will be extracted as separate entities.

For example, consider the following model/schema with attributes of "person" and "animal" entities, where the query specifies no relation between the them:

import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_extraction_chain

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"

# 1. define the schema to extract data based on
schema = {
  "properties": {
    "person_name": {"type": "string"},
    "person_age": {"type": "integer"},
    "animal_name": {"type": "string"},
    "animal_type": {"type": "string"},
  },
  "required": ["person_name"],
}

# 2. define the LLM and chain to use
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
chain = create_extraction_chain(schema, llm)

# execute the chain on some input
query = """Bruce is 24 years old, and Celina is three years younger than Bruce.
Ace is a dog who loves to chase after the neighbor's cat Otto, and the stray rabbit that comes over occasionally called Bugs."""
result = chain.run(query)

print(result)

When executed, this will produce an output like the following:

# [
#   {'person_name': 'Bruce', 'person_age': 24},
#   {'person_name': 'Celina', 'person_age': 21},
#   {'animal_name': 'Ace', 'animal_type': 'dog'},
#   {'animal_name': 'Otto', 'animal_type': 'cat'},
#   {'animal_name': 'Bugs', 'animal_type': 'rabbit'}
# ]

Extracting Data With Extra Information

If you wish to extract some extra information from the response that might not be covered by the entities you define, you can create an additional placeholder (for unstructured extraction).

For example, to extract additional information about the "person" entity, you can add "person_extra_info" placeholder to the model/schema:

import os
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_extraction_chain

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"

# 1. define the schema to extract data based on
schema = {
  "properties": {
    "person_name": {"type": "string"},
    "person_age": {"type": "integer"},
    "person_extra_info": {"type": "string"},
    "pet_name": {"type": "string"},
    "pet_type": {"type": "string"},
  },
  "required": ["person_name", "person_age"],
}

# 2. define the LLM and chain to use
llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")
chain = create_extraction_chain(schema, llm)

# execute the chain on some input
query = """Bruce is 24 years old, and Celina is three years younger than Bruce. Bruce works as the CEO of Wayne Enterprises, while Celina works at a bar called the Iceberg Lounge. Ace the dog lives with Bruce, and Otto the cat lives with Celina."""
result = chain.run(query)

print(result)

When executed, this will produce an output like the following:

# [
#   {'person_name': 'Bruce', 'person_age': 24, 'person_extra_info': 'CEO of Wayne Enterprises', 'pet_name': 'Ace', 'pet_type': 'dog'},
#   {'person_name': 'Celina', 'person_age': 21, 'person_extra_info': 'works at the Iceberg Lounge', 'pet_name': 'Otto', 'pet_type': 'cat'}
# ]

This post was published by Daniyal Hamid. Daniyal currently works as the Head of Engineering in Germany and has 20+ years of experience in software engineering, design and marketing. Please show your love and support by sharing this post.