Project: Insurance Claim Processing with Pinecone

Every workplace, from buzzing construction sites to quiet offices, has its share of risks. When accidents happen—whether it’s a slip, a fall, or repetitive strain—workers’ compensation is there to help. This essential system provides support for injured employees while helping businesses manage claims efficiently.

Now, you get to step into the action! 🏗️📊

Your task is to analyze workers’ compensation claims data and retrieve cases that match specific injury types based on query prompts. Whether it’s identifying patterns in back injuries, falls, or repetitive motion strains, your work will shine a light on how injuries occur and how similar claims are handled. These insights can drive better safety practices and smarter resource allocation.

Data

The data used in this project are synthetically generated worker compensation insurance policies, all of which have had an accident. For each record there is demographic and worker related information, as well as a text description of the accident. This dataset is designed to simulate real-world insurance claims while adhering to data privacy standards. We have limited the scope of the data to the first 100 records of the original data sourced from Kaggle, to limit the burden on your OpenAI API credits.

insurance_claims_top_100.csv

Column	Description
`'ClaimNumber'`	Unique policy identifier. Each policy has a single claim in this synthetically generated data set.
`'DateTimeOfAccident'`	Date and time when the accident occurred (MM/DD/YYYY HH:MM:SS).
`'DateReported'`	Date the accident was reported to the insurer (MM/DD/YYYY).
`'Age'`	Age of the worker involved in the claim.
`'Gender'`	Gender of the worker: `M` for Male, `F` for Female, or `U` for Unknown.
`'MaritalStatus'`	Marital status of the worker: Married, Single, or Unknown.
`'DependentChildren'`	Number of dependent children.
`'DependentsOther'`	Number of dependents excluding children.
`'WeeklyWages'`	Total weekly wage of the worker.
`'PartTimeFullTime'`	Employment type: `P` for Part-time or `F` for Full-time.
`'HoursWorkedPerWeek'`	Total hours worked per week by the worker.
`'DaysWorkedPerWeek'`	Number of days worked per week by the worker.
`'ClaimDescription'`	Free-text description of the claim, providing details about the incident.
`'InitialIncurredClaimCost'`	Initial cost estimate for the claim made by the insurer.
`'UltimateIncurredClaimCost'`	Total claims payments by the insurance company. This is the target variable for prediction.

Citation

Ali. Actuarial loss prediction. https://kaggle.com/competitions/actuarial-loss-estimation, 2020. Kaggle.

Before you start

In order to complete the project you will need to create a developer account with OpenAI and Pinecone and store your API key as an environment variable. Instructions for these steps are outlined below.

Create a developer account with OpenAI

Go to the API signup page.
Create your account (you'll need to provide your email address and your phone number).
Go to the API keys page.
Create a new secret key.

Take a copy of it. (If you lose it, delete the key and create a new one.)

Add a payment method

OpenAI sometimes provides free credits for the API, but this can vary depending on geography. You may need to add debit/credit card details.

This project should cost much less than 1 US cents with gpt-4o-mini (but if you rerun tasks, you will be charged every time).

Go to the Payment Methods page.
Click Add payment method.

Fill in your card details.

Create a starter account with Pinecone

Go to pinecone.io.
Create a free Starter account.
Head to the API keys section and copy your API key.

Add environmental variables for your API keys

In the workbook, click on "Environment," in the top toolbar and select "Environment variables".
Click "Add" to add environment variables.
In the "Name" field, type "OPENAI_API_KEY". In the "Value" field, paste in your secret key. Do the same for your Pinecone API key, assigning it to "PINECONE_API_KEY".

Click "Create", then you'll see the following pop-up window. Click "Connect," then wait 5-10 seconds for the kernel to restart, or restart it manually in the Run menu.

# Install the Pinecone Python SDK
!pip install pinecone

# Import the relevant Python libraries
import pandas as pd
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec

# Here we are loading the insurance claims dataset into Pandas
df = pd.read_csv('insurance_claims_top_100.csv')

print(df.shape)

# Here we are initializing OpenAI and Pinecone
# Be sure to use your own API keys
import os
openai_api_key = os.environ["OPENAI_API_KEY"]
pinecone_api_key = os.environ['PINECONE_API_KEY']
client = OpenAI(api_key = openai_api_key)
pc = Pinecone(api_key = pinecone_api_key)

# Create a name for the Pinecone index
INDEX_NAME = "insurance-claims"

# Start coding here
# Use as many cells as you like

Project Instructions

Embed and store the insurance_claims_top_100.csv dataset in a Pinecone vector database and use your findings to answer the following questions:

What is the ClaimNumber of the claim that is closest to the example query, "Car accident with rear-end collision"? Store as a string variable called closest_claim_id.

What is the ClaimDescription of the claim that is closest to the example query? Save as a string variable called closest_claim_description.

Find the description of the claim most similar to the query "Worker developed carpal tunnel syndrome from repetitive typing" and save it as a string variable called closest_claim_description_carpal_tunnel.

from Guides

How to approach the project

Create a Pinecone index To store and search embeddings, you'll need to create a Pinecone index. This index will serve as the database where embeddings for the insurance claims will be stored and queried.

Creating the Pinecone Index Connecting to the Pinecone index
Creating a function to create the embeddings To create embeddings for claim descriptions, you can define a function that sends a request to the OpenAI API.

Generating Text Embeddings
Inserting embeddings into Pinecone After generating embeddings for claim descriptions, insert them into the Pinecone index so each embedding will be stored alongside its unique claim number.

Storing Embeddings in Pinecone
Create a function to find similar claims To retrieve claims similar to a given query, you can create a function that generates an embedding from the query and searches the Pinecone index for the closest matches.

Querying Pinecone for Similar Claims
Finding the claim number of the claim that is closest to the example query

You'll need to access the query results from Pinecone.

Accessing the query results
Finding the claim description of the claim that is closest to the example query Filter the DataFrame using the relevant identifier to find the claim description of the claim that is closest to the example query.

Subsetting the dataframe with the claim id
Locating the description of the claim most similar to the query, "Worker developed carpal tunnel syndrome from repetitive typing"

"""
Create a Pinecone index
    To store and search embeddings, you'll need to create a Pinecone index. This index will serve as the database where embeddings for the insurance claims will be stored and queried.

    Creating the Pinecone Index
    Connecting to the Pinecone index

"""
if INDEX_NAME in [index.name for index in pc.list_indexes()]:
    pc.delete_index(INDEX_NAME)

# Create your Pinecone index
pc.create_index(
    name=INDEX_NAME, 
    dimension=1536, 
    metric='euclidean',
    spec=ServerlessSpec(
        cloud='aws', 
        region='us-east-1'
    )
)


# Connect to your index
index = pc.Index(INDEX_NAME)
print(index)
print(index.describe_index_stats())

"""
Creating a function to create the embeddings
    To create embeddings for claim descriptions, you can define a function that sends a request to the OpenAI API.

    Generating Text Embeddings

Use the OpenAI client.embeddings.create() method to generate embeddings.
Replace any newline characters in the text with spaces for compatibility.
Pass the input text and the desired embedding model (e.g., "text-embedding-3-small") to the API.
Return the embedding from the API response using .data[0].embedding.
"""
def generate_embeddings(query, emb_model="text-embedding-3-small"):
    # Encode the input query using OpenAI
    query_response = client.embeddings.create(
        input=query.replace('\n', ' '),
        model=emb_model
    )
    
    query_emb = query_response.data[0].embedding
    return query_emb

import time

# Create a function to create embeddings with rate limit handling
def generate_embeddings(query, emb_model="text-embedding-3-small"):
    text = query.replace('\n', ' ')
    while True:
        try:
            # Encode the input query using OpenAI
            query_response = client.embeddings.create(
                input=[text],
                model=emb_model
            )
            return query_response.data[0].embedding
        except openai.error.RateLimitError:
            print("Rate limit exceeded. Retrying in 5 seconds...")
            time.sleep(5)

#sample_df = df.head(100)

from tqdm.auto import tqdm
prepped = []

for i, row in tqdm(df.iterrows(), total=df.shape[0]):
    claim_num = row['ClaimNumber']
    claim_desc = row['ClaimDescription']
    desc_embed = generate_embeddings(claim_desc)
    prepped.append({'id': claim_num, 
                    'values': desc_embed})
    if len(prepped) >= 10:
        index.upsert(prepped)
        prepped = []
        
if len(prepped):
    index.upsert(prepped)
    prepped = []

# 4 Create a function to find similar claims
def find_similar_claims(query, top_k=5):
	query_embedding = generate_embeddings(query)
	results = index.query(vector=query_embedding, top_k=top_k, include_metadata=True)
	return results

‌
‌
‌