Project: Insurance Claim Processing with Pinecone

Every workplace, from buzzing construction sites to quiet offices, has its share of risks. When accidents happen—whether it’s a slip, a fall, or repetitive strain—workers’ compensation is there to help. This essential system provides support for injured employees while helping businesses manage claims efficiently.

Now, you get to step into the action! 🏗️📊

Your task is to analyze workers’ compensation claims data and retrieve cases that match specific injury types based on query prompts. Whether it’s identifying patterns in back injuries, falls, or repetitive motion strains, your work will shine a light on how injuries occur and how similar claims are handled. These insights can drive better safety practices and smarter resource allocation.

Data

The data used in this project are synthetically generated worker compensation insurance policies, all of which have had an accident. For each record there is demographic and worker related information, as well as a text description of the accident. This dataset is designed to simulate real-world insurance claims while adhering to data privacy standards. We have limited the scope of the data to the first 100 records of the original data sourced from Kaggle, to limit the burden on your OpenAI API credits.

insurance_claims_top_100.csv

Column	Description
`'ClaimNumber'`	Unique policy identifier. Each policy has a single claim in this synthetically generated data set.
`'DateTimeOfAccident'`	Date and time when the accident occurred (MM/DD/YYYY HH:MM:SS).
`'DateReported'`	Date the accident was reported to the insurer (MM/DD/YYYY).
`'Age'`	Age of the worker involved in the claim.
`'Gender'`	Gender of the worker: `M` for Male, `F` for Female, or `U` for Unknown.
`'MaritalStatus'`	Marital status of the worker: Married, Single, or Unknown.
`'DependentChildren'`	Number of dependent children.
`'DependentsOther'`	Number of dependents excluding children.
`'WeeklyWages'`	Total weekly wage of the worker.
`'PartTimeFullTime'`	Employment type: `P` for Part-time or `F` for Full-time.
`'HoursWorkedPerWeek'`	Total hours worked per week by the worker.
`'DaysWorkedPerWeek'`	Number of days worked per week by the worker.
`'ClaimDescription'`	Free-text description of the claim, providing details about the incident.
`'InitialIncurredClaimCost'`	Initial cost estimate for the claim made by the insurer.
`'UltimateIncurredClaimCost'`	Total claims payments by the insurance company. This is the target variable for prediction.

Citation

Ali. Actuarial loss prediction. https://kaggle.com/competitions/actuarial-loss-estimation, 2020. Kaggle.

Before you start

In order to complete the project you will need to create a developer account with OpenAI and Pinecone and store your API key as an environment variable. Instructions for these steps are outlined below.

Create a developer account with OpenAI

Go to the API signup page.
Create your account (you'll need to provide your email address and your phone number).
Go to the API keys page.
Create a new secret key.

Take a copy of it. (If you lose it, delete the key and create a new one.)

Add a payment method

OpenAI sometimes provides free credits for the API, but this can vary depending on geography. You may need to add debit/credit card details.

This project should cost much less than 1 US cents with gpt-4o-mini (but if you rerun tasks, you will be charged every time).

Go to the Payment Methods page.
Click Add payment method.

Fill in your card details.

Create a starter account with Pinecone

Go to pinecone.io.
Create a free Starter account.
Head to the API keys section and copy your API key.

Add environmental variables for your API keys

In the workbook, click on "Environment," in the top toolbar and select "Environment variables".
Click "Add" to add environment variables.
In the "Name" field, type "OPENAI_API_KEY". In the "Value" field, paste in your secret key. Do the same for your Pinecone API key, assigning it to "PINECONE_API_KEY".

Click "Create", then you'll see the following pop-up window. Click "Connect," then wait 5-10 seconds for the kernel to restart, or restart it manually in the Run menu.

# Install the Pinecone Python SDK
!pip install pinecone

# Import the relevant Python libraries
import pandas as pd
from openai import OpenAI
from pinecone import Pinecone, ServerlessSpec
import os
import time

# Here we are loading the insurance claims dataset into Pandas
df = pd.read_csv('insurance_claims_top_100.csv')

# Here we are initializing OpenAI and Pinecone
# Be sure to use your own API keys
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

# Create a name for the Pinecone index
index_name = "insurance-claims"

# Start coding here
# Use as many cells as you like

pc.delete_index('insurance')

pc.create_index(name='insurance', dimension=1536, spec=ServerlessSpec(cloud='aws', region='us-east-1'), metric='cosine')
print(pc.list_indexes())

Hidden output

df['ClaimDescriptionEmbedded'] = df['ClaimDescription'].apply(lambda x : client.embeddings.create(model='text-embedding-3-small', input=x).data[0].embedding)
#print(df[['ClaimDescription', 'ClaimDescriptionEmbedded']].head())

ids = df['ClaimNumber'].tolist()
values = df['ClaimDescriptionEmbedded'].tolist()
metadata = [{'claim_description' : c} for c in df['ClaimDescription']]
#print(ids[:3], '\n', values[:3], '\n', metadata[:3])
vectors = [{'id' : i, 'values' : v, 'metadata' : m} for i, v, m in zip(ids, values, metadata)]
print(vectors[:2])