Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.
You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.
The Data
The dataset contains anonymized medical transcriptions categorized by specialty.
transcriptions.csv
| Column | Description |
|---|---|
"medical_specialty" | The medical specialty associated with each transcription. |
"transcription" | Detailed medical transcription texts, with insights into the medical case. |
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json# Load the data
df = pd.read_csv("data/transcriptions.csv")
df.head()# Initialize the OpenAI client
client = OpenAI()# Function to extract Age and recommended treatment
def extract_age_treatment(transcription):
"""Extracts age and treatment from a transcript using the OPENAI API"""
messages = [
{
"role": "system",
"content": "You are a healthcare professional extracting patient data. Always return both the age and recommended treatment. If the information is missing, still create the field and specify 'Unknown'.",
"role": "user",
"content": f"Please extract and return both the patient's age and recommended treatment from the following transcription. Transcription: {transcription}."
}
]
function_definition = [
{
"type": "function",
"function": {
"name": "extract_medical_info",
"description": "Extract the age and recommended treatment from the input text. Always return both age and recommended treatement.",
"parameters": {
"type": "object",
"properties": {
"Age": {
"type": "integer",
"description": "Age of the patient"
},
"Recommended Treatment" : {
"type": "string",
"description": "Recommended Treatment for the patient"
}
}
}
}
}
]
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=function_definition
)
return json.loads(response.choices[0].message.tool_calls[0].function.arguments)#function to get ICD_CODES for given treatment.
def get_icd_codes(treatment):
if treatment != 'Unknown':
"""Retrieves ICD codes for a given treatment using OpenAI."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"Provide the ICD codes for the following treatment or procedure: {treatment}. Return the answer as a list of codes. Please only include the codes and no other information."
}],
temperature=0.3
)
output = response.choices[0].message.content
else:
output = "Unknown"
return output# Process each row of the dataframe containing the patient data.
processed_data = []
for index, row in df.iterrows():
medical_speciality = row['medical_specialty']
extracted_data = extract_age_treatment(row['transcription'])
icd_code = get_icd_codes(extracted_data['Recommended Treatment']) if 'Recommended Treatment' in extracted_data.keys() else "Unknown"
extracted_data['Medical Speciality'] = medical_speciality
extracted_data["ICD Code"] = icd_code
processed_data.append(extracted_data)
# Covert the list into DF
df_structured = pd.DataFrame(processed_data)print(df_structured.columns)
print(df_structured.head(2))