Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.
You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.
The Data
The dataset contains anonymized medical transcriptions categorized by specialty.
transcriptions.csv
| Column | Description |
|---|---|
"medical_specialty" | The medical specialty associated with each transcription. |
"transcription" | Detailed medical transcription texts, with insights into the medical case. |
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json
import uuid# Load the data
df = pd.read_csv("data/transcriptions.csv")
df.head()# Initialize the OpenAI client
client = OpenAI()
## Start coding here, use as many cells as you needfunction_definition = [
{
'type':'function',
'function':{
'name': 'infor_extractor',
'description':'extracts information from the table',
'parameters':{
'type':'object',
'properties':{
'age':{
'type':'string',
'description':'the ages in the transcription'
},
'medical_speciality':{
'type':'string',
'description':'medical speciality value'
},
'treatment_recommendation':{
'type':'string',
'description':'the treament recommended in the transcription'
},
'ICD Code':{
'type':'string',
'description':'International Classification of Diseases (ICD) code to be matched each recommended treatment'
}
}
}
}
}
]messages = [
{
'role':'system',
'content':'you are a medical AI assistant. Your task is to extract relevant pieces of information as required of you. make sure to Match each recommended treatment with the corresponding International Classification of Diseases (ICD) code'
}]
data = ['medical specialty : '+i+' '+'transcription : '+j for i,j in zip(df['medical_specialty'],df['transcription'])]
merge = [{'role':'user','content': merge} for merge in data]
messages.extend(merge)
response = client.chat.completions.create(
model = 'gpt-4o-mini',
messages = messages,
tools = function_definition
)import pandas as pd
import json
columns = ['age', 'medical_speciality', 'treatment_recommendation','ICD code']
index = list(range(len(df)))
df_structured = pd.DataFrame(columns=columns)
for i in index:
function_call = response.choices[0].message.tool_calls[i].function
arguments = json.loads(function_call.arguments)
row_data = {col: arguments.get(col) for col in columns}
df_structured = pd.concat([df_structured, pd.DataFrame([row_data])], ignore_index=True)
print(df_structured)