Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.
You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.
The Data
The dataset contains anonymized medical transcriptions categorized by specialty.
transcriptions.csv
| Column | Description |
|---|---|
"medical_specialty" | The medical specialty associated with each transcription. |
"transcription" | Detailed medical transcription texts, with insights into the medical case. |
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json# Load the data
df = pd.read_csv("data/transcriptions.csv")
df.head()# Initialize the OpenAI client
client = OpenAI()
## Start coding here, use as many cells as you need
def extract_medical_info(transcription: str, original_specialty: str) -> dict:
"""
Use the OpenAI API to extract:
- age
- medical_specialty
- recommended_treatment
- icd10_code
from a single transcription.
"""
system_message = (
"You are a medical coding assistant. "
"Read the medical transcription and extract the patient's age (if stated), "
"the medical specialty, the recommended treatment or plan, and the most "
"appropriate primary ICD-10 diagnosis code.\n\n"
"Return ONLY valid JSON with the following keys:\n"
" - age: integer or null if unknown\n"
" - medical_specialty: string (if uncertain, use the provided original specialty)\n"
" - recommended_treatment: short string summarizing the recommended treatment or plan\n"
" - icd10_code: string with the primary ICD-10 code (e.g., 'J20.9'). "
"If no reasonable code can be inferred, use 'R69'."
)
user_message = (
"Here is the transcription and metadata.\n\n"
f"Original medical specialty: {original_specialty}\n\n"
f"Transcription:\n{transcription}"
)
# Call the Chat Completions API in JSON mode
response = client.chat.completions.create(
model="gpt-4o-mini", # or the model specified in the project instructions
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
],
temperature=0
)
content = response.choices[0].message.content
# Try to parse the JSON returned by the model
try:
data = json.loads(content)
except json.JSONDecodeError:
# Fallback if anything goes wrong
data = {
"age": None,
"medical_specialty": original_specialty,
"recommended_treatment": "",
"icd10_code": "R69",
}
# Ensure all required keys exist
data.setdefault("age", None)
data.setdefault("medical_specialty", original_specialty)
data.setdefault("recommended_treatment", "")
data.setdefault("icd10_code", "R69")
return data
# Build a list of structured records by iterating over the DataFrame
structured_records = []
for _, row in df.iterrows():
transcription = row["transcription"]
original_specialty = row["medical_specialty"]
info = extract_medical_info(transcription, original_specialty)
structured_records.append(info)
# Create the final structured DataFrame
df_structured = pd.DataFrame(structured_records)
# Make sure the required columns exist and are ordered nicely
required_cols = ["age", "medical_specialty", "recommended_treatment", "icd10_code"]
for col in required_cols:
if col not in df_structured.columns:
df_structured[col] = None
df_structured = df_structured[required_cols]
df_structured.head()