Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.
You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.
The Data
The dataset contains anonymized medical transcriptions categorized by specialty.
transcriptions.csv
| Column | Description |
|---|---|
"medical_specialty" | The medical specialty associated with each transcription. |
"transcription" | Detailed medical transcription texts, with insights into the medical case. |
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json
import re# Load the data
df = pd.read_csv("data/transcriptions.csv")
df.head()# Initialize the OpenAI client
client = OpenAI()
## Start coding here, use as many cells as you needtest = """
SUBJECTIVE:, This 23-year-old white female presents with complaint of allergies. She used to have allergies when she lived in Seattle but she thinks they are worse here. In the past, she has tried Claritin, and Zyrtec. Both worked for short time but then seemed to lose effectiveness. She has used Allegra also. She used that last summer and she began using it again two weeks ago. It does not appear to be working very well. She has used over-the-counter sprays but no prescription nasal sprays. She does have asthma but doest not require daily medication for this and does not think it is flaring up.,MEDICATIONS: , Her only medication currently is Ortho Tri-Cyclen and the Allegra.,ALLERGIES: , She has no known medicine allergies.,OBJECTIVE:,Vitals: Weight was 130 pounds and blood pressure 124/78.,HEENT: Her throat was mildly erythematous without exudate. Nasal mucosa was erythematous and swollen. Only clear drainage was seen. TMs were clear.,Neck: Supple without adenopathy.,Lungs: Clear.,ASSESSMENT:, Allergic rhinitis.,PLAN:,1. She will try Zyrtec instead of Allegra again. Another option will be to use loratadine. She does not think she has prescription coverage so that might be cheaper.,2. Samples of Nasonex two sprays in each nostril given for three weeks. A prescription was written as well.
"""
# response = client.chat.completions.create(
# model= "gpt-3.5-turbo",
# messages =[{"role":"system","content":"You are a medical scribe. Your task it extract three values from the text given to you. These values are: age of patient, medical specialty, and the recommended treament. You are to return these as a dictionary with keys 'age','medical_specialty','recommended_treatment'"},
# {"role":"user","content":test}]
# )
def summarize_med_data(text):
response = client.chat.completions.create(
model= "gpt-3.5-turbo",
messages =[{"role":"system","content":"""You are a medical scribe. Your task it extract four values from the text given to you. These values are: age of patient, medical specialty, the recommended treament, and the ICD code coressponding to the treatment's International Classification of Diseases code. You are to return these as a dictionary with keys 'age','medical_specialty','recommended_treatment' so your response must be in the format:
{"age":value,"medical_specialty":value,"recomended_treatment":value,"icd_code":value}
"""},
{"role":"user","content":text}])
data = re.sub(r'[\n\t]','',response.choices[0].message.content)
data_dict = json.loads(data)
return data_dictsummarize_med_data(text=test)df['resps'] = df['transcription'].apply(lambda x: summarize_med_data(x))df_structured = pd.DataFrame([x for x in df['resps']])