Skip to content

Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.

You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.

The Data

The dataset contains anonymized medical transcriptions categorized by specialty.

transcriptions.csv

ColumnDescription
"medical_specialty"The medical specialty associated with each transcription.
"transcription"Detailed medical transcription texts, with insights into the medical case.
# Import the necessary libraries
import pandas as pd
from openai import OpenAI
import json
# Load the data
df = pd.read_csv("data/transcriptions.csv")
df.head()
# df_str = df.to_csv(index=False)
# Initialize the OpenAI client
# You need to have the following:
# age, recommended treatment, ICD code and medical specialty
client = OpenAI()

messages = [
    {'role':'system', 'content': '''
    You have been provided an anonymized dataset of medical transcriptions organized by specialty. You need to extract the data fields from EACH transcription and return them as a JSON array under the key "data".
    '''},
    {'role':'user', 'content': f'''
    The file containing the data to be extracted is delimited by triple backticks: ```{df_str}```. Perform the following tasks to extract the data from each transcription of this CSV File:
    - You need to EXTRACT 'age', 'medical_specialty', and a new data field to store the recommended_treatment extracted from each transcription.
    - Match each recommended treatment with the corresponding International Classification of Diseases (ICD) code.
    '''}
]

function_definition = [{
    'type':'function',
    'function': {
        'name':'extract_data',
        'description':'Extracts the age, medical_specialty, and and a new data field to store the recommended_treatment extracted from each transcription',
        'parameters': {
            'type':'object',
            'properties': {
                'data': {'type':'array',
                     'items': {
                         'type':'object',
                         'properties': {
                             'age': {'type':'string',
                                    'description':'age'},
                             'recommended_treatment': {'type':'string',
                                                    'description':'new data field to store the recommended treatment extracted from each transcription'},
                             'ICD_code': {'type':'string',
                                         'description':'International Classification of Diseases code'},
                            'medical_specialty': {'type':'string',
                                                 'description':'medical specialty'},
                         }
                     }
                        }

            }
        }
    }
}]
response = client.chat.completions.create(
    model='gpt-4o-mini',
    messages=messages,
    tools=function_definition
)

output = response.choices[0].message.tool_calls[0].function.arguments
print(output)
parsed_output = json.loads(output)
data = parsed_output['data']
df_structured = pd.DataFrame(data)
print(df_structured.head())