Skip to content
Project: Organizing Medical Transcriptions with the OpenAI API with gpt-4o-mini
  • AI Chat
  • Code
  • Report
  • Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.

    You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.

    The Data

    The dataset contains anonymized medical transcriptions categorized by specialty.

    transcriptions.csv

    ColumnDescription
    "medical_specialty"The medical specialty associated with each transcription.
    "transcription"Detailed medical transcription texts, with insights into the medical case.

    Before you start

    In order to complete the project you will need to create a developer account with OpenAI and store your API key as a secure environment variable. Instructions for these steps are outlined below.

    Create a developer account with OpenAI

    1. Go to the API signup page.

    2. Create your account (you'll need to provide your email address and your phone number).

    3. Go to the API keys page.

    4. Create a new secret key.

    1. Take a copy of it. (If you lose it, delete the key and create a new one.)

    Add a payment method

    OpenAI sometimes provides free credits for the API, but this can vary depending on geography. You may need to add debit/credit card details.

    This project should cost less than 10 US cents with GPT-3.5-Turbo (but if you rerun tasks, you will be charged every time).

    1. Go to the Payment Methods page.

    2. Click Add payment method.

    1. Fill in your card details.

    Add an environmental variable with your OpenAI key

    1. In the workbook, click on "Environment," in the left sidebar.

    2. Click on the plus button next to "Environment variables" to add environment variables.

    3. In the "Name" field, type "OPENAI_API_KEY". In the "Value" field, paste in your secret key.

    1. Click "Create", then you'll see the following pop-up window. Click "Connect," then wait 5-10 seconds for the kernel to restart, or restart it manually in the Run menu.
    # Import the necessary libraries
    import pandas as pd
    from openai import OpenAI
    import json
    # Load the data
    df = pd.read_csv("data/transcriptions.csv")
    df.head()
    ## Start coding here, use as many cells as you need
    # Initialize the OpenAI client: make sure you have a valid API key named OPENAI_API_KEY in your Environment Variables
    client = OpenAI()
    
    
    def extract_info_with_openai(transcription):
        """Extracts age and recommended treatment or procedure from a transcription using OpenAI."""
        messages = [
            {
                "role": "system",
                "content":"You are a healthcare professional and need to get the age and recommended treatment or procedure from a medical record transcript. Always return both age and recommended treatment or procedure: if any of the fields is missing in the transcript, return Not Found.",
                "role": "user",
                "content": f"Return the age and recommended treatment or procedure for the patients from the body of the following transcription: {transcription}. "
            }
        ]
        function_definition = [
            {
                'type': 'function',
                'function': {
                    'name': 'extract_medical_data',
                    'description': 'Get the age and recommended treatment or procedure from the input text. Always return both age and recommended treatment or procedure: if any of the fields is missing in the transcript, return Not Found.',
                    'parameters': {
                        'type': 'object',
                        'properties': {
                            'Age': {
                                'type': 'integer',
                                'description': 'Age of the patient'
                            },
                            'Recommended Treatment/Procedure': {
                                'type': 'string',
                                'description': 'Recommended treatment or procedure for the patient'
                            }
                        }
                    }
                }
            }
        ]
        response = client.chat.completions.create(
            model="gpt-4o-mini", # gpt-3.5-turbo
            messages=messages,
            tools=function_definition
        )
        return json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    
    def get_icd_codes(treatment):
        """Retrieves ICD codes for a given treatment using OpenAI."""
        response = client.chat.completions.create(
            model="gpt-4o-mini", # gpt-3.5-turbo
            messages=[{
                "role": "user",
                "content": f"Provide the ICD codes for the following treatment or procedure: {treatment}. Return the answer as a list of codes with corresponding definition."
            }],
            temperature=0.3
        )
        return response.choices[0].message.content
    
    processed_data = []
    
    for index, row in df.iterrows():
        transcription = row['transcription']
        medical_specialty = row['medical_specialty']
        extracted_data = extract_info_with_openai(transcription)
        icd_code = get_icd_codes(extracted_data["Recommended Treatment/Procedure"])
        extracted_data["Medical Specialty"] = medical_specialty
        extracted_data["ICD Code"] = icd_code
        
        processed_data.append(extracted_data)
    
    df_structured = pd.DataFrame(processed_data)
    
    df_structured.head()