Skip to content

Medical professionals often summarize patient encounters in transcripts written in natural language, which include details about symptoms, diagnosis, and treatments. These transcripts can be used for other medical documentation, such as for insurance purposes, but as they are densely packed with medical information, extracting the key data accurately can be challenging.

You and your team at Lakeside Healthcare Network have decided to leverage the OpenAI API to automatically extract medical information from these transcripts and automate the matching with the appropriate ICD-10 codes. ICD-10 codes are a standardized system used worldwide for diagnosing and billing purposes, such as insurance claims processing.

The Data

The dataset contains anonymized medical transcriptions categorized by specialty.

transcriptions.csv

ColumnDescription
"medical_specialty"The medical specialty associated with each transcription.
"transcription"Detailed medical transcription texts, with insights into the medical case.

First off, I will import the necessary libraries and classes, including typing for documenting the data extractor assistant's function, and logging for error observabilty. Besides that, I will import the openai, pandas, and json libraries.

# --- Import the necessary libraries ---
import pandas as pd
from openai import OpenAI, APIError   
from typing import Optional
import json
import logging

After that, I will configure the logger to exibit messages up from the INFO level, adding the date/time informaton, the message level, and the message itself.

# --- Logging Config ---
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%d-%m-%Y %H:%M:%S'
)

Then I'll load the data reading a the transcriptions.csv file and turning it into a DataFrame object.

# --- Load the data ---
df = pd.read_csv("data/transcriptions.csv")
df.head()

Here I'll initialize te OpenAI client and set the DataFrame display to maximum column width to prevent it from being truncated, this will allow th e model to read the whole transcription text. After that, I will build the data extractor assistant using a function and the OpenAI API so that I can apply the it to the transcription column df['transcription'], instead of using the whole DataFrame.

The data extractor assistant is instructed to extract the information from the transcription and return the data as a JSON object with the corresponding keys, of course, it is just a string formatted as a JSON, that's why I'll use the json.loads() function to turn the resulting string into a JSON object, ant then turn it into a DataFrame.

# --- Initialize the OpenAI client ---
client = OpenAI()

# --- Setting the display of the Pandas DataFrame ---
pd.set_option('display.max_colwidth', None) # Prevents text from being truncated

# --- Data Extractor Assistant Function ---
def extract_data_from_text(transcription_text: str, model: str = "gpt-5"):
    """
    AI data extractor that extracts patient information from transcription text.
    
    Args
        transcription_text (str): Transcription text of the patient being analyzed.
        model (str) OpenAI AI model being prompted. Default is "gpt-4o".
        
    Returns
        JSON object of the extracted data.
    """

    # --- System message ---
    system_message = """
    You are an AI assistant specialized in the medical field. Your purpose is to extract information from patient transcritptions. Return data as a JSON object with the keys: 'age', 'medical_specialty', 'recommended_treatment', and 'icd_10_code'.
    """
    
    # -- User Prompt --
    prompt = f"""
    Analyze the medical transcription delimited by triple backticks and extract the following data:
    1. Patient age (age)
    2. Medical specialty envolved (medical_specialty)
    3. Treatment or procedure envolved (recommended_treatment)
    4. ICD-10 code corresponding to the main condition (icd_10_code)

    If no information is found in the transcription, use the ```null``` value in the corresponding field. 
    
    ```{transcription_text}```
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role":"system", "content": system_message},
                {"role":"user","content":prompt}
            ],
            response_format={"type":"json_object"}
        )
    
        if response.choices[0] and response.choices[0].message.content:
            return json.loads(response.choices[0].message.content)
        else:
            logging.warning("Warning: Received an empty response from the OpenAI API.")
            return None
        
    except APIError as e:
        logging.error(f"An OpenAI API error has occured: {e}")
    except json.JSONDecodeError as e:
        logging.error(f"Error while decoding JSON response: {e}")
    except Exception as e:
        logging.critical(f"An unexpected error has occured: {e}")    

Now I can apply the function to the df['transcription'] column and call it extracted_data, then I'll treat the data and turn it into a list of dictionaries using the tolist() function. Lastly, I'll concatenate the original DataFrame with the extracted data DataFrame and call it df_structured.

# --- Applying the data extractor assistant and generating the final DataFrame ---
extracted_data = df['transcription'].apply(extract_data_from_text)
extracted_df = pd.DataFrame(extracted_data.dropna().tolist())
df_structured = pd.concat([df, extracted_df], axis=1)

I've successfully organized the medical transcription using the OpenAI API.

# --- Visualizing Results ---
df_structured.head()