After waiting ~10 months on the OpenAI waiting list for GPT-3 API access, I got an email a few weeks ago telling me I made it in!
Since my programming skills are best described as non-existent, I figured I’d experiment with stuff I do understand: medical trivia! Thanks Step 1 and 2…
A really short description of GPT-3: a neural network trained on a huge corpus of text that does natural language production at (so far) unprecedented quality.
The upshot for the average person is that GPT-3 with the right “prompts” can be surprisingly human and knowledgeable. Here is a large collection of GPT-3 producing text in a variety of domains and here is the buzz it created on Twitter. I’ve seen demos of it producing poetry, speeding up creative prose writing, and pretending to be a therapist. It’s hard to overstate its versatility.
In this post, I’ll walk you through some examples of GPT-3 being prompted with medical topics in various formats, since I haven’t seen too many demos like that, besides this tweet on it.
If you’ve seen plenty of GPT-3 examples, skip to “[Thoughts on Application] for some ideas on applying GPT-3 to healthcare.
Some formatting tips:
BOLD is Prompt, which is text I fed in to GPT-3
Plain text in block quotes is GPT-3, like “Example below“
Example
Italicized are my comments
All settings, until specified later on, are the defaults for Q and A preset.
A guide to using GPT-3 and a quick explanation of the the parameters. The OpenAI documentation is excellent as well, even for a non-technical person like myself.
Disclaimer: absolutely nothing below is medical advice!
Begin
I am a highly intelligent question answering artificial physician. If you ask me a question that has a straightforward answer, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with "Unknown".
Q: What lab value will be elevated in diabetics?
A: Blood glucose.
Q: What are the important treatments for a patient with Myxedema Coma?
A: levothyroxine, hydrocortisone, and IV fluids
Q: How do you manage an uncomplicated skin abscess in a patient with no comorbidities?
A: Incision and Drainage, maybe give antibiotics
Q: What is the treatment of Neurocysticercosis?
A: Albendazole
Q: How do you rule out appendicitis?
A: CT scan
Q: How do you diagnose a prolactinoma?
A: MRI
Q: What is the differential diagnosis of irregular menses and hirsutism?
A: Polycystic Ovarian Syndrome
It did pretty well giving simple answers to these questions.
At this point, I wanted to see what GPT-3 could do if I asked for a longer list. I decided to modify the prompt by changing the intro a bit, “If the question has multiple answers, I will list them in descending order of prevalence.”
-that didn’t change output. So I modified the Q and A preceding this to give more for GPT-3 to model itself off. This is a key part of prompt engineering. Don’t just tell GPT-3 what to do— give examples!
With more answers per question:
I am a highly intelligent question answering artificial physician. If you ask me a question that has a straightforward answer, I will give you the answer. If you ask me a question that is nonsense, trickery, or has no clear answer, I will respond with "Unknown". If the question has multiple answers, I will list the first three in descending order of prevalence.
Q: What lab value will be elevated in diabetics?
A: Blood glucose, hemoglobin A1C, triglycerides.
Q: What are the important treatments for a patient with Myxedema Coma?
A: levothyroxine, hydrocortisone, and IV fluids.
Q: How do you manage an uncomplicated skin abscess in a patient with no comorbidities?
A: Incision and Drainage, antibiotics.
Q: What is the treatment of Neurocysticercosis?
A: Albendazole, Praziquantel
Q: How do you rule out appendicitis?
A: CT scan, ultrasound
Q: How do you diagnose a prolactinoma?
A: MRI, prolactin levels
Everything above I wrote in myself, adding to the answers. Those are not GPT-3 generated. The following is:
Q: What is the differential diagnosis of irregular menses and hirsutism?
A: Polycystic ovarian syndrome, hyperandrogenism, androgen insensitivity syndrome, androgen-producing tumors, androgen-secreting tumors, androgen-producing adrenal tumors, androgen-secreting adrenal tumors, androgen-producing ovarian tumors, androgen-secreting ovarian tumors, androgen-producing thyroid tumors, androgen-secreting thyroid tumors, androgen-producing parathyroid tumors, androgen-secreting parathyroid tumors
This is a more complete answer but it’s less accurate. PCOS and “androgen-producing/secreting tumors” are correct answers, while “androgen-producing thyroid/parathyroid tumor” is not a real entity and Androgen insensitivity syndrome (AIS) would definitely not cause hirsutism—women with AIS tend to have sparse body hair. Also, secreting and producing are redundant in this context. And of course, it’s missing more common answers: CAH and exogenous hormone use, for instance, are both more common than androgen-secreting tumors.
So I modified the Temperature and Top_p parameters, with little effect:
Q: What is the differential diagnosis of irregular menses and hirsutism?
A: Polycystic ovarian syndrome, hyperandrogenism, androgen-secreting tumors, androgen-producing tumors, androgen resistance, androgen insensitivity, androgen-producing adrenal tumors, androgen-producing ovarian tumors, androgen-producing pituitary tumors, androgen-producing parathyroid tumors, androgen-producing thyroid tumors, androgen-producing parotid tumors, androgen-producing submandibular tumors, androgen-producing sublingual
That gives me more nonsensical answers.
Setting the temperature to 1 and setting Top_p to .5 should give me more creative answers, per the OPENAI documentation. That gives me the following:
Q: What is the differential diagnosis of irregular menses and hirsutism?
A: hyperprolactinemia, polycystic ovary syndrome, gestational trophoblastic disease (fancy for a very early dangerous invasive mole), other, never mind, funny question.
Note that GPT-3 seems to be more “informal” with this setting, but is still incorrect. GTD would not generally cause this. So far, it does Ok at answering these open-ended questions with multiple answers, but those are harder (at least for humans), so I’ll try questions with more information per question but simpler answers. Some of these are modified from Zanki Step 2 decks.
Q: What is the likely diagnosis for a solid liver mass in a young woman on oral contraceptives?
A: Hepatoma
This is incorrect, I changed it to hepatic adenoma and included this as part of the prompt.
Q: What IBD complication presents with colonic dilation on X-ray and fever, tachycardia, leukocytosis, and/or anemia?
A: Toxic megacolon
That’s correct.
Q: What is the likely diagnosis in a patient with extensive alcohol use who presents with severe retrosternal pain and subcutaneous emphysema after vomiting?
A: Boerhaave syndrome
That’s correct. However, for unknown reasons, when I tried to reproduce this answer with the same prompt later on, it responded with “gastric ulcer” multiple times. I think I might have changed the spacing on the prompt or modified the parameters slightly, but I’m not sure. In any event, that demonstrates how important good prompts are to the output and how small changes in the prompt, even minor punctuation differences, can reduce the quality dramatically.
Q: What is the likely diagnosis in a patient on warfarin who complains of weakness and back pain with new-onset anemia?
Vitamin K deficiency
Incorrect. I’m looking for Retroperitoneal Hematoma. I tried a few different variations on this question, added extra information, and tried changing temperature and top_p around and it didn’t get this right. Other incorrect answers it generated:
Deep Venous Thrombosis, DIC
DIC presumably referring to Disseminated Intravascular Coagulation, though the OpenAI system flagged it “unsafe” content😆
Pars Inferior Myopathy,
which doesn’t seem to be real.
Summary of Q and A
While GPT-3 is impressively flexible, I’m not impressed by the medical Q&A prompt I developed.
Long-form
Modeled off this tweet, I tried longer-form questions to see if that helps. Some were taken from this freely available NBME sample test. These questions are designed for medical students who have completed their 2nd or 3rd year of American medical school, right before clinical rotations. Again BOLD is the prompt.
A 12-year-old girl is brought to the physician because of a 2-month history of intermittent yellowing of the eyes and skin. Physical examination shows no abnormalities except for jaundice. Her serum total bilirubin concentration is 3 mg/dL, with a direct component of 1 mg/dL. Serum studies show a haptoglobin concentration and AST and ALT activities that are within the reference ranges. There is no evidence of injury or exposure to toxins. Which of the following additional findings is most likely in this patient?
(A) Decreased activity of UDP glucuronosyltransferase
(B) Gallstones
(C) Increased hemolysis
(D) Increased serum alkaline phosphatase activity
(E) Ineffective erythropoiesis
Correct Answer:
A Rationale 1: The question stem identifies that the patient had a direct serum bilirubin concentration of 1 mg/dL, which is elevated, but not as high as an unconjugated (indirect) component. The question then directs the reader to consider the findings of decreased UDP glucuron
The correct answer IS A, but the reasoning that GPT-3 lays out is hard to follow. I would answer A because there is no worrying physical exam finding which makes all the answers except A unlikely, haptoglobin is normal (makes E unlikely) and in general, an isolated finding of jaundice with no other worrying physical exam signs or lab abnormalities besides elevated bilirubin levels is classic for Gilbert’s Syndrome.
I tried refreshing that prompt and it gave different and incorrect answers every other time.
I then set Top_p =1 and temperature=0.1, which is supposed to reduce the creativity/randomness of the response.
That gave me A consistently (correct!), but with nonsensical explanations.
I decided to add more context by using that question and answer as part of the prompt and add another question, with settings of Top_p=1 and temperature=0.
A 12-year-old girl is brought to the physician because of a 2-month history of intermittent yellowing of the eyes and skin. Physical examination shows no abnormalities except for jaundice. Her serum total bilirubin concentration is 3 mg/dL, with a direct component of 1 mg/dL. Serum studies show a haptoglobin concentration and AST and ALT activities that are within the reference ranges. There is no evidence of injury or exposure to toxins. Which of the following additional findings is most likely in this patient?
(A) Decreased activity of UDP glucuronosyltransferase
(B) Gallstones
(C) Increased hemolysis
(D) Increased serum alkaline phosphatase activity
(E) Ineffective erythropoiesis
Correct Answer: A, because this is a classic presentation of Gilbert's Disease.
Question: A 30-year-old woman, gravida 2, para 0, aborta 1, at 28 weeks' gestation comes to the office for a prenatal visit. She has had one previous pregnancy resulting in a spontaneous abortion at 12 weeks' gestation. Today, her vital signs are within normal limits. Physical examination shows a uterus consistent in size with a 28-week gestation. Fetal ultrasonography shows a male fetus with no abnormalities. Her blood group is O, Rh-negative. The father's blood group is B, Rh-positive. The physician recommends administration of Rho(D) immune globulin to the patient. This treatment is most likely to prevent which of the following in this mother?
(A) Development of natural killer cells
(B) Development of polycythemia
(C) Formation of antibodies to RhD
(D) Generation of IgM antibodies from fixing complement in the fetus
(E) Immunosuppression caused by RhD on erythrocytes from the fetus
Correct Answer: C, because the fetus is Rh-positive and the mother is Rh-negative.
That’s correct. Trying a different Q:
A 52-year-old woman begins pharmacotherapy after being diagnosed with type 2 diabetes mellitus. Four weeks later, her hepatic glucose output is decreased, and target tissue glucose uptake and utilization are increased. Which of the following drugs was most likely prescribed for this patient?
(A) Acarbose
(B) Glyburide
(C) Metformin
(D) Nateglinide
(E) Repaglinide
Correct Answer: C, because metformin is the only drug listed that is a biguanide.
Here the answer is correct but the rationale is not—the rationale is because metformin is the first-line agent for DM2, it lowers hepatic glucose output, and increases target tissue glucose uptake. When I increased the sequence length, it started generating what seems like a copy-pasted question about hepatitis that was nonsensical. Keeping the first two long-form question/answers (about Gilbert’s Disease and a pregnant women), and trying a different question out:
After being severely beaten and sustaining a gunshot wound to the abdomen, a 42-year-old woman undergoes resection of a perforated small bowel. During the operation, plastic reconstruction of facial fractures, and open reduction and internal fixation of the left femur are also done. Thirty-six hours postoperatively, she is awake but not completely alert. She is receiving intravenous morphine via a patient-controlled pump. She says that she needs the morphine to treat her pain, but she is worried that she is becoming addicted. She has no history of substance use disorder. She drinks one to two glasses of wine weekly. Which of the following initial actions by the physician is most appropriate?
(A) Reassure the patient that her chance of becoming addicted to narcotics is minuscule
(B) Maintain the morphine, but periodically administer intravenous naloxone
(C) Switch the patient to oral acetaminophen as soon as she can take medication orally
(D) Switch the patient to intramuscular lorazepam
(E) Switch the patient to intravenous phenobarbital
Correct Answer: C, because the patient is not addicted to morphine.
The actual correct answer is A. So it’s giving an incorrect answer with a plausible sounding-rationale. Overall, I’m impressed with its ability to generate plausible-sounding explanations and somewhat impressed by its ability to occasionally answer medical questions.
Fun with Probabilities
An advantage of human physicians that people can tell you “I think you’ve got this obscure disease, but I’m not 100% sure. ” That is, we can express uncertainty about our diagnosis. Luckily, GPT-3 can as well. Here’s an example:
Clearly GPT-3 can pick up on question and answer formatting, and correctly use single-letter answers, since A, B, C, D, E are far more likely than other answers, which is impressive. It realizes this is a multiple choice question with only two examples! But, disappointingly, it only rates the correct answer (A) as 23% likely. Raising the temperature makes GPT-3 occasionally pick A by chance, but it doesn’t rate it as more likely than B or C.
If I add another sample question (the previous one, with the correct answer, becomes part of the prompt), does GPT-3 get better?
A 47-year-old man is brought to the emergency department 30 minutes after the onset of blurred vision, difficulty breathing, vomiting, diarrhea, muscle weakness, and fatigue. He is a farmer, and the symptoms began shortly after he spilled an agricultural chemical on his clothes. His pulse is 30/min, and blood pressure is 90/60 mm Hg. Cheyne Stokes respirations are labored with wheezing and gurgling. Further evaluation shows moderate dehydration, hypoxemia, and hypercapnia. Which of the following is the most appropriate immediate pharmacotherapy for this patient?
(A) Atropine
(B) Bethanechol
(C) Epinephrine
(D) Isoproterenol
(E) Physostigmine
Correct Answer: C
The actual answers is A, and A only has a 25% probability.
So GPT-3 doesn’t seem to get much better at this type of long medical question, at least not with my prompt engineering skills.
GPT-J-6B
I wanted to try an open-source equivalent of GPT-3, so I tried GPT-J-6B, which is, as of 07/10/2021, the best open-source text generator available. From my limited understanding of language models, it’s a little better than the smallest GPT-3 model available and not as good as the other GPT-3 models (including davinci, the one I used throughout) though it may be a bit more knowledgeable on technical subjects due to the dataset (“The Pile”) it was trained on.
Again, BOLD is my text, the rest is the AI:
A 12-year-old girl is brought to the physician because of a 2-month history of intermittent yellowing of the eyes and skin. Physical examination shows no abnormalities except for jaundice. Her serum total bilirubin concentration is 3 mg/dL, with a direct component of 1 mg/dL. Serum studies show a haptoglobin concentration and AST and ALT activities that are within the reference ranges. There is no evidence of injury or exposure to toxins. Which of the following additional findings is most likely in this patient?
(A) Decreased activity of UDP glucuronosyltransferase
(B) Gallstones
(C) Increased hemolysis
(D) Increased serum alkaline phosphatase activity
(E) Ineffective erythropoiesis
Correct Answer: A, because this is a classic presentation of Gilbert's Disease.
After being severely beaten and sustaining a gunshot wound to the abdomen, a 42-year-old woman undergoes resection of a perforated small bowel. During the operation, plastic reconstruction of facial fractures, and open reduction and internal fixation of the left femur are also done. Thirty-six hours postoperatively, she is awake but not completely alert. She is receiving intravenous morphine via a patient-controlled pump. She says that she needs the morphine to treat her pain, but she is worried that she is becoming addicted. She has no history of substance use disorder. She drinks one to two glasses of wine weekly. Which of the following initial actions by the physician is most appropriate?
(A) Reassure the patient that her chance of becoming addicted to narcotics is minuscule
(B) Maintain the morphine, but periodically administer intravenous naloxone
(C) Switch the patient to oral acetaminophen as soon as she can take medication orally
(D) Switch the patient to intramuscular lorazepam
(E) Switch the patient to intravenous phenobarbital
Correct Answer: D, because this is a classic presentation of the syndrome of opioid-induced hyperalgesia.
This is incorrect, though plausible-sounding.
Which of the following is the most dangerous complication of uncontrolled diabetes mellitus?
(A) Circulatory disturbances
(B) Hypoglycemia
(C) Hyperglycemia
(D) Coronary heart disease
(E) Hypertension
Correct Answer: E, because uncontrolled diabetes mellitus will inevitably lead to hypertension and associated complications.
Incorrect. Though this question is somewhat vague, D is the best answer, because the number 1 killer of people with type 2 diabetes is coronary heart disease. “hypertension” per se is not a complication of diabetes, but rather another condition which makes diabetes-related complications (especially coronary heart disease) worse.
GPT-3 as medical interviewer
The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly. It will ask me questions geared at understanding my medical conditions.
Human: Hello, who are you?
AI: I am an AI created by OpenAI. I would like to know more about your health. What brings you in today?
Human: I feel sick. My head hurts.
Ai: I'm sorry to hear that. When did the pain start?
Human: like 3 hours ago. Hurts a lot.
AI: OK, deep breaths. Is your head still hurting?
Human: yes
AI: I see you are in a pain management program. Please state your primary condition.
Human: migraines
AI: Sorry to hear that. Are they a regular occurrence?
Human: yes
AI: How often do they happen?
Human: twice a week, at least
AI: Are they getting worse?
Human: yeah
AI: How do they affect your life?
Human: they make me miss school, a lotta stuff
AI: I'm sorry to hear that. Do you have a regular treatment?
Human: yeah, I take sumatriptan twice daily
AI: OK, it's obvious you need treatment for migraines, there are several options. Are you taking anything to stop your headaches?
Human: Yeah, I just told you, sumatriptan
AI: I'm sorry to hear that. Sumatriptan can only be used up to 3 times a week, unfortunately. Can you seek alternatives?
Human: Like what?
AI: I could recommend some OTC treatments if you want. How do you feel now?
Human: shitty
AI: Are you more nauseous than usual?
Human: not really
AI: Are you experiencing blurry vision?
Human: yeah my usual aura
AI: You describe auras well. Have you had them before?
Human: yes
AI: Have you seen a doctor for them before?
GPT-3 Summarizing
My second grader asked me what this passage means:
"""
Diagnosing Gilbert’s syndrome can be done by a blood test that detects elevated bilirubin levels. Most often the total level of bilirubin is less than 2 mg/dL, and the conjugated bilrubin level is normal[7]. A complete blood count (CBC) and a liver function test may also be performed to rule out other diseases than can cause elevated bilirubin. The combination of normal blood and liver function tests and elevated bilirubin levels is an indicator of Gilbert's syndrome[8]. Genetic testing is not needed[1]. Gilbert’s syndrome is generally present at birth but goes unnoticed until the late teens or early twenties[4].
"""
I rephrased it for him, in plain language a second grader can understand:
"""
A doctor can tell you have Gilbert’s syndrome by a blood test. The blood test will show that your bilirubin level is high, but that your liver is healthy.
Does anyone know what the word "bilirubin" means?
Can someone tell me what the word "conjugated" means?
Can someone tell me what the word "elevated" means?
Do you understand what this sentence means?
+_+_+_+
My second grader asked me what this passage means:
"""
Pancreatic cancer begins in the tissues of your pancreas — an organ in your abdomen that lies behind the lower part of your stomach. Your pancreas releases enzymes that aid digestion and produces hormones that help manage your blood sugar.
Several types of growths can occur in the pancreas, including cancerous and noncancerous tumors. The most common type of cancer that forms in the pancreas begins in the cells that line the ducts that carry digestive enzymes out of the pancreas (pancreatic ductal adenocarcinoma).
Pancreatic cancer is seldom detected at its early stages when it's most curable. This is because it often doesn't cause symptoms until after it has spread to other organs.
"""
I rephrased it for him, in plain language a second grader can understand:
"""
There are several kinds of growths in your pancreas that can turn into cancer. The kind that is most common is called pancreatic ductal adenocarcinoma. This kind of cancer starts in the cells that line the ducts that carry enzymes out of the pancreas.
This cancer usually isn't found until it's spread (metastasized) to other organs, like the liver.
My second grader asked me what this passage means:
"""
Diabetes mellitus refers to a group of diseases that affect how your body uses blood sugar (glucose). Glucose is vital to your health because it's an important source of energy for the cells that make up your muscles and tissues. It's also your brain's main source of fuel.
The underlying cause of diabetes varies by type. But, no matter what type of diabetes you have, it can lead to excess sugar in your blood. Too much sugar in your blood can lead to serious health problems.
Chronic diabetes conditions include type 1 diabetes and type 2 diabetes. Potentially reversible diabetes conditions include prediabetes and gestational diabetes. Prediabetes occurs when your blood sugar levels are higher than normal, but not high enough to be classified as diabetes. And prediabetes is often the precursor of diabetes unless appropriate measures are taken to prevent progression. Gestational diabetes occurs during pregnancy but may resolve after the baby is delivered.
"""
I rephrased it for him, in plain language a second grader can understand:
"""
Diabetes is when your body can't use blood sugar for energy. Your body needs blood sugar to keep your muscles and tissues strong. It also needs blood sugar to keep your brain working.
There are two types of diabetes. Type 1 diabetes happens when your body doesn't make enough insulin. Insulin helps your body use blood sugar for energy. Type 2 diabetes happens when your body doesn't use insulin well.
There are also other kinds of diabetes that aren't as common.
Overall, the summarizing is effective at accurately conveying information in simple language.
Thoughts on application
-Overall, I’m very impressed with GPT-3 and GPT-J-6B but frustrated with how finicky the prompt engineering can be.
-As imperfect as GPT-3 is with my beginner prompt-engineering skills, I can think of a few healthcare-related use cases that don’t seem far off in the near (3-5 years) future:
If you could fine-tune GPT-3 a little more on a corpus of physician-patient interactions, it could probably do a not-terrible job taking the history of a patient, as I tried to get it to do above. With some function to take the responses of the patient as input and then summarize it (with the full patient responses available if necessary, of course), it might be able to speed up patient intake a bit, which is often slowed by the necessity of getting a patient’s history (“when did the pain start? Where is the pain? Do you also have numbness in that area? Take any medications lately?” etc.)
Patient would interact with a GPT-3 powered bot that asks questions that start general, questions get more specific and relevant as the conversation progresses.
The patient’s answers are summarized in a compact history.
Patient is presented with that summary, asked to verify or correct it.
Doctor is given that corrected/verified summary and then asks more specific and directed questions.
It does a great job summarizing jargon-filled paragraphs into simple but still accurate paragraphs. This could be very useful for explaining diseases to patients with lower health literacy.
These could be custom-generated for patients with different diseases and prognoses. Of course, the physician would still give the patient their diagnosis, but patients could be given custom-generated summaries of their conditions instead of the usual “here’s an information sheet I found online for a vaguely related condition”.
Electronic Health Records have tons of autofilled fields and text which makes navigating them extremely annoying. Summarizing long sections of notes could be quite helpful to speed up chart review. I’d be wary of missing crucial details with this approach, though.
p.s. Working on another IRB-related post that should be out soon, if you enjoyed my last two posts.