Bias in Healthcare Data Science

Not to sound like an alarmist, but most people don’t really understand the algorithms and data science that control many parts of our lives. It’s an intimidating subject to start learning about, but it’s an important one. Whether you understand it or not, data science is impacting your finances, what content you get served on social media platforms, and yes, even your healthcare. We want to ensure that every person is able to understand the basics of how these things work, and how we’re doing things differently at Spora to usher in a new era of equitable, data-driven healthcare.

The Basics of Data Science

Every single day you have hundreds of interactions and make thousands of decisions, all of which plot multiple data points in the world like time, your location, money spent, among others. Data science (DS) is the process of making sense of all of information, using models to predict your next move to make life easier, safer, healthier, or less expensive. For example, when you take an Uber, data is being collected on where you got picked up, where you’re going, the time of day, traffic, cost and a number of other factors. We use DS to help drivers understand where to be, how much to charge, and the best route to take. Data science and the models created to make predictions are powerful tools that have the potential to make our lives better, but are also vulnerable to bias which can be introduced in a number of ways.

Bias in Healthcare Data Science

There are several different kinds of bias that can affect data science, but sampling bias is one that impacts many healthcare decisions specifically. Sampling bias occurs when members of a population are more likely to be selected in a sample than others (this is also known as ‘ascertainment bias’ in the medical field).

Take a moment to do an internet search for “dry face skin” and scroll through the images section. What do you see? A lot of white skin! Yes, an overwhelming majority of the images show what dry facial skin looks like for folks, all with very light-colored skin. If you use Google’s image search function to find out what dry facial skin looks like you’d probably be pretty good at identifying dry skin in people with that color complexion, but what about people with darker skin tones? With ~40% of the population identifying as non-White, images on Google aren’t going to be a representative data set for us to study what dry facial skin looks like. If we used this data we’d be introducing sampling bias. Now imagine what this bias looks like when Machine Learning (ML) uses these datasets in their algorithms which are used to automatically make a lot of our healthcare decisions. If models are built on sets of data that don’t represent the population that they’re going to be used on, that bias can have dire consequences for underrepresented people.

“Researchers at the University of Chicago found that pervasive algorithmic bias is infecting countless daily decisions about how patients are treated by hospitals, insurers, and other businesses. Their report points to a gaping hole in oversight that is allowing deeply flawed products to seep into care with little or no vetting, in some cases perpetuating inequitable treatment for more than a decade before being discovered.”

— Casey Ross, ‘Nobody is catching it’: Algorithms used in health care nationwide are rife with bias

Photo by Mati Mango from Pexels — Photo by **Mati Mango** from **Pexels**

The Center for Disease Control and Prevention works to reduce this bias in the way they collect data, and acknowledge that Machine Learning systems are only as good as the datasets they’re trained on. The CDC intentionally over-samples underrepresented populations and publicly publishes those datasets after they anonymize them. In theory, these datasets allow us to create more equitable machine learning models, if executed properly.

How to Create Equitable Machine Learning Models

1. Get your data right (Random Over/Under sampling)

As mentioned previously, more often than not, datasets that models are built on don’t quite represent the population they’re trying to predict. This bias results in an imbalanced dataset consisting of two groups which we’ll call the majority and minority.

Let’s say we tried to build a model that could help us detect diabetes. Roughly 10.5% of the population has diabetes (the minority), meaning 89.5% do not have diabetes (the majority). If we modeled the likelihood of having diabetes off of the entire population, it would be very easy to create a bad model that is 89.5% accurate by just saying folks don’t have diabetes. That would be wrong only 10.5% of the time for anyone that does have diabetes, even though technically the model will be right a lot.

How do we fix this so the model works for the minority population that has diabetes? We could either randomly under-sample the majority group, or we could over-sample the minority group. Either way we’d be selecting an equal number of diabetic and non-diabetic folks to ensure that our dataset is balanced 50%-50%.

This is clearly an oversimplified example, but it can get complex quickly. Let’s say we also want to balance this dataset on folks’ sex at birth, which takes us from two groups (diabetic / non-diabetic) to four groups (Female diabetics / Male diabetics / Female non-diabetics / male non-diabetics) that we’d want to balance in our sample. Now add in ethnicity, and you can see how this escalates quickly in complexity. It’s tedious, and not the sexy modeling that data scientists love to do, but it’s important to build a model that’s equitable.

2. Modeling made simple (Model Complexity)

Photo by Markus Spiske from Pexels — Photo by **Markus Spiske** from **Pexels**

When you hear Artificial intelligence (AI) or Machine learning talked about it’s usually related to self-driving cars, facial recognition, or some chat bot. These are all great technologies, but require very complex models that oftentimes make decisions that are not equitable. When people don’t fully understand how or why models make their predictions, all kinds of unintentional bias can be introduced without a data scientist realizing it.

The solution to this is easier said than done, but it’s doable. Instead of spending weeks creating a complex model of the data, we spend weeks understanding the relationships between the data points—a simpler approach. Using this in combination with advanced probability theory, we can oftentimes get similar, if not better, results that are less biased than with an advanced AI model because we took the time to understand every step of the process.

3. Put it to the test (Analyzing the Test set)

When creating a model to help us make sense of large sets of data to predict outcomes, there are two fundamental groups that are generated from the entire dataset; the training set and the testing set.

The training set is the part of the dataset that is used to teach the machine algorithms how to predict the outcome, similar to the diabetes example from earlier. Oftentimes this is a large percentage of the data (up to 75%) that will ‘teach’ the model how to effectively identify people who have diabetes.

The testing set is the remaining portion of the dataset used to test the effectiveness of the model we built with the training set.

When a model is being measured for effectiveness, we look at the results we get from the testing set. To evaluate a model’s effectiveness we look at what is called “the confusion matrix,” which is made of 5 parts. They are:

Accuracy: Percentage of correct predictions. This is how often we get it right overall.
Precision: Percentage of positive cases correctly identified. This is how often we correctly identify a target, (ex: diabetes).
Negative Predictive Value: Percentage of negative cases correctly identified. This is how often we correctly identify that someone did not have the target we were looking for (ex: non-diabetic).
Recall: Proportion of actual positive cases correctly identified. This is how many of the positive cases we got right out of the total number of positive cases.
Specificity: Proportion of actual negative cases correctly identified. This is how many of the negative cases we got right out of the total number of negative cases.

Looking at the testing set as a whole you may be able to get some great values, but a key step is making sure that your model works for the people it’s going to be used on. Once you start looking at certain groups of people based on demographics, you can start to see some severe differences in how well your model works.

Photo by William Fortunato from Pexels — Photo by **William Fortunato** from **Pexels**

There are biases that are difficult to detect without careful auditing by an independent party. “If you're not doing this audit—if you're not looking for bias—then you can pretty much guarantee that you're releasing biased algorithms,” said Chris Hemphil, vice president of applied AI at SymphonyRM. “You can have a model that's performing really well overall, but then when you start breaking it down by gender and ethnicity, you start seeing different levels of performance."

Back to Google’s image search function, imagine using the images obtained from the “dry face skin” search, where a large percentage of the images are for very light skinned individuals. Now let’s assume we train a model on that data, and in the test set we get an accuracy of say 90%- meaning we got it right 90% of the time. This looks like a great model, but knowing that the training set is very biased towards light skinned folks it would be unreasonable to expect it to perform at 90% if we looked at it’s accuracy for darker skinned folks. The precision for darker skinned individuals might be 0% because the model doesn’t know what dry skin looks like in this context. Is this an equitable model? No, but without proper due diligence in testing, these models make it out into the world and affect diagnosis, treatment plans, and healthcare costs.

How we do it at Spora Health

Data science can very easily go South if it isn’t handled appropriately, so you’re probably wondering how we do things differently here at Spora Health. We’ve built proprietary and equitable machine learning technology that can effectively help people of color in the US to understand their risks for certain chronic diseases. The purpose of this is to help each Spora Health member visualize and understand their risks while working with a healthcare provider to reduce those risks through tailored treatment plans.

How do we keep bias out of the process?

First, we use publicly available datasets curated by the CDC that oversamples on underrepresented populations to ensure we eliminate sampling bias. From the start, our models are representative of the populations we’re treating.

Next, we focus on diseases that disproportionately impact the Black American population. Our Founding Physician and Chief Medical Officer decides what these diseases are to ensure that we can impact the largest number of folks and help prevent the chronic conditions that most often impact communities of color.

Then we select the data points that our machine learning algorithms will be based on to make sure that every question can be answered quickly and easily. Oftentimes data questionnaires are filled with questions that are difficult to answer or understand, which can skew results. To that end, we only look at data from questions that are straightforward.

After we build our selection of questions, we start looking at the demographics of those who answered. When we talk about demographics, we define this as ethnicity and sex at birth. We do this because we want to ensure that our models don’t just work for one subsection of the population, but all of them. When we build our initial dataset we look at the breakout of demographics to determine what steps we need to take to ensure we’re going to have an equitable model (once again, over/under-sampling).

When we ensure that our training dataset is fully equitable in its construction, we pick simple machine learning algorithms that prevent unintentional bias. We are able to use these simple models because we spend a lot of our time preparing the data so it works for modeling, balancing the dataset so that we have an even distribution across all groups of people.

Finally, when we feel that we have a usable model, we analyze how the trained model performs on the testing set. We can see how the model performs for the testing set, and we can evaluate it based on the five measurements of the confusion matrix for each demographic group. If it works well for one demographic, but is ineffective for another, we’ll go back and re-analyze the dataset and build a better model. If this re-analysis does not work, which it sometimes doesn't, we will not use that model at all because if it doesn’t work for everyone, it doesn’t work for us.

The data science team at Spora Health acknowledges that data science has never been equitable since its inception, but we also see it as a powerful tool that has the potential to transform healthcare. That is, if we take the time to do it correctly, of course. Helping people to better understand and access information that affects their healthcare is the work we’re committed to doing.

If you enjoyed this content and would like to see a different topic related to data science covered please feel free to email me directly at waco@sporahealth.com and I’ll be sure to reply and let you know where that topic sits in the blog workflow!

If you are interested in using our proprietary machine learning technology to better understand your health risks, sign up at sporahealth.com

Sign up for Spora