http://www.code4pa.tech/
Code4PA is a codeathon that encourages learning, collaboration, growth, innovation, and fun among PA’s network of technical talent. Through a series of collaborative events, teams will utilize state and local data to generate ideas, designs, prototypes and/or apps to increase transparency and efficiency for public engagement with the government.
This year’s theme is to help Pennsylvania turn data into insights for the Opioid Epidemic.
Throughout this article, I will discuss Team Oracular Cypher’s goals, the project’s details and results, as well as further recommendations on how to scale this project. We welcome anyone with user interface experience to use the code.
Project Overview
In this challenge, my team – Oracular Cypher – used a deep learning neural network model to predict the survival of opioid users who overdosed, based on publicly available overdose data collected by emergency personnel.
Data
We used the Overdose Information Network Data from the Current County State Police. The data was obtained from the data.pa.gov website, which has a large number of publicly available data sets.
While we understand that the data is input voluntarily, and may not be representative of the totality of overdose incidents, we still think that this data is incredibly valuable and will give us a close approximation to run a viable model.
Additionally, this data showed an alarming number of overdose victims in Pennsylvania. This data showed more than 4,500 incidents of overdoses caused by various substances, all within the last year.
This data included various demographic information, suspected overdose drug information, naloxone administration, and survival information (Y/N).
Aim and Intended Audience
The goal of this project was to provide local lawmakers and members of the community with an assessment tool that could also function as an early detection mechanism to help mitigate the insidious effects of this disease. The results of the model could help local representatives and community members understand who the most acutely affected populations are in their community and, in turn, allocate more money, more preventative care, and supply better treatment options.
Exploratory Data Analysis
Bellow you can see various visualizations created using Tableau. before exploring the data in depth and building our neural network, we wanted to take a look at the data and see what information could be gleaned from it.
In the Tableau tabs below, you will find the following:
- County Rank by Number of Opioid Induced Deaths
- Overdose Incidents by County
- Naloxone Administration per County
- Overdose Incidents by Race
- Suspected Overdose Drugs by Age and Number of Survivors
- Naloxone Administration per Suspected Overdose Drug
**NOTE** Naloxone is a synthetic drug, similar to morphine, that is used to block the opiate receptors in the nervous system, and is most often used to reverse an overdose and prevent death. Read more about it here.
Insight from our Exploratory Data Analysis:
- Allegheny County ranks number one as the county with the highest number of overdose incidents in the state of Pennsylvania.
- The following counties did not report any information, making it difficult to report any insight: Sullivan, Cameron, Warren and Venango Counties.
- On average, Naloxone is more likely to be administered than not in PA counties; this could be due to various factors such as emergency personnel readiness and overall response time.
- On average, white residents of Pennsylvania experience more overdose incidents than any other race.
- Black residents of Pennsylvania are the second largest affected population.
- Heroin appears to be the most popular drug among those in their mid-thirties. However, it also has the highest survival rate. This could be due to Naloxone administration and an overall education campaign.
- The data shows that people who overdose on Heroin do, in fact, receive Naloxone care the majority of the time. In the last year, 2,215 people received Naloxone as a result of overdosing on Heroin, compared to the 481 Heroin overdose patients who did not.
Data Wrangling in Python
The data needed some cleaning to be workable, here are some of the steps taken.
- Create a new Average Age column from the Age Start and Age End columns.
- Extract the Incident Month from the Incident Date.
- Drop columns that pertain little to our analysis or that have substantial amounts of null values.
- Check the number of null values in the remaining columns and examine the data shape to prevent too much loss to the data set.
- Rename columns according to Python conventions for easier handling.
- Visualize some of the data to get a clearer picture.
- Remove ‘unknown’ from the Survive column to train the model better.
- Although the removal of ‘unknown’ points caused us to lose 250 data points, we still had more than 4000 data points overall.
- Check for connections between the data through correlation matrices.
**Note** All of the steps and the code can be seen in the notebook below.
Neural Network Modeling
“Artificial Neural Network(ANN) uses the processing of the brain as a basis to develop algorithms that can be used to model complex patterns and prediction problems.” Read more about it in this excellent blog post.
While a deep learning model requires tons of data, we were only able to supply it with approximately 4000 data points. The fact that our raw data showed no correlation to overdose survival, or death, necessitated that we employ machine learning to be able to glean an analysis, recognize patterns and produce insights. For even more accurate insights, we recommend gathering more data and feeding that data into the model.
The model was built using Keras, which is a high-level neural networks API, written in Python, that can run on top of TensorFlow.
The Architecture of the Neural Network Model:
- 3 layer sequential model.
- The inner layers had the Rectified Linear Units (Relu) activation function.
- Relu is an activation function that returns a 0 if there is any negative input and the input value if the value was positive. This functions helps our model account for non-linearity instances.
- The last layer had a softmax activation function.
- The softmax function is used for multi-class classification problems, it squashes the output of each unit to be between 0 and 1, which is what we need for this model.
- I chose categorical cross entropy for the model loss, and adam as the optimizer.
- Categorical Cross Entropy, or log loss, is a way to measure the performance of the classification model. We want our cross entropy to be as close to zero as possible.
- Adam (Adaptive Moment Estimation) computes adaptive learning rates for each parameter and it can handle sparse gradient on noisy problems.
- Our model was split into 70% training, 30% for testing.
**Note** More information about the model and the code can be seen in the notebook below.
Model Results
Model Accuracy = 82%
Our model yielded a more than 80% accuracy score, which is a good accuracy score that can help us in predicting completely new data inputs.
Predicting New Input
We then deployed our model and decided to make new predictions, to estimate whether this person will survive.
Person 1
We used the following demographic information:
- Incident Month: January
- Average Age: 35
- Day of the Incident: Monday
- Incident County Name: Delaware
- Gender: Male
- Race: White
- Ethnicity: Not Hispanic
- Victim’s State: Pennsylvania
- Victim’s County: Delaware
- Accidental Exposure: No
- Suspected Overdose Drug: Cocaine/Crack
- Naloxone Administered: Yes
Prediction: This person will Survive! This is great news!
Person 2
- Incident Month: July
- Average Age: 25
- Day of the Incident: Monday
- Incident County Name: Allegheny
- Gender: Female
- Race: Black
- Ethnicity: Not Hispanic
- Victim’s State: Pennsylvania
- Victim’s County: Philadelphia
- Accidental Exposure: No
- Suspected Overdose Drug: Heroin
- Naloxone Administered: No
Prediction: This person will Survive! Also great news!
Full Code in Python
Further Recommendations
- We highly recommend building a user interface for policy makers and families to input information and interact with the model themselves.
- We recommend that emergency response continues to collect data to grow the data sets and help maximize our accuracy results.
- We also recommend that more data regarding response time, response description, and revive action description be collected in order to test for the significance of those inputs.
This initiative was incredibly important. My team and I are grateful to have had the chance to help people in whatever way we can. Civic engagement no longer has to be restricted to traditional forms such as canvassing; now, technical skills can be harnessed for the advancement of advocacy efforts and humanity as a whole.
Thanks to the organizers of Code4PA, Oracular Cypher was able to participate and bring our data skills to the codeathon. Special thanks to an amazing teammate, Mr. Frank Guzman, for making this submission a reality.
Very appreciative of the work you put into this project. It is refreshing to see quantitative data analysis being used to address serious societal challenges such as this opioid epidemic.