Unit 3: |
This assignment builds on all of our previous work and introduces you to predictive analytics through a forecasting method called a binary classifier. We will then work on how to visualize and understand a binary classifier.
In this assignment, you will:
● Receive an introduction to binary classifiers, logistic regression, and the results, including true- positive, false-positive, true-negative, and false-negative results
● Run a binary classification algorithm on our diabetes data
● Visualize the results in Tableau
For this assignment, follow these steps:
1) Download the diabetes dataset if you need it
2) Learn about binary classifiers
3) Perform binary classification using a logistic regression in Python (this has been written for you; all you need to do is press ‘run’ in Colab)
4) Download the results
5) Visualize the results in Tableau
Attachments:
· Diabetes_Classifier.ipynb
· Diabetes.csv dataset
Download the Diabetes Dataset
If you need to download the dataset again, click on the following link:
Pima Indians Diabetes Database
(We just used this dataset in a previous assignment, so you very well may already have it handy.)
Learn About Binary Classifiers
The word “binary” in this context means “just two options.” Some common binary outcomes could be whether a consumer will respond to direct marketing outreach (binary outcomes: they buy or they don’t buy), whether a streaming subscriber will like a certain movie (binary outcomes: they give it thumbs-up or thumbs-down), or whether an attempted financial transaction is legitimate (binary outcomes: it’s legitimate, or it’s a fraud). The important part of a binary outcome is that there are exactly two options.
A classifier is an algorithm that takes as its input one or more input variables and, as its output, makes a prediction about the value of a different variable. The prediction values are constrained to be on a pre-selected list.
A binary classifier, then, is an algorithm that takes as its input one or more variables and, as its output, classifies the results into one of two mutually exclusive categories:
Problem Domain |
Possible Input Variables (can have lots) |
Binary Output Variable (2 values only) |
Direct marketing |
Age, income, gender of the consumer |
Consumer buys or does not buy |
Streaming subscriptions |
Other movies they like, age of streamer, |
Thumbs-up or thumbs-down for this movie |
Financial transactions |
Dollar amount of transaction, country of 2. Upload two files: a. Upload the “Diabetes_Classifier.ipynb” as a notebook: b. Upload the “diabetes.csv” as a file uploaded to session storage: Alt text: Google Colab 3. Run the first cell, the classifier model. You can ask ChatGPT to explain this to you more fully, but basically what we are doing here with this code is: a. Importing a bunch of other code written by other people to help us build the model b. Reading in the diabetes.csv dataset c. Splitting the data into a training dataset (which we will use to build our logistic regression prediction model) and a testing dataset (which we will use to tell how good our model really was) d. Running the model on our training data e. Evaluating the model on our testing data 4. When the code in this cell has finished running, it gives a little confusion matrix. (Note this confusion matrix has its labels switched from the way StatQuest did them. If you are keeping close track of these things, you will notice that the matrix printed from this code has the actual values on the left and the predicted values on the top. If you are not keeping close track of these things, you don’t need to keep close track of this switch either.) Alt text: StatQuest
5. Run the next cell to generate the output file we will use to visualize the results in Tableau. Your output should look something like this, and you should have a “diabetes_predicted.csv” file available for download. It may take a minute or two to run and another minute or two to refresh, and you can click the “refresh” icon if you want to see the output file the very minute it is available: Alt text: Classifier
6. Let’s just look at the “diabetes_predicted.csv” file before we download it: Alt text: csv file a. Here, let’s look at the first row, Patient_ID 767. This person has a glucose of 126, BMI of 30.1, and an age of 47. This person also had an actual outcome of Diabetes (fourth column) but was predicted to have Not Diabetes (fifth column). The Model Results column classified this as a False Negative for this person (sixth column).
Question 6: Interpreting the Output File
Look further through the diabetes_predicted.csv file. For Patient_ID 526, what was their outcome?
A True positive B False positive C False negative D True negative
7. Download the diabetes_predicted.csv file to your computer. We are now ready to visualize it using Tableau.
Visualize the Results in Tableau
We can see that these sorts of output files can be difficult to interpret. Let’s use Tableau to help visualize them.
1. Fire up Tableau and import your diabetes_predicted.csv data file to Tableau. Be sure the file you import has both Actual Outcome Text and Predicted Outcome Text fields in it. 2. Check: You should have 231 total rows in this data source. 3. First, let’s make a basic bar graph: How many model results were true positives? False positives? Other values? a. Drag the Model Results to the Columns bar and the diabetes_predicted.csv (Count) to the Rows. It should look a little bit like the skeleton below—but you should have bar charts here. Alt text: csv file Question 7: Interpreting the Output File
How did the model do? Of the 231 people in this dataset, what was the most frequent model result?
A True positive: 49% of the results were true positive B False positive: 18 people had a false-positive result C False negative: 32% of the results were a false negative D True negative: 132 people had a true-negative result
4. Let’s take another look at these results, which are more akin to the confusion matrix we saw earlier. a. Go to another worksheet b. Put the Actual Outcome Text in the Rows area, and the Predicted Outcome Text in the Columns area: Alt text: outcome c. Then drag the diabetes_predicted.csv (Count) to the area with the “Abc” in it:
Alt text: csv file
d. You will now have the numbers of the actual and predicted outcomes summed up for you: Alt text: predicted outcomes e. Let’s get the Marks a bit fancier: Take the diabetes_predicted.csv (Count), also, to the Size, and once again drag diabetes_predicted.csv (Count) to the Label. Take the Model Results to the Label and expand your graphics so you can see the whole thing. You will get something that should look like this:
Alt text: predicted csv Question 8: Interpreting the Visual Confusion Matrix
Look at your visual matrix. Which statements would you agree with? Select all that apply.
A If a person actually has diabetes, their results would be found on the top row. B If a person actually does not have diabetes, their results would be found on the bottom row. C If the model predicts diabetes, the majority of the people in this category will turn out to have diabetes D If the model predicts not diabetes, the majority of the people in this category will not turn out to have diabetes E If a person has diabetes, the model is not great at predicting this; there will be a lot of incorrect predictions given F If a person does not have diabetes, the model is not great at predicting this; there will be a lot of incorrect predictions given
5. Sometimes we want to see how a model’s predictions vary as certain variables change. Does this model predict differently for people of different ages? a. Go to a new worksheet and make a histogram of the age. Set the bin size to 10. It should look like this: Alt text: bar graph b. Add the Predicted Outcome text in front of the Age (bin). You will now see histograms, but they are split by predictions: Alt text: bar graph Question 9: Interpreting the Split Histograms
Look at these two histograms. Which statements would you agree with? Select all that apply.
A Among those who are predicted not to have diabetes, the age distribution has a lot of younger people in it. B In the age group 40–49, the model is predicting approximately the same number of people with and without diabetes. C In the age group 40–49, the model is predicting approximately the same percentage of people with and without diabetes. C In the group which is predicted to have diabetes, the ages are relatively evenly distributed between people in their 20s, 30s, 40s, and 50s, with a sharp drop-off at age 60 and older.
6. Sometimes the total head count does not give the whole picture, and a percentage is a better way to go. Let’s try to get our histograms to show us percentages of total. a. Duplicate your paired Age histograms to a new sheet. b. Under the Rows, CNT(Age), pull down the right arrow and Add Table Calculation. Alt text: histogram c. For your Table Calculation, choose Percent of Total, and have it compute using Table(down): Alt text: table d. Then put the Model Results on the Color so you can see what percentage of each age group has what sorts of model results:
Alt text: graph e. The final touch: Often, culturally, we see green as “good/correct” and red as “bad/error.” Let’s go through and set the colors so the “true” outcomes are in the green family and the “false” outcomes are in the red family.
Alt text: graph f. Now we can look at – for example – a person in their 20s who is predicted not to have diabetes. Do they need to worry? i. The prediction is not diabetes, so we want the graph on the right (blue and red). ii. Find the bar which represents people in their 20s who are not predicted to have diabetes Alt text: graph iii. Let’s look at this bar a little more closely. We can drag the diabetes_predicted.csv (Count) onto the labels to have it show us the total number of people here. We can see that it does pretty well (lots of true model outcomes) for people in their 20s who are predicted not to have diabetes. Alt text: graph Question 10: Interpreting the Stacked Percentage Bar Charts
Look at these charts. Which statements are accurate? Select all that apply.
A For people in their 40s (age 40–49), a model prediction of “no diabetes” is very good news because the model is nearly always correct, and they probably don’t have diabetes. B For very elderly people (age 80–89), there is only one person in the dataset of this age. Because the model predicts “diabetes” for this person, it will always predict “diabetes” for all people in this age group, regardless of their BMI, glucose, or other variables. C Say you have 10 people in their 20s who receive a model prediction of “diabetes.” Approximately 7 of those people will actually have diabetes, but 3 will be incorrectly predicted to have diabetes. D Say you have 10 people in their 20s who receive a model prediction of “diabetes.” Approximately 4 of those people will actually have diabetes, and these are the false positives. E There are relatively few people in either category (predicted diabetes, predicted no diabetes) who are age 60–69, so we should be cautious about interpreting these percentages for a broader population. Unit 3: Self-Check Assignment 3: Diabetes Forecasting This assignment builds on all of our previous work and introduces you to predictive analytics through a forecasting method called a binary classifier.
Scroll to top
|