How high is your income?
Income is very important because it influences the choices we make in our lives. The quality of our health and lifestyle will be different if we have higher or lower income. I think we should always strive to have a higher income, because when we do, we will have more time to do other things that will make us happy. However, it is sometimes hard to pinpoint the variables that play in order to have a higher income.
…but — WHAT IF WE CAN?
This dataset is from a census in the United States about the people’s background and if they make above or below $50,000.
Here is a sample of the Income dataset:
Because there are many things that affect income, I want to find out the top 5 things that influence whether income would be above or below $50,000. In order to do that, I will do statistical models and machine learning from the dataset that I found.
I will separate the columns [1 (age) — 10 (hours-per-week)], which will be the features dataset, and column 11 (income_>50K), which will be the target dataset.
I will also split the features dataset into training dataset and validation dataset.
After this split of dataset, I can create models that will allow me to predict whether someone will have income of above or below $50,000.
Before I start the models, I need to establish a baseline, so that I can compare if they will make the prediction more accurate.
Based on the training dataset, the baseline is 76%.
Now, that I have established a baseline, I can start the models. I chose two models, so I can compare which model will give more accurate prediction.
- Logistic Regression Model
- Random Forest Classifier Model
Logistic Regression Model:
Training Accuracy: 0.8435376084174605
Validation Accuracy: 0.8445177434030937
Random Forest Classification Model:
Training Accuracy: 0.9528224086449595
Validation Accuracy: 0.8409918107370337
Since the accuracy scores are very close, we will do another visual test to see the model that has the higher area in under the line.
Even with the graph, it is still hard to tell which is more accurate. With that, we will do another test.
ROC-AUC Score is the area under the curve.
Logistic Regression ROC-AUC: 0.7501116955668395
Random Forest Classification ROC-AUC: 0.7582744751152407
Because the Random Forest has a higher ROC-AUC score, I will tune this model for hyper parameters and try to improve accuracy, and then I will check if this will give more accurate income prediction.
Here are the accuracy scores after the tuning:
Training Accuracy: 0.8718043509171051
Validation Accuracy: 0.8620336669699727
It shows that the tuned model is more accurate. With that said, I will use this to predict the income.
Earlier, I wanted to check the top 5 features that affect income, and this is the order of importance for the features:
It shows that age, education, capital gains, and hours worked are the top 5. However, I don’t think I can dismiss the fact that there are more features that influence the income that are very close to the percentage of contribution for the prediction of income. With that said, I think it is helpful to look at the whole picture.
I think this just shows that education can really help us to be competitive and able to perform the tasks required in different job positions. We will also have more income with with time and experience. It is also wise to manage our assets properly in order to have capital gains. Of course, it is also important to effectively use our time while working. Our relationships and responsibilities are also great driving force for higher income. With all of these information presented, I think time is at our advantage. We have to effectively use our resources in order to improve ourselves, our decisions, and our wellbeing.