Education is one of the most important factors that can affect an individual’s salary. Studies have shown that people with higher levels of education tend to earn more than those with lower levels of education. According to the Bureau of Labor Statistics (BLS), people with higher levels of education tend to earn more than those with lower levels of education. In a similar manner, I tried In this project to examine the relationship between the level of education and salary In Saudi Arabia, and how education affects salary.
I used in this project the Programming Langues Python with its main libraries for data science and machine learning.
Kaggle is rich with datasets, this dataset was imported from Kaggle site, link.
Importing Libraries
We explored the dataset, and as we can see the data is well-balanced with 6 columns and 504 rows.
In Saudi Arabia, the average salary is 8,950, with the minimum and maximum salaries being 1,331 and 35,622 respectively, as can be seen from the dataset. These salaries differ by a significant amount.
Since we have object columns we will convert these columns to dummy variables so it will be easier to perform statistical analysis.
We performed a correlation between the features and salary.
As seen in the graph, the highest features that positively correlated with salary are Degree Level Doctorate (0.7 coefficient) and Nationality Saudi (0.36 coefficient).
According to this data, the lowest salary is 1,331 and it was for a non-Saudi female with a primary degree in 2020, while the highest salary is 35,622 and it was for a Saudi male with a doctorate degree.
We converted the date column from quarter to month, and in order to achieve that we used this function. Then we can check the head of the data below.
The two graphs show a normal distribution for Saudi and Male in the dataset.
Using the Ploty library we can visualize the mean of salary over time for Gender and Nationality features, and at first glance, we can see the disparity in both graphs. While the salary for Females is rising, it appears to have taken during the Covid-19 pandemic, but it recovered in 2021, similarly to the male gender but it recovered quickly and higher before it dipped again. In general, salaries are steading increasing with time, particularly in the Saudi and male population.
This graph shows how education has a significant impact on salary, as holding a doctoral degree can increase your salary followed up by a Master's degree.
In this section, we performed feature Engineering and feature selection before we feed the final data to the machine learning algorithms.
As we stated earlier the dataset is well-balanced with no missing values and 504 rows and 7 columns.
We removed three columns, Currency', 'Year Quarter', 'Date'. And then we Hot encode the categorical variables 'Nationality',' Degree Level', 'Gender' and converted them to numeric. We did that because many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.
#get_dummies
df_d = pd.get_dummies(df, columns=['Nationality','Degree Level', 'Gender'])
With this one line of code, we converted these categorical variables to numeric.
The final result is 12 numric columns.
Before starting training the data, we need we select the independent variables and the target variable. The independent features are Nationality, Gender, and Degree level, while the target or dependent variable is Salary. Then, we split the data to train and test, and since the data is small the test data will be 10% of the data and the train data will be 90%.
X_train
y_train
In the next section, we used multiple machine learning algorithms to predict salaries based on the features.
The Algorithms:
Linear Regression
Polynomial Regression
SVM
K-neighbor
Decision tree
Random Forest
AdaBoost Regression
Gradient Boosting Regression
For the sake of brevity, I will only show three Algorithms that did good on the train and test.
We created a function that trains, predicts, and evaluates the model. We used two popular evaluation metrics, Mean Absolute error and Mean squared error, then we visualize the results.
Linear Regression
We imported Linear Regression from the Sklearn library, then we run the function we created earlier and evaluated the model. The metrics show the predicted values were off by 1,308 for Mean Aboslote error and 1,805, while these numbers are not encouraging we will see other Algoramims to compare with.
K_Neighbors Regression
Our second Machine Learning Algromathm is K_Neighbors Regression, we imported the K_neighbors, and at first glance, we can see the improvement, the MAE metric is 436 which is better than the previous model linear regression.
Decision Tree Regression
Decision Tree Regression is a powerful algorithm that uses a tree structure to predict a continuous value. As we can see from the image, this model has the lowest mean error.
Also, Decision Tree allows you to calculate how much each feature contributes to the accuracy of the model through feature importance, and as we can see here the most important feature is Degree Level_Doctorate with 0.53.
In conclusion, there is a strong correlation between education level and salary. In general, those with more education earn higher incomes. This is because education provides people with the skills and knowledge that are in demand in the workforce. Finally, this dataset lacked important features that may help determine someone's salary such as years of experience, age, major area of study, work-sector private or public.
* For more about the project check here on Github.