Higher Education Outcomes Analysis
We will explore the relationships among different variables in a dataset, focusing on average points, median pay, and other course-related factors. We will use Python to preprocess, merge, filter, and visualize the data, as well as apply machine learning techniques to gain insights into the data. The goal is to understand how these variables are correlated and possibly provide useful information to educational institutions, students, and policymakers.
Data Preprocessing and Merging:
First, we need to import the necessary libraries and load the data:
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt
Next, we preprocess the data by cleaning and merging DataFrames. We calculate the average points for students and sum the values of columns to create a bar plot, which helps us visualize the summed values.
Data Exploration and Visualization:
We create new columns for categories (Low, Average, High, and Very High) and group the data by 'PUBUKPRN'. Then, we scale the data using StandardScaler and MinMaxScaler and calculate the correlation matrix. To visualize the correlations between different variables, we create a heatmap:
sns.heatmap(corr_heat, annot=True, cmap='coolwarm')
Filtering and Visualizing the Relationship between Average Points and Median Pay:
We perform further data preprocessing and merge DataFrames to create a reduced version of KISCOURSE. We create a filtered DataFrame containing only BA and BSC courses and calculate the average points. To visualize the relationship between average points and median pay, we create scatter and regression plots:
sns.scatterplot(x='AV_POINTS', y='GOINSTMED', data=groupedBA_BSC_SAL3) sns.regplot(x='AV_POINTS', y='GOINSTMED', data=groupedBA_BSC_SAL3) plt.xlabel('Average UCAS points achieved by student') plt.ylabel('Median wage 15 months after graduating') plt.show()
Visualizing the Top Importance Values:
Lastly, we create a new DataFrame with the top five importance values and their corresponding questions. We visualize these using a bar plot:
sns.barplot(x='feature', y='importance', hue='questions', dodge=False, data=five_top_importance)
By analyzing the relationships among average points, median pay, and other course-related factors, we gained valuable insights into the data. Our visualizations and machine learning techniques provided a better understanding of how these variables are correlated. This information can be used by educational institutions, students, and policymakers to make informed decisions regarding course offerings, resource allocation, and future policies.