This analysis focuses on understanding poverty rates across U.S. counties and their relationship with population density, education levels, and other key demographic factors. The goal is to explore the geographic and socioeconomic trends that explain variations in poverty rates and to build a predictive model to better understand these factors.
The analysis uses a dataset that contains county-level statistics across the U.S. The key variables include:
Several preprocessing steps were performed to clean and process the data:
The first step in the analysis was to calculate the average poverty rate by state, providing a broad view of geographic poverty distribution. By grouping the data by state and averaging the poverty rates, the analysis highlights which states are most affected by poverty.
The bar chart above shows that poverty rates vary significantly between states. This visualization provided key insights into regional disparities in poverty levels.
To investigate the relationship between poverty rate and population density, a scatter plot was created, where each point represents a county. This analysis helps to identify whether population density has a significant effect on poverty rates, with the assumption that more densely populated areas may have different poverty trends compared to rural counties.
A linear regression model was built to predict poverty rates based on population density. The dataset was split into training (80%) and testing (20%) sets to ensure the model could generalize well to unseen data.
poverty_model <- lm(poverty ~ popdensity, data = train_data)
The performance of the model was evaluated using the Mean Squared Error (MSE) to measure the accuracy of the predictions. A regression line was plotted on a scatter plot of population density vs. poverty rate, with actual and predicted values compared.
To further explore relationships between multiple variables, a correlation matrix was computed for numerical columns, including population density, total population, poverty rate, and educational attainment. The heatmap below shows the strength of correlations between these variables: