top of page

USA, USA, PCA

Kaggle support a wide variety of open-source datasets, and their collection of election data is trove of great material. One of their most robust files is data on every county in the US with over 50 demographic features ranging from land usage to socioeconomic. Given the highly dimensional dataset, I wanted to use principal component analysis to plot all counties revealing the similarities and dissimilarities.

The plot below shows the distance of each county based on the similarities of these core demographics with each county color coded to their state. Hawaii, with it high native islander population, is the clear stranger in the US. Los Angeles county is also distantly similar. The defining eigenvector for LA is the its complete lack of wholesale trade, summarized by census dot gov as industry output such as agriculture, mining, manufacturing,

Drawing the eigenvectors show the distance and reasoning behind the placement on the PCA series (refer to table at end for code definitions).

Given these naturally formed deviations, we can apply a basic k-means segmentation and cluster each grouping.

This short exercise allows for a quick understanding of disparate locations using core analytical methods.

 

Featured Posts
Recent Posts
Archive
Search By Tags
Follow Us
  • Facebook Basic Square
  • Twitter Basic Square
  • Google+ Basic Square
bottom of page