Thomas added snippets and dense vectors (08211f38) · Commits · Ranthony Clark / NC_Clustering_Districts

data-munging-by-thomas.ipynb

0 → 100644

+44 −0

Original line number	Diff line number	Diff line
		%% Cell type:code id:a144fa2a-d7a5-4c29-9223-bc17d8a09f42 tags:

		``` python
		import numpy as np
		import pandas as pd
		from tqdm import tqdm
		import geopandas as gpd
		import matplotlib.pyplot as plt
		import networkx as nx
		```

		%% Cell type:markdown id:46a6719b-5757-4068-8f6b-b185a4595de2 tags:

		## Reformatting cluster labels

		Currently, the output of the clustering algorithm is a dataframe with columns plan_index, district, row_index, population, cluster_label.

		For more efficient storage, we prefer a shorter dataframe with just plan_index, district, cluster_label.

		%% Cell type:code id:73993582-f1a7-4daa-b3f8-f9d1920bd4f4 tags:

		``` python
		num_clusters = 30
		input_filename = 'data/processed/centroids/ensemble_with_cluster_labels_k30.csv'
		output_filename = 'cluster_labels_k30.csv'
		```

		%% Cell type:code id:5a03e2da-f6ee-425a-b396-b56c3d9a3685 tags:

		``` python
		df = pd.read_csv(input_filename) #load input file
		```

		%% Cell type:code id:28d0ff7d-f02c-4937-af6c-3f0264dffdeb tags:

		``` python
		newdf = df.groupby(by=['plan_index', 'district']).max() #group by plan_index and district and retain cluster label
		```

		%% Cell type:code id:4a154a8f-7580-40d7-a93b-35a13a0c0220 tags:

		``` python
		newdf[['cluster_label']].to_csv(output_filename) #output cluster label information to output file
		```