Schelling's model of Segregation Python implementation with Geopandas (Follow-up)

Question

This is the repost of the following question as suggested by @HoboProber .

Again, if you don't know what is Schelling's model of segregation, you can read it here.

The Schelling model of segregation is an agent-based model that illustrates how individual tendencies regarding neighbors can lead to segregation. In the Schelling model, agents occupy cells of rectangular space. A cell can be occupied by a single agent only. Agents belong to one of two groups and are able to relocate according to the fraction of friends (i.e., agents of their own group) within a neighborhood around their location. The model's basic assumption is as follows: an agent, located in the center of a neighborhood where the fraction of friends f is less than a predefined tolerance threshold F (i.e., f < F), will try to relocate to a neighborhood for which the fraction of friends is at least f (i.e., f ≥ F)

You can get the shapefile here if you want to try it.

I have implemented most of the changes suggested and also did split up my code in 4 different classes. Here is my complete code.

geo_schelling_populate.py

import numpy as np from shapely.geometry import Point import geopandas as gp import pandas as pd class geo_schelling_populate: """ Generate the coordinates in a polygon (In this case a map of the state) on the basis of the given spacing and then randomly assign coordiantes to different races and as empty houses. It also takes ratio and empty_ratio in consideration. Parameters ---------- shapefile : str It is a string pointing to a geospatial vector data file which has information for an american state's geometry data. spacing : float It defines the distance between the points in coordinate units. empty_ratio : float What percent of houses need to be kept empty. ratio : float What is the real ratio between the majority and minority in reality in a given state. ratio : int Number of races. Currently the model is tested on 2 races, but maybe in future support fot more races will be added. Attributes ---------- shapefile : str Same as in parameter section spacing : float Same as in parameter section empty_ratio : float Same as in parameter section demographic_ratio : float Same as ratio in the parameter section races : int Same as in parameter section shape_file : geopandas.GeoDataFrame Pandas DataFrame with a geometry column generated from the .shp file df_allhouses : pandas.DataFrame Pandas DataFrame which contains all the different coordinates generated inside a specific state. df_emptyhouses : pandas.DataFrame Pandas DataFrame which contains all the coordiantes which are emptyhouses. df_agenthouses : pandas.DataFrame Pandas DataFrame which contains all the coordinates associated with a race. Methods ------- _generate_points() Private function! It generates the coordinates inside a given shape. Returns a dataframe with all the coordiantes generated. populate() It populates the coordinates with either a race or denotes it as an empty house. Returns a tuple with pandas dataframe of empty houses and agent houses. Again agents are races in this case. """ def __init__( self, shapefile, spacing, empty_ratio, ratio, races=2, ): self.shapefile = shapefile self.spacing = spacing self.empty_ratio = empty_ratio self.demographic_ratio = ratio self.races = races self.shape_file = \ gp.read_file(r"%s"%shapefile).explode().reset_index().drop(columns=['level_0' , 'level_1']) self.df_allhouses = pd.DataFrame([]) self.df_emptyhouses = pd.DataFrame([]) self.df_agenthouses = pd.DataFrame([]) def _generate_points(self, polygon): """It returns a DataFrame with all the coordiantes inside a certain shape passed in as an parameter. Parameters ---------- polygon : shapely.geometry.Polygon A polygon object which contains the geometry of a county in a state. Returns ------- A pandas DataFrame with all the coordiantes generated inside the polygon object. """ (minx, miny, maxx, maxy) = polygon.bounds x_coords = np.arange(np.floor(minx), int(np.ceil(maxx)), self.spacing) y_coords = np.arange(np.floor(miny), int(np.ceil(maxy)), self.spacing) grid = np.column_stack((np.meshgrid(x_coords, y_coords)[0].flatten(), np.meshgrid(x_coords, y_coords)[1].flatten())) df_points = pd.DataFrame.from_records(grid, columns=['X', 'Y']) df_points = df_points[df_points[['X', 'Y']].apply(lambda x: \ Point(x[0], x[1]).within(polygon), axis=1)] return df_points.round(2) def populate(self): """ Populates the coordinates by assigning them a certain race or by assigning them as empty houses. Parameters ---------- No parameters Returns ------- A tuple which consist two pandas DataFrames. One contains the coordiantes who have empty houses and the other contains the coordiante with a race assigned to them. """ pd.set_option('mode.chained_assignment', None) self.df_allhouses = \ pd.concat(iter(self.shape_file.geometry.apply(lambda x: \ self._generate_points(x))), ignore_index=True) empty_ratio_seperator = round(self.empty_ratio * len(self.df_allhouses)) self.df_allhouses['Agent/Empty'] = \ np.random.permutation(np.asarray(['Empty'] * empty_ratio_seperator + ['Agent'] * (len(self.df_allhouses) - empty_ratio_seperator))) self.df_emptyhouses = \ self.df_allhouses[self.df_allhouses['Agent/Empty'] == 'Empty'][['X', 'Y' ]].reset_index(drop=True) self.df_agenthouses = \ self.df_allhouses[self.df_allhouses['Agent/Empty'] == 'Agent'][['X', 'Y' ]].reset_index(drop=True) demographic_ratio_seperator = round(self.demographic_ratio * len(self.df_agenthouses)) self.df_agenthouses['Race'] = \ np.random.permutation(np.asarray([1] * demographic_ratio_seperator + [2] * (len(self.df_agenthouses) - demographic_ratio_seperator))) return (self.df_emptyhouses, self.df_agenthouses)

geo_schelling_update.py

import numpy as np import math class geo_schelling_update: """Updates the position of the races and the empty houses on the basis of similarity threshold. Agents change positions if they have enough agents from their own races in a certain neighbourhood. In this scenario, neighbourhood is defined as the eight nearest coordinate around the coordinate we are checking satisfaction for. Parameters ---------- n_iterations : int Maximum number of times, update method should run. spacing : float It defines the distance between the points in coordinate units. np_agenthouses : numpy array df_agenthouses converted to the numpy array for faster computations. np_emptyhouses : numpy array df_emptyhouses converted to the numpy array for faster computations. similarity_threshold : float What percent of similarity people want from their neighbours. Attributes ---------- spacing : float Same as in parameter section n_iterations : int Same as in parameter section np_agenthouses : numpy array Same as in parameter section np_emptyhouses : numpy array Same as in parameter section similarity_threshold : float Same as in parameter section Methods ------- _is_unsatisfied() Private function! It checks if an agent is satisfied or not at a certain position. _move_to_empty() Private function! Moves an unsatisified agent to a random empty house. _update_helper() Private functions! A helper function to help update method with numpy's apply_along_axis method. update() It updates array as long as it reaches maximum number of iterations or reaches a point where no agent is unsatisfied. Getter ------ get_agenthouses() Returns an array with all the agents at a certain coordinate. """ def __init__( self, n_iterations, spacing, np_agenthouses, np_emptyhouses, similarity_threshold, ): self.spacing = spacing self.n_iterations = n_iterations self.np_emptyhouses = np_emptyhouses self.np_agenthouses = np_agenthouses self.similarity_threshold = similarity_threshold def _is_unsatisfied(self, x, y): """ Checks if an agent is unsatisfied at a certain position. Parameters ---------- x : float x coordinate of the agent being checked y : float y coordinate of the agent being checked Returns ------- True or False based on if the agent is satisfied or not. """ race = np.extract(np.logical_and(np.equal(self.np_agenthouses[: , 0], x), np.equal(self.np_agenthouses[:, 1], y)), self.np_agenthouses[:, 2])[0] euclid_distance1 = round(math.hypot(self.spacing, self.spacing), 4) euclid_distance2 = self.spacing total_agents = \ np.extract(np.logical_or(np.equal(np.round(np.hypot(self.np_agenthouses[: , 0] - x, self.np_agenthouses[:, 1] - y), 4), euclid_distance1), np.equal(np.round(np.hypot(self.np_agenthouses[: , 0] - x, self.np_agenthouses[:, 1] - y), 4), euclid_distance2)), self.np_agenthouses[:, 2]) if total_agents.size == 0: return False else: return total_agents[total_agents == race].size \ / total_agents.size < self.similarity_threshold def _move_to_empty(self, x, y): """Moves the agent to a new position if the agent is unsatisfied. Parameters ---------- x : float x coordinate of the agent being checked y : float y coordinate of the agent being checked Returns ------- None """ race = np.extract(np.logical_and(np.equal(self.np_agenthouses[: , 0], x), np.equal(self.np_agenthouses[:, 1], y)), self.np_agenthouses[:, 2])[0] (x_new, y_new) = \ self.np_emptyhouses[np.random.choice(self.np_emptyhouses.shape[0], 1), :][0] self.np_agenthouses = \ self.np_agenthouses[~np.logical_and(self.np_agenthouses[:, 0] == x, self.np_agenthouses[:, 1] == y)] self.np_agenthouses = np.vstack([self.np_agenthouses, [x_new, y_new, race]]) self.np_emptyhouses = \ self.np_emptyhouses[~np.logical_and(self.np_emptyhouses[:, 0] == x_new, self.np_emptyhouses[:, 1] == y_new)] self.np_emptyhouses = np.vstack([self.np_emptyhouses, [x, y]]) def _update_helper(self, agent): """Helps the update function with number of changes made in every iterations. Parameters ---------- agent : tuple x and y coordinates for the agent's position. Returns ------- 1 if the position of the agent is changed, 0 if not. """ if self._is_unsatisfied(agent[0], agent[1]): self._move_to_empty(agent[0], agent[1]) return 1 else: return 0 def update(self): """Main player in updating the array with all the agents' position. It updates the array until it reaches the iteration limit or the number of changes become 0. Parameters ---------- None Returns ------- None """ for i in np.arange(self.n_iterations): np_oldagenthouses = self.np_agenthouses.copy() n_changes = np.sum(np.apply_along_axis(self._update_helper, 1, np_oldagenthouses)) print('n Changes---->' + str(n_changes)) print(i) if n_changes == 0: break def get_agenthouses(self): return self.np_agenthouses

geo_schelling_data.py

from shapely.geometry import Point import pandas as pd class geo_schelling_data: """Get the important data from the simulation required for the further analysis. Following data is obtained: 1) County Name 2) Majority Population 3) Minority Population 4) Total Population 5) Percentage of Majority Population 6) Percentage of Minority Population Parameters ---------- np_agenthouses : numpy array df_agenthouses converted to the numpy array for faster computations. pd_shapefile : Pandas DataFrame Pandas DataFrame with a geometry column generated from the .shp file. Attributes ---------- np_agenthouses : numpy array Same as in parameter section pd_shapefile : Pandas DataFrame Same as in parameter section df_county_data : Pandas DataFrame Pandas DataFrame with important data mentioned above. Methods ------- _get_number_by_county() Private Function! It returns the number of majority and minority in a polygon. get_county_data() Returns the pandas DataFrame with the important data required for the further analysis as mentioned above. """ def __init__(self, np_agenthouses, pd_shapefile): self.np_agenthouses = np_agenthouses self.pd_shapefile = pd_shapefile self.df_county_data = pd.DataFrame([]) def _get_number_by_county(self, geometry): """Returns the number of minority agents and majority agents within a polygon. Parameters ---------- geometry : shapely.geometry.Polygon A polygon object which contains the geometry of a county in a state. Returns ------- A tuple with of minority agents and majority agents """ num_majority = len([x[2] for x in list(self.np_agenthouses) if Point(x[0], x[1]).within(geometry) and x[2] == 1.0]) num_minority = len([x[2] for x in list(self.np_agenthouses) if Point(x[0], x[1]).within(geometry) and x[2] == 2.0]) return (num_majority, num_minority) def get_county_data(self): """Does calculation on Minority and Majority agents' number and returns a pandas dataframe for further analysis. Parameters ---------- None Returns ------- Pandas DataFrame """ self.df_county_data['CountyName'] = self.pd_shapefile.NAMELSAD self.df_county_data['MajorityPop'] = \ self.pd_shapefile.geometry.apply(lambda x: \ self._get_number_by_county(x)[0]) self.df_county_data['MinorityPop'] = \ self.pd_shapefile.geometry.apply(lambda x: \ self._get_number_by_county(x)[1]) self.df_county_data['TotalPop'] = \ self.df_county_data['MajorityPop'] \ + self.df_county_data['MinorityPop'] self.df_county_data['MajorityPopPercent'] = \ self.df_county_data[['TotalPop', 'MajorityPop' ]].apply(lambda x: (0 if x['TotalPop'] == 0 else x['MajorityPop'] / x['TotalPop']), axis=1) self.df_county_data['MinorityPopPercent'] = \ self.df_county_data[['TotalPop', 'MinorityPop' ]].apply(lambda x: (0 if x['TotalPop'] == 0 else x['MinorityPop'] / x['TotalPop']), axis=1) return self.df_county_data

geo_schelling_plot.py

from matplotlib import pyplot as plt class geo_schelling_plot: """Visualize the simulation as it helps us interpret and analyze the results. Parameters ---------- np_agenthouses : numpy array df_agenthouses converted to the numpy array for faster computations. pd_shapefile : Pandas Series Pandas Series of the geometry column generated from the .shp file. Attributes ---------- np_agenthouses : numpy array Same as in parameter section pd_shapefile : Pandas Series Same as in parameter section Methods ------- plot() Plots the agents and shape on the same graph. """ def __init__(self, np_agenthouses, pd_geometry): self.np_agenthouses = np_agenthouses self.pd_geometry = pd_geometry def plot(self): """Plots the agents and shape on the same graph so that one can see where the different agents lies. It is very helpful in the case when the simulation is finished and you can see the changes occured in the plot. Yellow is majority Violet is minority Parameters ---------- None Returns ------- None """ fig, ax = plt.subplots(figsize=(10, 10)) self.pd_geometry.plot(ax=ax, color='white', edgecolor='black',linewidth=4) ax.scatter(self.np_agenthouses[:, 0], self.np_agenthouses[:, 1], c=self.np_agenthouses[:, 2]) ax.set_xticks([]) ax.set_yticks([])

run.py

from geo_optimized.geo_schelling_data import geo_schelling_data from geo_optimized.geo_schelling_plot import geo_schelling_plot from geo_optimized.geo_schelling_populate import geo_schelling_populate from geo_optimized.geo_schelling_update import geo_schelling_update shapefile = "Path to the shapefile" spacing = 0.05 empty_ratio = 0.2 similarity_threshhold = 0.65 n_iterations = 100 ratio = 0.41 if __name__ == '__main__': schelling_populate = geo_schelling_populate(shapefile, spacing, empty_ratio, ratio) df_emptyhouses, df_agenthouses = schelling_populate.populate() np_agenthouses = df_agenthouses.to_numpy() np_emptyhouses = df_emptyhouses.to_numpy() pd_geometry = schelling_populate.shape_file.geometry schelling_update = geo_schelling_update(n_iterations, spacing, np_agenthouses, np_emptyhouses,similarity_threshhold) schelling_update.update() np_agenthouses = schelling_update.get_agenthouses() pd_shapefile = schelling_populate.shape_file schelling_plot = geo_schelling_plot(np_agenthouses,pd_geometry) schelling_plot.plot() schelling_get_data = geo_schelling_data(np_agenthouses,pd_shapefile) df_county_data = schelling_get_data.get_county_data()

I got rid of most of the for loops and replace the code with pandas and numpy to do some faster looping and also some vectorize operations. I believe this code is much more cleaner and faster than the previous one but still some improvements in speed and documentation can be made.

I believe I can still make some operations vectorize but don't know how to do it. If someone can suggest those that would be great. Also if someone can help me with the better documentation of the code, that would be great too. Currently I am using numpy style docstrings.

Also if someone can help me optimize my code with numba to achieve C level speed that would be great. I believe that geo_schelling_update.py can be speed up with numba but I am unable to do so.

that is only the .shp. There are some associated files which are needed gis.stackexchange.com/a/262509 — Maarten Fabré
– Maarten Fabré, Commented Sep 18, 2019 at 12:39

Maarten Fabré · Accepted Answer · 2019-09-19 09:37:53Z

Code style

Try to use an IDE which integrates with linters (Pycodestyle, Pylama, Mypy,...). This alone found some 97 warning, ranging from no whitespaces after a comma, trailing whitespace, redundant backslashes, closing brackets not matching the indentation,...

All of these are no big issues, but they make the code look messy, and are easy to fix. I use a code formatter (black, but there is also yapf) with a maximum line length of 79 to take care of these smaller issues for me

Classes should be in CapWords

The np_ prefix in some variable names is not helpful

Python is not JAVA

Not everything needs to be in a class, and not every class needs to be in a separate file.

pd.set_option('mode.chained_assignment', None)

This is a sign that you are doing something dangerous with views or copies, and data might be lost when you change a subset. It is better to use .loc then to make sure you get a copy, and not a view

ratio

You have 2 ratio's, so better call the second demographic_ratio or something

keyword-only arguments

Methods with a lot of arguments can cause confusion, and are called with the wrong order from time to time. To prevent this, use keyword-only arguments if there are a lot, especially if they are of the same type, so the code does not trow an error immediately, but just gives a garbage answer

occupation

df_allhouses["Agent/Empty"] lists whether a property is occupied. Since this is simply a flag, you can use 0 and 1 of True and False instead of "Empty" or "Agent" This will simplify a lot of the further processing. There is also no need to make this a column in df_allhouses.

I would also extract the method to provide this random population to a separate method:

def random_population(size, ratio): samples = np.zeros(size, dtype=np.bool) samples[: round(size * ratio)] = 1 return np.random.permutation(samples)

So the houses that are ocupied are defined by occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio)

architecture

What geo_schelling_populate does is provide an empty simulation, that you later populate. You never do anything else with this object, apart from getting the shapefile. A more logic architecture would be a method to read the shapefile, and another method to deliver a populated simulation. No need for the class, and no need for the extra file

This is an interesting talk: Stop writing classes by Jack Diederich

I will convert the example of the populate here, but the same way of working can be done for the other parts. No need for a class, with just an __init__ that populates the object variables, and then 1 action method that uses those object variables. It is better to just pass those as argument to a function

There is still a lot of other stuff to improve, especially on vectorisation, but I don't have time for that at this moment

from pathlib import Path import geopandas as gp import numpy as np import pandas as pd from shapely.geometry import Point def _generate_points(polygon, spacing): """It returns a DataFrame with all the coordiantes inside a certain shape passed in as an parameter. Parameters ---------- polygon : shapely.geometry.Polygon A polygon object which contains the geometry of a county in a state. Returns ------- A pandas DataFrame with all the coordiantes generated inside the polygon object. """ (minx, miny, maxx, maxy) = polygon.bounds x_coords = np.arange(np.floor(minx), int(np.ceil(maxx)), spacing) y_coords = np.arange(np.floor(miny), int(np.ceil(maxy)), spacing) grid = np.column_stack( ( np.meshgrid(x_coords, y_coords)[0].flatten(), np.meshgrid(x_coords, y_coords)[1].flatten(), ) ) df_points = pd.DataFrame.from_records(grid, columns=["X", "Y"]) df_points = df_points[ df_points[["X", "Y"]].apply( lambda x: Point(x[0], x[1]).within(polygon), axis=1 ) ] return df_points.round(2) def random_population(size, ratio): samples = np.zeros(size, dtype=np.bool) samples[: round(size * ratio)] = 1 return np.random.permutation(samples) def populate_simulation( *, shape_file: gp.GeoDataFrame, spacing: float, empty_ratio: float, demographic_ratio: float, races=2 ): """provides a random populated simulation ... """ all_houses = pd.concat( iter( shape_file.geometry.apply(lambda x: _generate_points(x, spacing)) ), ignore_index=True, ) occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio) agent_houses = all_houses.loc[occupied, ["X", "Y"]].reset_index(drop=True) empty_houses = all_houses.loc[~occupied, ["X", "Y"]].reset_index(drop=True) agent_houses["Race"] = random_population( len(agent_houses), demographic_ratio ) # +1 if you need it to be 1 and 2 return empty_houses, agent_houses if __name__ == "__main__": shapefilename = Path(r"../../data/CA_Counties_TIGER.shp") shape_file = gp.read_file(shapefilename) # no need spacing = 0.05 empty_ratio = 0.2 similarity_threshhold = 0.65 n_iterations = 100 demographic_ratio = 0.41 empty_houses, agent_houses = populate_simulation( shape_file=shape_file, spacing=spacing, empty_ratio=empty_ratio, demographic_ratio=demographic_ratio, races=2, ) ...

Part 2:

random seed

When simulating, always give the possibility to give a random seed, so you can repeat the same simulation to verify certain results.

Geopandas

You do a lot of distance calculations, and checks whether coordinates are withing Polygons semi-manually. It is a lot cleaner if you can let GeoPandas do that for you. If you keep the coordinates as Points, instead of x and y columns:

def generate_points(polygon, spacing): (minx, miny, maxx, maxy) = polygon.bounds x_coords = np.arange(np.floor(minx), (np.ceil(maxx)), spacing) y_coords = np.arange(np.floor(miny), (np.ceil(maxy)), spacing) grid_x, grid_y = map(np.ndarray.flatten, np.meshgrid(x_coords, y_coords)) grid = gp.GeoSeries([Point(x, y) for x, y in zip(grid_x, grid_y)]) return grid[grid.intersects(polygon)].reset_index(drop=True)

To get the houses in Los Angeles County:

generate_points(shape_file.loc[5, "geometry"], .05).plot()

grid creation

Best would be to extract the grid creation. This way you could in the future reuse the raw grid, with different simulation characteristics.

def create_grid(counties: gp.GeoSeries, spacing: float, random_seed=None): return gp.GeoDataFrame( pd.concat( { county.NAME: generate_points(county.geometry, spacing) for county in counties.itertuples() }, names=["county"], ) .rename("geometry") .reset_index(level="county") .reset_index(drop=True) .astype({"county": "category"}), geometry="geometry", )

all_houses = create_grid(counties=shape_file, spacing=spacing, random_seed=0)

def populate_simulation( *, all_houses, empty_ratio: float, demographic_ratio: float, races=2, random_seed=None, ): """provides a random populated simulation ... """ if random_seed is not None: np.random.seed(random_seed) occupied = random_population(size=len(all_houses), ratio=1 - empty_ratio) race = random_population(size=int(occupied.sum()), ratio=demographic_ratio) agent_houses = gp.GeoDataFrame( { "race": race.astype(int), "county": all_houses.loc[occupied, "county"], }, geometry=all_houses.loc[occupied, "geometry"].reset_index(drop=True), ) empty_houses = all_houses[~occupied].reset_index(drop=True) return empty_houses, agent_houses

empty_houses, agent_houses = populate_simulation( all_houses=all_houses, empty_ratio=.1, demographic_ratio=.3, races=2, random_seed=0, )

The agent_houses then looks like this:

 Race geometry 0 0 POINT (-4 -9) 1 0 POINT (-3 -9) 2 0 POINT (-2 -9) 3 1 POINT (-1 -9) 4 0 POINT (0 -9)

To plot this:

 agent_houses.plot(column="race", categorical=True, figsize=(10, 10))

To check the neighbours who live within a perimeter of an agent is simple:

def get_neighbours(agent, agent_houses: gp.GeoDataFrame, spacing): """ returns all the agents that live within `perimeter` of the `agent` The `agent` excluding""" surroundings = agent.geometry.buffer(spacing * 1.5) return agent_houses.loc[ agent_houses.intersects(surroundings) & (agent_houses != agent).any(axis=1) ]

this can be tested:

agent = agent_houses.loc[4] neighbours = get_neighbours(agent, agent_houses, radius=0.05 * 5) neighbours

 race county geometry 5 0 Sierra POINT (-120.4500000000001 39.44999999999997) 17 1 Sierra POINT (-120.5500000000001 39.49999999999997) 18 0 Sierra POINT (-120.5000000000001 39.49999999999997) 19 0 Sierra POINT (-120.4500000000001 39.49999999999997) 12321 0 Nevada POINT (-120.5500000000001 39.39999999999998) 12322 0 Nevada POINT (-120.5000000000001 39.39999999999998) 12323 0 Nevada POINT (-120.4500000000001 39.39999999999998) 12334 0 Nevada POINT (-120.5500000000001 39.44999999999997)

This is rather slow (500ms for a search among 15510 occupied houses)

Of you add x and y columns to a copy of the original:

agent_houses_b = agent_houses.assign(x=agent_houses.geometry.x, y=agent_houses.geometry.y)

and then use these column:

def get_neighbours3(agent, agent_houses: gp.GeoDataFrame, spacing): """ returns all the agents that live within `perimeter` of the `agent` The `agent` excluding""" close_x = (agent_houses.x - agent.geometry.x).abs() < spacing * 1.1 close_y = (agent_houses.y - agent.geometry.y).abs() < spacing * 1.1 return agent_houses.loc[ (agent_houses.index != agent.name) # skip the original agent & (close_x) & (close_y) ]

get_neighbours3(agent, agent_houses_b, spacing)

returns in about 3ms

This will possibly go even faster of you do it per county, and use DataFrame.groupby.transform if the people on the border of one county don't count as neighbours for people of a neighbouring county

To find out who is satisfied:

satisfied_agents = pd.Series( { id_: is_satisfied( agent=agent, agent_houses=agent_houses_b, spacing=spacing, similarity_threshold=similarity_threshold, ) for id_, agent in agent_houses.iterrows() }, name="satisfied", )

value_counts

Checking whether an agent is satisfied becomes simple, just using pd.Series.value_counts

def is_satisfied(*, agent, agent_houses, spacing, similarity_threshold): neighbours = get_neighbours3(agent, agent_houses, spacing=spacing) if neighbours.empty: return False group_counts = neighbours["race"].value_counts() return group_counts.get(agent["race"], 0) / len(neighbours) < similarity_threshold

To get the count per race:

agent_houses.groupby(["race"])["geometry"].count().rename("count")

To get the count in a county:

agent_houses.groupby(["county", "race"])["geometry"].count().rename("count")

update

One iteration in the update can then be described as:

def update(agent_houses, empty_houses, spacing, similarity_threshold): agent_houses_b = agent_houses.assign( x=agent_houses.geometry.x, y=agent_houses.geometry.y ) satisfied_agents = pd.Series( { id_: is_satisfied( agent=agent, agent_houses=agent_houses_b, spacing=spacing, similarity_threshold=similarity_threshold, ) for id_, agent in agent_houses.iterrows() }, name="satisfied", ) open_houses = pd.concat( ( agent_houses.loc[~satisfied_agents, ["county", "geometry"]], empty_houses, ), ignore_index=True, ) new_picks = np.random.choice( open_houses.index, size=(~satisfied_agents).sum(), replace=False ) new_agent_houses = agent_houses.copy() new_agent_houses.loc[ ~satisfied_agents, ["county", "geometry"] ] = open_houses.loc[new_picks] new_empty_houses = open_houses.drop(new_picks).reset_index(drop=True) return new_empty_houses, new_agent_houses

This redistributes all the empty houses and the houses of the people unsatisfied, simulating a instantaneous move of all the unsatisfied people

part 3 spatial datastructures.

There are some datastructures which are explicitly meant for spatial data, and finding nearest neighbours.

a scipy.spatial.KDTree (or implemented in cython cKDTree) is specifically meant for stuff like this, and will speed up searches a lot when going to large grid.

from scipy.spatial import cKDTree grid_x = np.array([grid["geometry"].x, grid["geometry"].y, ]).T tree = cKDTree(grid_xy)

To query:

tree.query_ball_point((-120.4, 35.7), spacing * 1.5)

This query only takes 70µs for those 17233 grid points, which is 30 times faster than get_neighbours3.

You can even look for all neighbour pairs with tree.query_pairs(spacing * 1.5). This takes about as much time as 1 neighbour lookup in neighbours3

This means you can prepopulate a dict with all neighbours:

all_neighbours = defaultdict(list) for i, j in tree.query_pairs(spacing * 1.5): all_neighbours[i].append(j) all_neighbours[j].append(i)

If you now keep the information on occupation and race in 2 separate numpy arrays, you can quickly look for all satisfied people:

occupied = random_population(size=len(grid), ratio=1 - empty_ratio) race = random_population(size=int(occupied.sum()), ratio=demographic_ratio) def is_satisfied2(agent, *, all_neighbours, occupied, race, similarity_index): if not occupied[agent] or agent not in all_neighbours: return False neighbours = all_neighbours[agent] neighbours_occupied = occupied[neighbours].sum() neighbours_same_race = ( occupied[neighbours] & (race[neighbours] == race[agent]) ).sum() return (neighbours_same_race / neighbours_occupied) > similarity_index

and all the satisfied people:

satisfied_agents = np.array( [ is_satisfied2( agent, all_neighbours=all_neighbours, occupied=occupied, race=race, similarity_index=similarity_index, ) for agent in np.arange(len(grid)) ] )

The people who want to move:

on_the_move = ~satisfied_agents & occupied

And the houses that are free is either free_houses = ~satisfied_agents or free_houses = ~occupied depending on your definition.

So update becomes as simple as:

def update(*, occupied, race, all_neighbours, similarity_index): satisfied_agents = np.array( [ is_satisfied2( agent, all_neighbours=all_neighbours, occupied=occupied, race=race, similarity_index=similarity_index, ) for agent in np.arange(len(grid)) ] ) on_the_move = ~satisfied_agents & occupied free_houses = ~satisfied_agents # or # free_houses = ~occupied assert len(on_the_move) <= len(free_houses) # they need a place to go new_houses = np.random.choice( free_houses, size=len(on_the_move), replace=False ) new_occupied = occupied[:] new_occupied[on_the_move] = False new_occupied[new_houses] = True new_race = race[:] new_race[new_houses] = race[on_the_move] return new_occupied, new_race

Thank you for this. I really appreciate your answer. But np.random.binomial doesn't give me the exact number of 1s and 2s as I really need that — Kartikeya Sharma
– Kartikeya Sharma, Commented Sep 17, 2019 at 18:04
Thanks a lot for giving me all the different ideas here. I will definitely implement them in my code and hopefully make it run faster and also make it cleaner but I think that one thing you didn't understand about the simulation is that the update method can't work the way you are trying it to work. When an agent is unsatisfied, it needed to be moved to an randomly chosen empty space right away because lets say if we get 5000 unsatisfied agents and only have 2000 empty houses, where would be the other 3000 agents would go. An empty house need to be available all the time. — Kartikeya Sharma
– Kartikeya Sharma, Commented Sep 27, 2019 at 20:51
Which is why I decided the free houses as the ones not occupied by satisfied people, so a discontent person can move to a horse of another discontent — Maarten Fabré
– Maarten Fabré, Commented Sep 28, 2019 at 18:27
Also in your function populate_simulation, its not really populating Santa Barbara, LA and Ventura mainland — Kartikeya Sharma
– Kartikeya Sharma, Commented Sep 30, 2019 at 5:45

Reinderien · Accepted Answer · 2019-09-17 13:20:26Z

Documentation

The amount of documentation you've written is ambitious, but its arrangement is slightly unhelpful for a few reasons.

When documenting the "parameters to a class", you're really documenting the parameters to __init__. As such, this block:

 """ Parameters ---------- shapefile : str It is a string pointing to a geospatial vector data file which has information for an american state's geometry data. spacing : float It defines the distance between the points in coordinate units. empty_ratio : float What percent of houses need to be kept empty. ratio : float What is the real ratio between the majority and minority in reality in a given state. ratio : int Number of races. Currently the model is tested on 2 races, but maybe in future support fot more races will be added. """

should be moved to the docstring of __init__.

Attributes generally aren't documented at all because they should mostly be private (prefixed with an underscore) and discouraged from public access except through functions. Regardless, such documentation should probably go into comments against the actual initialization of members in __init__.

Move the documentation for "Methods" to docstrings on each method.

Typo

coordiantes = coordinates

Modern IDEs have spell checking support, such as PyCharm.

Indentation

Perhaps moreso than in any other language, clear indentation in Python is critical. This:

 self.df_county_data['MinorityPopPercent'] = \ self.df_county_data[['TotalPop', 'MinorityPop' ]].apply(lambda x: (0 if x['TotalPop'] == 0 else x['MinorityPop'] / x['TotalPop']), axis=1)

could be better represented like so:

self.df_county_data['MinorityPopPercent'] = ( self.df_county_data[ ['TotalPop', 'MinorityPop'] ].apply( lambda x: ( 0 if x['TotalPop'] == 0 else x['MinorityPop'] / x['TotalPop'] ), axis=1 ) )

That being said, this is somewhat over-extending the usefulness of a lambda; you're probably better off just writing a function to pass to apply.

Stack Exchange Network

Schelling's model of Segregation Python implementation with Geopandas (Follow-up)

2 Answers 2

Code style

Python is not JAVA

pd.set_option('mode.chained_assignment', None)

ratio

keyword-only arguments

occupation

architecture

random seed

Geopandas

grid creation

value_counts

update

part 3

spatial datastructures.

Documentation

Typo

Indentation

You must log in to answer this question.

Linked

Hot Network Questions

Schelling's model of Segregation Python implementation with Geopandas (Follow-up)

2 Answers 2

Code style

Python is not JAVA

pd.set_option('mode.chained_assignment', None)

ratio

keyword-only arguments

occupation

architecture

random seed

Geopandas

grid creation

value_counts

update

part 3

spatial datastructures.

Documentation

Typo

Indentation

You must log in to answer this question.

Linked

Related

Hot Network Questions