Collecting Neighborhood Data

Since deploying my latest web application, a Los Angeles Neighborhood Ranker, I’ve wanted to explain the process of gathering the data. The first step was to decide which neighborhoods to use, what they’re called, and how they’re defined, geographically.

The first stop I made is the LA Times Mapping LA project. It has a mapping feature for many Los Angeles neighborhoods. Each neighborhood has an individual page where the neighborhood is plotted on a map. After inspection of the page’s source code, the coordinates of the neighborhood’s boundaries can be discovered:

(From http://maps.latimes.com/neighborhoods/neighborhood/santa-monica/)

"geometry": { "type": "MultiPolygon", "coordinates": 
[ [ [ [ -118.483981, 34.041635 ], [ -118.483766, 34.041430 ]...

This information can be collected and stored for further use via web scraping using a Python library like urllib2 to open the page and regular expression to find the coordinates. The following shows the essential steps:

import urllib2
import re
### Code here to define fullurl
webpage = urllib2.urlopen(fullurl)
for lines in webpage:
  d = re.search(r"geometry.+(\[ \[ \[ \[.+\] \] \] \])",lines)
  if d:
    ### store d.group(1)

Storing these coordinates to text allows them to be used in mapping and data gathering. Mapping is accomplished with leaflet.js and Mapbox, and I can describe that later. For now I will talk about how the nieghborhood coordinates help in the data collection process.

For the neighborhood ranker, I needed neighborhood-level data in categories like house price and others. Unfortunately, the existing real estate APIs (Zillow e.g.) don’t support neighborhood-level data, nor would it be likely they have exactly the same neighborhood definition.

The Zillow API, which I used for real estate data, only supports single house queries based on street address. So, how to go from single Zillow queries to average house price in a neighborhood? A first natural instinct may be to collect information on every house in a neighborhood and compute the average.

However, perhaps it isn’t necessary to get every value. Consider surveys like the census and employment situation surveys. These gather important information about populations without getting information from every member of the population. Instead they use knowledge of statistics and error to estimate the values and quantify uncertainty.

Thus, getting every house in the neighborhood is unnecessary. But how many are necessary? We estimate the standard error of the mean using sample information:

SE,mean = s/sqrt(n)

Where s is the sample standard deviation and n is the number of samples.

Thus, if the distribution of prices for each neighborhood is relatively tight, you wouldn’t need many samples at all to get a good estimate of population mean.

In general what I found is that only about 5 samples were needed per neighborhood to get a good estimate, with estimated error about 10-20% of sample average. This is sufficient to rank the neighborhoods as they vary in average price from $300,000 to $5,000,000 or about 1500%.

But how do you generate random samples within neighborhoods? Zillow doesn’t support it. One thing that may help is to first generate a random coordinate-pair within the neighborhood. I use the following:

def get_random_point_in_polygon(poly):
  (minx, miny, maxx, maxy) = poly.bounds
  while True:
    p = Point(random.uniform(minx, maxx), random.uniform(miny, maxy))
    if poly.contains(p):
      return p

The poly argument is a Polygon object from the Shapely library, which defines the neighborhood boundry. The function returns a randomly generated Point object which is inside the Polygon object. We can actually plot the neighborhood and look at the coverage we get with let’s say, 50 points.

xy = coords['santa-monica']

xylist = zip(*xy)
p = Polygon(xylist)
points = [get_random_point_in_polygon(p) for i in range(50)]

ax = plt.subplot()
ax.plot(*xy)
for point in points: ax.plot(point.x,point.y,marker = 'o')

plt.show()

This is the result:

samopts

50 random points in the Santa Monica neighborhood.

There is one more step. Zillow accepts API queries using addresses, not coordinates. You can get addresses from coordinates by using a ‘geocoding’ API. I used Google’s but there are others available. Using the address information, the Zillow API may be called, and the price information extracted. A similar method can be used for Yelp API and restaurant density.

On a technical note, this method may not give a completely random sample from a neighborhood. A random point will be assigned to the nearest address during the geocoding step. Thus, addresses in high density areas are less likely to be chosen. If price correlates with density within a neighborhood, this could yield biased samples.

Leave a comment