Collecting Neighborhood Data

Since deploying my latest web application, a Los Angeles Neighborhood Ranker, I’ve wanted to explain the process of gathering the data. The first step was to decide which neighborhoods to use, what they’re called, and how they’re defined, geographically.

The first stop I made is the LA Times Mapping LA project. It has a mapping feature for many Los Angeles neighborhoods. Each neighborhood has an individual page where the neighborhood is plotted on a map. After inspection of the page’s source code, the coordinates of the neighborhood’s boundaries can be discovered:

(From http://maps.latimes.com/neighborhoods/neighborhood/santa-monica/)

"geometry": { "type": "MultiPolygon", "coordinates": 
[ [ [ [ -118.483981, 34.041635 ], [ -118.483766, 34.041430 ]...

This information can be collected and stored for further use via web scraping using a Python library like urllib2 to open the page and regular expression to find the coordinates. The following shows the essential steps:

import urllib2
import re
### Code here to define fullurl
webpage = urllib2.urlopen(fullurl)
for lines in webpage:
  d = re.search(r"geometry.+(\[ \[ \[ \[.+\] \] \] \])",lines)
  if d:
    ### store d.group(1)

Storing these coordinates to text allows them to be used in mapping and data gathering. Mapping is accomplished with leaflet.js and Mapbox, and I can describe that later. For now I will talk about how the nieghborhood coordinates help in the data collection process.

For the neighborhood ranker, I needed neighborhood-level data in categories like house price and others. Unfortunately, the existing real estate APIs (Zillow e.g.) don’t support neighborhood-level data, nor would it be likely they have exactly the same neighborhood definition.

The Zillow API, which I used for real estate data, only supports single house queries based on street address. So, how to go from single Zillow queries to average house price in a neighborhood? A first natural instinct may be to collect information on every house in a neighborhood and compute the average.

However, perhaps it isn’t necessary to get every value. Consider surveys like the census and employment situation surveys. These gather important information about populations without getting information from every member of the population. Instead they use knowledge of statistics and error to estimate the values and quantify uncertainty.

Thus, getting every house in the neighborhood is unnecessary. But how many are necessary? We estimate the standard error of the mean using sample information:

SE,mean = s/sqrt(n)

Where s is the sample standard deviation and n is the number of samples.

Thus, if the distribution of prices for each neighborhood is relatively tight, you wouldn’t need many samples at all to get a good estimate of population mean.

In general what I found is that only about 5 samples were needed per neighborhood to get a good estimate, with estimated error about 10-20% of sample average. This is sufficient to rank the neighborhoods as they vary in average price from $300,000 to $5,000,000 or about 1500%.

But how do you generate random samples within neighborhoods? Zillow doesn’t support it. One thing that may help is to first generate a random coordinate-pair within the neighborhood. I use the following:

def get_random_point_in_polygon(poly):
  (minx, miny, maxx, maxy) = poly.bounds
  while True:
    p = Point(random.uniform(minx, maxx), random.uniform(miny, maxy))
    if poly.contains(p):
      return p

The poly argument is a Polygon object from the Shapely library, which defines the neighborhood boundry. The function returns a randomly generated Point object which is inside the Polygon object. We can actually plot the neighborhood and look at the coverage we get with let’s say, 50 points.

xy = coords['santa-monica']

xylist = zip(*xy)
p = Polygon(xylist)
points = [get_random_point_in_polygon(p) for i in range(50)]

ax = plt.subplot()
ax.plot(*xy)
for point in points: ax.plot(point.x,point.y,marker = 'o')

plt.show()

This is the result:

samopts

50 random points in the Santa Monica neighborhood.

There is one more step. Zillow accepts API queries using addresses, not coordinates. You can get addresses from coordinates by using a ‘geocoding’ API. I used Google’s but there are others available. Using the address information, the Zillow API may be called, and the price information extracted. A similar method can be used for Yelp API and restaurant density.

On a technical note, this method may not give a completely random sample from a neighborhood. A random point will be assigned to the nearest address during the geocoding step. Thus, addresses in high density areas are less likely to be chosen. If price correlates with density within a neighborhood, this could yield biased samples.

Advertisements

Ranking Neighborhoods in Los Angeles

My latest web application is an interactive, personalized map-based ranking of neighborhoods in Los Angeles. I’ve spent the last few weeks gathering data points for each of 155 neighborhoods in the Los Angeles area. That on its own could (and will soon) be the topic of its own post.

For now, I wanted to explain the ranking algorithm I use to come up with the rank score.

First, the features I have gathered on each of the neighborhoods:

  • Property Crime
  • Violent Crime
  • Air Quality
  • School Quality
  • Population Density
  • Restaurant Density
  • Price to Buy
  • Price to Rent

How to rank neighborhoods based on these characteristics? How to do so in a way that allows users to choose which features they care about and their relative importance?

First, assign each neighborhood a score within each category. For purposes of simplicity, I made each score from 0 to 100. The best neighborhood in each category would rank 100, the worst 0. But how exactly to determine this? Some features are clearly better when low (price, crime) and some are better when higher (walkability, school quality). Furthermore, there is the choice of whether to apply any other transformations to the data such as log-scaling etc.

So I defined functions that transform the data in desired ways.

import math
def standard(n):
	return n
def neglog(n):
	return -math.log(n)
def neg(n):
	return -n

Now I can define a dictionary pairing each dataset with the desired transformation function.

 fundic = {
     'aqf':neg, #air quality
     'hpf':neglog, #house price
     'rpf':neg, #rent pruce
 'pcf':neg, #property crime
 'pdf':standard, #population density
 'rdf':standard, #restaurant density
 'sqf':standard, #school quality
 'vcf':neg, #violent crime
 'wsf':standard, #walkscore
}

Python treats functions as first-class citizens (basically as variables) so this allows us to pass the function from the dictionary basically as any other variable. Within a loop that iterates over each data set, we can select the function corresponding to the dataset stem of choice.

 for stem in list_of_stems:
      fun = fundic[stem]

Now, while loading the data set, we apply the function

 featdic = {}
 for line in datafile:
     feats = line.split(',')
     featdic[ feats[0] ] = fun( float( feats[1] ) )

Here we load the raw value into a dictionary keyed by the neighborhood name. These appear together in the data storage file separated by a comma. Now find the max and min.

 vals = featdic.values()
 minval = min(vals)
 maxval = max(vals)

Now to put the data in order and normalize it.

 orderfile = open('name_order.txt','r')
 for line in orderfile:
     name = line
     rankscore = ( featdic[name] - minval)/( maxval - minval ) * 100.0

So this gives us a rank score now normalized to 0 to 100 and transformed by negative, log or any other transformation we’d like.

Now to address another question: given the user’s rank order of features, how to use that to generate a composite ranking? There is no perfect answer to this question. Of course, you could give the user control over the exact weighting, but this seems like unecssary feature bloat. Instead I decided to weight harmonically. The harmonic series goes like: 1, 1/2, 1/3, 1/4 … This weighting is arbitrary but seems quite natural. First place is twice as important as second, three times as important as 3rd and so on.

On the server side, I have Flask waiting for the names of columns to include in the ranking. First I have a copy of the order of the names in the database. (In this case just a list of lists in a .py file, but could be in a SQL database or other configuration)

names = ["prop-crime", 
"viol-crime",
"air-qual",
"school-qual",
"pop-dens-h",
"pop-dens-l",
"rest-dens",
"buy-price",
"rent-price",
"walk-score"]

Now I can define a function to get the composite ranking. First get the number of columns.

def getrank(cols):
    n = len(cols)

Load in the names and the list of scores.

    scoremat = scores()
    nameslist = names()

Now calculate the overall weight from n columns (i.e. the sum from 1 + 1/2 + … 1/n)

     norm = sum( [ 1/float(i+1) for i in range(n)] )

And now get the rank and index of each feature in the list.

    rankinf = [(i+1, nameslist.index(j) ) for i,j in enumerate(cols)]

And finally calculate the composite rank for each neighborhood and return it as a list of strings.

    return [ '{0:.1f}'.format( sum( [ 1/float(i)*sl[j] for i,j in rankinf ] )...
        /norm ) for sl in scoremat ]