Condividi tramite


How to Datamine Zillow

This is a pull from my blog: www.indiedevspot.com There is a slight delta between this and indiedevspot, where indiedevspot has more content.  I also post all of my presentation slide decks at: https://indiedevspot.com/category/events/ NOTE: the formatting may be nicer on www.indiedevspot.com

So on with the article!

 

Hello World,

As many of you may know at this point, I am relocating to South Florida.  Final location to be determined, but will probably be renting around Pompano Beach or Fort Lauderdale while working out of Venture Hive and the Microsoft Fort Lauderdale Offices.  So what does this have to do with Zillow?  Well, It has EVERYTHING to do with Zillow.  What I’ve found while searching for homes is that between Realtors, Zillow and Trulia, they really just don’t have a predictive analytics solution that works for me.  So I decided to give a shot at AzureML to mash together a few datasets to send me notifications more to my liking than is currently being sent.  So step 1 in this plan is to data mine Zillow.  Luckily, Zillow has an api for that.  Or if you are feeling particularly frisky, Zillow gets their data from ArcGIS (example for Raleigh).  So lets get cracking…

 

Tools for the Job

  1. Azure:  There are literally hundreds of millions of properties for sale.  There is no way I’m going to get this done in a timely fashion without millions of computers at my finger tips.  Add in the fact that I can right click publish cloud services with 100 nodes…Hell yeah, that’s what I’m talking about… https://azure.microsoft.com/en-us/
  2. F#:   F# just is a wonderful tool.  The type providers are excellent and seems well suited for this job.  I think it might have been overkill for this sample, but the XML Type provider ended up being really useful.  F# is my go-to web scraping tool from here on out. https://fsharp.org/
  3. C#:   Until F# gets some better tooling dealing with things like SQL Azure (EF support in C# is way better) and ease of publishing Cloud Services, I just am stuck with it.  C# just has better libraries and support.  Good note is that C# is supposedly getting some new F# features like pattern matching :).
  4. Visual Studio 2013 Community:  Free version of Visual Studio: https://www.visualstudio.com/en-us/news/vs2013-community-vs.aspx All of my future development however will be on 2015, it is now acting pretty stable for a beta build and has a host of really awesome new features like performance diagnostics while debugging displayed right next to each stepped over line of code.

Project Structure Breakdown

Note:   The project will NOT compile fresh out of the repository.  See AnalyzeZillow.Brains breakdown.

I’m used to writing production grade software, so I always break my project apart in a way to make my code highly maintainable and extensible.  There are 5 projects in this solution, built in dependency order, and that may seem like a lot, but once you get used to breaking problems apart in this fashion, it seems pretty straight forward.  Before looking too deeply here, you might want to download the code from here: https://github.com/drcrook1/ZillowAnalysis

  1. AnalyzeZillow.Core: This is the core basic project with repositories and types.  This project has no dependency requirements within the solution.  To Create this project from scratch, from VS 2013 Community, you simply Click File -> New Project -> C# Library.
  2. AnalyzeZillow.Brains:   This project requires just the types from the Core solution.  Working with F#, C# and web has taught me that though you get a performance gain working with F# types, to provide ease of system integration, those types should be C# types.  The Brains project simply provides a function that parses the Zillow api response into C# types.  To create this project from scratch, from VS 2013 Community File -> New Project -> Other Languages -> F# Library.  You will need to install the nugget packages for F# Core and FSharp.Data.  Installing the F# Powertools extensions makes life easier as well.
  3. AnalyzeZillow.Host.Role:   This is the project that orchestrates using the Brains and the Core Repositories to store data retrieved from the brains and stored into SQL Azure.  Role is also responsible for figuring out which instance of the service it is and providing the orchestration of ensuring duplicate data is not retrieved.  This gets generated when creating AnalyzeZillow.Host.
  4. AnalyzeZillow.Host:   This project simply defines the Azure Cloud Service.  It points to the Host.Role and has configuration for number and size of instances.  You can see from the config here that we only have 2 instances.  To create this project from scratch, from VS 2013 Ensure you have the Azure SDK installed for VS 2013.  File -> New Project -> Cloud Service.  Name the cloud service what you want, add a worker role (Make sure you edit the name here to Host.Role)
  5. AnalyzeZillow.DB:   This project is the SQL Project defining the database structure we will be using for the project.  To create this from Scratch File -> New Project -> Other Languages -> SQL Server Database Project.

Lets look at some code already!

So now that we have the project structure and dependencies laid out, we can start looking at some code.  I am going to skip the DB project, as that is a visual editor, and I think you can figure it out.  If not, I’m working on a separate article (release pending) where we deep dive into those.

AnalyzeZillow.Core

In this project, there is a folder “SQL”.  The files under here are auto-generated by code-first from Database.  It generates the Entity Framework Context and mappings automatically for you based on the deployed database.  You can generate these by right clicking SQL, Add new item, go to the Data section.  Notice the “Context” class.  This means in the app.config, you will need to replace the connection string with your own.  You are probably thinking the app.config for this project, but NO!  You need to replace the connection string in the app.config in AnalyzeZillow.Host.Role, as that is the end project being deployed.  This is simply a library project.  You more or less forget this app.config even exists.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 namespace AnalyzeZillow.Core.SQL {     using System;     using System.Data.Entity;     using System.ComponentModel.DataAnnotations.Schema;     using System.Linq;     /// <summary>     /// Entity Framework Generated Code.  You will need to search in the app.config     /// for ZillowDataContext to find the connection string and replace with your own.     /// </summary>     public partial class ZillowDataContext : DbContext     {         public ZillowDataContext()             : base("name=ZillowDataContext")         {         }           public virtual DbSet<Home> Homes { get; set; }           protected override void OnModelCreating(DbModelBuilder modelBuilder)         {             modelBuilder.Entity<Home>()                 .Property(e => e.State)                 .IsFixedLength();               modelBuilder.Entity<Home>()                 .Property(e => e.Latitude)                 .HasPrecision(18, 0);               modelBuilder.Entity<Home>()                 .Property(e => e.Longitude)                 .HasPrecision(18, 0);               modelBuilder.Entity<Home>()                 .Property(e => e.TaxAssessment)                 .HasPrecision(18, 0);               modelBuilder.Entity<Home>()                 .Property(e => e.NumBathrooms)                 .HasPrecision(18, 0);         }     } }

So how about some non generated code?  Well there is IZillowDataRepository and ZillowDataRepository.  Using an interface allows me the ability to swap out implementations without worries and I should be using a dependency injection framework, but oh well, this is quick.  The only code really being leveraged here is “SaveSingle”.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 using AnalyzeZillow.Core.SQL; using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks;   namespace AnalyzeZillow.Core {     public class ZillowDataRepository : IZillowDataRepository     {         public ZillowDataContext dbContext { get; set; }           public ZillowDataRepository()         {             dbContext = new ZillowDataContext();         }         /// <summary>         /// Notice that this is NOT IZillowDataRepository.SaveSingle,         /// Doing so would force a private member here.         /// </summary>         /// <param name="home"></param>         /// <returns></returns>         public async Task<bool> SaveSingle(Home home)         {             try             {                 dbContext.Homes.Add(home);                 await dbContext.SaveChangesAsync();                 return true;             }             catch(Exception e)             {                 //if something goes wrong, we can't save the changes,                 //but chances are it is still in the context.                 //next time we attempt a save, it will still be there                 //causing another exception.                 //we must remove the bad home (probably a key duplicate)                 //otherwise our program will constantly be failing with a growing                 //in memory list of homes to add                 dbContext.Homes.Remove(home);                 return false;             }         }           public async Task<bool> SaveBatch(ICollection<Home> homes)         {             foreach(Home home in homes)             {                 //Add a bunch of homes here                 dbContext.Homes.Add(home);             }             try             {                 //save all of them as a batch                 //this is a HUGE performance gain.                 //I've seen upwards of multiples of minutes                 //performance increase for larger inserts                 await dbContext.SaveChangesAsync();                 return true;             }             catch(Exception e)             {                 return false;             }         }           ICollection<SQL.Home> IZillowDataRepository.GetHomes()         {             throw new NotImplementedException();         }           SQL.Home IZillowDataRepository.GetSingleHome(int zId)         {             throw new NotImplementedException();         }     } }

This repository allows the ability to get and save data.  In our instance as of now, we are JUST saving data and therefor we do not need implementations for any of the getters.  Entity Framework is really awesome and extremely performant.  I highly recommend learning this if you don’t know it yet.  It makes dealing with SQL super easy.  This repository is what is used by our Host.Role to save data retrieved by our F# Brains.

That’s really it for the Core, on to the Brains!

AnalyzeZillow.Brains

There are really only two files here that matter; Brains.fs and Script.fsx.  You will probably notice that when you bring the project down it fails to compile.  It fails to compile because I am using the FSharp.Data XML Type provider.  It can’t compile, because the xml coming back has a single element “Message”, which says “Where the heck is your api key dummy!?”.   Find the following lines of code at the top of the Brains.fs file and replce the {YOURKEY} with your very own Zillow api key.  You will see it works.

1 2 3 4 5 6 7 8 9 module Brains =     [<Literal>]     let zillowBasicSample = "https://www.zillow.com/webservice/GetUpdatedPropertyDetails.htm?zws-id={YOURKEY}&zpid=48749425"     let zillowBasicUrl = "https://www.zillow.com/webservice/GetUpdatedPropertyDetails.htm?zws-id={YOURKEY}&zpid="     [<Literal>]     let zillowDeepSample = "https://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id={YOURKEY}&address=2114+Bigelow+Ave&citystatezip=Seattle%2C+WA"     let zillowDeepsUrl = "https://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id={YOURKEY}"     type ZillowBasic = XmlProvider<zillowBasicSample>     type ZillowDeep = XmlProvider<zillowDeepSample>

The type providers are really awesome.  There are type providers for all sorts of different things.  What it does is builds a dynamic type “ZillowBasic” and “ZillowDeep” from the xml returned by the samples.  The Url’s are the base url’s for better readability for formulating a request.  You will notice that what I do here is a terrible habit (Rats, you caught me).  I should pull the XML ahead of time and set a string in another file for it to read the sample from and pass that sample into the provider.  This would actually prevent compilation problems as well.  Oh well, in this instance, just get your very own api key, put it here, and the code should compile fine.

Ok, so now the project compiles, lets talk about each of the files, Brains.fs and Script.fsx.  Script.fsx is not actually compiled, it is a script file that you can use for data exploration and testing.  Super awesome to have something like this in the .net run time!  Since F# is compatible with C#, you can test out C# stuff like this too (but you have to do it from F#).  Highlight the code you want and push ALT+Enter and it will execute in the F# Interactive window.  This is how I figured out which calls from Zillow had the data returned that I wanted and could compose the second more complicated request easily and test it real time.  Brains.fs is simply the resulting compiled form of what the experimentation led me to.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 namespace AnalyzeZillow.Brains   open FSharp.Data open AnalyzeZillow.Core.SQL open System.Xml.Linq   module Brains =     [<Literal>]     let zillowBasicSample = "https://www.zillow.com/webservice/GetUpdatedPropertyDetails.htm?zws-id={YOURKEY}&zpid=48749425"     let zillowBasicUrl = "https://www.zillow.com/webservice/GetUpdatedPropertyDetails.htm?zws-id={YOURKEY}&zpid="     [<Literal>]     let zillowDeepSample = "https://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id={YOURKEY}&address=2114+Bigelow+Ave&citystatezip=Seattle%2C+WA"     let zillowDeepsUrl = "https://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id={YOURKEY}"     type ZillowBasic = XmlProvider<zillowBasicSample>     type ZillowDeep = XmlProvider<zillowDeepSample>       let GetHome (id:int, zillowApiKey:string) =         let idString = id.ToString()         let zBasic = ZillowBasic.Load(zillowBasicUrl + idString)         let address = zBasic.Response.Address.Street.Replace(" ", "+")         let citystatezip = zBasic.Response.Address.City + @"%2C+" + zBasic.Response.Address.State         let deepUrl = zillowDeepsUrl + "&address=" + address + "citystatezip=" + citystatezip         let zFull = ZillowDeep.Load(zillowDeepsUrl + "&address=" + address + "&citystatezip=" + citystatezip)         let home = new Home();         home.City <- zFull.Response.Result.Address.City         home.FIPSCounty <- zFull.Response.Result.FipScounty         home.HomeSize <- zFull.Response.Result.FinishedSqFt         home.HomeType <- zFull.Response.Result.UseCode         home.Latitude <- zFull.Response.Result.Address.Latitude         home.Longitude <- zFull.Response.Result.Address.Longitude         home.NumBedrooms <- zFull.Response.Result.Bedrooms         home.State <- zFull.Response.Result.Address.State         home.Street <- zFull.Response.Result.Address.Street         home.TaxAssesmentYear <- zFull.Response.Result.TaxAssessmentYear         home.TaxAssessment <- zFull.Response.Result.TaxAssessment         home.YearBuild <- zFull.Response.Result.YearBuilt         home.zId <- id         try             home.NumBathrooms <- zFull.Response.Result.Bathrooms         with             | :? System.Exception -> home.NumBathrooms <- 0.1m         try             home.LotSize <- zFull.Response.Result.LotSizeSqFt         with             | :? System.Exception -> home.LotSize <- 0         try             home.ZillowEstimate <- float(zFull.Response.Result.Zestimate.Amount.Value)             home.ZillowHighEstimate <- float(zFull.Response.Result.Zestimate.ValuationRange.High.Value)             home.ZillowLowEstimate <- float(zFull.Response.Result.Zestimate.ValuationRange.Low.Value)         with             | :? System.Exception ->                 home.ZillowEstimate <- float(0)                 home.ZillowHighEstimate <- float(0)                 home.ZillowLowEstimate <- float(0)         home         let GetData (something:int) =         let data = 1         0

Now I admit, this is not really “functional” coding, but I really just wanted access to the REPL and Type providers F# provides for this.  There was no real need for me to put together a series of compose-able functions, that would have been just an academic exercise.  I will end up using more F# features during my exploration and analysis post collection.  You will notice a fair number of try/catch blocks.  One of the biggest things you will find is that very rarely does data come back the way it is promised.  Even Zillow has issues.

Some properties have the same id’s, some properties are missing estimates, some properties they just didn’t bother to put any data in at all.  So the try/catch go independently around various properties that I noticed had a substantial number of issues so I can hopefully get a substantial amount of data back.  Even doing this, it appears there are issues with approximately 80% of the data I parsed through.  So some extra effort needs to go into checking that out.  Some of the results just simply said “I have no data for this, sorry”.  That was actually a fairly large portion of the 80% that had issues (no statistics on what was bad with the 80%).

AnalyzeZillow.Host.Role

So in this instance of our cloud services, I planned to use only 2 cloud services, so I only needed to distribute the workload across even requests and odd requests.  I simply determine which node is running the code (0 or 1) and set them infinitely querying Zillow starting at property 1 or 2 and incrementing the request by 2.  Notice there is a fair amount of generated code, I just left it there, our code really doesn’t need any of that mess, but I just hate deleting generated code some times, because its tough to tell what it is really doing.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 using System; using System.Collections.Generic; using System.Diagnostics; using System.Linq; using System.Net; using System.Threading; using System.Threading.Tasks; using Microsoft.WindowsAzure; using Microsoft.WindowsAzure.Diagnostics; using Microsoft.WindowsAzure.ServiceRuntime; using Microsoft.WindowsAzure.Storage; using AnalyzeZillow.Core.SQL; using AnalyzeZillow.Core;   namespace AnalyzeZillow.Host.Role {     public class WorkerRole : RoleEntryPoint     {         //Generated Code         private readonly CancellationTokenSource cancellationTokenSource = new CancellationTokenSource();         private readonly ManualResetEvent runCompleteEvent = new ManualResetEvent(false);                   public override void Run()         {             Trace.TraceInformation("AnalyzeZillow.Host.Role is running");             Trace.TraceInformation("Working");             //DO Work             var instanceIdSplit = RoleEnvironment.CurrentRoleInstance.Id.Split('_');         //last token in the id is the integer id of the role _00 - _99             var roleNumberString = instanceIdSplit[instanceIdSplit.Length - 1];             int roleNumber;             int.TryParse(roleNumberString, out roleNumber);             //82300179 <- Home in Pompano Beach             if(roleNumber == 0)             {                 WorkerRole.DoWork(82300178);             }             else             {                 WorkerRole.DoWork(82300179);             }           }           public static async void DoWork(int i)         {                           bool done = false;             ZillowDataRepository zRepo = new ZillowDataRepository();             while (!done)             {                 try                 {                     Home h = Brains.Brains.GetHome(i);                     await zRepo.SaveSingle(h);                     i += 2;                 }                 catch (Exception e)                 {                     //done = true;                     i += 2;                 }               }         }         //Generated Code         public override bool OnStart()         {             // Set the maximum number of concurrent connections             ServicePointManager.DefaultConnectionLimit = 12;               // For information on handling configuration changes             // see the MSDN topic at https://go.microsoft.com/fwlink/?LinkId=166357.               bool result = base.OnStart();               Trace.TraceInformation("AnalyzeZillow.Host.Role has been started");               return result;         }         //Generated Code         public override void OnStop()         {             Trace.TraceInformation("AnalyzeZillow.Host.Role is stopping");               this.cancellationTokenSource.Cancel();             this.runCompleteEvent.WaitOne();               base.OnStop();               Trace.TraceInformation("AnalyzeZillow.Host.Role has stopped");         }     } }

Also, remember, this is the project, where we need to add the database connection string that case sensitively matches our context in our AnalyzeZillow.Core project.  The connection string needs to be set in app.config.

Debugging the project

As the project has 2 cloud service instances, you will need to use the full emulator.  To specify the full emulator to use.  Right click AnalyzeZillow.Host and select properties.  From the properties, on the left tab, click “Web”.  This will change the primary view, where you can select the option “Use Full Emulator”.  This will simulate multiple instances of cloud services.

FullCloudServiceEmulator

Close Visual Studio, and start it again in admin mode.  Right click AnalyzeZillow.Host -> Debug -> Start new Instance.  Verify everything works, you are putting data in the database etc etc.

Publishing the solution

This is super easy.  Right click AnalyzeZillow.Host -> Publish -> New Cloud Service.  Enter in your credentials, hit OK and you are off!

Summary

We covered Datamining Zillow through their api using Azure Cloud Services, SQL Azure, C#, F# and Visual Studio!  I hope this was useful to you.  I use these same techniques for pretty much all data mining activities.  The complexity of this is a bit lower than most, which makes it a great starter project.