Home Sale Forecasting#

Run on Google Colab View source on GitHub Download notebook


Introduction#

In this tutorial, we present how to create and train a machine learning model for forecasting the sale price of a home using the built-in Forecasting AI Engines in EvaDB. EvaDB makes it easy to do forecasting queries over data in your database using its built-in Forecasting engines.

Prerequisites#

To follow along, you will need to set up a local instance of EvaDB via pip.

Connect to EvaDB#

After installing EvaDB, use the following Python code to establish a connection and obtain a cursor for running EvaQL queries.

import evadb
cursor = evadb.connect().cursor()

Connect EvaDB to PostgreSQL Database Server#

We will assume that you have a PostgreSQL database running locally that contains the data needed for analysis. Follow these instructions to install PostgreSQL.

Note

If find it challenging to install the PostgreSQL database on your machine, here is an alternative for quick prototyping.

You can use an embedded SQLite database. If you go with the sqlite database, alter the SQL commands in this tutorial to use the sqlite engine and the evadb.db SQLite database file as explained in the SQLite page.

EvaDB lets you connect to your favorite databases, data warehouses, data lakes, etc., via the CREATE DATABASE statement. In this query, we connect EvaDB to an existing PostgreSQL server:

CREATE DATABASE postgres_data
WITH ENGINE = 'postgres',
PARAMETERS = {
    "user": "eva",
    "password": "password",
    "host": "localhost",
    "port": "5432",
    "database": "evadb"
}

We will assume that the input data is loaded into a PostgreSQL database. To load the home sales dataset into your database, see the complete home sale forecasting notebook on Colab.

Preview the Home Sale Price Data#

We use the House Property Sales Time Series dataset from Kaggle in this tutorial. The dataset (raw_sales.csv) contains five columns: postcode, price, bedrooms, datesold, and propertytype.

SELECT *
FROM postgres_data.home_sales
LIMIT 3;

This query presents a subset of the dataset in the home_sales table:

+---------------------+------------------+---------------------+---------------------+-------------------------+
| home_sales.postcode | home_sales.price | home_sales.bedrooms | home_sales.datesold | home_sales.propertytype |
|---------------------|------------------|---------------------|---------------------|-------------------------|
|                2607 |           525000 |                   4 |          2007-02-07 |                   house |
|                2906 |           290000 |                   3 |          2007-02-27 |                   house |
|                2905 |           328000 |                   3 |          2007-03-07 |                   house |
+---------------------+------------------+---------------------+---------------------+-------------------------+

Train a Forecasting Model#

Let’s next train a time-series forecasting model over the home_sales table using EvaDB’s CREATE FUNCTION statement. In particular, we are interested in forecasting the price of homes with three bedrooms in the 2607 postcode.

CREATE FUNCTION IF NOT EXISTS HomeSaleForecast FROM
(
    SELECT propertytype, datesold, price
    FROM postgres_data.home_sales
    WHERE bedrooms = 3 AND postcode = 2607
)
TYPE Forecasting
PREDICT 'price'
TIME 'datesold'
ID 'propertytype'
FREQUENCY 'W';

This query returns the trained model:

+----------------------------------------------+
| Function HomeSaleForecast successfully added |
+----------------------------------------------+

Note

The Time Series Forecasting page lists all the configurable parameters for the forecasting model.

In the home_sales dataset, we have two different types of properties – houses and units, and price gap between them is large. To get better forecasts, we specify the propertytype column as the ID of the time series data. This denotes the identifier for the series and allows EvaDB to forecast the prices of houses and units independently.

Forecast using the Trained Model#

Next we use the trained HomeSaleForecast model to predict the home sale price for next 3 months. The model takes the horizon as input during prediction. The horizon denotes the steps in time over which we want to forecast in the future. In this query, the horizon is 3 months.

SELECT HomeSaleForecast(3);

The query returns the forecasted prices of the properties:

+-------------------------------+---------------------------+------------------------+
| homesaleforecast.propertytype | homesaleforecast.datesold | homesaleforecast.price |
+-------------------------------+---------------------------+------------------------+
|                         house |                2019-07-21 |                 766572 |
|                         house |                2019-07-28 |                 766572 |
|                         house |                2019-08-04 |                 766572 |
|                          unit |                2018-12-23 |                 417229 |
|                          unit |                2018-12-30 |                 409601 |
|                          unit |                2019-01-06 |                 402112 |
+-------------------------------+---------------------------+------------------------+

We may use ORDER BY to find out which month in the following year has the lowest price.

SELECT *
FROM (SELECT HomeSaleForecast(12)) AS HomeSale
ORDER BY price
LIMIT 1;

Here is the query’s output:

+-----------------------+-------------------+----------------+
| HomeSale.propertytype | HomeSale.datesold | HomeSale.price |
+-----------------------+-------------------+----------------+
|                  unit |        2019-03-10 |         340584 |
+-----------------------|-------------------|----------------|

What’s Next?#

👋 If you are excited about our vision of bringing AI inside databases, consider:

Language Models (🦙) and Databases

Language Models (🦙) and Databases#