Hedonic Pricing Model:Single-Family Residential Homes, Moreno Valley, 2020-2023
Prepared by Michael Hesford Spring 2023
EXECUTIVE SUMMARY
This report is based on 3 years of home sales data in the 92557 area code of Southern California in Moreno Valley. Using a Hedonic Pricing Model, regressions are run using added variables and consider macroeconomic factors such as the unemployment rate, mortgage rate, crime and more. Surprisingly, the final model results with a resounding 50% of homes falling within a prediction error of +/- 3.6%.
I. INTRODUCTION
This study examines the determinants of sales prices of homes in a part of Moreno Valley, California over the three-year period beginning April 15, 2020 and ending April 14, 2023. The included area is bounded on the South by California State Route 60 (SR60), by Morton Road on the West and Lasselle Street to the East. The northern end is bounded by the hills and undeveloped areas of the city. This area is the boundary for the 92557 ZIP Code and includes the neighborhoods of Hidden Springs, Sunnymead Ranch, Sunnymead and Box Springs.
Real estate agents, developers and homebuyers frequently refer to a home's price per square foot. In the area studied, the mean and median price per square foot was $278 and $267, respectively. However, the distribution of this measure ranges from a low of $101 to a high of $555, exhibiting a positive skew. This measure implies that home size is the only factor influencing its price. Economic intuition immediately suggests other factors such as lot size, number of bedrooms and bathrooms, interest rates, unemployment, pool, views, property tax rates, neighborhoods, annual demand shifts and seasonal buying behavior (e.g., most families do not move during the school year). Therefore, while perhaps convenient, price per square foot is a misleading statistic that is irrelevant to agents and buyers.
Accordingly, I use linear regression to develop a hedonic pricing model. A hedonic pricing model identifies price factors under the assumption that an item's price is determined by internal characteristics of the good being sold, and also by external factors affecting it. The hedonic pricing model is commonplace when working with real estate data and the housing market. This model is appropriate in determining the value of a house and its various attributes, and we can also control for various neighborhoods and property crime rates in the 92557 ZIP Code.
II. MODELS AND DATA
Geographic Scope
To develop the hedonic pricing model, three years of sales data was obtained from a title company in Southern California. The data is limited to real estate transactions in the 92557 area code of Moreno Valley.
Models
The following four hedonic pricing models are estimated:
- Price[i] = b0 + b1 Home Size[i] + e[i]
- Price[i] = b0 + b1 Home Size[i] + D[j] Quarter[j] + D[k] Year[k] + e[i]
- Price[i] = b0 + b1 Home Size[i] + b2 Bedrooms[i] + b3 Bathrooms[i] + b4 Pool[i] + b5 Lot Size[i] + D[m] Neighborhood[m] + D[j] Quarter[j] + D[k] Year[k] + e[i]
- Price[i] = b0 + b1 Home Size[i] + b2 Bedrooms[i] + b3 Bathrooms[i] + b4 Pool[i] + b5 Lot Size[i] + b6 Unemployment[i] + b7 Mortgage Rate[i] + D[m] Neighborhood[m] + D[j] Quarter[j] + D[k] Year[k] + e[i]
The first model has a single explanatory variable: square footage. This is similar to the naive model used by agents and homebuyers, except that a linear regression model enables fitting of an intercept. In the second model, seasonality and annual shifts in demand are captured by dummy variables for each year and quarter was run. The term ??j+1 Quarterj is a shorthand notation indicating the creation of three dummy variables where the summation operator (?) is indexed from j = 2 to j = 4, the first quarter being the baseline, or reference, level. Similarly, the summation operator for the term containing Year goes over the range k = 1 to k = 3. This also creates three dummy variables that represent the years 2021 to 2023, with the reference year being 2020. The third model adds attributes about the home. Finally, the fourth model adds macroeconomic variables -- unemployment and the average interest rate -- that are likely to have an impact on home sales and, therefore, home prices. The subscript i ranges from 1 to N and represents each home sale in the sample data. These models are estimated using ordinary least squares (OLS) regression.
Dependent Variable
Sales Price
The variable Sales Price (Price) is measured using the closing price of the home. [1]
Inflation
Since the data set obtained is of home sales over several years, it is important that monetary values be put in constant dollars. Accordingly, inflation data was obtained from the Bureau of Labor Statistics (BLS) website. [2] These data enable sales prices to be controlled for inflation during these years, making comparisons possible. [3] Inflation (inflation) is an index of the U.S. city average series of prices for all goods and services purchased for consumption by urban households. The index represents changes in prices between the month of sale and April 2023. I then created the variable PriceAdjusted as follows:
PriceAdjusted = Price x inflation
PriceAdjusted is the dependent variable used in each of the four regression models.
Explanatory Variables
Home Size
Home size (HomeSize) is the area of a home, as measured in square feet.
Bedrooms
The number of bedrooms (Bedrooms) is the total number of bedrooms in a home.
Bathrooms
The number of bathrooms (Bathrooms) is the total number of bathrooms within a home. A complete bathroom (i.e., toilet, sink, bathtub and shower) is 1, while a bathroom with only a toilet and sink would be a "half bath" and have Bathrooms = 0.5. One quarter (0.25) and three-quarters bathrooms are also possible. Accordingly, this measure has values from 1 to 4, inclusive, in increments of 0.25, except for 1.25 bathrooms.
Pool
A pool in good condition is, ceteris paribus, likely to increase the value of a home. I include this as a factor influencing sales price. I create a dummy variable, Pool, that indicates whether a home sold has a pool. The data set does not specify whether pools were above-ground or in-ground pool, but a sample of homes with pools viewed from satellite imagery all were in-ground pools. Accordingly, we did not attempt to classify each of the pools.
Lot Size
Lot size is the area of the land as measured in square feet. A scatter plot of sales price versus lot size (not provided) shows, on average, a non-linear relationship between price and lot size. Initially, price is increasing linearly in lot size until approximately 10,000 square feet, but begins to increase at a decreasing rate until approximately 18,500 square feet. After this maximum, Price begins to drop slightly, although there are relatively few homes (n = 84) with lots greater than 18,500 square feet in size.
To minimize the effect of the significant right skew in the distribution of lot size, I transformed lot size using the natural logarithm of log size, as follows:
LogLotSize = ln(LotSize)
The transformed variable will reduce, and perhaps eliminate, highly influential observations that would result from extreme values (i.e., high hat values) in x-space (i.e., among independent variables).
Age Group
Older homes, ceteris paribus, should sell for less. They are more likely to be in need of major repairs, have floor plans not suited to today's consumer preferences, and are not likely to have modern conveniences. But adding a continuous variable, such as age, is not likely to be sensible. Just because a home is older by one year does not imply price drops linearly. Accordingly, I constructed three age categories and added the age-classification as a factor variable in the regression model.
Figure 1 shows the distribution of homes by the year in which they were built. The three age categories are older, mid-range, and newer construction. More specifically, these were defined as homes built prior to 1980 (older homes), in the twenty-year period 1980-2000 (mid-age homes), and those built 2001-present (newer construction). To assess the impact of this classification scheme on the four models, I plotted the relationship between home size and sales price for each of the three categories. To these individual plots I fitted a simple linear regression line. A combined plot, shown in Figure 2, makes clear that the slopes of the relationship between home size and sales prices are similar for the three age groups, but there are vertical shifts in pricing across them, most notably with the "newer construction" group. The median inflation-adjusted price is monotonically decreasing in age, from $590,846 to $517,963 and $488,724. [4]
2 hidden cells
To capture the shift in sales price for the three groups, I created indicator variables for the mid-age and newer construction groups. Also called a dummy variable, the indicator variable for the mid-age group (Mid) takes on a value of 1 when an observation (i.e., a home sold) has, for its YearBuilt variable a value between 1980 and 2000, inclusive, and 0 otherwise. Similarly, the dummy variable for homes in the newer construction group (New) takes on a value of 1 when YearBuilt takes on a value greater than 2000, and is otherwise 0. A significant coefficient on a dummy variable means that the regression model's intercept is shifted. Since the slope of the regression line is constrained to be the same for both conditions (i.e., 0 or 1, or the presence of absence of the categorical variable), the response function consists of parallel lines.
Neighborhoods
The data was further classified into five neighborhood regions: Box Springs, Sunnymead, Sunnymead Ranch, Hidden Springs and "Other" (mostly the older areas on the East side of the region, and also homes near the CA-60 freeway). Figure 3 is a map with neighborhoods outlined and points indicating the location of homes sold during the study period. In Panel A, points are also colored based on the neighborhood classification. For the regression models that controlled for neighborhood, the base or reference level is "Other". Panel B shows the age categories of the homes sold by location. The oldest homes are mostly in a large cluster in the Eastern half of Sunnymead and along the 60 Freeway. Homes in the Mid group (mostly from the mid-1980s to the early 1990s) expanded in all other areas of the zip code, filling out Sunnymead and points West through to Box Springs and North into the Sunnymead Ranch and Hidden Springs neighborhoods. The newest homes (New) mostly filled out subdivisions in Sunnymead Ranch. Table 1 below provides descriptive summary statistics by neighborhood.
FIGURE 1
Panel A: Location of Home Sales with the Zip Code
Panel B: Home Locations by Age Grouping
TABLE 1
Descriptive Statistics by Neighborhood
| Other | Sunnymead | Sunnymead Ranch | Box Springs | Hidden Springs | |
|---|---|---|---|---|---|
| Price | $477,007 | $438,343 | $519,371 | $493,920 | $490,041 |
| Home Size (sq. ft.) | 1,728 | 1,482 | 2,096 | 1,887 | 1,868 |
| Bedrooms | 3.43 | 3.16 | 3.65 | 3.78 | 3.56 |
| Bathrooms | 2.20 | 2.02 | 2.45 | 2.49 | 2.45 |
| Lot Size (sq. ft.) | 9,488 | 7,328 | 8,788 | 8,705 | 6,898 |
| Age (years) | 36.4 | 38.5 | 23.3 | 30.6 | 30.1 |
| Pool | 23.9% | 15.0% | 19.2% | 20.8% | 18.5% |
Dummy variables were created to capture price differences in each neighborhood. With five neighborhoods, four dummy variables were created (BoxSprings, HiddenSprings, Sunnymead and Sunnymead Ranch). This leaves "Other" as the reference neighborhood. Similar to the description of the dummy variables for the age groups, the dummy variables result in a shift in selling prices for neighborhoods relative to homes in the "Other" neighborhood.
Unemployment
Unemployment data (Unemployment) for Moreno Valley was obtained from YCharts. [5] Data were available from April 2020 through February 2023. Since the sales data extended through mid-April 2023, the last two months of unemployment were imputed using the trend from the prior months. Figure 4 shows the unemployment rate over the months of this study.
Mortgage Interest Rate
Finally, to assess the impact of mortgage interest rates (MortgageRate) on home sales prices, I obtained the average 30-year fixed-rate from the St. Louis Federal Reserve Economic Data (FRED) website. [6] The line chart shown in Figure 5 depicts the interest rate over time.
Sample Selection
The data set provided had 2,261 real estate transactions. See Table 2 below. First, the data set was filtered to only include single family residential transactions, leaving 2,079 transactions. Surprisingly, 293 homes had missing values for sales price and those transactions were dropped. Similarly, ten homes with missing values for bedrooms were also removed. Then, to remove likely outliers, nine homes that sold for over $1,000,000 were dropped. To reduce the potential for high leverage observations, 38 homes with lot sizes greater than 35,000 square feet were filtered out. Lastly, 54 homes that sold for $250,000 or less were not represented with the larger population of homes and were deemed outliers and were removed as well. The resulting sample consists of 1,675 transactions.
TABLE 2
| :--- | ---: | ---: |
| Total Real Estate Transactions, 2020 - 2023 | | 2,261 |
| Remove Commercial, Multi-Family Residential, Vacant, Agricultural Properties, etc. | | 182 |
| | | ----- |
| Total Single-Family Residential Homes | | 2,079 |
| Remove Homes With | | |
| Missing Data on Price | 293 | |
| Price Above
III. RESULTS
Descriptive Statistics
Panels A and B of Table 3 provides descriptive statistics on all variables. For continuous variables, descriptive statistics are provided. The dependent variable, PriceAdjusted, exhibits significant variation. Its distribution, shown in Figure 6, appears to be close to a normal distribution but with a modest positive skew. Panel B consists of frequency tables for the five categorial data variables. Panel C contains the Pearson's r and Spearman's rho correlations for all continuous variables.
TABLE 3 Descriptive Statistics and Correlation Matrix
Panel A: Continuous Measures
Panel B: Qualitative Variables
| Year | N | % |
|---|---|---|
| 2020 (9 months) | 408 | 24.4 |
| 2021 | 681 | 40.7 |
| 2022 | 472 | 28.2 |
| 2023 (3.5 months) | 114 | 6.8 |
| ----- | ----- | |
| Total | 1,675 | 100.0 |
| Quarter | N | % |
|---|---|---|
| Q1 | 400 | 23.9 |
| Q2 | 374 | 22.3 |
| Q3 | 462 | 27.6 |
| Q4 | 439 | 26.2 |
| --- | ----- | |
| Total | 1,675 | 100.0 |
| Neighborhood | N | % |
|---|---|---|
| Box Springs | 132 | 7.9 |
| Hidden Springs | 141 | 8.4 |
| Sunnymead | 343 | 20.5 |
| Sunnymead Ranch | 527 | 31.5 |
| Other | 532 | 31.8 |
| --- | ----- | |
| Total | 1,675 | 100.0 |
| Age Group | N | % |
|---|---|---|
| Old | 204 | 12.2 |
| Mid | 1,212 | 72.4 |
| New | 259 | 15.5 |
| ----- | ----- | |
| Total | 1,675 | 100.0 |
| Pool | N | % |
|---|---|---|
| No | 1,333 | 79.6 |
| Pool | 342 | 20.4 |
| ----- | ----- | |
| Total | 1,675 | 100.0 |
Panel C: Correlation Matrix
| Variable | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 1. Price | 0.96 | 0.56 | 0.37 | 0.36 | 0.31 | 0.31 | -0.58 | 0.40 | -0.25 | |
| 2. PriceAdjusted | 0.96 | 0.67 | 0.43 | 0.43 | 0.37 | 0.37 | -0.35 | 0.18 | -0.37 | |
| 3. HomeSize | 0.58 | 0.68 | 0.62 | 0.69 | 0.32 | 0.32 | 0.07 | -0.03 | -0.53 | |
| 4. Bedrooms | 0.38 | 0.44 | 0.62 | 0.48 | 0.19 | 0.19 | 0.02 | 0.00 | -0.23 | |
| 5. Bathrooms | 0.39 | 0.45 | 0.71 | 0.52 | -0.02 | -0.02 | 0.03 | -0.01 | -0.43 | |
| 6. Lot Size | 0.29 | 0.34 | 0.28 | 0.12 | 0.03 | 1.00 | 0.01 | -0.01 | -0.09 | |
| 7. LogLotSize | 0.34 | 0.39 | 0.34 | 0.19 | 0.05 | 0.94 | 0.01 | -0.01 | -0.09 | |
| 8. Unemployment | -0.56 | -0.34 | 0.06 | 0.03 | 0.03 | 0.02 | 0.01 | -0.73 | -0.21 | |
| 9. MortgageRate | 0.36 | 0.12 | -0.03 | -0.01 | -0.02 | -0.02 | -0.01 | -0.68 | 0.19 | |
| 10. Age | -0.25 | -0.34 | -0.43 | -0.13 | -0.25 | -0.07 | -0.08 | -0.14 | 0.16 |
Notes. Coefficients below (above) the diagonal are Pearson's r (Spearman's rho) correlations. Statistically significant correlations (p < .05, two-tailed) are shown in bold. N = 1,675 observations.
Model Results
Base Model (1)
The intercept might be interpreted as the price of a lot (i.e., the sales price of a property with a home size of zero), but this is beyond the relevant range of the data since the smallest home is 740 square feet. The coefficient on home size is $110.83 per square foot and it is significant at p < .01. The intercept is $324,125 and is significant (p < .01). Adjusted r-square for this model is 0.46, meaning that just under half of the variance in sales price is explained by home size. See Table 3. Checking the model's assumptions, a plot of studentized residuals vs. fitted values suggests a modest degree of heteroskedasticity and outliers, while a histogram of the residuals reveals a non-normal distribution. While it may be possible to address some of the model's problems, the relatively low r-square suggests looking at the more complete models.
1 hidden cell
summ(m1)