5 minute read

Market share helps companies demonstrate how prevelant their sales are in any given market. It’s a good thought exercise in sampling and data completeness. Some businesses can showcase their market share with pinpoint accuracy if they collect data from enough sources (paid or open source). But, when your industry is indirectly tied to another market, your share of the market is harder to calculate. The basic equation for market share is (your sales / all sales). The resulting fraction represents your percent of the pie. It helps showcase your competitiveness in any given market space.

This analysis demonstrates combining multiple sources of data to externally validate the unknown total sales in the market space. This was completed for a company adjacent to the real estate market. There’s no complete dataset for all home sales everywhere since there’s some home sales that don’t get added to a national or county database (paying with cash, for sale by owner, etc.). I used a combination of Zillow and data from FRED to normalize the home sales according to their Metropolitan Statistical Area (MSA). MSA’s help standardize geographic areas especially when sales could come from a major city and the surrounding towns. This helped us maintain an inferred total sales number when comparing our adjacent industry impact.

Much of the information has been annonimized, and there’s some parts I deliberately left out to maintain that annonymity.

Data Analysis

Import Data

Publicly available housing data for each MSA on Zillow was mapped to a table with county names per MSA from FRED to align with our sales data.

import pandas as pd
import json
import datetime as dt

# house inventory by MSA
df_zill_msa = pd.read_csv("data/zillow/home_sales_2022-2024-01-08.csv")
df_zill_msalong = pd.read_csv(
    "data/zillow/home_sales_long_2022-2024-01-08.csv")

df_zill_source = pd.read_csv(
    "https://files.zillowstatic.com/research/public_csvs/sales_count_now/Metro_sales_count_now_uc_sfrcondo_month.csv"
)

# Data from SALES_PLATFORM
# -------------
df_BRAND1_main = pd.read_csv("DATA_FILE.csv")
df_BRAND1_LOC2 = pd.read_csv("DATA_FILE.csv")
df_BRAND1_LOC3 = pd.read_csv("DATA_FILE.csv")
df_BRAND2_LOC3 = pd.read_csv("DATA_FILE.csv")
df_BRAND2_LOC4 = pd.read_csv("DATA_FILE.csv")
df_BRAND2_LOC4_1 = pd.read_csv("DATA_FILE.csv")
df_BRAND3_LOC4 = pd.read_csv("DATA_FILE.csv")
df_BRAND3_LOC5 = pd.read_csv("DATA_FILE.csv")
df_BRAND3_LOC6 = pd.read_csv("DATA_FILE.csv")
df_BRAND3_LOC7 = pd.read_csv("DATA_FILE.csv")
df_BRAND4 = pd.read_csv("DATA_FILE.csv")

# sale Order Types
# ----------------------
with open("data/qual_insp_dict.json", "r") as f:
    qual_insp_dict = json.load(f)

# MSA County Mapping
# ------------------
with open("data/msa_county_mapping.json", "r") as f:
    msa_county_dict = json.load(f)

Cleaning Sales Data

brand_dict = {}
brand_dict["BRAND1"] = qual_insp_dict["BRAND1"]
brand_dict["BRAND3"] = (
    qual_insp_dict["BRAND3 (LOC5)"]
    + qual_insp_dict["BRAND3 (LOC8)"]
    + qual_insp_dict["LOC6"]
    + qual_insp_dict["LOC7"]
)
brand_dict["BRAND2"] = (
    qual_insp_dict["BRAND2 (LOC4)"]
    + qual_insp_dict["BRAND2 (LOC3)"]
    + qual_insp_dict["LOC4_1"]
)
brand_dict["BRAND4"] = qual_insp_dict["BRAND4"]

# combine the SALES_PLATFORM dataframes and add the SALES_PLATFORM name as a column
df_BRAND1_raw = (
    pd.concat([df_BRAND1_main, df_BRAND1_LOC2, df_BRAND1_LOC3],
              keys=["BRAND1_LOC1", "BRAND1_LOC2", "BRAND1_LOC3_1"])
    .reset_index(level=0)
    .rename(columns={"level_0": "SALES_PLATFORM"})
)

df_BRAND3_raw = (
    pd.concat(
        [df_BRAND3_LOC4, df_BRAND3_LOC5, df_BRAND3_LOC6, df_BRAND3_LOC7],
        keys=["BRAND3_LOC4", "BRAND3_LOC5", "BRAND3_LOC6", "BRAND3_LOC7"],
    )
    .reset_index(level=0)
    .rename(columns={"level_0": "SALES_PLATFORM"})
)

df_BRAND2_raw = (
    pd.concat([df_BRAND2_LOC3, df_BRAND2_LOC4, df_BRAND2_LOC4_1],
              keys=["BRAND2_LOC3", "BRAND2_LOC4", "BRAND2_LOC4_1"])
    .reset_index(level=0)
    .rename(columns={"level_0": "SALES_PLATFORM"})
)

df_BRAND4["SALES_PLATFORM"] = "BRAND4_LOC3"
df_BRAND4_raw = df_BRAND4

I then created a function to clean our sales data according to what we considered qualifying sales based on features like warranties, type of sale, and payment received.

# fn for filtering out non-qualifying sales
def get_qualifying_sales(df, qual_insp_dict, brand):
    start_row_count = df.shape[0]
    # Convert to datetime
    df["OrderCompletedDate"] = pd.to_datetime(df["OrderCompletedDate"])
    # Filter by order type, sale date, and paid sales
    df = df[df["OrderTypeID"].isin(qual_insp_dict[brand])]
    df = df[
        (df["OrderCompletedDate"].notnull())
        & (df["OrderCompletedDate"] >= pd.to_datetime("2018-01-01"))
        & (df["OrderCompletedDate"] < pd.to_datetime("2024-01-01"))
    ]
    df = df[df["OrderTotalFee"].notnull() & (df["OrderTotalFee"] > 0)]
    # Add year, quarter, month, state, county, and zipcode columns
    df["year"] = df["OrderCompletedDate"].dt.year
    df["year_quarter"] = (
        df["year"].astype(str) + "_Q" +
        df["OrderCompletedDate"].dt.quarter.astype(str)
    )
    df["month"] = df["OrderCompletedDate"].dt.month
    df["county"] = df["OrderCounty"].str.replace(
        " County", "").str.strip().str.lower()
    df["state"] = df["OrderState"].str.strip()
    df["zipcode"] = df["OrderPostal"].str.strip().str[:5]
    # keep select columns
    cols_keep = [
        "SALES_PLATFORM",
        "year",
        "year_quarter",
        "month",
        "state",
        "county",
        "zipcode",
        "OrderId",
    ]
    df = df[cols_keep]
    end_row_count = df.shape[0]
    rows_removed = start_row_count - end_row_count
    # print out the results
    print(f"{start_row_count} initial rows in {brand} sales dataframe.")
    print(f"{end_row_count} final rows in {brand} sales dataframe.")
    print(f"Removed {rows_removed} rows from {brand} sales dataframe.\n")
    return df

df_BRAND1 = get_qualifying_sales(df_BRAND1_raw, brand_dict, "AX")
df_BRAND3 = get_qualifying_sales(df_BRAND3_raw, brand_dict, "BRAND3")
df_BRAND2 = get_qualifying_sales(df_BRAND2_raw, brand_dict, "BRAND2")
df_BRAND4 = get_qualifying_sales(df_BRAND4_raw, brand_dict, "BRAND4")

df_list = [df_BRAND1, df_BRAND3, df_BRAND2, df_BRAND4]

# combine the dataframes into a single dataframe
df = pd.concat(df_list, keys=["AX", "BRAND3", "BRAND2", "BRAND4"])
df = df.reset_index(level=0).rename(columns={"level_0": "Brand"})

After we had confirmed the correct sub-sample of sales was complete, I combined them into a single dataframe, added the MSA column, and cleaned up the columns.

# Use the msa_county_dict to create a new column named "msa"
# Reverse the dictionary
reverse_dict = {}
for k, v in msa_county_dict.items():
    for i in v:
        i_low = i.lower()
        reverse_dict[i_low] = k

# Map the reversed dictionary to the 'county' column
df["msa"] = df["county"].map(reverse_dict)

df_orders = (
    df.groupby(
        [
            "Brand",
            "SALES_PLATFORM",
            "msa",
            "county",
            "year",
            "year_quarter",
            "month",
        ]
    )
    .agg({"OrderId": "count"})
    .reset_index(drop=False)
    .rename(columns={"OrderId": "order_count"})
    .sort_values(["year", "year_quarter", "month"])
)

# reorder columns based on list
cols = ["year", "year_quarter", "month", "Brand",
        "SALES_PLATFORM", "msa", "county", "order_count"]

df_orders = df_orders[cols]

Importing Zillow Data

The Zillow dataset included many MSA’s we didnt’ operate in, so cleaning out those unnecessary to the analysis was the next step. Finally, to merge the two datasets, I reformatted the MSA data and cleaned up the column names.

regionname_list = [
    # region names removed for website publication
]

# Filter for only the regions of interest
df_zill = df_zill_source[df_zill_source["RegionName"].isin(regionname_list)]
df_zill["msa"] = df_zill["RegionName"].str.split(",").str[0].str.lower()
df_zill["RegionState"] = df_zill["RegionName"].str.split(
    ",").str[1].str.strip()

# convert the dataframe from wide to long
df_zill_pivot = df_zill.melt(
    id_vars=["RegionName", "msa", "RegionState"],
    value_vars=df_zill.columns[df_zill.columns.str.startswith("20")],
    var_name="date",
    value_name="home_sales",
)

# convert the date column to datetime and extract year and month
df_zill_pivot["date"] = pd.to_datetime(df_zill_pivot["date"])
df_zill_pivot = df_zill_pivot[df_zill_pivot["date"] >=
                              pd.to_datetime("2018-01-01")].reset_index(drop=True)
df_zill_pivot["year"] = df_zill_pivot["date"].dt.year
df_zill_pivot["month"] = df_zill_pivot["date"].dt.month

# rearrange columns
zill_cols = [
    "date",
    "year",
    "month",
    "RegionName",
    "RegionState",
    "msa",
    "home_sales",
]


df_zill_pivot = df_zill_pivot[zill_cols]

Market Share Calculation

The calculation for market share once all the data was cleaned is a simple divsion if set up properly. The hard parts were getting the data to this spot across brands, market locations, and data sources. The final portion of the analysis supplied our team with an accurate representation of qualifying sales across 11 different locations for 4 of our biggest brands.

# group df by year, month, and msa
df_msa = (
    df_orders.groupby(["msa", "year", "year_quarter", "month"])
    .agg({"order_count": "sum"})
    .reset_index(drop=False)
    .sort_values(["year", "year_quarter", "month"])
)

merged_df = pd.merge(
    df_msa,
    df_zill_pivot,
    how="outer",
    left_on=["year", "month", "msa"],
    right_on=["year", "month", "msa"],
    validate="1:1",
)

merged_df["market_share"] = merged_df["order_count"] / merged_df["home_sales"]
merged_col_order = [
    "date",
    "year",
    "year_quarter",
    "month",
    "msa",
    "RegionState",
    "order_count",
    "home_sales",
    "market_share",
]
merged_df = merged_df[merged_col_order]
merged_df.head()

# export final data
merged_df.to_csv(
    f"output/market_share_launchpad_clean_{dt.date.today()}.csv", index=False
)

Updated: