BigQuery (ML) for Eco-Friendly Logistics Optimization

12 min readOct 1, 2023

On September 23, I had the opportunity to attend an enlightening workshop hosted by 1000StartUpDigital in collaboration with BliBli. The workshop’s central theme revolved around harnessing the power of big data for eco-friendly logistics strategies, drawing inspiration from BliBli’s impressive track record of data-driven decision-making.

During the workshop, one particularly intriguing session caught my attention — a challenge session where participants were tasked with solving cases related to optimizing product warehousing based on customer location. In this article, I am excited to share my solution to this problem, which I crafted using the capabilities of Google BigQuery, BigQuery ML, and Looker Studio.

Note: You can follow along without the need to insert your credit card information on Google Cloud Platform (GCP). BigQuery offers pricing models that will not exceeded by queries on this article. Meanwhile, Looker Studio is charging no cost.

Background

E-commerce has become an integral part of daily life. It’s undeniable that the process of shipping products from warehouses to customers is an essential aspect of this ecosystem. With multiple warehouses in play, it’s not uncommon for a product to be dispatched from a warehouse far from the customer, rather than the nearest one.

From a business perspective, this inefficiency adds to logistics costs. Moreover, from an environmental standpoint, the greater the distance between the warehouse and the customer, the larger the carbon footprint generated in the process.

Given the statement, as a data-people, let’s tackle this challenge!

Sneak Peek into the Data

The dataset of this case can be accessed via Google Drive. It shows information about product shipments, including the warehouses and customer locations over a span of 2 years. Here is the description of the dataset.

Within the dataset, there are 2 primary objectives:

Optimize warehouse selection: The first objective involves determining the ideal warehouse for each customer, operating under the assumption that all warehouses possess unlimited stock.
Stock Recommendations: The second objective revolves around leveraging the optimized data to generate recommendations regarding the optimal stock levels each warehouse should maintain for every product in the upcoming month.

To kickstart our exploration, the initial step involves loading the data into a BigQuery table. Given that the dataset is stored on Google Drive, we can effortlessly import it using the Drive URI. Alternatively, if you prefer, you may opt for the “Upload” method. In either case, it is important to enable schema auto-detection and set “1” in the header rows to skip options.

Create a BigQuery table for the shipping data

In this context, I have defined the table name as shipping_date, within the playground dataset, located in my ammfat-data-… GCP project. Here is a preview of what the data will look like within BigQuery.

SELECT *
FROM `ammfat-data-engineering-101.playground.shipping_data`
LIMIT 5

Now, let’s delve a bit deeper into the dataset.

SELECT 
  MIN(order_date) AS first_date
  , MAX(order_date) AS last_date
  , COUNT(DISTINCT DATE_TRUNC(order_date, MONTH)) AS total_month
  , COUNT(*) AS total_shipment
  , COUNT(DISTINCT warehouse_uuid) AS total_unique_warehouses
  , COUNT(DISTINCT product_uuid) AS total_unique_products
FROM `ammfat-data-engineering-101.playground.shipping_data`

Understanding the shipping data

This simple query provides us with key insights:

The dataset spans 24 months, covering the period from January 1, 2023, to December 31, 2022.
Within this timeframe, there have been over 30,000 shipments.
A total of 6 warehouses serve as the sources of delivery.
Among all the shipments, we’ve identified 4 distinct products.

Let’s dive a bit deeper to determine which products are associated with each warehouse.

SELECT
  warehouse_uuid
  , STRING_AGG(DISTINCT product_uuid, ', ' ORDER BY product_uuid) AS products
FROM `ammfat-data-engineering-101.playground.shipping_data`
GROUP BY 1
ORDER BY 1

The results indicate that each warehouse stocks the same set of products.

With these initial insights in hand, we’re now prepared to construct queries that align with our primary objectives!

Time for the Optimization

Now, we proceed to optimize our logistics strategy. Our first step is to calculate the current, unoptimized shipping distance between the warehouses and customers. This information will serve as a baseline for our improvements.

CREATE OR REPLACE VIEW
  `ammfat-data-engineering-101.playground.shipping_distance_unoptimized`
AS (
  SELECT
    user_uuid
    , warehouse_uuid
    , SQRT(
      POW(destination_x - warehouse_x, 2) + POW(destination_y - warehouse_y, 2)
    ) AS distance
  FROM `ammfat-data-engineering-101.playground.shipping_data`
);

SELECT *
FROM `ammfat-data-engineering-101.playground.shipping_distance_unoptimized`;

Let’s save this as a view, which we will later use for comparison before and after optimization.

Afterwards, aim to determine the optimized distance — the nearest warehouse for each customer. This involves a multi-step process:

Map each customer to every warehouse;
Then, calculate the distance between each customer and warehouse;
Assign a “rank” to each distance, ordered by proximity;
Lastly, select only the distances with the top rank, representing the nearest warehouse.

The following query represents each of these steps.

CREATE OR REPLACE TABLE
  `ammfat-data-engineering-101.playground.shipping_data_optimized`
AS (
  WITH 
  cte_warehouse AS (
    SELECT DISTINCT 
      warehouse_uuid
      , warehouse_x
      , warehouse_y
    FROM
      `ammfat-data-engineering-101.playground.shipping_data`
  )
  , cte_mapping AS (
    SELECT
      u.order_date
      , u.user_uuid
      , u.product_uuid
      , w.warehouse_uuid
      , w.warehouse_x
      , w.warehouse_y
      , u.destination_x
      , u.destination_y
    FROM
      `ammfat-data-engineering-101.playground.shipping_data` u
    CROSS JOIN
      cte_warehouse w
  )
  , cte_distance AS (
    SELECT
      *
      , SQRT(
        POW(destination_x - warehouse_x, 2) + POW(destination_y - warehouse_y, 2)
      ) AS distance
    FROM
      cte_mapping
  )
  SELECT
    *
    , RANK() OVER(PARTITION BY user_uuid ORDER BY distance) AS distance_rank
  FROM cte_distance
  QUALIFY distance_rank = 1
);

SELECT *
FROM `ammfat-data-engineering-101.playground.shipping_data_optimized`;

In this process, we employ a CROSS JOIN to effectively map customers to warehouses in step #1. Additionally, we utilize QUALIFY to combine steps #3 and #4, ranking the warehouse-customer distances and filtering for the closest one — all within the same query stage.

If you wonder why do we select order_date and product_uuid in cte_mapping stage, it is because we will need it in building our forecasting model. Keep following, ya!

Phew, that was quite a bit to process, wasn’t it? Well, for the final steps pertaining to this objective, let’s compare the distances before and after optimization.

SELECT 
  un.user_uuid
  , un.warehouse_uuid
  , un.distance
  , op.warehouse_uuid AS op_warehouse_uuid
  , op.distance AS op_distance_uuid
FROM 
  `ammfat-data-engineering-101.playground.shipping_distance_unoptimized` un
JOIN
  `ammfat-data-engineering-101.playground.shipping_data_optimized` op
USING (user_uuid)
ORDER BY 1

Compare distance on before-after optimization

We can observe changes in the warehouse assignments for customers after optimization. For instance, transitions from WH-C to WH-A, WH-E to WH-F, and more have occurred.
However, you may wonder why there is no change in row #1. This is because, for the customer in row #1, WH-A was already the nearest warehouse, so there was no need for alteration before or after optimization.

Great! With this optimization, we've pinpointed the nearest warehouses for the customers.

Objective #1: Checked.

Build Forecasting Model

The next objective is to provide recommendations on the required stock for each warehouse in the upcoming month, which is January 2023. To achieve this, we’ll need to create a forecasting model based on the data available until December 2022.

Prepare the Data

To begin, let’s initiate a query to extract warehouse stock data from our optimized dataset. As we intend to construct a forecasting model using this data, we will store it as a table.

CREATE OR REPLACE TABLE
  `ammfat-data-engineering-101.playground.warehouse_stock`
AS (
  SELECT
    order_date
    , warehouse_uuid
    , product_uuid
    , COUNT(*) AS stock
  FROM `ammfat-data-engineering-101.playground.shipping_data_optimized`
  GROUP BY 1,2,3
  ORDER BY 1,2,3
);

SELECT *
FROM `ammfat-data-engineering-101.playground.warehouse_stock`;

After executing the second query, let’s take a quick visual tour of the data. Here, we can explore the data with Looker Studio.

Let’s build a simple visualization with the following elements:

Time-series chart to visualizes stock trends for each product.
Dimension: order_date
Breakdown dimension: product_uuid
Metric: stock (SUM)
Breakdown dimension sort: product_uuid (Ascending)
Drop-down list to filter data by the warehouses.
Control field: warehouse_uuid
Order: Dimension (Ascending)
Drop-down list to filter data by the products.
Control field: product_uuid
Order: Dimension (Ascending)

The above image illustrates the stock trend for product P-1 in warehouse WH-A. Take your time to explore and analyze the trends for other products and warehouses.

Now, the ML Part!

Okay, so far everything is going smoothly. Now, let’s dive into the eye catching machine learning part.

BigQuery ML offers a way to build an ML model by using SQL syntax (specifically GoogleSQL). Some of the built-in models available include linear regression, logistic regression, K-means clustering, PCA, and time series forecasting (the one we need).

Without further ado, here’s the magic.

CREATE OR REPLACE MODEL
  `ammfat-data-engineering-101.playground.forecast_warehouse_stock`
OPTIONS (
  MODEL_TYPE = 'ARIMA_PLUS'
  , TIME_SERIES_TIMESTAMP_COL = 'order_date'
  , TIME_SERIES_DATA_COL = 'stock'
  , TIME_SERIES_ID_COL = ['warehouse_uuid', 'product_uuid']
  , DATA_FREQUENCY = 'WEEKLY'
) AS
SELECT
  DATE_TRUNC(order_date, WEEK) AS order_date
  , warehouse_uuid
  , product_uuid
  , SUM(stock) AS stock
FROM `ammfat-data-engineering-101.playground.warehouse_stock`
WHERE order_date < "2022-09-01"
GROUP BY 1,2,3

In the query above, we define a model called forecast_warehouse_stock. Since we have weekly data in the warehouse_stock table, we set the DATA_FREQUENCY to align with this pattern. In the TIME_SERIES_ID_COL, we specify that the time series should be separated for each combination of warehouse and product. This results in what BigQuery calls ‘multi-series forecasting’. Think of it as building a separate forecasting model for each warehouse and product, but BigQuery streamlines the process using the TIME_SERIES_ID_COL parameter.

Note that we truncate the date to a weekly basis using DATE_TRUNC(…, WEEK). The reason behind this is that since we are predicting stock levels for the next month, we need to capture the sub-monthly stock patterns. Using daily data could be an option, but it would create a much finer-grained dataset with many time points, which would increase model training time. Hence, we opt for a weekly basis as a good compromise between capturing patterns and resource management.

Another point to aware is that, even though we have data spanning from January 2021 to December 2022, we select only the stock data before September 2022 in the WHERE clause. This is called a ‘train-test split’, which means we ‘train’ the model using data from January 2021 to August 2022 and reserve the rest for ‘testing’ our model’s performance.

It may take a minute to train the model. Once it’s done, click the “Go to model” button to see the model’s performance. To assess the model, we can go to the “Evaluation” tab or use a query.

SELECT *
FROM ML.ARIMA_EVALUATE(
  MODEL `ammfat-data-engineering-101.playground.forecast_warehouse_stock`
)

With ARIMA_EVALUATE, we obtain evaluation metrics tailored for time series cases, such as the Akaike Information Criterion (AIC).

Alternatively, we can use regression metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root MSE (RMSE). These metrics can be calculated using our ‘test’ data, starting from September 2022.

SELECT *
FROM ML.EVALUATE(
  MODEL `ammfat-data-engineering-101.playground.forecast_warehouse_stock`
  , (
    SELECT
      DATE_TRUNC(order_date, WEEK) AS order_date
      , warehouse_uuid
      , product_uuid
      , SUM(stock) AS stock
    FROM `ammfat-data-engineering-101.playground.warehouse_stock`
    WHERE order_date >= "2022-09-01"
    GROUP BY 1,2,3
  )
)

Evaluate the model with regression metrics

One of the shown metrics is MAE, which measures the difference (error) between the prediction value and actual value, turns it into a positive number (absolute), and then gets the average of them (mean). The same applies to the MSE metric, except it squares the differences instead of taking their absolute values. Meanwhile, RMSE takes the square root of MSE to provide a metric that is more ‘in a similar scale’ to the actual data. Lower values of these metrics indicate better model performance.

For example, in warehouse WH-A for product P-1, an MAE score of 6.xx means that the model’s predictions for stock levels are off by approximately 6 units. Similarly, in warehouse WH-B for product P-2, the model’s predictions for stock levels are off by approximately 14 units.

Ahead for the ML part, let’s put our model to work and discover the optimal stock levels for January 2023. However, it’s also interesting to see how the model forecasts stock levels for other future months. So, let’s expand our forecasting horizon.

CREATE OR REPLACE TABLE
  `ammfat-data-engineering-101.playground.warehouse_stock_forecast`
AS (
  SELECT
    DATE(forecast_timestamp) AS order_date
    , warehouse_uuid
    , product_uuid
    , NULL AS stock_history
    , forecast_value AS stock_forecast
    , prediction_interval_lower_bound
    , prediction_interval_upper_bound
  FROM
    ML.FORECAST(
      MODEL `ammfat-data-engineering-101.playground.forecast_warehouse_stock`
      , STRUCT(32 AS horizon, 0.8 AS confidence_level)  -- 32 weeks = 4 weeks * 8 months
    )
  WHERE DATE(forecast_timestamp) >= "2022-09-01"
  ORDER BY 1,2,3
);

CREATE OR REPLACE VIEW
  `ammfat-data-engineering-101.playground.warehouse_stock_forecast_view`
AS (
  SELECT
    DATE_TRUNC(order_date, WEEK) AS order_date
    , warehouse_uuid
    , product_uuid
    , SUM(stock) AS stock_history
    , NULL AS stock_forecast
    , NULL AS prediction_interval_lower_bound
    , NULL AS prediction_interval_upper_bound
  FROM `ammfat-data-engineering-101.playground.warehouse_stock`
  GROUP BY 1,2,3
  UNION ALL
  SELECT
    order_date
    , warehouse_uuid
    , product_uuid
    , stock_history
    , stock_forecast
    , prediction_interval_lower_bound
    , prediction_interval_upper_bound
  FROM `ammfat-data-engineering-101.playground.warehouse_stock_forecast`
);

SELECT *
FROM `ammfat-data-engineering-101.playground.warehouse_stock_forecast_view`;

Excellent!

Next, let’s visualize the data. To do this, instead of using “Explore with Looker Studio” within BigQuery, we’ll use Looker Studio directly.

Present the Insights

Go to the Looker Studio dashboard, and ensure that the account is the same as the one used in BigQuery. Next, create a new dashboard by choosing “Blank Report”.

Choose BigQuery as the data source and add the warehouse_stock_forecast_view that we just created. When prompted, click "Add to report."

Now, setup the dashboard with the following elements:

A text box contains the title of the dashboard.
Drop-down list to filter data by the warehouses.
Control field: warehouse_uuid
Metric: stock_history (SUM)
Order: Dimension (Ascending)
Drop-down list to filter data by the products.
Control field: product_uuid
Metric: stock_history (SUM)
Order: Dimension (Ascending)
Time-series chart to visualize stock trends in both history and forecast.
Setup Tab:
Dimension: order_date (Year Month)
Metric:
- stock_history (SUM)
- stock_forecast (SUM)- prediction_interval_upper_bound (SUM)
- prediction_interval_lower_bound (SUM)
Style Tab:
General > Missing Data: Line Breaks

Dashboard for stock trends and forecasting

Now, we have a dashboard that displays the stock trend forecast for products, warehouses, and time periods. For example, examine the stock trend forecast for product P-1 in warehouse WH-A. In this case, we will notice a similar pattern between stock_history and stock_forecast from September 2022 to December 2022.

Keep in mind that these patterns may vary for different combinations of warehouses and products. For instance, warehouse WH-F may have a much higher forecasted stock from September 2022 to December 2022 compared to the actual data.

Differences in the trends of actual and forecasted value

Note that the ML model is trained on a weekly basis. Therefore, when visualizing the forecast results for testing data on a monthly basis (September 2022 to December 2022), we may observe a gap between the actual and predicted values that could be higher than the MAE score produced in ML.EVALUATE().

And, there is it!

With this dashboard, we can provide stakeholders with valuable insights into their warehouse stock. For instance, in January 2023, we may recommend that warehouse WH-A prepares around 80 units of stock for product P-1.

Well, done!

Objective #2: Checked.

In this exploration, we harnessed the power of data-driven solutions for a more sustainable logistics strategy. Through a series of practical steps, we demonstrated how to load and analyze shipping data using Google BigQuery, BigQuery ML, and Looker Studio. Along the way, we tackled the challenge of finding the optimal warehouse for each customer, a step that will reduce business costs while ensuring a smaller carbon footprint in product delivery. Furthermore, we also forecasted stock levels for the future, taking into account the optimal warehouse locations for each product.

As a future development, we may include fine-tuning the forecasting model to enhance its accuracy, especially when handling monthly data aggregations. This refinement could lead to even more precise predictions and better-informed stock management decisions. Moreover, by including the shipping_data table as the data source, the dashboard insights may present the top customers and highlight overall product order percentages, providing a more comprehensive view of the logistics operations.

I hope you enjoyed following this article. Let me know what ideas come to your mind about providing data solutions in sustainable business operations!