Apr 14, 2020
Often businesses have lots of data that just sits there. It may be a history of orders, inventory movements or bank transactions. They say data is the king, but it’s only a half of the truth. The actionable insight is what makes the difference between the data lake and data swamp.
To have that kind of insight, we need some tools to work with data. In this blog post we are going to create a proof of concept using Amazon Forecast. Our goal is to figure out whether even relatively small amounts of historical data can be used to do useful forecasts or make some conclusions.
Amazon Forecast - is a fully managed service that uses machine learning to deliver highly accurate forecasts.
Amazon Forecast supports the following dataset domains:
We are going to check the Amazon Forecast using two types of datasets:
Let's imagine we need to be able to forecast air quality in Vilnius, Lithuania’s capital. We are going to use datasets from Open AQ Platform API. Open AQ API provides data about individual measurements for the last 90 days in different countries, cities and locations. First, we used dataset for the period between January 9th 2020 and March 27th 2020. Then we compared the 2-week forecast with actual values. After it, we used dataset for the period between January 9th 2020 and April 8th 2020 to create the forecast for the next two weeks. If it's April 2020 now, the outcome of this experiment is still surrounded with the mystery of tomorrow. If you are reading this blog post in May 2020 or later, you can confirm predicting the future until it's came is pretty challenging.
To import data into Amazon Forecast we upload it into the S3 bucket in CSV format and start creating our dataset. Our data is an air quality data with more than 3300 records over the given period. Then we choose the “Forecasting domain”.
For each dataset group that you create, you associate a dataset domain and a dataset type. A dataset domain defines a forecasting use case.
For our needs, we will use CUSTOM domain because all other domains are not suitable for our needs. Then we create the dataset and import-job to import data from the S3 bucket, for “Frequency of our data” we select day, set data schema and timestamp format. Importing data took 15 minutes for importing 3300 records, depending on the amount of your data it may take more.
After importing our data we train “Predictor” which will make forecasts for us.
We set 14 to “Forecast horizon” because we want to see forecasts for the next 14 days. We are able to choose one of the five algorithms manually or to choose AutoML param. This option tells Amazon Forecast to evaluate all algorithms and choose the best algorithm based on your datasets, but it can take longer to train “Predictor”. As we want Amazon Forecast to choose the right algorithm for our data set we set AutoML param.
After training “Predictor” we can see that the AutoML feature has chosen the NPTS algorithm for us. For creating forecasts we select the Predictor, name, and quantiles, by default they are equal to .10, .50 and .90. The quantiles at which probabilistic forecasts are generated. Accepted values include .01 to .99 (increments of .01 only) and 'mean'.
The period from March-26-2020 to April-08-2020 compared to real data .90 suits the best for future predictions. Even though .90 is the most precise, we have huge differences in certain measurements, while others are almost 100% exact. Select the best algorithm for your solution and set quantiles that suit best for you. By calculating prediction quantiles, the model shows how much uncertainty is associated with each forecast, you are able to add up to 5 quantile values but do not forget about pricing.
Also, we’ve compared results using other algorithms, quantile .90 because it best suits our forecasting type. In this example, we will compare with real results by days and compute average difference percent. The table incorporates 3 params(co, pm10 and so2).
|Algorithm||Data Point||Percent Variance|
|Any algorithm||SO2||> 100%|
* - SO2 is formed when fuel containing sulfur, such as coal and oil, is burned, creating air pollution and because of the current world situation, there is less oil burned, while our dataset includes data when the SO2 parameter was very high(January-2020 -15 March 2020).
From the table above, we can conclude that:
A cash flow forecast is a projection showing how much money a business expects to receive and spend, over a given timeframe. This helps with planning not only amounts but also dates with incomes and expenses. Having this information there is a superior opportunity to design, for instance, when to start the creation of new merchandise.
Cash flow forecast is one of the main tools for businesses. Understanding the cash flow is essential for making strategic business decisions.
In this example we want to be able to forecast cash flow on some bank account. We are going to use the period from 01-01-2019 to 31-12-2019 for creating a dataset to receive the forecast for the next 3 months of the 2020 year.
Before implementing it we are going to check how it works, so we will create a dataset for the period from 01-01-2019 to 30-09-2019 to compare it with real data for 01-10-2019 - 31-12-2019.
METRICS domain is the most suitable for cashflow. Data importing took 10 minutes which is less than air quality data import because our dataset consists only of 700 transactions.
We set 90 to “Forecast horizon” because we want to see forecasts for the next 90 days. We weren’t able to use the DeepAR+ Algorithm which supports hyperparameter optimization (HPO) or the Non-Parametric Time Series (NPTS) Algorithm for training “Predictor” from our dataset.
Interesting that AutoML has chosen NPTS which we couldn’t select manually. Training each model took 20-25 minutes, AutoML about 30 minutes.
We’ve tried all the existing algorithms, and the best option for us was to choose AutoML. Prophet, ETS and ARIMA algorithms were not suitable for us and the forecasted data was not accurate, results are in the table below.
For the .number forecast, the true value is expected to be lower than the predicted value .number% of the time. For example, p50 can be useful when you have middling demand for particular goods, while p90 is better for critical goods that must be sold anyway or for orders with large amounts of commodities. If you still have misunderstanding in what quantile is and which value to choose for your forecast, please read Amazon Whitepaper.
Algorithm testing conclusion:
Diagram for one of the metrics for the period from Oct-02-2019 to Oct-06-2019.
Although the AutoML feature(NPTS) algorithm gave the best results compared to other algorithms, we are not going to make future predictions using the current data set, because the accuracy of this forecast makes it completely useless.
Forecasting accuracy depends on data that you import inside this service, the more random data and the smaller data amount, the worse the forecast can be. We’ve already investigated it in this article, in the air quality example we had no random data and 3300 operations, accordingly, accuracy was high, wherein in cash flow forecasting we had 700 transactions and numerous irregular operations, subsequently, we received numerous divergences.
You can watch all API calls on your account, including calls to Amazon Forecast using CloudTrail service, which is enabled by default when you create the account. When activity occurs in this service, the appropriate record is created in a CloudTrail in Event history. The following example shows CloudTrail log entries. Also, detailed event information can be shown, such as userAgent, sourceIPAddress, sessionContext, etc.
There are DataSet, Predictor, Import Job and other metrics that can be shown using CloudWatch. The following example shows the TimeSeriesForecastsGenerated metric which is the number of unique time series forecasts generated for each quantile across all predictors in the account, billed to the nearest 1000 and charged on a per 1,000 basis.
Metrics also can be shown in a numeric value, the following metric is the time taken for training, inference, and metrics for a specific predictor. From here we can compare manual algorithm selection or AutoML by time taken for air quality and cash flow predictions.
In this article, we've shown how to build your own forecasts using Amazon Forecast service without any knowledge in machine learning, actuarial science or probability theory. You are able to select algorithms and forecast types which suit best for you. Also, a significant advantage is that the more data you have as a dataset, the more accurate forecasting you receive. That’s why we got incorrect cash flow predictions on our bank account.
Cash flow forecasting was not accurate because we had insufficient data in our dataset and many transactions were unpredictable. Air quality forecasting returned us great results by some metrics while results of other metrics leave much to be desired. It's very important to remember the forecast is nothing more than an educated guess. You should never rely on it completely when making strategic decisions.