Hands on forecasting the time series with limited cycles_第1张图片

甩锅大招

It's kind of a tricky problem on forecasting the time series without a couple of complete cycles because we can not confidently infer the trend and seasonality only using a few data. However, this situation might happen quite often especially when the object to be forecasted is at an early stage and just begin generating data. For example, the shop only starts their business for 14th months, and your job is to predict the revenue of the 15th months. Therefore, the question pops up: how can I include the trend and seasonality in our model?
We believe that the key to this problem is to blend both qualitative and quantitative approach, which utilize the expert knowledge and historical data to fix the trends for the specific month. With this notion, we try out a few of methods to deal with this issue.

Semi-automatically learn the seasonality: Rolling and Copying

Moving forward the existing data to the future, in other words, we just copy the data of a whole year to the future which hasn't happened yet. Also, you can manually multiply a factor to adjust the trends. By doing this, we can successfully include the real month number as a feature without impairing the performance of the model.
However, this method can be problematic. Say the revenue forecasting, if we rolling forward the revenue of a shop, chances are the revenue may be far from the truth, because we simply assume that the trend goes linearly year by year, while the revenue may not follow the linearity assumption. Still, the further we predict the riskier we are.
Another problem is the multi-factors issue. Some shops may show a different pattern at the early stage, if we simply apply the rolling and copying strategy on each shop, we just replicate the initial data which may be totally wrong. Actually, we can remove the initial stage by setting a threshold, but the question is how to find that threshold? What's more, such a pattern may affect the revenue all the time. So, currently, there is no fix to this problem.

Manually pick the ratio of the next month

Instead of learning the seasonality, we may reduce the complexity of our problem, that is to say, we do not guess the whole trend, we carefully make our decision for the revenue of the next month. This could be much reasonable, because expert only looks at one month at a time, more importantly, the next month is highly correlated with the current month, and we also have a lot of information to help us make a more robust guessing.

So, how can we involve the ratio?

The most straightforward answer is to directly multiply the factor, however, in most cases, the dist. of the value do not follow the uniform assumption. Therefore, we'd better use the quantile to estimate the trend. More specifically, we use the quantile regression giving the best quantile prediction of my current month, then the expert manually tunes the factor on the quantile. Say the revenue case, if we want to predict the Jan. revenue, so we take our Dec. as the validation set to train a regressor and a quantile. The expert may know that the Dec. has much lower revenue compared to that in Nov., so the first quantile may be low, while the revenue in Jan. usually will be sky-rocketing owing to the coming spring festival. So, if we increase the quantile, we are more likely getting close to the truth. Comparing with the linear multiplication way, this method will be much safer, because it considers the dist.

Next question is how to quantify the ratio?

Usually, we can refer to the historical data. We can use the historical ratio which is the number of the median revenue of Jan. divided by that in Dec. , as our anchor, then we tune a little bit according to the expert's personal experience. The underlying intuition is that the median of Dec. may shift according to the ratio which meanly controlled by the joint efforts of the seasonality and the market trend.

Is this method reasonable?

It's hard to say. When digging into this method, we may be hard to prove it is totally right, because we cannot adjust the quantile for every shop only by the median of the last month. Some shops' revenue may change violently, while some may be the opposite.

Can we improve it?

Though it does make sense of adjusting the quantile instead of linearly multiplication, we still take a step further to this question in a fine-grained manner, which is to cluster those shops by their features (comparing with the clustering results by only using revenue which can be seen as the ground truth), then calculate the ratio accordingly. Also, we can even consider building different models for different clusters, such as high income, median income and low income. We believe these methods would be more reasonable, and it also minimizes the risks when making the decision of picking the ratio.

Conclusion

We've discussed two methods on fitting models to short time series. Both of them include guided statistical analysis that incorporates human judgement. As a result, it reduces judgement bias underlying in the data. The second one may be promising according to our practice because it does make sense in simplifying the problem and trying to model in a small granularity. But it should be noted that this method would suffer from overfitting owing to the limited size of the cluster. So, there is a trade-off here, and it required a lot of experiments to find the best choice of the number of clusters.

Reference

[1]. https://www.datascience.com/blog/how-to-forecast-with-limited-historical-data/

[2]. http://www.statsoft.com/Textbook/Time-Series-Analysis

Hands on forecasting the time series with limited cycles