| 1 | 2 | 3 | 4 | 5 | 6 | |
|---|---|---|---|---|---|---|
| YearMonth | 03-Jan | 03-Feb | 03-Mar | 03-Apr | 03-May | 03-Jun |
| Sales | 155 | 173 | 204 | 219 | 223 | 208 |
The following dataset shows monthly truck sales data from 2003 to 2014. It includes information on the number of trucks sold in each month.
| N obs. | Średnia | Odch. stand. | Min | 1 kwartyl | Mediana | 3 kwartyl | Max | |
|---|---|---|---|---|---|---|---|---|
| Value | 144.0000 | 428.7292 | 188.6330 | 152.0000 | 273.5000 | 406.0000 | 560.2500 | 958.0000 |
Growth trend:The number of trucks sold has steadily increased from 2003 to 2014, indicating that the market is growing. Seasonality: Sales show clear seasonality with regular ups and downs in certain months, such as December or January.Variability:The difference between the highest and lowest sales during the year increased significantly between 2010 and 2014 compared to the 2003-2006 period.Local declines: Periodic declines in sales (e.g. 2009-2010) may be due to external factors such as the financial crisis.Maximum sales:The highest sales were recorded at the end of 2014, confirming the dynamic development of the market.
Seasonality of sales: The highest average sales occur during the summer (May-August), suggesting that summer is the most active time in the industry. Best months: July and August lead the way in terms of average sales, which may be due to companies preparing before the end of the third quarter.Lower sales: The lowest results occur in January and February, which can be linked to the New Year period and lower investment activity.Slow growth: From March to May, sales rise steadily, peaking in the summer and then falling in September and October, indicating a seasonal slowdown after the vacations.
Median sales: The median (the line in the middle of the box) is about 500 trucks sold, which means that half the months had higher sales and the other half had lower sales.Interquartile range (IQR): Most of the data (50%) is in the range of about 400 to 600 trucks sold, indicating relatively little variability in sales in most months. No outliers: There are no outlier points on the chart, suggesting that sales are fairly uniform and no extreme low or high values. Range of values: The box plot shows that the minimum sales are around 250 trucks and the maximum sales are around 750 trucks per month.
Constant sales growth: Total truck sales have been growing year-on-year in an almost uninterrupted manner, indicating the market’s development from 2003 to 2014.Slower growth from 2009 to 2010: Since 2011, a clear accelerated growth in sales has been evident, culminating in 2014, the best year in the period under review.
Seasonality of sales: Each year, sales rise in the first half of the year, reaching a peak in the summer months (June-July), and then fall in the second half of the year, reaching their lowest values in November and December. Stable pattern across years: The overall shape of the curves is very similar for all years, indicating a clear and repeatable pattern of seasonality.Growth in sales over the years: The higher lines for 2013-2014 indicate an overall increase in sales in those years compared to the beginning of the period (2003-2006). Particularly evident is the increase in sales during the summer months.
The graph shows ACF and PACF. ACF declines slowly, indicating a minor role for the MA component, while the sharp decline in PACF suggests a greater influence of the AR component.
The graph shows the decomposition of the time series using the STL method. The seasonal component confirms the presence of regular patterns, the trend indicates long-term growth, and the residual component contains random fluctuations. The data can be described taking into account both trend and seasonality.
In the previous graph, you can see that the time series has an upward trend and seasonality, which means that it is not stationary. So, we need to do differentiation to make it stationary. So we will do a first-order differentiation we will do a visualization and check it with appropriate tests.
In the given graph you can see that the first-order variation made the time series stationary. Now we will check it with the appropriate tests.
| Nazwa_testu | Statystyka | p_value | |
|---|---|---|---|
| Dickey-Fuller | Augmented Dickey-Fuller | -15.1206382 | 0.01 |
| KPSS Level | Kwiatkowski-Phillips-Schmidt-Shin | 0.0124762 | 0.10 |
| Dickey-Fuller Z(alpha) | Phillips-Perron | -110.1168101 | 0.01 |
Three tests supported the stationarity of the time series.
model_holtwinters <- HoltWinters(train_ts,seasonal="multiplicative")
The parameter seasonal=“multiplicative” takes into
account seasonality, the amplitude of which changes in proportion to the
level of the trend, which fits data with an increasing trend and
intensifying seasonality.
The forecast shows seasonality and trend in the training data well, which means that the method has correctly captured patterns in the data. Comparing the forecast with the test data shows that the model works well, although there may be minor differences.
| Nazwa_testu | Statystyka | p_value | |
|---|---|---|---|
| W | Shapiro-Wilk | 0.9791780 | 0.0597338 |
| A | Anderson-Darling | 0.7282517 | 0.0561049 |
| BP | Breusch-Pagan | 10.7081965 | 0.0300468 |
| GQ | Goldfeld-Quandt | 1.3427149 | 0.1387731 |
In the tests performed, we can see that the assumption of normality of errors has been met, because in both tests the p-values are greater than 0.05. The assumption of homogeneity of variance has been violated, because in the first test the p-value is less than 0.05, which forces us to reject the null hypothesis, although and Goldfeld-Quandt test says to accept it.
The graph shows that the best degree of the polynomial is 24, since it has the lowest AIC score.
The forecast reproduces the test data reasonably well, accounting for both seasonality and the overall trend of the data. However, compared to the Holt-Winters method, polynomial regression performs slightly worse in terms of forecast accuracy, especially for more complex seasonal patterns.
The diagnostic charts indicate that the polynomial regression model may have some problems. The “Residuals vs Fitted” graph shows curvature, suggesting that the model has not fully captured the structure of the data. The “Normal Q-Q” graph shows that the residuals are not perfectly normally distributed, especially at the ends, indicating a possible deviation from the assumption of normality. “Scale-Location” shows an increasing variance of the residuals, which may suggest a heteroskedasticity problem. The “Residuals vs Leverage” chart identifies several points of high influence (high leverage) that may significantly affect the model fit.
| Nazwa_testu | Statystyka | p_value | |
|---|---|---|---|
| W | Shapiro-Wilk | 0.9887849 | 0.3004680 |
| A | Anderson-Darling | 0.2782367 | 0.6454426 |
| BP | Breusch-Pagan | 18.4647172 | 0.0000173 |
| GQ | Goldfeld-Quandt | 3.5644317 | 0.0000001 |
| DW | Durbin-Watson | 0.7288279 | 0.0000000 |
| LM test | Breusch-Godfrey | 58.3443941 | 0.0000000 |
Test results indicate that the residuals of the polynomial regression model have a normal distribution. The assumption of homogeneity of variance was violated, as both tests have p-values less than 0.05. Tests for autocorrelation indicate that the residuals are correlated, which is a good signal for time series analysis.
sarima <- auto.arima(train_ts, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
summary(sarima)
## Series: train_ts
## ARIMA(1,1,0)(2,1,0)[12]
##
## Coefficients:
## ar1 sar1 sar2
## -0.2851 -0.0826 0.2318
## s.e. 0.0918 0.0919 0.1049
##
## sigma^2 = 287.9: log likelihood = -505.03
## AIC=1018.05 AICc=1018.4 BIC=1029.17
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.8079424 15.90494 11.86291 0.045109 2.96353 0.2546599 0.007296978
This code snippet uses the auto.arima function from the
forecast package to automatically construct the
SARIMA model. The SARIMA model includes both autoregressive
\(AR\) and seasonal \(SAR\) components adjusted for data with
pronounced trend and seasonality. The coefficients of the model indicate
a moderate effect of ordinary and seasonal lags on the time series. The
model has a good fit, as evidenced by low error values and low
autocorrelation of residuals(\(ACF1 =
0.0073\)). The SARIMA model, based on the estimated
parameters, can be written in the form: \[(1
+ 0.2851 B)(1 + 0.0826 B^{12} - 0.2318 B^{24})(1 - B)(1 - B^{12}) Y_t =
\epsilon_t\].
The graph shows a time series forecast made using the SARIMA model. The forecast reproduces the test data well, taking into account both trend and seasonality, which shows the effectiveness of the model in analyzing this type of data.
| Nazwa_testu | Statystyka | p_value | |
|---|---|---|---|
| W | Shapiro-Wilk | 0.9826246 | 0.0902160 |
| A | Anderson-Darling | 0.8190532 | 0.0334795 |
| BP | Breusch-Pagan | 9.4122285 | 0.0021554 |
| GQ | Goldfeld-Quandt | 1.5518511 | 0.0405796 |
The assumption of normality of the residuals was violated because the p-value is less than 0.05 in the second test. The assumption of homogeneity of variance was also violated because the p-values are less than 0.05 in both tests.
| Model | R2 | MAE | MSE | RMSE | MAPE |
|---|---|---|---|---|---|
| Holt-Winters | 0.9913795 | 15.63479 | 585.4071 | 24.19519 | 2.196681 |
| Polynomial Regression | 0.8752309 | 31.94093 | 1817.4034 | 42.63102 | 4.264551 |
| SARIMA | 0.9909566 | 25.11870 | 859.8484 | 29.32317 | 3.655697 |
Based on a comparison of the results of the three models (Holt-Winters, Polynomial Regression, SARIMA), it can be seen that the Holt-Winters model performed best in all key measures. The SARIMA model performed slightly worse, with higher errors and smaller \(R^2\), but still outperformed Polynomial Regression. Polynomial regression was the least effective, with the highest error values, which may be due to the difficulty in capturing complex seasonal patterns in the data.
Holt-Winters:
The Holt-Winters model is the best model for forecasting truck sales
because it has achieved the best results in key metrics. It is able to
reproduce both trend and seasonality in the data well, allowing it to
accurately forecast future values.
Polynomial regression:
Polynomial regression was the least effective model because it had the
highest error values and lowest \(R^2\). It may have difficulty capturing
complex seasonal patterns in the data.
SARIMA:
The SARIMA model produced results with slightly higher errors than
Holt-Winters, but better than polynomial regression.
So Holt-Winters and SARIMA are good models for time series analysis, and depending on the data, they may show better or worse results when comparing one with the other, but both give quite good results and are very effective in time series analysis.