Data

1 2 3 4 5 6
YearMonth 03-Jan 03-Feb 03-Mar 03-Apr 03-May 03-Jun
Sales 155 173 204 219 223 208

The following dataset shows monthly truck sales data from 2003 to 2014. It includes information on the number of trucks sold in each month.

Basic statistics

N obs. Średnia Odch. stand. Min 1 kwartyl Mediana 3 kwartyl Max
Value 144.0000 428.7292 188.6330 152.0000 273.5000 406.0000 560.2500 958.0000

Data visualization

Basic charts

Growth trend:The number of trucks sold has steadily increased from 2003 to 2014, indicating that the market is growing. Seasonality: Sales show clear seasonality with regular ups and downs in certain months, such as December or January.Variability:The difference between the highest and lowest sales during the year increased significantly between 2010 and 2014 compared to the 2003-2006 period.Local declines: Periodic declines in sales (e.g. 2009-2010) may be due to external factors such as the financial crisis.Maximum sales:The highest sales were recorded at the end of 2014, confirming the dynamic development of the market.

Seasonality of sales: The highest average sales occur during the summer (May-August), suggesting that summer is the most active time in the industry. Best months: July and August lead the way in terms of average sales, which may be due to companies preparing before the end of the third quarter.Lower sales: The lowest results occur in January and February, which can be linked to the New Year period and lower investment activity.Slow growth: From March to May, sales rise steadily, peaking in the summer and then falling in September and October, indicating a seasonal slowdown after the vacations.

Median sales: The median (the line in the middle of the box) is about 500 trucks sold, which means that half the months had higher sales and the other half had lower sales.Interquartile range (IQR): Most of the data (50%) is in the range of about 400 to 600 trucks sold, indicating relatively little variability in sales in most months. No outliers: There are no outlier points on the chart, suggesting that sales are fairly uniform and no extreme low or high values. Range of values: The box plot shows that the minimum sales are around 250 trucks and the maximum sales are around 750 trucks per month.

Constant sales growth: Total truck sales have been growing year-on-year in an almost uninterrupted manner, indicating the market’s development from 2003 to 2014.Slower growth from 2009 to 2010: Since 2011, a clear accelerated growth in sales has been evident, culminating in 2014, the best year in the period under review.

Seasonality of sales: Each year, sales rise in the first half of the year, reaching a peak in the summer months (June-July), and then fall in the second half of the year, reaching their lowest values in November and December. Stable pattern across years: The overall shape of the curves is very similar for all years, indicating a clear and repeatable pattern of seasonality.Growth in sales over the years: The higher lines for 2013-2014 indicate an overall increase in sales in those years compared to the beginning of the period (2003-2006). Particularly evident is the increase in sales during the summer months.

Time series analysis

The graph shows ACF and PACF. ACF declines slowly, indicating a minor role for the MA component, while the sharp decline in PACF suggests a greater influence of the AR component.

The graph shows the decomposition of the time series using the STL method. The seasonal component confirms the presence of regular patterns, the trend indicates long-term growth, and the residual component contains random fluctuations. The data can be described taking into account both trend and seasonality.

Stationarity of the time series

In the previous graph, you can see that the time series has an upward trend and seasonality, which means that it is not stationary. So, we need to do differentiation to make it stationary. So we will do a first-order differentiation we will do a visualization and check it with appropriate tests.

In the given graph you can see that the first-order variation made the time series stationary. Now we will check it with the appropriate tests.

Nazwa_testu Statystyka p_value
Dickey-Fuller Augmented Dickey-Fuller -15.1206382 0.01
KPSS Level Kwiatkowski-Phillips-Schmidt-Shin 0.0124762 0.10
Dickey-Fuller Z(alpha) Phillips-Perron -110.1168101 0.01

Three tests supported the stationarity of the time series.

Holt-Winters

model_holtwinters <- HoltWinters(train_ts,seasonal="multiplicative")

The parameter seasonal=“multiplicative” takes into account seasonality, the amplitude of which changes in proportion to the level of the trend, which fits data with an increasing trend and intensifying seasonality.

The forecast shows seasonality and trend in the training data well, which means that the method has correctly captured patterns in the data. Comparing the forecast with the test data shows that the model works well, although there may be minor differences.

Checking the assumptions

Nazwa_testu Statystyka p_value
W Shapiro-Wilk 0.9791780 0.0597338
A Anderson-Darling 0.7282517 0.0561049
BP Breusch-Pagan 10.7081965 0.0300468
GQ Goldfeld-Quandt 1.3427149 0.1387731

In the tests performed, we can see that the assumption of normality of errors has been met, because in both tests the p-values are greater than 0.05. The assumption of homogeneity of variance has been violated, because in the first test the p-value is less than 0.05, which forces us to reject the null hypothesis, although and Goldfeld-Quandt test says to accept it.

Polynomial regression

The graph shows that the best degree of the polynomial is 24, since it has the lowest AIC score.

The forecast reproduces the test data reasonably well, accounting for both seasonality and the overall trend of the data. However, compared to the Holt-Winters method, polynomial regression performs slightly worse in terms of forecast accuracy, especially for more complex seasonal patterns.

Diagnostic charts

The diagnostic charts indicate that the polynomial regression model may have some problems. The “Residuals vs Fitted” graph shows curvature, suggesting that the model has not fully captured the structure of the data. The “Normal Q-Q” graph shows that the residuals are not perfectly normally distributed, especially at the ends, indicating a possible deviation from the assumption of normality. “Scale-Location” shows an increasing variance of the residuals, which may suggest a heteroskedasticity problem. The “Residuals vs Leverage” chart identifies several points of high influence (high leverage) that may significantly affect the model fit.

Checking the assumptions

Nazwa_testu Statystyka p_value
W Shapiro-Wilk 0.9887849 0.3004680
A Anderson-Darling 0.2782367 0.6454426
BP Breusch-Pagan 18.4647172 0.0000173
GQ Goldfeld-Quandt 3.5644317 0.0000001
DW Durbin-Watson 0.7288279 0.0000000
LM test Breusch-Godfrey 58.3443941 0.0000000

Test results indicate that the residuals of the polynomial regression model have a normal distribution. The assumption of homogeneity of variance was violated, as both tests have p-values less than 0.05. Tests for autocorrelation indicate that the residuals are correlated, which is a good signal for time series analysis.

SARIMA

sarima <- auto.arima(train_ts, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
summary(sarima)
## Series: train_ts 
## ARIMA(1,1,0)(2,1,0)[12] 
## 
## Coefficients:
##           ar1     sar1    sar2
##       -0.2851  -0.0826  0.2318
## s.e.   0.0918   0.0919  0.1049
## 
## sigma^2 = 287.9:  log likelihood = -505.03
## AIC=1018.05   AICc=1018.4   BIC=1029.17
## 
## Training set error measures:
##                     ME     RMSE      MAE      MPE    MAPE      MASE        ACF1
## Training set 0.8079424 15.90494 11.86291 0.045109 2.96353 0.2546599 0.007296978

This code snippet uses the auto.arima function from the forecast package to automatically construct the SARIMA model. The SARIMA model includes both autoregressive \(AR\) and seasonal \(SAR\) components adjusted for data with pronounced trend and seasonality. The coefficients of the model indicate a moderate effect of ordinary and seasonal lags on the time series. The model has a good fit, as evidenced by low error values and low autocorrelation of residuals(\(ACF1 = 0.0073\)). The SARIMA model, based on the estimated parameters, can be written in the form: \[(1 + 0.2851 B)(1 + 0.0826 B^{12} - 0.2318 B^{24})(1 - B)(1 - B^{12}) Y_t = \epsilon_t\].

The graph shows a time series forecast made using the SARIMA model. The forecast reproduces the test data well, taking into account both trend and seasonality, which shows the effectiveness of the model in analyzing this type of data.

Checking the assumptions

Nazwa_testu Statystyka p_value
W Shapiro-Wilk 0.9826246 0.0902160
A Anderson-Darling 0.8190532 0.0334795
BP Breusch-Pagan 9.4122285 0.0021554
GQ Goldfeld-Quandt 1.5518511 0.0405796

The assumption of normality of the residuals was violated because the p-value is less than 0.05 in the second test. The assumption of homogeneity of variance was also violated because the p-values are less than 0.05 in both tests.

Model comparison

Model R2 MAE MSE RMSE MAPE
Holt-Winters 0.9913795 15.63479 585.4071 24.19519 2.196681
Polynomial Regression 0.8752309 31.94093 1817.4034 42.63102 4.264551
SARIMA 0.9909566 25.11870 859.8484 29.32317 3.655697

Based on a comparison of the results of the three models (Holt-Winters, Polynomial Regression, SARIMA), it can be seen that the Holt-Winters model performed best in all key measures. The SARIMA model performed slightly worse, with higher errors and smaller \(R^2\), but still outperformed Polynomial Regression. Polynomial regression was the least effective, with the highest error values, which may be due to the difficulty in capturing complex seasonal patterns in the data.

Conclusions:

Holt-Winters:
The Holt-Winters model is the best model for forecasting truck sales because it has achieved the best results in key metrics. It is able to reproduce both trend and seasonality in the data well, allowing it to accurately forecast future values.
Polynomial regression:
Polynomial regression was the least effective model because it had the highest error values and lowest \(R^2\). It may have difficulty capturing complex seasonal patterns in the data.

SARIMA:
The SARIMA model produced results with slightly higher errors than Holt-Winters, but better than polynomial regression. 

So Holt-Winters and SARIMA are good models for time series analysis, and depending on the data, they may show better or worse results when comparing one with the other, but both give quite good results and are very effective in time series analysis.