
Join the conversation on Facebook

Join the conversation on Twitter

Subscribe to the YouTube Channel

Join the conversation on LinkedIn

View our projects on GitHub

Share via email
Analyze the Time series datasets (Forecast SQL)
12/06/2018
You will learn
 Understand the basics about Time Series analysis
 Which statistics can help you better understand the structure of the dataset
 Based on the statistical assessment, identify what algorithm options are available
Most of the content for this steps has been extracted from the Wikipedia article on Time series
First, here is quick definition of a time series:
A time series is a series of indexed data points using a time order.
Most commonly, a time series is a sequence taken at successive equally spaced points in time.
Time series are used in statistics, signal processing, pattern recognition, econometrics, mathematical finance, weather forecasting, earthquake prediction, electroencephalography, control engineering, astronomy, communications engineering, and largely in any domain of applied science and engineering which involves temporal measurements.
Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data.
Time series forecasting is the use of a model to predict future values based on previously observed values.
Time series data have usually a natural temporal ordering.
In addition, time series models will often make use of the natural oneway ordering of time so that values for a given period will be expressed as deriving in some way from past values, rather than from future values (see time reversibility).
Time series analysis can be applied to realvalue, continuous data, discrete numeric data, or discrete symbolic data (i.e. sequences of characters, such as letters and words in the English language).
Time series are very frequently plotted via line charts.
Cash Flows
The Cash Flows file (CashFlows.txt
) presents daily measures of cash flows from January 2, 1998 to September, 30 1998. Each observation is characterized by 25 variables described in the following table.
Variable  Description  Example of values 

Date  Day, month and year of the readings  A date 
Cash 
Cash flow  A numerical value with n decimals 
BeforeLastMonday LastMonday BeforeLastTuesday LastTuesday BeforeLastWednesday LastWednesday BeforeLastThursday LastThursday BeforeLastFriday LastFriday 
Boolean variables that indicate if the information is true or false  1 if the information is true. 
Last5WDays Last4WDays 
Boolean variables that indicate if the date is in the 5 or 4 last working days of the month  1 if the information is true. 
LastWMonth BeforeLastWMonth 
Boolean variables that indicate if the information is true or false  1 if the information is true. 
WorkingDaysIndices ReverseWorkingDaysIndices 
Indices or reverse indices of the working days  An integer value 
MondayMonthInd TuesdayMonthInd WednesdayMonthInd ThursdayMonthInd FridayMonthInd 
Indices of the week days in the month  An integer value 
Last5WDaysInd Last4WDaysInd 
Indices of the 5 or 4 last working days of the month  An integer value 
Los Angeles Ozone
The Los Angeles Ozone file (R_ozonela.txt
) presents monthly averages of hourly ozone (O3) readings in downtown Los Angeles from 1955 to 1972.
Each observation is characterized by 2 variables described in the following table:
Variable  Description  Example of values 

Time 
Month and year of the readings  A date 
R_ozonela 
Average of the hourly readings for the month  A numerical value 
“Lag 1 And Cycles” & “Trend And Cyclic” with and without White Noise (Wn
)
These files can be used to observe and analyze the impact of specific signal phenomenon.
Each observation is characterized by 2 variables described in the following table:
Variable  Description  Example of values 

TIME 
The date of the readings  A date 
Signal 
the signal value  A numerical value 
Connect to the HXE tenant using the ML_USER
user credentials and execute the following SQL statement to check the number of rows:
select 'cashflow' as "TABLE", count(1) as "COUNT" FROM forecast_cashflow
union all
select 'ozone' as "TABLE", count(1) as "COUNT" FROM forecast_ozone
union all
select 'lag_1_and_cycles' as "TABLE", count(1) as "COUNT" FROM forecast_lag_1_and_cycles
union all
select 'lag_1_and_cycles_and_wn' as "TABLE", count(1) as "COUNT" FROM forecast_lag_1_and_cycles_and_wn
union all
select 'trend_and_cyclic' as "TABLE", count(1) as "COUNT" FROM forecast_trend_and_cyclic
union all
select 'trend_and_cyclic_and_wn' as "TABLE", count(1) as "COUNT" FROM forecast_trend_and_cyclic_and_wn
union all
select 'trend_and_cyclic_and_4wn' as "TABLE", count(1) as "COUNT" FROM forecast_trend_and_cyclic_and_4wn
The result should be:
Table name  Row count 

cashflow 
272 
ozone 
204 
lag_1_and_cycles 
499 
lag_1_and_cycles_and_wn 
499 
trend_and_cyclic 
500 
trend_and_cyclic_and_wn 
500 
trend_and_cyclic_and_4wn 
500 
As stated earlier, the Cash Flow dataset presents daily measures of cash flows from January 2, 1998 to September, 30 1998. Each observation is characterized by 25 variables described in the following table.
Visualize the data
Let’s have a look at the data using the following SQL:
select cashdate, cash from forecast_cashflow order by cashdate asc;
And here is the result in a graph:
As you can visually notice, the signal includes:
 steep peaks
 a repeating pattern but with irregular gaps/intervals
 the signal trend tends to slightly decline then rise at the end
Also, you can notice that the peaks happens at certain intervals, and the data include some kind of trend that is slightly going down then rising.
Dates & intervals
As the cash flow value is provided for certain dates, let’s have a look at statistics using the following SQL:
select 'max' as indicator, to_varchar(max(cashdate)) as value
from forecast_cashflow union all
select 'min' , to_varchar(min(cashdate))
from forecast_cashflow union all
select 'delta days' , to_varchar(days_between(min(cashdate), max(cashdate)))
from forecast_cashflow union all
select 'count' , to_varchar(count(1))
from forecast_cashflow
indicator  value 

max  20020131 
min  20010102 
delta days  394 
count  271 
As you can notice, you have 272 data points spread across 394 days. This implies that data is not available on a daily basis.
This may have an impact on the way some algorithms work.
Now let’s check the interval distribution using the following SQL:
select interval, count(1) as count
from (
select days_between (lag(cashdate) over (order by cashdate asc), cashdate) as interval
from forecast_cashflow
order by cashdate asc
)
where interval is not null
group by interval;
The result should be:
interval  count 

1  211 
3  52 
4  2 
2  4 
5  1 
6  1 
Most data points are provided on a daily basis when others have a:
 2 days interval most likely caused by a bank holiday during the week
 3 days interval most likely because of weekends
 4, 5 or 6 days interval most likely because of a bank holiday next to a weekend or other special events
Generic statistics
Now, let’s have a look at some generic statistical elements using the following SQL:
select 'max' as indicator , round(max(cash)) as value from forecast_cashflow union all
select 'min' , round(min(cash)) from forecast_cashflow union all
select 'delta min/max' , round(max(cash)  min(cash)) from forecast_cashflow union all
select 'avg' , round(avg(cash)) from forecast_cashflow union all
select 'median' , round(median(cash)) from forecast_cashflow union all
select 'stddev' , round(stddev(cash)) from forecast_cashflow
The result should be:
indicator  value 

max 
24659 
min 
1579 
delta min/max 
23079 
avg 
5361 
median 
4434 
stddev 
3594 
As you can notice the average and median values are not in the same range of values which may imply a skewed data distribution.
Here is a graph which can help you visualize the cash flow value in ascending order:
As you can notice, at the end of the curve, a small set of data point have really high values compared to the rest.
Data Distribution
Now let’s have a look at the data distribution using the NTILE function that will partition the dataset into a number of groups based on the value order.
This usually helps finding issues with the first and/or the last groups (outliers).
The following SQL will partition the data into 10 groups and get the same generic statistics as before but for each group:
with data as (
select ntile(10) over (order by cash asc) as tile, cash
from forecast_cashflow
where cash is not null
)
select tile
, round(max(cash)) as max
, round(min(cash)) as min
, round(max(cash)  min(cash)) as "delta min/max"
, round(avg(cash)) as avg
, round(median(cash)) as median
, round(abs(avg(cash)  median(cash))) as "delta avg/median"
, round(stddev(cash)) as stddev
from data
group by tile
The result should be:
tile  max  min  delta  avg 
median  delta  stddev 

1  2957  1580  1378  2535  2665  130  382 
2  3513  2976  538  3248  3281  33  183 
3  3874  3521  353  3695  3695  0  114 
4  4116  3888  228  4023  4039  16  65 
5  4435  4123  312  4281  4269  12  110 
6  4832  4438  394  4611  4577  33  127 
7  5364  4879  485  5147  5125  22  144 
8  5953  5365  588  5701  5703  2  179 
9  7284  5995  1288  6600  6577  24  404 
10  24659  7542  17117  13891  11464  2427  6059 
As you can notice the first and last groups both have the delta between min and max but also between average and median higher than any other groups.
The last groups most likely represent all the peaks that you saw earlier.
Provide an answer to the question below then click on Validate.
As stated earlier, the Los Angeles Ozone dataset presents monthly averages of hourly ozone (O3) readings in downtown Los Angeles from 1955 to 1972.
Each observation is characterized by 2 variables, a time and an average of the hourly ozone readings for the month.
Visualize the data
Let’s have a look at the data using the following SQL:
select time, reading from forecast_ozone order by time asc;
And here is the result in a graph:
As you can visually notice, the data includes:
 an irregular sine pattern
 the oldest data range looks larger than the later data points
 the signal trend tends to slightly decline
Dates & intervals
As the ozone reading value is provided for a certain date, let’s have a look at date values using the following SQL:
select 'max' as indicator, to_varchar(max(time)) as value
from forecast_ozone union all
select 'min' , to_varchar(min(time))
from forecast_ozone union all
select 'delta days' , to_varchar(days_between(min(time), max(time)))
from forecast_ozone union all
select 'count' , to_varchar(count(1))
from forecast_ozone
indicator  value 

max  19711228 
min  19550128 
delta months  203 
count  204 
As you can notice, you have 204 data points spread across 16 years. This implies that data is available on a monthly basis.
Now let’s check the date value interval distribution using the following SQL:
select interval, count(1) as count
from (
select days_between (lag(time) over (order by time asc), time) as interval
from forecast_ozone
order by time asc
)
where interval is not null
group by interval
The result should be:
interval  count 

31  118 
28  13 
30  68 
29  4 
The fact that every month don’t have the same duration may impact certain algorithms leveraging the date information in the model.
Generic statistics
Now, let’s have a look at some additional statistical elements using the following SQL:
select 'max' as indicator , round(max(reading)) as value from forecast_ozone union all
select 'min' , round(min(reading)) from forecast_ozone union all
select 'delta min/max' , round(max(reading)  min(reading)) from forecast_ozone union all
select 'avg' , round(avg(reading)) from forecast_ozone union all
select 'median' , round(median(reading)) from forecast_ozone union all
select 'stddev' , round(stddev(reading)) from forecast_ozone
The result should be:
indicator  value 

max 
8.13 
min 
1.17 
delta min/max 
6.96 
avg 
3.72 
median 
3.67 
stddev 
1.41 
As you can notice the average and median values are in the same range of values.
Here is a graph which can help you visualize the ozone reading value in ascending order:
Data Distribution
Now let’s have a look at the data distribution using the NTILE function.
The following SQL will partition the data into 10 groups and get the same generic statistics as before but for each group:
with data as (
select ntile(10) over (order by reading asc) as tile, reading
from forecast_ozone
where reading is not null
)
select tile
, round(max(reading), 2) as max
, round(min(reading), 2) as min
, round(max(reading)  min(reading), 2) as "delta min/max"
, round(avg(reading), 2) as avg
, round(median(reading), 2) as median
, round(abs(avg(reading)  median(reading)), 2) as "delta avg/median"
, round(stddev(reading), 2) as stddev
from data
group by tile
The result should be:
tile  max  min  delta  avg 
median  delta  stddev 

1  1.92  1.17  0.75  1.62  1.71  0.09  0.23 
2  2.42  1.94  0.48  2.2  2.25  0.05  0.15 
3  2.81  2.42  0.39  2.59  2.58  0.01  0.13 
4  3.29  2.81  0.48  3.05  3.06  0.01  0.15 
5  3.71  3.31  0.4  3.48  3.44  0.04  0.13 
6  4.13  3.71  0.42  3.9  3.86  0.04  0.15 
7  4.5  4.17  0.33  4.35  4.35  0  0.12 
8  4.88  4.52  0.36  4.73  4.76  0.03  0.13 
9  5.48  4.88  0.6  5.22  5.27  0.05  0.2 
10  8.13  5.5  2.63  6.33  6  0.33  0.88 
As you can notice, the last group have both the delta between min and max but also between average and median higher than any other groups.
The last groups most likely represent some peaks that you saw earlier in the graph.
Provide an answer to the question below then click on Validate.
As stated earlier, the Lag 1 And Cycles with or without White Noise has been built to analyze certain phenomenon in the data.
Each observation is characterized by 2 variables, a time and a signal value.
In this step, you will analyze the data with and without White Noise at the same time.
Visualize the data
Let’s have a look at the data using the following SQL:
select
l1cnn.time, l1cnn.signal as signal , l1cwn.signal as signal_wn, l1cnn.signal  l1cwn.signal as delta
from
forecast_lag_1_and_cycles l1cnn
join forecast_lag_1_and_cycles_and_wn l1cwn
on l1cnn.time = l1cwn.time
And here is the result in a graph:
As you can visually notice:
 the data set without white noise (in blue) is following a sine wave with some small irregularities at the end
 the data set with white noise (in red) tend to follow a sine wave too but with mush stronger irregularities
 the delta (in green) represent the White Noise between the two data
Dates & intervals
As the reading value is provided for certain dates, let’s have a look at date values using the following SQL:
select 'max' as indicator, to_varchar(max(time)) as value
from forecast_lag_1_and_cycles union all
select 'min' , to_varchar(min(time))
from forecast_lag_1_and_cycles union all
select 'delta days' , to_varchar(days_between(min(time), max(time)))
from forecast_lag_1_and_cycles union all
select 'count' , to_varchar(count(1))
from forecast_lag_1_and_cycles
indicator  value 

max  20020514 
min  20010101 
delta days  498 
count  499 
As you can notice, you have 499 data points spread across 498 days. This implies that data is available on a daily basis.
The same analysis is applicable to the dataset with white noise.
Generic statistics
Now, let’s have a look at some additional statistical elements using the following SQL:
with data as (
select l1cnn.signal as value_nn, l1cwn.signal as value_wn
from forecast_lag_1_and_cycles l1cnn join forecast_lag_1_and_cycles_and_wn l1cwn on l1cnn.time = l1cwn.time
)
select 'max' as indicator , round(max(value_nn), 2) as value_nn
, round(max(value_wn), 2) as value_wn from data union all
select 'min' , round(min(value_nn), 2)
, round(min(value_wn), 2) from data union all
select 'delta min/max' , round(max(value_nn)  min(value_nn), 2)
, round(max(value_wn)  min(value_wn), 2) from data union all
select 'avg' , round(avg(value_nn), 2)
, round(avg(value_wn), 2) from data union all
select 'median' , round(median(value_nn), 2)
, round(median(value_wn), 2) from data union all
select 'stddev' , round(stddev(value_nn), 2)
, round(stddev(value_wn), 2) from data
The result should be:
indicator  value without White Noise  value with White Noise 

max 
6.08  13.95 
min 
7.73  21.68 
delta min/max 
13.8  35.63 
avg 
0.41  3.22 
median 
0.3  2.91 
stddev 
3.93  7.87 
As you can notice the average and median values are in the same range of values for both datasets.
Here is a graph that can help you visualize the signal values in ascending order:
Data Distribution
Now let’s have a look at the data distribution using the NTILE function.
The following SQL will partition the data into 10 groups and get the same generic statistics as before but for each group:
with data as (
select ntile(10) over (order by signal asc) as tile, signal
from forecast_lag_1_and_cycles
where signal is not null
)
select tile
, round(max(signal), 2) as max
, round(min(signal), 2) as min
, round(max(signal)  min(signal), 2) as "delta min/max"
, round(avg(signal), 2) as avg
, round(median(signal), 2) as median
, round(abs(avg(signal)  median(signal)), 2) as "delta avg/median"
, round(stddev(signal), 2) as stddev
from data
group by tile
The result should be:
tile  max  min  delta  avg 
median  delta  stddev 

1  5.59  7.73  2.14  6.44  6.39  0.05  0.73 
2  4.57  5.58  1.01  5.07  5.04  0.03  0.31 
3  3.55  4.54  0.99  4.07  4.07  0.01  0.3 
4  1.94  3.52  1.58  2.81  2.87  0.06  0.46 
5  0.3  1.84  1.54  1.07  1.08  0.01  0.49 
6  1.51  0.28  1.79  0.67  0.71  0.04  0.55 
7  2.7  1.55  1.15  2.08  2.07  0.01  0.33 
8  3.81  2.75  1.06  3.31  3.31  0  0.35 
9  4.65  3.84  0.81  4.14  4.12  0.02  0.24 
10  6.08  4.66  1.42  5.33  5.34  0.01  0.36 
As you can notice, the deltas between the average and median are in the same range of value across all the tiles.
Now let’s do it for the data set with white noise.
with data as (
select ntile(10) over (order by signal asc) as tile, signal
from forecast_lag_1_and_cycles_and_wn
where signal is not null
)
select tile
, round(max(signal), 2) as max
, round(min(signal), 2) as min
, round(max(signal)  min(signal), 2) as "delta min/max"
, round(avg(signal), 2) as avg
, round(median(signal), 2) as median
, round(abs(avg(signal)  median(signal)), 2) as "delta avg/median"
, round(stddev(signal), 2) as stddev
from data
group by tile
Provide an answer to the question below then click on Validate.
As stated earlier, just like the Lag 1 And Cycles, Trend and Cyclic has been built to analyze certain phenomenon in the data.
Each observation is characterized by 2 variables, a time and a signal value.
In this step, you will analyze the data with and without White Noise at the same time.
Visualize the data
Let’s have a look at the data using the following SQL:
select
tcnn.time
, tcnn.signal as signal
, tcwn.signal as signal_wn
, tc4n.signal as signal_4n
, tcnn.signal  tcwn.signal as delta_wn
, tcnn.signal  tc4n.signal as delta_4n
from
forecast_trend_and_cyclic tcnn
join forecast_trend_and_cyclic_and_wn tcwn on tcnn.time = tcwn.time
join forecast_trend_and_cyclic_and_4wn tc4n on tcnn.time = tc4n.time
And here is the result in a graph:
As you can visually notice:
 the data set without white noise (in blue) is following a sine wave
 the data set with white noise (in red) tend to follow a sine wave too but with some irregularities
 the data set with 4 time white noise (in yellow) tend to follow a sine wave too but with stronger irregularities
 they all have a positive (increasing) trend
Dates & intervals
As the ozone reading value is provided for a certain date, let’s have a look at date values using the following SQL:
select 'max' as indicator, to_varchar(max(time)) as value
from forecast_trend_and_cyclic union all
select 'min' , to_varchar(min(time))
from forecast_trend_and_cyclic union all
select 'delta days' , to_varchar(days_between(min(time), max(time)))
from forecast_trend_and_cyclic union all
select 'count' , to_varchar(count(1))
from forecast_trend_and_cyclic
indicator  value 

max  20020515 
min  20010101 
delta days  499 
count  500 
As you can notice, you have 500 data points spread across 499 days. This implies that data is available on a daily basis.
The same analysis is applicable to the dataset with white noise.
Generic statistics
Now, let’s have a look at some additional statistical elements using the following SQL:
with data as (
select l1cnn.signal as value_nn, l1cwn.signal as value_wn
from forecast_lag_1_and_cycles l1cnn join forecast_lag_1_and_cycles_and_wn l1cwn on l1cnn.time = l1cwn.time
)
select 'max' as indicator , round(max(value_nn), 2) as value_nn
, round(max(value_wn), 2) as value_wn from data union all
select 'min' , round(min(value_nn), 2)
, round(min(value_wn), 2) from data union all
select 'delta min/max' , round(max(value_nn)  min(value_nn), 2)
, round(max(value_wn)  min(value_wn), 2) from data union all
select 'avg' , round(avg(value_nn), 2)
, round(avg(value_wn), 2) from data union all
select 'median' , round(median(value_nn), 2)
, round(median(value_wn), 2) from data union all
select 'stddev' , round(stddev(value_nn), 2)
, round(stddev(value_wn), 2) from data
The result should be:
indicator  value without White Noise  value with White Noise  value with 4 x White Noise 

max 
159.84  161.12  167.49 
min 
0.83  0.54  3.28 
delta min/max 
159.01  161.66  170.76 
avg 
75.47  75.48  75.5 
median 
81.76  80.61  78.96 
stddev 
43.21  43.33  43.78 
As you can notice the average and median values are all in the same range of values for each datasets.
Here is a graph which can help you visualize the signal values in ascending order:
Data Distribution
Now let’s have a look at the data distribution using the NTILE function.
The following SQL will partition the data into 8 groups and get the same generic statistics as before but for each group:
with data as (
select ntile(8) over (order by signal asc) as tile, signal
from forecast_trend_and_cyclic
where signal is not null
)
select tile
, round(max(signal), 2) as max
, round(min(signal), 2) as min
, round(max(signal)  min(signal), 2) as "delta min/max"
, round(avg(signal), 2) as avg
, round(median(signal), 2) as median
, round(abs(avg(signal)  median(signal)), 2) as "delta avg/median"
, round(stddev(signal), 2) as stddev
from data
group by tile
The reason you are using 8 tiles here is because the signal has 4 waves. Each wave is made of a declining and rising part, so 8 parts in total. And our goal here is to assess if these parts are more or less with the same shape.
The result should be:
tile  max  min  delta  avg 
median  delta  stddev 

1  18.14  0.83  17.31  13.69  15.33  1.64  4.56 
2  41.71  18.27  23.43  24.68  20.52  4.17  7.06 
3  54.4  42.49  11.91  51.03  51.35  0.32  2.73 
4  82.79  54.55  28.24  63.37  58.84  4.53  9.25 
5  90.61  83.45  7.16  87.7  87.51  0.19  1.69 
6  121.59  90.64  30.95  101.74  98.68  3.05  10.63 
7  126.5  121.6  4.9  123.78  123.61  0.16  1.62 
8  159.84  126.64  33.2  140.18  138.62  1.56  11.66 
As you can notice, the deltas between the average and median are either really small or large. The small values correspond to slowly rising phases whereas the larger values relates to steeper phases of the curve.
Now let’s do it for the data set with white noise.
with data as (
select ntile(8) over (order by signal asc) as tile, signal
from forecast_trend_and_cyclic_and_wn
where signal is not null
)
select tile
, round(max(signal), 2) as max
, round(min(signal), 2) as min
, round(max(signal)  min(signal), 2) as "delta min/max"
, round(avg(signal), 2) as avg
, round(median(signal), 2) as median
, round(abs(avg(signal)  median(signal)), 2) as "delta avg/median"
, round(stddev(signal), 2) as stddev
from data
group by tile
Provide an answer to the question below then click on Validate.
Based on this series of elements, you have found out that these datasets :
 may include some peaks or deeps that might be outliers data
 looking at peaks or the overall graph, you can easily notice the presence of seasonal or cyclic effect
 a trend in the data can easily be spotted in a graph compared to raw data
These findings can drive the way you will use one algorithm or another.
Off course this analysis is not complete, but is provided here to help you understand the importance of this activity.
Note If you are using Jupyter Notebook, you can download the following notebooks to run most of the SQL statement listed in the tutorial: