Time Series Analysis and Forecasting

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values. While regression analysis is often employed in such a way as to test theories that the current values of one or more independent time series affect the current value of another time series, this type of analysis of time series is not called “time series analysis”, which focuses on comparing values of a single time series or multiple dependent time series at different points in time.

R has extensive facilities for analyzing time series data: creation of a time series, seasonal decompostion, modeling with exponential and ARIMA models, and forecasting with the forecast package.

In order to analyse time series data has to be  read into R and then the time series are plotted. We can read data into R using the scan() function, which assumes that our data for successive time points is in a simple text file with one column.

For example, the file http://robjhyndman.com/tsdldata/misc/kings.dat contains data on the age of death of successive kings of England, starting with William the Conqueror (original source: Hipel and Mcleod, 1994).

The first three lines contain some comment on the data, and we want to ignore this when we read the data into R. We can use this by using the “skip” parameter of the scan() function, which specifies how many lines at the top of the file to ignore. To read the file into R, ignoring the first three lines, we type:

kings <- scan(“http://robjhyndman.com/tsdldata/misc/kings.dat”,skip=3)

The age of death of 42 successive kings of England has been read into the variable ‘kings’.

The next step is to store the data in a time series object in R, so that you can use R’s many functions for analysing time series data. To store the data in a time series object, we use the ts() function in R. For example, to store the data in the variable ‘kings’ as a time series object in R, we type:

kingstimeseries <- ts(kings)

The result we get:

Time Series:
Start = 1
End = 42
Frequency = 1
[1] 60 43 67 50 56 42 50 65 68 43 65 34 47 34 49 41 13 35 53 56 16 43 69 59 48 59 86 55 68 51
[31] 33 49 67 77 81 67 71 81 68 70 77 56

We can plot the time series for the age of death of 42 kings by executing the following command:



We can see from the time plot that this time series could probably be described using an additive model, since the random fluctuations in the data are roughly constant in size over time.

If the time series data set have been collected at regular intervals to specify the number of times that data was collected per year by using the ‘frequency’ parameter in the ts() function. For monthly time series data frequency=12, while for quarterly time series data frequency=4.

You can also specify the first year that the data was collected, and the first interval in that year by using the ‘start’ parameter in the ts() function. For example, if the first data point corresponds to the second quarter of 1986, we would set start=c(1986,2).

I will demonstrate this on the example of data set of the number of births per month in New York city, from January 1946 to December 1959. This data is available in the file http://robjhyndman.com/tsdldata/data/nybirths.dat We can read the data into R, and store it as a time series object, by typing:

births <- scan(“http://robjhyndman.com/tsdldata/data/nybirths.dat”)

birthstimeseries <- ts(births, frequency=12, start=c(1946,1))

To plot the data we type the following command:



We can see from this time series that there seems to be seasonal variation in the number of births per month: there is a peak every summer, and a trough every winter. Again, it seems that this time series could probably be described using an additive model, as the seasonal fluctuations are roughly constant in size over time and do not seem to depend on the level of the time series, and the random fluctuations also seem to be roughly constant in size over time.

The file http://robjhyndman.com/tsdldata/data/fancy.dat contains monthly sales for a souvenir shop at a beach resort town in Queensland, Australia, for January 1987-December 1993 (original data from Wheelwright and Hyndman, 1998). We can read the data into R by typing:

souvenir <- scan(“http://robjhyndman.com/tsdldata/data/fancy.dat”)

Read 84 items

souvenirtimeseries <- ts(souvenir, frequency=12, start=c(1987,1))

To plot the data we type the following command:



It appears that an additive model is not appropriate for describing this time series, since the size of the seasonal fluctuations and random fluctuations seem to increase with the level of the time series. We transform the time series in order to get a transformed time series that can be described using an additive model. For example, we can transform the time series by calculating the natural log of the original data:

logsouvenirtimeseries <- log(souvenirtimeseries)


Here we can see that the size of the seasonal fluctuations and random fluctuations in the log-transformed time series seem to be roughly constant over time, and do not depend on the level of the time series. Thus, the log-transformed time series can probably be described using an additive model.

Forecasts using Exponential Smoothing

Exponential smoothing can be used to make short-term forecasts for time series data.

The simple exponential smoothing method provides a way of estimating the level at the current time point. Smoothing is controlled by the parameter alpha; for the estimate of the level at the current time point. The value of alpha; lies between 0 and 1. Values of alpha that are close to 0 mean that little weight is placed on the most recent observations when making forecasts of future values.

For example, the file http://robjhyndman.com/tsdldata/hurst/precip1.dat contains total annual rainfall in inches for London, from 1813-1912 (original data from Hipel and McLeod, 1994). We can read the data into R and plot it by typing:

rain <- scan(“http://robjhyndman.com/tsdldata/hurst/precip1.dat”,skip=1)

Read 100 items

rainseries <- ts(rain,start=c(1813))


To make forecasts using simple exponential smoothing in R, we can fit a simple exponential smoothing predictive model using the “HoltWinters()” function in R. To use HoltWinters() for simple exponential smoothing, we need to set the parameters beta=FALSE and gamma=FALSE in the HoltWinters() function (the beta and gamma parameters are used for Holt’s exponential smoothing, or Holt-Winters exponential smoothing, as described below).

The HoltWinters() function returns a list variable, that contains several named elements.

For example, to use simple exponential smoothing to make forecasts for the time series of annual rainfall in London, we type:

rainseriesforecasts <- HoltWinters(rainseries, beta=FALSE, gamma=FALSE)

Smoothing parameters:
alpha: 0.02412151
beta : FALSE
gamma: FALSE

a 24.67819

The output of HoltWinters() tells us that the estimated value of the alpha parameter is about 0.024. This is very close to zero, telling us that the forecasts are based on both recent and less recent observations (although somewhat more weight is placed on recent observations).

By default, HoltWinters() just makes forecasts for the same time period covered by our original time series. In this case, our original time series included rainfall for London from 1813-1912, so the forecasts are also for 1813-1912.

In the example above, we have stored the output of the HoltWinters() function in the list variable “rainseriesforecasts”. The forecasts made by HoltWinters() are stored in a named element of this list variable called “fitted”, so we can get their values by typing:


We can plot the original time series against the forecasts by typing:



The plot shows the original time series in black, and the forecasts as a red line. The time series of forecasts is much smoother than the time series of the original data here.

K-Means and KNN Algorithms in R

Machine learning is a branch in computer science that studies the design of algorithms that can learn. Typical machine learning tasks are concept learning, function learning or “predictive modeling”, clustering and finding predictive patterns.

These tasks are learned through available data that were observed through experiences or instructions, for example. Machine learning hopes that including the experience into its tasks will eventually improve the learning. The ultimate goal is to improve the learning in such a way that it becomes automatic, so that humans like ourselves don’t need to interfere any more.

In this blog I will use R to work with machine learning algorithms called “KNN” or k-nearest neighbors and Kmeans.

R has a wide variety of functions for cluster analysis. In my project I will test two machine learning cluster algorithms K-nn and Kmeans, using R Studio to work with them.

In R’s partitioning approach, observations are divided into K groups and reshuffled to form the most cohesive clusters possible according to a given criterion. The most common partitioning method is the K-means cluster analysis. Conceptually, the K-means algorithm:

  1. Selects K centroids (K rows chosen at random)
  2. Assigns each data point to its closest centroid
  3. Recalculates the centroids as the average of all data points in a cluster (i.e., the centroids are p-length mean vectors, where p is the number of variables)
  4. Assigns data points to their closest centroids
  5. Continues steps 3 and 4 until the observations are not reassigned or the maximum number of iterations (R uses 10 as a default) is reached.

Implementation details for this approach can vary. R uses an efficient algorithm by Hartigan and Wong (1979) that partitions the observations into k groups such that the sum of squares of the observations to their assigned cluster centers is a minimum.

The KNN or k-nearest neighbours algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labelled instances. The distance (usually Euclidean distance) between the stored data and the new instance is calculated by means of some kind of a similarity measure. Then, this similarity value is used to perform predictive modelling. Predictive modelling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance.


In order to apply Machine learning algorithms I took my data from UCI Machine Learning

Repository: http://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling

The data represents users’ knowledge level, based on the following attributes:

STG (The degree of study time for goal object materials) (input value)
SCG (The degree of repetition number of user for goal object materials) (input value)
STR (The degree of study time of user for related objects with goal object) (input value)
LPR (The exam performance of user for related objects with goal object) (input value)
PEG (The exam performance of user for goal objects) (input value)
UNS (The knowledge level of user) (target value)
Very Low: 24
Low: 83
Middle: 88
High: 63

Data contains 258 observations of 6 variables.

K-means algorithm in R

In order to test K-means algorithm on selected dataset the following packages are installed:

“NbClust”, “flexclust”,” cluster”.

Read data in R:

grades = read.csv(“C:/Users/Natalia/Desktop/CA3_Darren/grades.csv”, header = TRUE)

Transform values in target column in numeric:

grades_trans <- transform(grades, UNS = as.numeric(UNS))

Since the variables vary in range, they are standardized prior to clustering :

df_grades<- scale(grades_trans)

The number of clusters is determined using the wwsplot() and NbClust()functions. A plot of the total within-groups sums of squares against the number of clusters in a K-means solution can be applied. A bend in the graph can suggest the appropriate number of clusters. The graph can be produced by the following function:

wssplot <- function(df_grades, nc=15, seed=1234){
wss <- (nrow(df_grades)-1)*sum(apply(df_grades,2,var))
for (i in 2:nc){
wss[i] <- sum(kmeans(df_grades, centers=i)$withinss)}
plot(1:nc, wss, type=”b”, xlab=”Number of Clusters”,
ylab=”Within groups sum of squares”)}



 Criteria provided by the NbClust package suggest a 2 cluster solution:

nc <- NbClust(df_grades, min.nc=2, max.nc=15, method=”kmeans”)

 ***** Conclusion ***** *

According to the majority rule, the best number of clusters is 2 *



0 2 3 4 5 6 8 9 10 11 12 14 15
2 8 2 2 1 1 2 1 2 1 1 2 1

A final cluster solution is obtained with kmeans() function and the cluster centroids are printed.  

xlab=”Numer of Clusters”, ylab=”Number of Criteria”,
main=”Number of Clusters Chosen by 26 Criteria”)


I tested 2 cluster and 3 clucter solutions, 3 cluster solution provides more accurate results.

fit.km <- kmeans(df_grades, 3, nstart=25)

[1] 102  93  63



aggregate(grades_trans, by=list(cluster=fit.km$cluster), mean)


A cross-tabulation of UNS (target variable) and cluster membership is given by the following command:

ct.km <- table(grades_trans$UNS, fit.km$cluster)

1  2  3

1  0  0 63

2 60 23  0

3 21 67  0

4 21  3  0

The adjusted Rand index provides a measure of the agreement between two partitions, adjusted for chance. It ranges from -1 (no agreement) to 1 (perfect agreement).




Agreement between the grades data and target variable and the cluster solution is 0.50. Still room for the model improvement.

K-nn algorithm in R

K-Nearest Neighbors (KNN) Classification: “knn” method from “class” package could be used for K-NN modeling.

In order to apply K-nn algorithm in R on selected dataset following packages are installed: “gmodels”, “ggvis”, “class”.

Scatter plots can give an idea if there is any correlation between two variables.

grades %>% ggvis(~PEG,~LPR, fill = ~UNS) %>% layer_points()


A quick look at the UNS attribute through tells you that the division of the UNS is 63-83-88-24:



High Low Middle very_low
63       83      88         24

To check the percentual division of the UNS attribute, command for a table of proportions can be executed:

round(prop.table(table(grades$UNS)) * 100, digits = 1)

High Low Middle very_low
24.4    32.2    34.1       9.3

Normalization makes it easier for the KNN algorithm to learn, for this purposes the following command is executed:

normalize <- function(x) {
num <- x – min(x)
denom <- max(x) – min(x)
return (num/denom)
grades_norm <- as.data.frame(lapply(grades[1:5],normalize))

Data is divided in 70% for training and 30% for testing samples, the following commands are executed:

ind <- sample(2, nrow(grades), replace=TRUE, prob=c(0.7, 0.3))
grades.training <- grades[ind==1, 1:5]
grades.test <- grades[ind==2, 1:5]
grades.trainLabels <- grades[ind==1, 6]
grades.testLabels <- grades[ind==2, 6]

Next prediction model is applied

grades.trainLabels <- grades[ind==1, 6]
grades.testLabels <- grades[ind==2, 6]
grades_pred <- knn(train = grades.training, test = grades.test,
cl = grades.trainLabels, k=3)

Command for CrossTable to analyze data is executed

CrossTable(x = grades.testLabels, y = grades_pred, prop.chisq=FALSE)


From this table, I can derive the number of correct and incorrect predictions, I can see that some instances were misplaced. I come to the conclusion that farther testing is necessary to improve the model.

Conclusions and recommendations

In this project I applied two machine learning algorithms K-means for clustering and K-nn from classification package. Both algorithms were tested in R studio.  Cluster and prediction analysis are broad topics and R has some of the most comprehensive facilities for applying these methodologies currently available.

Regarding accuracy results in both models applied I came to the conclusion that a bigger dataset and farther testing would be the way to improve them.


  1. T. Kahraman, Sagiroglu S., Colak I. (2013) Developing intuitive knowledge classifier and modelling of users’ domain dependent data in web, Knowledge Based Systems

K-means Clustering (from “R in Action”). Available at: http://www.r-statistics.com/2013/08/k-means-clustering-from-r-in-action/ (Accessed: 29 March 2016)

Machine Learning in R for beginners. Available at: https://www.datacamp.com/community/tutorials/machine-learning-in-r/ (Accessed: 28 March 2016)

Analysis of Variance (ANOVA) in R

This blog post is about analysis of the Airbnb dataset and the model applied on it – ANOVA.

Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data.

To apply analysis of variance to the data the aov function can be used in R and then the summary method to give us the usual analysis of variance table.

The Airbnb dataset is the real-world data that comes from Airbnb company’s kaggle competition https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings/data. It contains a list of users along with their demographics, web session records, and some summary statistics.

In order to build ANOVA model first data is cleaned in R Studio.

Airbnb dataset used in this report contains 213,451 records. Each record has 16 properties:

  • id
  • date account created
  • timestamp first active
  • date first booking
  • gender
  • age
  • signup method
  • signup flow
  • language
  • affiliate channel
  • affiliate provider
  • first affiliate tracked
  • signup app
  • first device type
  • first browser
  • country destination

Dataset was downloaded in the RapidMiner first to make some exploratory analyses. I can see that there are a lot of missing values in the data set.

Users’ language: most users speak English – 206,314. This is not surprising, since Airbnb is a company located in US and its customers are mostly Americans.

Users’ age: unreal figures like maximum age 2014 are displayed, which brings to the conclusion that this data has to be cleaned.

Users’ gender: there are a lot of missing values for gender. 95,688 of the users didn’t input their gender (unknown), 282 input it as other.

Users’ country of destination: most people ended up booking nothing which is indicated as NDF – 124,543. Among the users who have booked in the Airbnb, US is the most popular choice.

Users’ first browser: Chrome is the most popular choice – 63,845, but as first device type Mac Desktop shows the highest number – 89,600.

Cleaning Data

Before I apply the model, I clean the data first (delete all the missing values and unnecessary symbols, set age group to be between 18 and 90 years old.) The importance of having clean and reliable data in any statistical analysis cannot be stressed enough. Often, in real-world applications the analyst may get mesmerised by the complexity or beauty of the method being applied, while the data itself may be unreliable and lead to results which suggest courses of action without a sound basis. Below is the script in R applied to clean the data.

To delete all NA values the following command was executed:

airbnb = na.omit(airbnb)

The number of observations is changed now to 68, 171 of 16 variables.

Structure and data are viewed and examined:



The following command deletes all the unnecessary values, in this case NDF from countries column, although by deleting NA before, it deleted NDF’s as well:

airbnb_no_NDF <- airbnb[airbnb$country_destination != ‘NDF’, ]

Some unnecessary columns are deleted from the data set:

airbnb_no_NDF <- airbnb_no_NDF[-1]
airbnb_no_NDF<- airbnb_no_NDF[-2]

Before ANOVA is applied the new column country_num was created, setting countries as quantifiable values, which will be treated as dependant variables in ANOVA.

airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “US”]<- 1
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “FR”]<- 2
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “CA”]<- 3
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “GB”]<- 4
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “ES”]<- 5
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “IT”]<- 6
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “PT”]<- 7
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “DE”]<- 8
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “NL”]<- 9
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “AU”]<- 10
> airbnb_1 <- airbnb_no_NDF$country_num[airbnb_no_NDF$country_destination == “other”]<- 11

The  command summary for age column was executed, it is obvious that this column has to be cleaned, removing unreal values like age – 2014 for example.

summary (airbnb_no_NDF$age)

Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00  28.00    33.00     47.86 42.00    2014.00

From the age column all the values  > 18 and  <  90 are deleted

airbnb2 <- subset(airbnb_no_NDF, age > 18 & age < 90)

Mean for age is compared,

summary (airbnb_no_NDF$age)

summary (airbnb2$age)

It is changed from being 48 down to 36.

To remove unnecessary characters, like – in the word “-unkown-“, the following command was executed:

airbnb_test <- as.data.frame(sapply(airbnb2,gsub,pattern=”-“,replacement=””))

Categorical variables were standardized – converted to numeric by column names:

airbnb_test <- transform(airbnb_test, country_num = as.numeric(country_num))

To check if the column was converted into numeric value the following command was executed:


[1] TRUE




After the data was cleaned I want to run analyses of variance, defining Null Hypothesis as that there is no difference in distribution between countries of destination and other variables and Alternative Hypothesis that there is the difference between. I assume the normal distribution, to apply this model.

airbnb_aov <- aov(country_num ~ age + gender + signup_method + language + affiliate_channel +
affiliate_provider + date_first_booking + signup_app + first_device_type + first_browser, data=airbnb_test)

When I run the summary command for my anova model I recieve the following results:


Df Sum Sq Mean Sq F value   Pr(>F)

age                   70   1990   28.43   5.981  < 2e-16 ***

gender                 3     71   23.63   4.970 0.001895 **

signup_method          2     10    4.92   1.036 0.354888

language              22    489   22.21   4.671 2.17e-12 ***

affiliate_channel      7    968  138.27  29.087  < 2e-16 ***

affiliate_provider    15    198   13.21   2.779 0.000252 ***

date_first_booking  1937  13289    6.86   1.443  < 2e-16 ***

signup_app             3    232   77.31  16.264 1.46e-10 ***

first_device_type      8    197   24.58   5.171 1.79e-06 ***

first_browser         34    208    6.12   1.288 0.121541

Residuals          64308 305707    4.75

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Looking at the significance level I can see that age, language, affiliate_channel, affiliate_provider, date_first_booking, signup_app, first_device_type are being the most significant values.

Since the p-value is less than the 0.5 significance level I reject my H0 and accept H1 that there is difference in distribution between the country of destination and other variables, some have greater impact than others.

Knowing that most of the countries of destination in Airbnb dataset are US, I create new data frame excluding US, and run ANOVA excluding insignificant values from the first ANOVA, to see if the result will be different from the first model.

airbnb_no_US <- airbnb_test[airbnb_test$country_destination != ‘US’,]


The number of observations becomes 19290 of 16 variables in new data set. I want to check how it affected language column.



By excluding US there is a lot less English language in the new data frame, going down from 64701 to 18776.


airbnb_aov2 <- aov(country_num ~ age + gender + language + affiliate_channel +
affiliate_provider + date_first_booking + signup_app + first_device_type
, data=airbnb_no_US)



Df Sum Sq Mean Sq F value   Pr(>F)

age                   70    790  11.291   2.106 2.05e-07 ***

gender                 3     80  26.727   4.986 0.001859 **

language              20    493  24.651   4.599 3.62e-11 ***

affiliate_channel      7    141  20.128   3.755 0.000451 ***

affiliate_provider    14    147  10.527   1.964 0.016661 *

date_first_booking  1703  11229   6.593   1.230 1.41e-09 ***

signup_app             3     20   6.533   1.219 0.301122    first_device_type      8    188  23.542   4.392 2.56e-05 ***

Residuals          17461  93601   5.361   —

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

I can see that first device type is becoming insignificant, and affiliate provider less significant than in the first model.

To write the csv files for two new data sets created in the process I might be interested in the future for further analysis, I execute the following command:

write.csv(airbnb_test, file = “test_users.csv”)
write.csv(airbnb_no_US, file = “test_users_2.csv”)

Conclusion: Key Findings and Recommendations

Conclusions I can draw from the results of ANOVA applied on 2 different data frames including and excluding US, that most of the variables would have significant impact on the country of destination. My recommendation would be to run farther analyses applying different models, types of data frames and different ways of data cleaning. Compare the results and choose the best fit model to the Airbnb dataset.

References and Further Readings

Discussion Paper: An Introduction to Data Cleaning with R https://cran.r-project.org/doc/contrib/de_Jonge+van_der_Loo-Introduction_to_data_cleaning_with_R.pdf

R-Bloggers: A Data Cleaning Examples http://www.r-bloggers.com/a-data-cleaning-example/

Quick R: Anova http://www.statmethods.net/stats/anova.html

R Tutorial: Analyses of Variance http://www.r-tutor.com/elementary-statistics/analysis-variance

Anscombe’s Quartet

The proverb “A picture is worth 1000 words” is one you have probably heard more than once. A picture can also be worth 1000 data points.

In 1973, the statistician Francis Anscombe demonstrated the importance of graphing data before analyzing it and the effect of outliers on statistical properties.[1] The Anscombe’s Quartet shows how four sets of data with identical simple summary statistics can vary considerably when graphed. Each dataset consists of eleven (x,y) pairs as follows:

x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

We can see that all the summary statistics are close to identical:

  • The average x value is 9 for each dataset
  • The average y value is 7.50 for each dataset
  • The variance for x is 11 and the variance for y is 4.12
  • The correlation between x and y is 0.816 for each dataset
  • A linear regression (line of best fit) for each dataset follows the equation y = 0.5x + 3

So far these four datasets appear to be pretty similar. But when we plot these four data sets on an x/y coordinate plane, we get the following results:

The first scatter plot(top left) appears to be a simple linear relationship, corresponding to two variables correlated and following the assumption of normality.

The second graph (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear, and the Pearson correlation  coefficient is not relevant (a more general regression and the corresponding coefficient of determination would be more appropriate).

In the third graph (bottom left), the distribution is linear, but with a different regression line, which is offset by the one outlier which exerts enough influence to alter the regression line and lower the correlation coefficient from 1 to 0.816 (a robust regression would have been called for).

Finally, the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.

Pearson correlation is a standardized measure of covariance (“co-variation”). The standardized values can vary between -1 and +1, where 1 indicates perfect positive (linear) relationships, -1 a perfect negative (liner) relationship, and 0 stands for no correlation at all. For the correlation to be considered “weak” or “strong” depends on the study and data, but usually more than 0.3 to 0.5 is expected.  If the data is biased and the outliers are a necessary part of the data (not random errors), the Pearson correlation should not be used. In such cases, Spearman’s rank correlation can be used, which is a similar measure to the Pearson correlation, but is not influenced by the effect of outliers.

The correlation coefficient is easily calculated by any statistical package, but the results are not meaningful unless there is a linear relationship between the variables. The distributions of the variables should also be approximately symmetrical for correlation to be meaningful. For example, the presence of outliers would severely harm the results of a correlation coefficient as seen in the example  no 3 above.

We can see now how misleading correlation coefficients (or any other single statistics) can be. Therefore, graphical explorations are absolutely essential.

In order to validate Anscombe’s work I will multiply all the data points from the original data set by two and will create a new set of data points:

x y x y x y x y
20 16.08 20 18.28 20 14.92 16 13.16
16 13.9 16 16.28 16 13.54 16 11.52
26 15.16 26 17.48 26 25.48 16 15.42
18 17.62 18 17.54 18 14.22 16 17.68
22 16.66 22 18.52 22 15.62 16 16.94
28 19.92 28 16.2 28 17.68 16 14.08
12 14.48 12 12.26 12 12.16 16 10.5
8 8.52 8 6.2 8 10.78 38 25
24 21.68 24 18.26 24 16.3 16 11.12
14 9.64 14 14.52 14 12.84 16 15.82
10 11.36 10 9.48 10 11.46 16 13.78


We can see that all the summary statistics are close to identical again:

  • The average x value is 18 for each dataset
  • The average y value is 15 for each dataset
  • The variance for x is 44 and the variance for y is 16.5

But when we plot it we have the same result as above, 4 different graphs with dissimilar graphics and distribution of data in each case.

Using R we also can calculate some statistical properties of the quartet. I will reshape data set which is available in R as anscombe:

anscombe2 <- with(anscombe, data.frame(
  x     = c(x1, x2, x3, x4),
  y     = c(y1, y2, y3, y4),
  group = gl(4, nrow(anscombe))


The use of gl to autogenerate factor levels. So we have four sets of x-y data, which we can easily calculate summary statistics from using ddply from the plyr package. In this case we calculate the mean and standard deviation of y, the correlation between x and y, and run a linear regression.

(stats <- ddply(anscombe2, .(group), summarize,
  mean = mean(y),
  std_dev = sd(y),
  correlation = cor(x, y),
  lm_intercept = lm(y ~ x)$coefficients[1],
  lm_x_effect = lm(y ~ x)$coefficients[2]
  group     mean  std_dev correlation lm_intercept lm_x_effect
1     1 7.500909 2.031568   0.8164205     3.000091   0.5000909
2     2 7.500909 2.031657   0.8162365     3.000909   0.5000000
3     3 7.500000 2.030424   0.8162867     3.002455   0.4997273
4     4 7.500909 2.030579   0.8165214     3.001727   0.4999091


Each of the statistics is almost identical between the groups. But lets take a look at the visualisation.

(p <- ggplot(anscombe2, aes(x, y)) +
  geom_point() +
  facet_wrap(~ group)



The Anscombe’s Quartet is often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.

Visualizing our data allows us to revisit our summary statistics and recontextualize them as needed. For example, Dataset II from Anscombe’s Quartet demonstrates a strong relationship between x and y, it just doesn’t appear to be linear. So a linear regression was the wrong tool to use there, and we can try other regressions. Eventually, we’ll be able to revise this into a model that does a great job of describing our data, and has a high degree of predictive power for future observations.

Truly effective data analysis should consist of both numerical statistics and clear, clean visualizations. Anscombe closes his 1973 paper with this call to action: “The user is not showered with graphical displays. He can get them only with trouble, cunning and a fighting spirit.”


Multi Linear Model

In this post I will build the Model to predict current milk production from a set of measured variables.

The source csv file can be found following the link:


Samples are taken once a month during milking. The period that a cow gives milk is called lactation. Number of lactations is the number of times a cow has calved or given milk. The recommended management practice is to have the cow produce milk for about 305 days and then allow a 60- day rest period before beginning the next lactation.

First I read in R studio milk_production csv file.

milk_production <- read.csv(‘milk_production.csv’, header = T)

To check the structure of the data set I run the following command:


This data set includes:
‘data.frame’: 199 obs. of 7 variables:
CurrentMilk: int 45 86 50 42 61 93 91 90 53 84 …
Previous : int 45 86 50 42 61 93 91 90 53 84 …
Fat : num 5.5 4.4 6.5 7.4 3.8 4.2 2.9 4.7 2.5 4.3 …
Protein : num 8.9 4.1 4 4.1 3.8 3 2.6 2.9 3.5 3.3 …
Days : int 21 25 25 25 33 45 46 46 46 50 …
Lactation : int 5 4 7 2 2 3 2 5 2 7 …
I79 : int 0 0 0 0 0 0 0 0 0 0 …

To create a multiple scatter plots I execute the following command:



There appears to be some negative and positive linear relationships between the data. To look at correlation matrix I execute the following command, round the numbers to two decimal points:

round( cor(milk_prod), 2)

                        CurrentMilk/ Previous /Fat/ Protein/ Days/ Lactation/ I79
CurrentMilk     1.00                0.74      -0.12  -0.35    -0.47      0.12         -0.28
Previous           0.74                 1.00     -0.16  -0.31    -0.32      0.09          -0.02
Fat                     -0.12               -0.16     1.00    0.40    -0.06     0.10          -0.07
Protein             -0.35                -0.31     0.40   1.00    0.06       0.19          -0.10
Days                 -0.47                -0.32     -0.06   0.06    1.00      -0.10          0.62
Lactation          0.12                 0.09     0.10   0.19   -0.10       1.00          -0.22
I79                    -0.28                -0.02    -0.07   -0.10   0.62       -0.22         1.00

For the model I’ll start with all the variables and then remove variables which aren’t contributing much to the model, setting CurrentMilk as dependent variable.

milk_prod.fit1 <- lm( CurrentMilk ~ Previous + Fat + Protein + Days + Lactation + I79, data=milk_prod)

Now I want to inspect my model:


lm(formula = CurrentMilk ~ Previous + Fat + Protein + Days +
Lactation + I79, data = milk_prod)

Min 1Q Median 3Q Max
-40.795 -6.047 -0.778 6.371 42.980

Estimate Std. Error t value Pr(>|t|)
(Intercept) 51.03037 7.30322 6.987 4.48e-11 ***
Previous 0.69846 0.05203 13.423 < 2e-16 ***
Fat 0.79788 0.95244 0.838 0.403224
Protein -6.41555 1.60920 -3.987 9.51e-05 ***
Days -0.03165 0.01548 -2.045 0.042231 *
Lactation 0.52677 0.54881 0.960 0.338347
I79 -10.29764 2.84711 -3.617 0.000381 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.67 on 192 degrees of freedom
Multiple R-squared: 0.6639, Adjusted R-squared: 0.6534
F-statistic: 63.2 on 6 and 192 DF, p-value: < 2.2e-16

We can see now that the higher the P value is the less it contributes to the model. Now I will build another model and remove all the values that aren’t significant:

milk_prod.fit2 <- lm( CurrentMilk ~ Days + Previous + Protein + I79, data=milk_prod)

lm(formula = CurrentMilk ~ Days + Previous + Protein + I79, data = milk_prod)

Min 1Q Median 3Q Max
-41.548 -6.116 -0.509 6.164 46.098

Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.74841 6.98285 7.554 1.62e-12 ***
Days -0.03176 0.01531 -2.074 0.039399 *
Previous 0.70353 0.05099 13.798 < 2e-16 ***
Protein -5.61924 1.46765 -3.829 0.000174 ***
I79 -10.77753 2.77815 -3.879 0.000143 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.66 on 194 degrees of freedom
Multiple R-squared: 0.6609, Adjusted R-squared: 0.6539
F-statistic: 94.54 on 4 and 194 DF, p-value: < 2.2e-16

What should be kept or left out in a regression model has been investigated for many years. There are number of standard approaches. There is a function in R that does this for us:

final.milk_prod <- step( milk_prod.fit1 )

The result is identical to the result for model2.

lm(formula = CurrentMilk ~ Days + Previous + Protein + I79, data = milk_prod)

Min 1Q Median 3Q Max
-41.548 -6.116 -0.509 6.164 46.098

Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.74841 6.98285 7.554 1.62e-12 ***
Days -0.03176 0.01531 -2.074 0.039399 *
Previous 0.70353 0.05099 13.798 < 2e-16 ***
Protein -5.61924 1.46765 -3.829 0.000174 ***
I79 -10.77753 2.77815 -3.879 0.000143 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.66 on 194 degrees of freedom
Multiple R-squared: 0.6609, Adjusted R-squared: 0.6539
F-statistic: 94.54 on 4 and 194 DF, p-value: < 2.2e-1

Now I want to see all the points along the line or as close to it as possible.

Model 1:

qqnorm( milk_prod.fit1$residuals, ylab=”Residuals” )
qqline( milk_prod.fit1$residuals, col = ‘red’, lty = 2 )


Model 2:
qqnorm( milk_prod.fit2$residuals, ylab=”Residuals” )
qqline( milk_prod.fit2$residuals, col = ‘red’, lty = 2 )


I can see that in both models not all the points seem to fall along the line.

My conclusions are that the residuals show some departure from a normal distribution and we need to revise the model further.

Despite the fact that we have a relatively high adjusted R-Squared value which indicates that the model explains 0.65% the variability of the response data around its mean, the residual plot suggests there is some bias in our model.