Semester 2 Midsem Project

Multivariate Analysis on Foreign Tourist Visits to India

project
statistics
Authors

Aman Das [BS2206]

Raj Pratap Singh [BS2219]

Shreyansh Mukhopadhyay [BS2147]

Published

May 7, 2023

Introduction

Aim

Our Aim is to do conduct an exploratory data analysis on Number of Foreign Tourist Arrivals in India over a period of 2000-2019 by attempting to obtain suitable predictors and identify causal relations through fitting the data to Multiple Linear Regression Models (MLR).

Variables

Along with the response variable Number of Foreign Tourist Arrivals in India, we have 14 other covariates as our predictor variables.

description = as.data.frame(t(read.csv(file.path(".","data","FOREIGNTOURISTSeries.csv"), fileEncoding = 'UTF-8-BOM')[c(1, 2, 5, 6),]))
names(description) = description[1,]
description = description[-1,]
kable(description)
Variable Region Unit Source
indfortour Foreign Tourist Arrivals: Total India Person Ministry of Tourism
intpop Population: Total World Person World Bank
intfortour International Tourism: Number of Arrivals World Person World Bank
hotelrev NAS 2011-2012: Gross Value Added: 2011-2012p: Services: Trade, Repair, Hotel and Restaurant: Hotels and Restaurants India INR mn Ministry of Statistics and Programme Implementation
indpop Population India Person mn Ministry of Statistics and Programme Implementation
indexport Exports: Annual: USD India USD mn Reserve Bank of India
indimport Imports: Annual: USD India USD mn Reserve Bank of India
indforinv Foreign Investment Inflows India USD mn Reserve Bank of India
indforres Foreign Exchange Reserve: Annual India INR mn Reserve Bank of India
indpastra Passenger Traffic: International: To India India Person Directorate General of Civil Aviation
stattourexp All States: Capital Expenditure: Outlay: Developmental: Economic Services: General: Tourism India INR mn Reserve Bank of India
indtourexp Union Budget: Expenditure: Ministry of Tourism India INR mn Ministry of Finance
indcrim IPC Crime: Incidence Rate India % National Crime Records Bureau
inddep Indian National Departures India Person Ministry of Tourism
forbankprof Foreign Banks: Operating Profit India INR mn Reserve Bank of India

Data

d = read.csv(file.path(".","data","tourist_data.csv"), colClasses = "numeric", fileEncoding = 'UTF-8-BOM')

for (x in colnames(d)) {
  d[, x ] = as.numeric(d[, x ])
}
d = data.frame(d)
kable(d)
year indfortour intpop intfortour hotelrev indpop indexport indimport indforinv indforres indpastra stattourexp indtourexp indcrim inddep forbankprof
2000 2638813 6144322697 1332379079 364920.6 1001 36822.4 49670.7 5181 1659130 5886514 712.50 1323.4 176.7 3117974 27536.70
2001 2537282 6226339538 1299004508 390347.7 1019 44560.3 50536.5 5860 1972040 6018591 524.70 1500.8 172.3 3579052 31051.40
2002 2384364 6308092739 1319581840 420988.5 1040 43826.7 51413.3 8151 2640360 5830289 991.50 1746.3 169.5 4123163 35130.00
2003 2726214 6389383352 1300066793 445024.2 1056 52719.4 61412.1 6014 3614700 6373862 961.70 3018.9 160.7 4161575 37284.20
2004 3457477 6470821068 1436450364 483368.6 1072 63842.6 78149.1 15699 4901290 7040851 1245.70 4057.2 168.8 4955244 49874.50
2005 3918610 6552571570 1501611341 538520.8 1089 83535.9 111517.4 13103 6191160 8362204 2310.60 4991.2 165.3 4962309 45773.01
2006 4447167 6634935638 1683029832 616003.4 1106 103090.5 149165.7 16261 6763870 9736531 3325.30 8002.7 167.7 6073303 66584.40
2007 5081504 6717641730 1739388600 689938.8 1122 126414.1 185735.2 14640 8682220 13124178 4305.40 8343.2 175.1 7339375 96194.68
2008 5282603 6801408360 1750746690 759225.4 1138 162904.2 251439.2 43325 12379650 14116845 4973.60 9906.9 181.5 6989719 140474.18
2009 5167699 6885490816 1687198488 726308.3 1154 185295.0 303696.3 8311 12838650 14116845 6631.40 10294.1 181.4 6665865 200984.74
2010 5775692 6969631901 1755720046 724247.9 1170 178751.4 288372.9 50361 12596650 15700668 7904.50 10029.7 187.6 8280194 163011.68
2011 6309222 7053533350 1792688016 844334.2 1186 251136.2 369769.1 41597 13610130 17348027 7870.90 10545.3 192.2 9835221 163136.00
2012 6577745 7140895722 1881196825 899008.3 1220 305963.9 489319.5 39032 15061390 18882428 11885.10 11140.9 196.7 10412854 185734.43
2013 6967601 7229184551 1952484358 929550.0 1235 300400.6 490736.6 46708 15884200 19802245 13513.60 8369.1 215.5 10775570 204316.87
2014 7679099 7317508753 2012330419 925650.0 1251 314415.7 450213.6 26385 18283800 21170370 16532.80 8864.4 229.2 11484956 227366.28
2015 8027133 7404910892 2071785910 982012.8 1267 310352.0 448033.4 73458 21376400 22438105 20350.40 9433.1 234.2 12843838 252987.71
2016 8804411 7491934113 2140443306 1113046.0 1283 262291.1 381007.8 31890 23787400 24431931 21113.70 8782.5 233.6 12524188 245262.83
2017 10035803 7578157615 2248870724 1210921.9 1299 275852.4 384357.0 43224 23982000 26996263 23185.60 16311.9 237.7 13379452 266088.51
2018 10557976 7661776338 2339399881 1321908.5 1314 303526.2 465581.0 52401 27608500 30117721 23721.82 17657.0 236.7 14292672 242375.45
2019 10930355 7742681934 2403074088 1445436.2 1327 330078.1 514078.4 30095 28558815 31630713 26263.08 20906.2 241.1 15760664 267275.83

Elementary Analysis

Time Series

As the observations are over time, we observe the general trend in the variables through time series plots.

plots = lapply(colnames(d)[2:ncol(d)], function(.x) ggplot(d, aes(year, as.numeric(unlist(d[.x]))))+
                 geom_line()+labs(y=.x)+mytheme+
                 theme(axis.title.x = element_blank())
               )
require(gridExtra)
grid.arrange(grobs = plots, ncol = 3)

As we can see, most covariates including indfortour tend to increase with time.

Box Plots

We try to detect outliers through box plots.

plots = lapply(colnames(d)[2:ncol(d)], function(.x) ggplot(d, aes(year, as.numeric(unlist(d[.x]))))+
                 geom_boxplot(fill="#ffffff40")+coord_flip()+labs(y=.x)+mytheme+
                 theme(axis.text.y=element_blank(),
                       axis.title.y = element_blank(),
                       axis.text.x = element_blank())
               )
require(gridExtra)
grid.arrange(grobs=plots, ncol = 3)

There is only one potential outlier in indtourexp. But previous time series plot shows it is not an anomaly.

Correlation

Computed correlations and partial correlations with respect to indfortour.

pcorr = c()
corr = c()
for (x in c(3:ncol(d))) {
  v = c(3:ncol(d))
  v = v[! v %in% c(x)]
  pcorr = append(pcorr, partial.r(d, c(2,x) , v )[1,2] )
  corr = append(corr, cor(d[,2], d[,x]) )
}

corrtable = data.frame(
  row.names = "Variable",
  Variable = colnames(d)[3:ncol(d)],
  correl = corr,
  partcorr = pcorr
)
kable(
  corrtable,
  col.names = c(
    "Correlation",
    "Partial Correlation"
  )
)
Correlation Partial Correlation
intpop 0.9859420 0.5073311
intfortour 0.9927588 0.7349934
hotelrev 0.9926960 0.0968559
indpop 0.9815706 -0.3146587
indexport 0.9118314 0.5312616
indimport 0.8928863 -0.6215878
indforinv 0.6969831 0.1700049
indforres 0.9910628 -0.3355004
indpastra 0.9955932 0.6821046
stattourexp 0.9776340 0.2881789
indtourexp 0.9120386 0.2853752
indcrim 0.9409528 0.1842352
inddep 0.9856463 -0.5989536
forbankprof 0.9402061 -0.5676603

Correlation Matrix

cor(d)
                 year indfortour    intpop intfortour  hotelrev    indpop indexport indimport indforinv indforres indpastra stattourexp indtourexp   indcrim    inddep forbankprof
year        1.0000000  0.9845696 0.9999087  0.9834454 0.9792756 0.9991224 0.9528133 0.9378670 0.7401516 0.9875388 0.9854118   0.9630997  0.8899158 0.9268277 0.9885726   0.9686062
indfortour  0.9845696  1.0000000 0.9859420  0.9927588 0.9926960 0.9815706 0.9118314 0.8928863 0.6969831 0.9910628 0.9955932   0.9776340  0.9120386 0.9409528 0.9856463   0.9402061
intpop      0.9999087  0.9859420 1.0000000  0.9839476 0.9802495 0.9991834 0.9515490 0.9359304 0.7383361 0.9887937 0.9867042   0.9665202  0.8882334 0.9311207 0.9895588   0.9683597
intfortour  0.9834454  0.9927588 0.9839476  1.0000000 0.9873411 0.9811150 0.9195015 0.9048397 0.7109258 0.9835434 0.9879517   0.9627482  0.9122705 0.9254252 0.9820989   0.9402730
hotelrev    0.9792756  0.9926960 0.9802495  0.9873411 1.0000000 0.9755694 0.9080370 0.8960069 0.6725237 0.9880833 0.9935675   0.9657596  0.9306003 0.9181263 0.9809357   0.9305096
indpop      0.9991224  0.9815706 0.9991834  0.9811150 0.9755694 1.0000000 0.9573082 0.9417886 0.7414410 0.9847639 0.9827566   0.9650981  0.8770030 0.9311448 0.9886827   0.9679262
indexport   0.9528133  0.9118314 0.9515490  0.9195015 0.9080370 0.9573082 1.0000000 0.9943942 0.7653203 0.9189625 0.9288788   0.8976107  0.8060634 0.8922261 0.9483567   0.9533926
indimport   0.9378670  0.8928863 0.9359304  0.9048397 0.8960069 0.9417886 0.9943942 1.0000000 0.7578143 0.9022249 0.9148673   0.8717561  0.8095449 0.8615457 0.9293601   0.9424746
indforinv   0.7401516  0.6969831 0.7383361  0.7109258 0.6725237 0.7414410 0.7653203 0.7578143 1.0000000 0.7118332 0.7060397   0.6785585  0.5993293 0.6859846 0.7409505   0.7352888
indforres   0.9875388  0.9910628 0.9887937  0.9835434 0.9880833 0.9847639 0.9189625 0.9022249 0.7118332 1.0000000 0.9926676   0.9789349  0.8949415 0.9431069 0.9813113   0.9599994
indpastra   0.9854118  0.9955932 0.9867042  0.9879517 0.9935675 0.9827566 0.9288788 0.9148673 0.7060397 0.9926676 1.0000000   0.9779379  0.9089153 0.9455448 0.9890865   0.9510034
stattourexp 0.9630997  0.9776340 0.9665202  0.9627482 0.9657596 0.9650981 0.8976107 0.8717561 0.6785585 0.9789349 0.9779379   1.0000000  0.8335011 0.9794994 0.9758391   0.9338204
indtourexp  0.8899158  0.9120386 0.8882334  0.9122705 0.9306003 0.8770030 0.8060634 0.8095449 0.5993293 0.8949415 0.9089153   0.8335011  1.0000000 0.7460892 0.8774698   0.8381739
indcrim     0.9268277  0.9409528 0.9311207  0.9254252 0.9181263 0.9311448 0.8922261 0.8615457 0.6859846 0.9431069 0.9455448   0.9794994  0.7460892 1.0000000 0.9481170   0.9212193
inddep      0.9885726  0.9856463 0.9895588  0.9820989 0.9809357 0.9886827 0.9483567 0.9293601 0.7409505 0.9813113 0.9890865   0.9758391  0.8774698 0.9481170 1.0000000   0.9462154
forbankprof 0.9686062  0.9402061 0.9683597  0.9402730 0.9305096 0.9679262 0.9533926 0.9424746 0.7352888 0.9599994 0.9510034   0.9338204  0.8381739 0.9212193 0.9462154   1.0000000
  • As we can clearly see almost all of the factors are correlated with each other. Thus, are originally assumption that they must be mutually independent is wrong.
  • The response variable indfortour is significantly related to all predictor variables which means there are some latent relation within the predictors variables.
  • Computing Partial Correlation Matrix throws singular matrix error.

Multiple Linear Regression

Ordinary Least Squares Model

Initial Model using all the predictors.

model1 = lm(indfortour~intpop+intfortour+hotelrev+
              indpop+indexport+indimport+indforinv+indpastra+
              stattourexp+indtourexp+indcrim+inddep+forbankprof,
            d)
summary(model1)

Call:
lm(formula = indfortour ~ intpop + intfortour + hotelrev + indpop + 
    indexport + indimport + indforinv + indpastra + stattourexp + 
    indtourexp + indcrim + inddep + forbankprof, data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-182841  -48171   -6602   52282  151519 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept) -2.039e+07  8.547e+06  -2.386   0.0544 .
intpop       4.056e-03  3.724e-03   1.089   0.3179  
intfortour   2.174e-03  9.116e-04   2.384   0.0544 .
hotelrev    -5.486e-01  1.713e+00  -0.320   0.7596  
indpop      -7.266e+03  1.735e+04  -0.419   0.6899  
indexport    7.726e+00  6.200e+00   1.246   0.2592  
indimport   -5.014e+00  3.026e+00  -1.657   0.1487  
indforinv   -1.807e-01  3.380e+00  -0.053   0.9591  
indpastra    1.675e-01  8.354e-02   2.005   0.0918 .
stattourexp  1.564e+01  5.681e+01   0.275   0.7923  
indtourexp   5.544e+01  4.078e+01   1.360   0.2228  
indcrim      1.275e+04  1.465e+04   0.871   0.4174  
inddep      -2.022e-01  1.323e-01  -1.528   0.1773  
forbankprof -5.789e+00  2.844e+00  -2.035   0.0880 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 155800 on 6 degrees of freedom
Multiple R-squared:  0.999, Adjusted R-squared:  0.9967 
F-statistic: 442.6 on 13 and 6 DF,  p-value: 7.774e-08
  • There are no significant covariates.
  • Six coefficients are negative along with the intercept, and seven coefficients are positive.
  • The model is not statistically significant

Model Plot

p0 <- d %>%
  ggplot(aes(x = year)) +
  geom_line(aes(y = indfortour, color = "Original"), size = 1) +
  geom_line(aes(y = predict(model1), color = "Fitted"), size = 1) +
  scale_colour_manual("", 
                      breaks = c("Original", "Fitted"),
                      values = c("#cc241d", "#689d6a")) +
  mytheme +
  labs(subtitle = "Time series", color="Prediction")
p0

Predictor Selection

We exhaustively fit all subsets of the predictors, and use the model with highest adjusted R² for each subset size.

subsetfit = regsubsets(indfortour~., data=d[2:ncol(d)], method="exhaustive", nvmax = NULL)
summary_subsetfit = summary(subsetfit)
summary_subsetfit$which
   (Intercept) intpop intfortour hotelrev indpop indexport indimport indforinv indforres indpastra stattourexp indtourexp indcrim inddep forbankprof
1         TRUE  FALSE      FALSE    FALSE  FALSE     FALSE     FALSE     FALSE     FALSE      TRUE       FALSE      FALSE   FALSE  FALSE       FALSE
2         TRUE  FALSE       TRUE    FALSE  FALSE     FALSE     FALSE     FALSE     FALSE      TRUE       FALSE      FALSE   FALSE  FALSE       FALSE
3         TRUE  FALSE       TRUE    FALSE  FALSE     FALSE      TRUE     FALSE     FALSE      TRUE       FALSE      FALSE   FALSE  FALSE       FALSE
4         TRUE   TRUE       TRUE    FALSE  FALSE     FALSE      TRUE     FALSE     FALSE      TRUE       FALSE      FALSE   FALSE  FALSE       FALSE
5         TRUE   TRUE       TRUE    FALSE  FALSE      TRUE      TRUE     FALSE     FALSE      TRUE       FALSE      FALSE   FALSE  FALSE       FALSE
6         TRUE  FALSE       TRUE    FALSE   TRUE      TRUE      TRUE     FALSE     FALSE      TRUE       FALSE       TRUE   FALSE  FALSE       FALSE
7         TRUE   TRUE       TRUE    FALSE  FALSE      TRUE      TRUE     FALSE     FALSE      TRUE       FALSE      FALSE   FALSE   TRUE        TRUE
8         TRUE   TRUE       TRUE    FALSE  FALSE     FALSE      TRUE     FALSE     FALSE      TRUE       FALSE       TRUE    TRUE   TRUE        TRUE
9         TRUE   TRUE       TRUE    FALSE  FALSE      TRUE      TRUE     FALSE     FALSE      TRUE       FALSE       TRUE    TRUE   TRUE        TRUE
10        TRUE   TRUE       TRUE    FALSE  FALSE      TRUE      TRUE     FALSE      TRUE      TRUE       FALSE       TRUE    TRUE   TRUE        TRUE
11        TRUE   TRUE       TRUE    FALSE   TRUE      TRUE      TRUE     FALSE      TRUE      TRUE       FALSE       TRUE    TRUE   TRUE        TRUE
12        TRUE   TRUE       TRUE    FALSE   TRUE      TRUE      TRUE     FALSE      TRUE      TRUE        TRUE       TRUE    TRUE   TRUE        TRUE
13        TRUE   TRUE       TRUE    FALSE   TRUE      TRUE      TRUE      TRUE      TRUE      TRUE        TRUE       TRUE    TRUE   TRUE        TRUE
14        TRUE   TRUE       TRUE     TRUE   TRUE      TRUE      TRUE      TRUE      TRUE      TRUE        TRUE       TRUE    TRUE   TRUE        TRUE
best_subsetsize = which.max(summary_subsetfit$adjr2)

Selected Model with highest adjusted R²

We find that highest adjusted R² is attained using 9 predictors.

model2 <- lm(indfortour~intpop+intfortour+indexport+indimport+
               indpastra+indtourexp+indcrim+inddep+forbankprof,
             d)
summary(model2)

Call:
lm(formula = indfortour ~ intpop + intfortour + indexport + indimport + 
    indpastra + indtourexp + indcrim + inddep + forbankprof, 
    data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-212002  -44342    7828   58392  140355 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1.838e+07  4.993e+06  -3.681  0.00424 **
intpop       2.476e-03  7.607e-04   3.255  0.00865 **
intfortour   1.967e-03  6.276e-04   3.133  0.01063 * 
indexport    7.394e+00  4.331e+00   1.707  0.11860   
indimport   -5.243e+00  2.297e+00  -2.282  0.04563 * 
indpastra    1.611e-01  5.208e-02   3.093  0.01138 * 
indtourexp   5.928e+01  2.779e+01   2.133  0.05868 . 
indcrim      1.551e+04  7.577e+03   2.047  0.06784 . 
inddep      -1.892e-01  9.298e-02  -2.035  0.06919 . 
forbankprof -5.159e+00  2.060e+00  -2.504  0.03123 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 123800 on 10 degrees of freedom
Multiple R-squared:  0.9989,    Adjusted R-squared:  0.9979 
F-statistic:  1013 on 9 and 10 DF,  p-value: 1.345e-13
  • The model has five significant variables, and 4 non-significant variables.

Correlation Matrix

cor(d[,c("indfortour", "intpop", "intfortour", "indexport", "indimport",
          "indpastra", "indtourexp", "indcrim", "inddep", "forbankprof")])
            indfortour    intpop intfortour indexport indimport indpastra indtourexp   indcrim    inddep forbankprof
indfortour   1.0000000 0.9859420  0.9927588 0.9118314 0.8928863 0.9955932  0.9120386 0.9409528 0.9856463   0.9402061
intpop       0.9859420 1.0000000  0.9839476 0.9515490 0.9359304 0.9867042  0.8882334 0.9311207 0.9895588   0.9683597
intfortour   0.9927588 0.9839476  1.0000000 0.9195015 0.9048397 0.9879517  0.9122705 0.9254252 0.9820989   0.9402730
indexport    0.9118314 0.9515490  0.9195015 1.0000000 0.9943942 0.9288788  0.8060634 0.8922261 0.9483567   0.9533926
indimport    0.8928863 0.9359304  0.9048397 0.9943942 1.0000000 0.9148673  0.8095449 0.8615457 0.9293601   0.9424746
indpastra    0.9955932 0.9867042  0.9879517 0.9288788 0.9148673 1.0000000  0.9089153 0.9455448 0.9890865   0.9510034
indtourexp   0.9120386 0.8882334  0.9122705 0.8060634 0.8095449 0.9089153  1.0000000 0.7460892 0.8774698   0.8381739
indcrim      0.9409528 0.9311207  0.9254252 0.8922261 0.8615457 0.9455448  0.7460892 1.0000000 0.9481170   0.9212193
inddep       0.9856463 0.9895588  0.9820989 0.9483567 0.9293601 0.9890865  0.8774698 0.9481170 1.0000000   0.9462154
forbankprof  0.9402061 0.9683597  0.9402730 0.9533926 0.9424746 0.9510034  0.8381739 0.9212193 0.9462154   1.0000000
  • Computing Partial Correlation Matrix throws singular matrix error.

Model Plot

p1 <- d %>%
  ggplot(aes(x = year)) +
  geom_line(aes(y = indfortour, color = "Original"), size = 1) +
  geom_line(aes(y = predict(model2), color = "Fitted"), size = 1) +
  scale_colour_manual("", 
                      breaks = c("Original", "Fitted"),
                      values = c("#cc241d", "#689d6a")) +
  mytheme +
  labs(subtitle = "Time series", color="Prediction")
p1

Selected Model with best p-value

We find that best p-value significance is attained using 4 predictors.

model3 <- lm(indfortour~intpop+intfortour+indimport+indpastra,
             d)
summary(model3)

Call:
lm(formula = indfortour ~ intpop + intfortour + indimport + indpastra, 
    data = d)

Residuals:
    Min      1Q  Median      3Q     Max 
-250619  -84196    3056   91404  225989 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.430e+06  2.960e+06  -3.186 0.006141 ** 
intpop       1.260e-03  5.316e-04   2.370 0.031643 *  
intfortour   2.334e-03  6.890e-04   3.387 0.004066 ** 
indimport   -2.504e+00  5.778e-01  -4.333 0.000591 ***
indpastra    2.003e-01  3.103e-02   6.456 1.08e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 148000 on 15 degrees of freedom
Multiple R-squared:  0.9977,    Adjusted R-squared:  0.997 
F-statistic:  1592 on 4 and 15 DF,  p-value: < 2.2e-16
  • All the predictors are significant in the model.
  • The R² value is very high using a small subset of the predictors.

Correlation Matrix

cor(d[,c("indfortour", "intpop", "intfortour", "indimport",
          "indpastra")])
           indfortour    intpop intfortour indimport indpastra
indfortour  1.0000000 0.9859420  0.9927588 0.8928863 0.9955932
intpop      0.9859420 1.0000000  0.9839476 0.9359304 0.9867042
intfortour  0.9927588 0.9839476  1.0000000 0.9048397 0.9879517
indimport   0.8928863 0.9359304  0.9048397 1.0000000 0.9148673
indpastra   0.9955932 0.9867042  0.9879517 0.9148673 1.0000000

Partial Correlation Matrix

pcor(d[,c("indfortour", "intpop", "intfortour", "indimport",
          "indpastra")])
$estimate
           indfortour      intpop  intfortour  indimport  indpastra
indfortour  1.0000000  0.52190757  0.65828074 -0.7455799  0.8575415
intpop      0.5219076  1.00000000 -0.07920208  0.6945475 -0.2517725
intfortour  0.6582807 -0.07920208  1.00000000  0.3859728 -0.3414137
indimport  -0.7455799  0.69454749  0.38597285  1.0000000  0.6393728
indpastra   0.8575415 -0.25177252 -0.34141372  0.6393728  1.0000000

$p.value
             indfortour      intpop  intfortour    indimport    indpastra
indfortour 0.000000e+00 0.031642971 0.004066218 0.0005911881 1.082203e-05
intpop     3.164297e-02 0.000000000 0.762530572 0.0019751471 3.296398e-01
intfortour 4.066218e-03 0.762530572 0.000000000 0.1259667167 1.798611e-01
indimport  5.911881e-04 0.001975147 0.125966717 0.0000000000 5.717951e-03
indpastra  1.082203e-05 0.329639772 0.179861132 0.0057179510 0.000000e+00

$statistic
           indfortour    intpop intfortour indimport indpastra
indfortour   0.000000  2.369677   3.386827 -4.333055  6.456354
intpop       2.369677  0.000000  -0.307715  3.738943 -1.007568
intfortour   3.386827 -0.307715   0.000000  1.620434 -1.406821
indimport   -4.333055  3.738943   1.620434  0.000000  3.220562
indpastra    6.456354 -1.007568  -1.406821  3.220562  0.000000

$n
[1] 20

$gp
[1] 3

$method
[1] "pearson"

Model Plot

p2 <- d %>%
  ggplot(aes(x = year)) +
  geom_line(aes(y = indfortour, color = "Original"), size = 1) +
  geom_line(aes(y = predict(model3), color = "Fitted"), size = 1) +
  scale_colour_manual("", 
                      breaks = c("Original", "Fitted"),
                      values = c("#cc241d", "#689d6a")) +
  mytheme +
  labs(subtitle = "Time series", color="Prediction")
p2

Quantile Regression Model

We use 5 selected predictors for the quantile regression model for the median.

model4 = rq(indfortour~indexport+indpop+indtourexp+indcrim+indimport,
            tau = 0.5, data = d)
model4
Call:
rq(formula = indfortour ~ indexport + indpop + indtourexp + indcrim + 
    indimport, tau = 0.5, data = d)

Coefficients:
  (Intercept)     indexport        indpop    indtourexp       indcrim 
-1.860771e+07 -2.055811e+00  1.569840e+04  1.483933e+02  2.960738e+04 
    indimport 
-1.664753e+00 

Degrees of freedom: 20 total; 14 residual

Simulation

We want to test our models in the situation that two predictors are absent - indtourexp and indcrim as these are very important variables, but difficult to accurately measure.

Model 1 is the model involving all predictors. Model 2 is the model with highest adjusted R².

data = read.csv(file.path(".","data","tourist_data_simulation_purpose.csv"), colClasses = "numeric", fileEncoding = 'UTF-8-BOM')

simulateindfortour = function(n) {
  r = 1000
  res = c(data[n,1], data[n,2], NA, NA, NA,NA)
  
  #using model1
  
  newdata <- data.frame(data[n,3:16])
  indfortour <- predict(model1, newdata)
  res[3] = mean(indfortour) #predicted value using model1
  indtourexpsim <- recdf(data$indtourexp, r)
  indcrimsim <- recdf(data$indcrim, r)

  for (i in 1:r) 
  {
    newrow <- c (data[n,3],data[n,4],data[n,5],data[n,6],data[n,7],data[n,8],data[n,9],data[n,10],data[n,11],data[n,12],indtourexpsim[i],indcrimsim[i],data[n,15],data[n,16])
    newdata <- rbind(newdata, newrow)
  }

  newdata <- newdata[-1,]
  indfortour <- predict(model1, newdata)
  res[4] = mean(indfortour)#predicted value using model1 with missing predictors
  
  #using model2
  
  newdata <- data.frame(data[n,3:16])
  indfortour <- predict(model2, newdata)
  res[5] = mean(indfortour) #predicted value using model2
  indtourexpsim <- recdf(data$indtourexp, r)
  indcrimsim <- recdf(data$indcrim, r)
  
  for (i in 1:r) 
  {
    newrow <- c (data[n,3],data[n,4],data[n,5],data[n,6],data[n,7],data[n,8],data[n,9],data[n,10],data[n,11],data[n,12],indtourexpsim[i],indcrimsim[i],data[n,15],data[n,16])
    newdata <- rbind(newdata, newrow)
  }
  
  indfortour <- predict(model2, newdata)
  res[6] = mean(indfortour)#predicted value using model2 with missing predictors
  return(res)
}

r = 1000

# data point1

simtable = simulateindfortour(1)

# data point2
  
simtable = rbind(simtable, simulateindfortour(2))

# data point3
  
simtable = rbind(simtable, simulateindfortour(7))

# data point4

simtable = rbind(simtable, simulateindfortour(15))

simtable = data.frame(simtable)
row.names(simtable) <- NULL

simtable %>%
    kable("html", col.names = c(
    "Year",
    "Original Value",
    "Model 1 prediction",
    "Model 1 prediction after simulation",
    "Model 2 prediction",
    "Model 2 prediction after simulation"
    )
  ) %>% column_spec(1:6, width = "6cm")
Year Original Value Model 1 prediction Model 1 prediction after simulation Model 2 prediction Model 2 prediction after simulation
2000 2638813 2520774 3312408 2498458 3401112
2001 2537282 2550759 3350263 2546721 3506358
2006 4447167 4500597 5012900 4487057 5108487
2014 7679099 7810983 7509445 7802343 7431145

Conclusion

  • We can see that in our model there are 14 covariates, of which 4 namely, intpop, intfortour, indimport and indpastra are most significant.
  • There is dependance between the variables.
  • The R² and adjusted R² are very high meaning our model fits indfortour well. They explain more than 99% of the variation in indfortour.
  • If the data for indtourexp and indcrim are not available, our model using all the predictors is better than the model with highest adjusted R².
  • Note that due to the nature of our analysis, we could not consider Tourists from Different Nations separately. Factors like relative GDP per Capita, Proximity between the Nations, and Flight Prices would certainly affect the amount of Tourists arriving from country to country.