Exploratory Data Analysis - Corruption and Parking Violations

knitr::opts_chunk$set(fig.height=4, fig.align = "center")
getwd()
setwd("~/MIDS/W203/Lab 1")


load("Corrupt.Rdata")

library(car)
library(knitr)
library(doBy)
library(corrplot)

## corrplot 0.84 loaded

##summary(FMcorrupt)

Cultural norms or legal constraints?

We were hired by the World Bank to take a study the parking behavior data of United Nations diplomats as a way to analyze the effect of cultural norms and legal constraints on controlling corruption. Diplomatic immunity protected UN officials from parking enforcement reprecussions until 2002, when enforcement authorities were given the ability to confiscate diplomatic plates of violators. Therefore, before 2002 behavior was constrained by cultural norms alone, and after 2002 both cultural norms and threat of legal action could have had an impact on behavior.

Data

We have been given a dataset for a selection of UN diplomatic missions, Corrupt.R. The dependent (or target) variable in this data is named violations. We are given the labels of some of the variables that are listed below. We are told that the rest of the variables should be self-explanatory. • corruption: Country corruption index, 1998 • violations: Unpaid New York City parking violations • trade: total trade with the United States (1998 US$)

Objective

The World Bank would like to know what if any relationship there is between corruption and parking violations both pre and post 2002 and if there are any other relevant explanatory variables.

Description of the Data

The data set we are working with includes 364 observations of 28 variables. There are two observations for each unique value in the wbcode and country variable. Within each of these observation pairs there are only 3 variables that differ within the pairs. These are prepost (a binary categorical variable), violations, and fines variables. It is important to note that violations (dependent variable) is almost colinear with fines (as discussed in the following investigation), such that the relevant changes in the dataset between the observation pairs are narrowed to prepost and violations. For the sake of the assigned task, we assume that the prepost variable relates to the pre and post 2002 periods when the enforcement action took place.

Scope of Analysis

Given that only the dependent variable changes (and none of the other variables change with the exception of fines as discussed), our analysis of the effects of the enforcement action on behavior is essentially limited to analyzing (1) changes in the violations variable relative to the key independent variable corruption (in both pre and post enforcement periods), (2) correlations between the dependent variable and other independent variables (in both pre and post enforcement periods), besides corruption, and (3) correlations between corruption and the independent variables themselves (regardless of pre or post period) to identify potential confounding effects.

A full summary of the variables in the dataset are below:

Changes to the Original Dataset

Since violations is the target variable, we chose to exclude 66 records with an NA value in the violations column.This brings the total number of records to 298 (i.e. 149 observation pairs). NA values remain in the gov_wage_gdp, pctmuslim, majoritymuslim, trade, cars_total, cars_personal, cars_mission, ecaid, milaid, region, totaid, and distUNplz variables. As these are not the the main variables of investigation, and because R handles NA values in its analysis pretty well, these records will remain in the dataset. However, it is worth noting special attention is required for the gov_wage_gdp variable, which has 114 NAs after the NAs are removed from violations.

#1. 
df <- FMcorrupt[complete.cases(FMcorrupt[, 3]), ]

Table: NAs in Changed Dataset

# returns a single vector with violations = 0 NAs and then counts the NAs in the other variables.
na_count <-sapply(FMcorrupt[!is.na(FMcorrupt$violations),], function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count)
na_count <- data.frame(cbind(row.names(na_count),na_count$na_count))
na_count <- na_count[na_count$X2 != 0,] 
colnames(na_count) <- c("Variable","NAs")
rownames(na_count) <- NULL
na_count

##          Variable NAs
## 1    gov_wage_gdp 114
## 2       pctmuslim   4
## 3  majoritymuslim   4
## 4           trade   4
## 5      cars_total  20
## 6   cars_personal  20
## 7    cars_mission  20
## 8           ecaid   4
## 9          milaid   4
## 10         region   2
## 11         totaid   4
## 12      distUNplz   6

Looking at the data, the majoritymuslim values that are -1 line up with a pctmuslim value of 0. Therefore, we assigned those values a 0 instead of a -1.

test1 <- df[df$majoritymuslim < 0, c(9,10)] 
df$majoritymuslim[df$majoritymuslim == -1] <- 0

Records with no violation values included those that pertained to nominal economic groupings as coded by World Bank. For example, observation #337 in wbcode = ‘WLD’ which is “World” according to World Bank coding. As a result of ommiting observations that a did not carry violation values (as they have no diplomatic representative standing as countries), these economic groupings were omitted from the analysis.

#1. 
df <- FMcorrupt[complete.cases(FMcorrupt[, 3]), ]    # Moved up under 1. as a suggestion
summary(df)
#2. 
test1 <- df[df$majoritymuslim < 0, c(9,10)]      # moved up under 2. as a suggestion
df$majoritymuslim[df$majoritymuslim == -1] <- 0
#3. 
test2 <- df[df$pop1998 ==max(df$pop1998), ]  # left as is under #3

Besides cleaning up records we didn’t want to include in our analysis/felt needed adjustment, we also wanted to restructure the table to have one row for each country, and have individual columns to represent the pre 2002 and post 2002 fines and violations. We also wanted to add a new column for the difference in violations and fines, more easily calculated in this new structure. We used the merge function to ensure that the post data was reassigned to the correct wbcode.

#4.
df_pre <- df[df$prepost == "pre", ]
df_post <- df[df$prepost == "pos", c(1,3,4)]
colnames(df_pre)[3:4] <- c("pre_violations", "pre_fines")
colnames(df_post) <- c("wbcode", "post_violations", "post_fines")

FMcorrupt_new <- merge(df_pre, df_post, by = "wbcode")

FMcorrupt_new$dif_violations = FMcorrupt_new$pre_violations-FMcorrupt_new$post_violations
FMcorrupt_new$dif_fines = FMcorrupt_new$pre_fines-FMcorrupt_new$post_fines
FMcorrupt_new = FMcorrupt_new[, c(1,20,3,4,29,30,31,32,5:19,21:28)]          #dropped prepost here, put ordered corruption 2nd and reordered pre/post violations to follow

Univariate Analysis

A simple plot of the total amount of fines against the number of violations shows a fairly strong linear relationship, suggesting that either parameter could be chosen as the dependent variable in the analysis. However, if the average fine is calculated, it varies by country and ranges from roughly $48 to $56. While this difference doesn’t appear to be particularly large, the amount of an individual fine could vary by severity of the offense or change over time, and so for the purposes of this analysis we will choose to focus on the number of violations as the dependent variable.

# plot(FMcorrupt_new$pre_violations,FMcorrupt_new$pre_fines,xlab = 'Number of violations',
#     ylab = 'Total fines', main = 'Relationship between fines and violations')
b = FMcorrupt_new$pre_fines/FMcorrupt_new$pre_violations
min(b, na.rm = T)
max(b, na.rm = T)

Our research question is focused on the relationship between violations and corruption, so these are clearly key variables. Summary statistics for the number of violations in the pre 2002 period are:

summary(FMcorrupt_new$pre_violations)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   17.22   51.65  198.07  189.59 3392.96

There is a positive skew, with a few countries whose number of violations is much higher than the average. This is skew is visible before and after 2002.

par(mfrow = c(1,2))
hist(FMcorrupt_new$pre_violations, breaks = "FD", xlab = "Violations", main = "Histogram of violations pre 2002")

hist(FMcorrupt_new$post_violations, breaks = "FD", xlab = "Violations", main = "Histogram of violations post 2002")

Compared to pre 2002, there is a significant decrease in violations overall post 2002. When examining the change in violations, we use the convention that a positive number indicates an overall decrease in the number of violations post 2002. It was noted that 10 countries experienced an increase in violations after 2002, though very modest. For example, Azerbaijan, which showed the largest increase in violations, had no violations before 2002 and only 5 afterwards. It is worth noting that countries that had an increase in violations post 2002, typically had a low number of pre 2002 violations to begin with.

summary(FMcorrupt_new$dif_violations)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   -4.906   14.481   51.324  194.383  185.588 3385.111

FMcorrupt_new[FMcorrupt_new$dif_violations < 0, c("country","pre_violations","post_violations")]

##          country pre_violations post_violations
## 6      AUSTRALIA      0.0000000       0.3270609
## 8     AZERBAIJAN      0.0000000       4.9059138
## 12  BURKINA-FASO      0.0000000       0.9811828
## 37       DENMARK      0.0000000       0.3270609
## 56        GREECE      0.0000000       2.2894266
## 65       IRELAND      0.0000000       0.6541219
## 67        ISRAEL      0.0000000       1.3082438
## 71         JAPAN      0.0000000       0.6541219
## 102  NETHERLANDS      0.4051054       1.6353047
## 106         OMAN      0.0000000       1.3082438

If we examine “corruption”, we find it is left skewed as per its histogram:

hist(FMcorrupt_new$corruption, breaks = "FD", xlab = "Corruption Index", main = "Histogram of Corruption Index")

summary(FMcorrupt_new$corruption)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -2.58299 -0.41515  0.32696  0.01364  0.72025  1.58281

As an index value we cannot assume that the corruption index criteria yields meets the requirements for an iid variable. In order to get a closer approximation of the true relationship between violations and corruption, further transformation of the corruption variable is likely warranted. However, any such transformation warrant domain knowledge of the methodology for deriving the corruption index.

Other observations

Region

The region variable is categorical taking an integer value between 1 and 7. No key is provided, but countried grouped by region code reveals that 1 = Caribbean, 2 = South America, 3 = Europe, 4 = Asia, 5 = Australasia, 6 = Africa, 7 = Middle East. Here, for example, region category 7 indicates countries in Middle East.

unique(FMcorrupt_new$region)

## [1]  6  3  7  2  4  5  1 NA

FMcorrupt_new[FMcorrupt_new$region == 7, 'country']

##  [1] ""             "BAHRAIN"      "CYPRUS"       "EGYPT"       
##  [5] "IRAN"         "ISRAEL"       "JORDAN"       "KUWAIT"      
##  [9] "LEBANON"      "OMAN"         "PAKISTAN"     "SAUDI ARABIA"
## [13] "SYRIA"        ""             "YEMEN"        NA

The distribution of military aid (milaid) is heavily skewed. The median value is 0.2, but the mean is 35 and the max is 3,120. There are three countries in the Middle East (Israel, Egypt and Jordan) that absorb over 96% of milaid, so to see the relationships clearly, the data must be plotted using a logarithmic scale. This is noted only in case further analysis involving this variable is required, given that Egypt also has the highest number of violations pre 2002.

options(scipen = 3)
boxplot(milaid~region, data = FMcorrupt_new, main = 'Milaid by Region', xlab = 'Region', ylab = 'milaid (log scale)', ylim = c(0.01,10000), log = "y")

Analysis of Key Relationships

In this section we will be looking for any potential relationships between parking violations and corruption. Part I looks at the changes in the violations variable relative to the key independent variable corruption (in both pre and post enforcement periods). Part II examines correlations between the dependent variable and other independent variables (in both pre and post enforcement periods), besides corruption. Part III looks at correlations between corruption and the independent variables themselves (regardless of pre or post period) to identify potential confounding effects.

First, we start from the general, overall correlation between each of the variables using corrplot package in R:

Chart: Overall Correlation between Variables

M <- cor(FMcorrupt_new[,c(2:8,10:29,31)], use = "complete.obs")
corrplot(M, order = "hclust", method = "square", tl.col ='black')

Several key features that emerge and analyzed in depth below. These are summarized as the following: * Violations are only mildly correlated with corruption overall, * Violations are more strongly correlated with aid (i.e. totaid, ecaid), particularly in the pre 2002 period, * Violations are modestly correlated with other variables (i.e. region, r_middleeast, pctmuslim) in both pre post periods, * Corruption is strongly correlated with gdppcus1998 (i.e gdp per capita), * Corruption is strongly correlated with variables that are modestly correlated with violations.

Part I. Analysis of Relationship Between Dependent and Key Independent Variable

From Chart: Overall Correlations, we observe a positive, yet relatively moderate relationship between violations (in pre and post 2002 periods) and corruption. With the scatterplotMatrix package in R, we further analyze the relationship between pre violations post violations and corruption, including their relative distributions.

scatterplotMatrix( ~ corruption + pre_violations + post_violations, data = FMcorrupt_new, diagonal = 'histogram')

The scatter plot matrix shows a positive relationship between corruption and violations (cor = 0.1153411. There is a positive but gently sloped relationship between corruption and violations both before and after 2002. It is interesting to note that the number of violations decreased dramatically in the post 2002 period, while the positive relationship between corruption and violations appears to remain. Furthermore, in the post 2002 period, the correlation between corruption and parking violations increased slightly by 0.05. This further suggests presence of a modest enforcement effect. Again, here, it is important to note that the corruption variable does not change between the pre and post 2002 period. We can’t ascertain if corruption changed between the two periods, thus we do not know if changes in corruption are related to changes in violations.

In order to further evaluate the relationship between parking violations and corruption, the data is subdivided into 10 bins equally spaced along the range of possible corruption values. 10 bins of width 0.42 were chosen in order to examine the relationship over the entire range of corruption values. On the x axis label of plots the center point of the range will be displayed.

par(mfrow = c(1,2))
binned_corruption = seq(min(FMcorrupt_new$corruption), max(FMcorrupt_new$corruption), length.out = 11)
sliced = cut(FMcorrupt_new$corruption, breaks = binned_corruption, labels = round(binned_corruption[1:10] + ((max(binned_corruption) - min(binned_corruption))/20), digits = 2))
plot(sliced, FMcorrupt_new$pre_violations, main = 'Violations\n by Corruption (pre 2002)', xlab = 'Corruption (bins = 10)', ylab = 'Number of Parking Violations')
plot(sliced, FMcorrupt_new$post_violations, main = 'Violations\n by Corruption (post 2002)', xlab = 'Corruption (bins = 10)', ylab = 'Number of Violations' )

The box plot of violations (pre 2002) by corruption shows that most of the countries that are more corrupt have a higher number of violations. However, for the countries in the lower 75 percentile of violations there does not appear to be a positive relationship between corruption and violations. Post 2002, for countries in the bottom 75 percentile there still does not appear to be a relationship between corruption and violations. However, also like pre 2002, the countries with the most violations do tend to have higher corruption scores. In order to further evaluate the relationship between corruption and parking violations mean and median number of parking violations by corruption are examine below.

violations_means_pre = by(FMcorrupt_new$pre_violations, sliced, mean,  na.rm = T)
violations_means_post = by(FMcorrupt_new$post_violations, sliced, mean, na.rm = T)

par(mfrow = c(1,2))
plot(y = as.single(violations_means_pre), x = sort(unique(sliced)), type = "p", main = 'Mean violations\n by corruption pre 2002', xlab = 'Corruption', ylab = 'Mean Violations')
plot(y = as.single(violations_means_post), x = sort(unique(sliced)), type = "p", main = 'Mean violations\n by corruption post 2002', xlab = 'Corruption (bins = 10)', ylab = 'Mean Violations')

The chart from pre 2002 does appear to show a positive relationship, but that is not mirrored into the post 2002 chart. However, prior to 2002 there are several outliers that could be having a strong effect.

In order to account for the potential effect of the outliers, the median number of parking violations are examined here:

par(mfrow = c(1,2))
violations_median_pre = by(FMcorrupt_new$pre_violations, sliced, median, na.rm = T)
violations_median_post = by(FMcorrupt_new$post_violations, sliced, median, na.rm = T)

plot(y = as.single(violations_median_pre), x = sort(unique(sliced)), type = "p", main = 'Median violations \nby corruption pre 2002', xlab = 'Corruption', ylab = 'Median Parking Violations')
plot(y = as.single(violations_median_post), x = sort(unique(sliced)), type = "p", main = 'Median violations \nby corruption post 2002', xlab = 'Corruption', ylab = 'Median parking Violations')

The relationship between corruption and parking violations, by median values, appears to be weak. This confirms that the median values are resistant to outliers whereas the mean values are not.The pre 2002 relationship more closely match the post 2002 median relationship, as seen below.

The final examination will split countries into three groups by dividing the range of the corruption variable into three groups. The groups will be those with low corruption levels, those with intermediate corruption levels and those with high corruption levels. The purpose of this transformation is to evaluate whether or not countries with low levels of corruption have fewer parking violations than those with high corruption.

The smaller number of bins seems to indicate a much stronger relationship between parking violations and a county’s corruption index. Based on the above charts it appears that countries with low levels of corruption tend to have relatively few parking violations and the countries with intermediate and high levels of corruption tend to have more parking violations. However, the earlier analysis of the same data did not give such a clear cut picture. This indicates a much more complicated relationship that will require evaluating more variables.

Part II Correlations between Violations and Other Independent Variables

Finding a positive, but modest correlation between violations and corruption, we want to then ask if there are other correlations that are worth noting, particularly those that may exhibit a stronger correlation between the dependent and key independent variable. This is important as we want to identify correlations with other variables that might also influence the dependent variable, and thus have potential explanatory power for its variation which was not captured in the univariate analysis above.

Table: Correlation Between Violations and Other Independent Variables (pre - post 2002)

vcorr<- data.frame(cor(FMcorrupt_new[,c(2:8,10:29,31)], use = 'complete.obs')[c(2,4),])
vcorr <- data.frame(t(vcorr))
vcorr <- vcorr[order(-abs(vcorr$pre_violations)),][c(5,6,9:28),]
kable(vcorr)

	pre_violations	post_violations
milaid	0.7768808	0.0585809
totaid	0.6692981	0.0278133
ecaid	0.3958826	-0.0094551
region	0.3398326	0.2706332
r_middleeast	0.3333974	0.0874204
pctmuslim	0.3322054	0.2710909
majoritymuslim	0.2990529	0.2363251
r_europe	-0.2289199	-0.1305464
gdppcus1998	-0.2070978	-0.1508573
cars_total	0.1959811	0.2084319
staff	0.1909950	0.0963176
gov_wage_gdp	0.1798621	0.0221087
corruption	0.1772369	0.2125595
cars_personal	0.1715371	0.2068946
pop1998	0.1515753	0.1501342
r_africa	0.1339295	0.2312132
spouse	0.1336007	0.1033493
cars_mission	0.1133905	0.0999396
r_southamerica	-0.0708971	-0.1523234
trade	-0.0416691	-0.0698595
distUNplz	-0.0211363	-0.0396609
r_asia	-0.0162077	0.0419485
Strikingly, we se	e that, well abov	e corruption, aid (i.e. milaid, totaid, and ecaid) to have the strong correlation with violations in the pre 2002 period. This is perhaps counterintuitive in that one might expect that aid is a carrot and thus would create a dependency a country might be concerned with damaging, thus we’d expect an aid dependent country to have lower number of violations. However, observing these results, we might anticipate at this point that the relationship between aid and the other independent variables with strong correlation with violations in the pre 2002 period (such as region, r_middleeast, pctmuslim, and majoritymuslim) is also strong. To verify this we create a scatterplot of the variables with the strong correlations with violations, i.e. absolute value of correlation > .25.

Chart: Correlation Between Violations and Other Independent Variables (>.25)

scatterplotMatrix( ~ pre_violations + post_violations + milaid + totaid + ecaid + region + r_middleeast + pctmuslim + majoritymuslim, data = FMcorrupt_new, diagonal = 'histogram', smoother = F)

We find that our expectations are accurate. Aid is strongly correlated across the other variables that correlate strongly with violations pre 2002. This means that the apparent relationships we might draw between these other independent variables and our violations variable would have to consider that variations in aid (i.e. milaid, totaid, and ecaid) might be dependent on other these other variables, and vice versa.

Additionally, we are curious about the we find that countries with a higher percentage of Muslim compared with lower percentage Muslim have about the same number of violations until about a 1000 violations. Above 1000 we see several countries with higher percentage Muslim with considerably higher number of violations. Therefore, we want to see if the relationship with the dependent variable is resistant to outliers. We subset the sample population into majority Muslim and non-majority Muslim using the “majoritymuslim” variable to compare with the dependent variable. Visualizing this comparison by boxplot we get Chart: Violations by Muslim Minority and Majority Countries.

Chart: Violations by Muslim Minority and Majority Countries

par(mfrow = c(1,2))
mjr_mus = subset(FMcorrupt_new, majoritymuslim = 1)
min_mus = subset(FMcorrupt_new, majoritymuslim < 1)
boxplot(min_mus$pre_violations, mjr_mus$pre_violations, names = c("Minority", "Majority"), main = "Violations by Muslim Minority\n and Majority Countries (pre 2002)", ylab ='Number of Violations')
boxplot(min_mus$post_violations, mjr_mus$post_violations, names = c("Minority", "Majority"), main = "Violations by Muslim Minority\n and Majority Countries (pos 2002)", ylab ='Number of Violations')

We find that in the pre and post 2002 period, the majority muslim group has marginally larger number of violations. We also find that three extreme outliers in the majority Muslim group are comparatively different in the pre 2002 period, and these outliers disappear in the post 2002 period. As the majority Muslim countries outnumber non-majority Muslim countries by ’r round(dim(mjr_mus)[1]/(dim(mjr_mus)[1]+dim(min_mus)[1])*100, 2)’, the relationship with violations may be of less practical significance anyways. For now, we leave this as finding that is worth considering in further research.

Part III Relationship between Corruption and Other Independent Variables

Our analyis now turns to examining the relationship between the key independent variable, corruption, and other independent variables. Table: Correlations Between Corruption shows the absolute value of the correlation (as we are concerned with magnitude of the correlation) between corruption and other other variables.

Table: Correlation Between Corruption and Other Independent Variables

crptcorr<- data.frame(cor(FMcorrupt_new[,c(2:8,10:29,31)], use = 'complete.obs')[1,])
colnames(crptcorr) <- "corr"
crptcorr <- data.frame(row.names(crptcorr), crptcorr$corr)
colnames(crptcorr) <- c("Variable", "Corr")
crptcorr <- crptcorr[2:28,]
crptcorr <- crptcorr[order(-abs(crptcorr$Corr)),]
kable(crptcorr, row.names = F)

Variable	Corr
gdppcus1998	-0.9094152
r_europe	-0.5246797
trade	-0.3850610
r_africa	0.3541182
region	0.3423100
pctmuslim	0.3213838
majoritymuslim	0.2732485
cars_personal	-0.2536773
gov_wage_gdp	0.2390442
r_asia	0.2314615
staff	-0.2309461
ecaid	0.2158125
post_violations	0.2125595
post_fines	0.2075819
pre_violations	0.1772369
pre_fines	0.1763915
dif_violations	0.1755310
dif_fines	0.1736190
cars_mission	0.1534692
totaid	0.1440510
distUNplz	0.1356020
spouse	-0.1344571
r_southamerica	0.1330117
r_middleeast	0.0993256
pop1998	0.0518788
cars_total	-0.0461003
milaid	0.0346287

Thus, we find that corruption is far from being most strongly correlated with violations. Instead, we find a much stronger correlation with gdppcus1998 (i.e. gdp per capita), (cor = -0.9094152. Additionally, given the practical associations with the other variables that also show high correlation (i.e. r_europe, trade, r_africa, region, pctmuslim, majoritymuslim, etc.), it is important to note that gdp per capita is likely to share are large part of its variation with these other variables, and vice versa. For example, Europe has the world’s richest countries, and Africa has the world’s poorest countries. Additional, gdp is directly linked to trade, where poorer countries have less trade overall than Europe, thus we would expect countries from Africa to have relatively less trade with the U.S. Using the scatterplotMatrix package in R, here we plot the strongest correlations (i.e. absolute value of correlation > .25) from Table: Correlation Between Corruption and Other Dependent Variables in order to see if we can visualize these relationships:

Chart: Correlation Between Corruption and Other Independent Variables (>.25)

scatterplotMatrix( ~ corruption + gdppcus1998 + r_europe + trade + r_africa +  region + pctmuslim, data = FMcorrupt_new, diagonal = 'histogram', smoother = F)

We see in that above chart that indeed corruption is strongly correlated with gdp per capita, and gdp per capita is strongly correlated with the other variables, r_europe, trade, r_africa, region, pctmuslim, majoritymuslim, as discussed.

Further, the negative correlation between pre 2002 violations and the 1998 GDP per capita (gdppcus) can be visualized in the graph below.

Chart: Correlation between GDP per Capita and Corruption

scatterplot(jitter(FMcorrupt_new$corruption,2),jitter(FMcorrupt_new$gdppcus1998,2), pch = 17, cex = 1,
            lty = "solid", ylab = 'GDP Per Capita (USD)', xlab = 'Corruption Index',
            main = 'Correlation between GDP per Capita and Corruption', smoother = FALSE,
            boxplot = FALSE)
text(FMcorrupt_new$corruption,FMcorrupt_new$gdppcus1998, labels=FMcorrupt_new$wbcode, cex= 0.7, lwd = 3, pos = 4)

We see a cluster of countries in the bottom right of the plot, clustered in an area between 0 and 1 on the corruption index, and then more dispersion as the income increases. Given, the strong observed correlation with corruption, we’d like to know if gdppcus1998 might be a confounding variable. This explored among other potential confounding variables in the following section.

Confounding Variables

There are two types of confounding variables we are concerned with. The first type in our dataset, and the second type are those that are not in the dataset, but those that we would want to consider if we had the data. To address the first type, we saw that the relationship between violations and corruption was relatively modest. Instead, we observed that violations variable (pre 2002) is most strongly correlated with aid variables (i.e. milaid, totaid, ecaid), and that corruption was most strongly correlated with gdppcus1998. We also saw that aid variables were weakly associated with corruption and gdppcus1998 was weakly associated with violations. But we have to ask ourselves, is it possible that an other variable in our data has a significant effect on both aid and gdppcus1998 at the same time? However, a quick glance at Chart: Overall Correlation between Variables suggests that the relationship between milaid and gdppcus1998 is relatively weak. We confirm this by calculation, cor = 0.1030945.

Among the variables that are not in our dataset, other confounding variables could possibly include: * Diplomatic staff. Since violations are committed by individual staff it makes sense to have a stronger understanding about staff as the main subject of observation. It is plausible that staff pick up certain behaviors while abroad, rather than from their home countries. We would want data for example on how long a staff person has they been in NY. We’d want to ask how long have the staff been abroad? How long have the staff been in other countries with low or high corruption scores? How much are staff concerned with the implications of parking violations via their home offices? What is the rank of staff? A more granual picture would also serve to strengthen the unit of observation which is currently spread over countries, diplomatic missions, and diplomatic staff. * Parking enforcement - While the corruption variable may be an accurate measure for abuse of power by public official, we do not know what if the index directly relates to the actual case of parking violations. It could be that in countries with high levels of corruption, that enforcement of parking violations is high as in the case of security states. Conversely, countries with low levels of corruption may have lax enforcement of parking regulations whereas they may have stricter regulations of things like grand corruption. Therefore, it could be the case that the corruption effect on violations is actually lower in our dataset than it might be otherwise. Therefore, data parking enforcement would be important to have to ensure it was not a confounding variable. * Exchange relationship with U.S. - As it is possible that treatment of U.S. laws is a reflection of respect or disrespect for the U.S., it would be important to understand the type of relationship that the U.S. has. For example, does the U.S. have a mutually beneficial trade relationship with the other country? What is the balance of trade in terms of types of goods and services? Is the relationship primarily defined by trade or security? What is the number of diaspora per capita by country residing in the U.S. and vice versa?

Conclusion

Our initial task was to look at the relationship between corruption and parking violations of UN diplomats, but the analysis we have done shows nothing clear or strong. There is clearly some relationship, but it is more complicated than a linear correlation. The increase in correlation after the legal enforcement began seem to indicate that for some countries the cultural norms as operationally constructed by the corruption index could in fact be stronger than threat of legal penalties. However, analysis of some of our compounding variables, such as percent of the country that is Muslim and region, showed to be even stronger relationships to both the primary and independent variables. Also, several other factors in the data set showed skews and outliers that could be explored further. A larger multivariate analysis is needed before any concrete conclusions can be drawn about the impact of cultural norms vs. legal penalties as pertaining to UN parking tickets.

EDA - Corruption and Parking Violations

Silas Everett, Nicholas Conidas, Simon Hodgkinson, and Hannah Morgan