I have always been interested in aviation. I am fascinated by planes and their complexities and how it has developed into one of the safest method of transportation through understanding of accidents since 1903, when the Wright brothers developed the first motor airplane.
Aviation encompasses engineering, mathematics, physics, chemistry, and architecture amongst other disciplines to successfully move people and objects between a point A and a point B. As someone who studies mathematics, aviation offers a clear application of the knowledge that I have learned throughout my studies.
Commercial aviation is one of the safest modes of travel and the safety of flying is backed by statistics. The National Transportation Safety Board (NTSB) is the independent body of the U.S government that studies aviation accidents, determines probable cause of accidents, and makes suggestions to airlines, manufactures, and the Federal Aviation Administration (FAA). I have watched many documentaries, movies, and television series that cover aviation crashes. One of my favorite shows is Mayday: Air Crash Investigation, by National Geographic, which covers many past (pre 2000) and present (2001-present) aviation accidents. The show has taught me a good deal of knowledge of how airplanes operate.
I plan on someday buying an airplane for my personal use. My personal choice is the Icon A5 light sports aircraft (LSA).
I collected the data to be used in this project from FiveThirtyEight who in turn collected and cleaned the data from Aviation Safety Network. In addition, I have found corroborating data from Data.gov which was collected by the NTSB regarding part 121 (commercial aviation) data in the United States.
The data from FiveThirtyEight is from a 30 year period between 1985 to 2014 and the data from the NTSB is from the year 1983 until 2014.
The data from FiveThirtyEight contains the following categories: airline, available seats kilometers flown per week, incidents from 1985 to 1999, fatal accidents from 1985 to 1999, fatalities from 1985 to 1999, incidents from 2000 to 2014, fatal accidents from 2000 to 2014, and fatalities from 2000 to 2014.
The data obtained from the NTSB with regards to U.S carriers operating under part 121. The data from the NTSB contains the following categories of data: year, accidents(all, fatal), fatalities (total, aboard), flight hours, miles flown, accidents per 100,000 flight hours (all, fatal), accidents per 100,000 miles flown (all, fatal), accidents per 100,000 departures (all, fatal).
The bar graph above shows the number of incidents per registered airline from the years 1985 to 1999. The bar graphs correspond to each aircraft, in alphabetical order beginning with Aer Lingus, Aerofloft, ...Xiamen Airlines. The y-axis corresponds to the number of incidents that each airline had during the time period. Notice that Aeroflot had the largest number of incidents during the time frame, with over 70 incidents.
The histogram above shows the incidents from 1985 to 1999 which resulted in at least one fatality. The x-axis corresponds to the accidents by airlines and the y-axis correspond to the frequency of the accidents. For example, based on the histogram you'd notice that from (0,2) the frequency is over 30, meaning that 30 or more airlines had at least 2 fatal accidents. Notice how the graph is right skewed as the majority of airlines only had between 0 and 4 accidents. This in turn means that the mean of the data set is larger than the median.
The histogram above shows the number of fatalities in terms of airlines. The x-axis correspond to the total fatalities of the airline for the time frame, while the y-axis shows the number of airlines with the given number of fatalities. Looking at the histogram, we see that over 35 airlines had between 0 and 100 fatalities. Notice from the graph that there were less than 10 airliners which had over 400 fatalities. The graph is right side skewed as the values are more spread on the right side of the median.
The bar graph shown displays the aviation incidents from the year 2000 to 2014. Each bar corresponds to an airline and they are arranged alphabetically from left to right starting with Aer Lingus. Notice that unlike the bar graph for the period '85-'99, this bar graph's range does not extend past 30. This shows a significant drop in overall incidents involving air crafts.
From the FiveThirtyEight data set, I have calculated the average mean, the average median, the variance, and the standard deviation of the columns for the fatalities over the periods 1985 to 1999 and 2000 to 2014.
The results for the fatalities from 1985 to 1999:
Average Mean: 112.4107
Average Median: 48.5
Variance: 21518.28
Standard Deviation: 146.6911
The results for the fatalities from 2000 to 2014 are given below:
Average Mean: 55.51786
Average Median: 0
Variance: 12394.98
Standard Deviation: 111.3328
From these calculations, we can observer that the average mean of the period '85 to '99 is more than double that of the period from '00 to '14. This means that over the same time frame, the average number of aviation fatalities has shrunk by approximately 50.06%. Although the data does not show any of the specific policy and design changes to the aircrafts and the system, it is clear that changes were made in order for the average to be halved in such a relatively short time.
Looking at the average median we can see that for the time period between 1985 and 1999 was 48.5 which means that half of the airlines had 48 or more fatalities during the 15 year period. In contrast, the average median for the 15 year period from 2000 to 2014 was 0 which means that at least half of the airlines did not suffer a single loss of life during this time frame.
Shifting our focus to the standard deviations, we can observe that the data values for the period 2000 to 2014 is much closer to the mean than those of the time period from 1985 to 1999. This means that when we randomly pick an airline a from the period 2000-2014, we can expect the airline a to have no more than 168 fatalities.
Using the NTSB data, we can make observations on the safety of flying in Airlines that fall under Part 121 of the FAA code. Part 121 corresponds to scheduled and non-scheduled services (airlines). Since March 20, 1997, aircraft with 10 or more seats used in scheduled passenger service have been operated under 14 CFR 121.
The data from the NTSB is broken up into various columns and ranges from 1983 to 2014. I will primarily focus on the flight hours (by year) and accidents by year (fatal and non fatal). Flight hours, miles, and departures are compiled by the Federal Aviation Administration.
We can see from the following scatterplot some interesting results
On this scatterplot, we can observe that as the flight hours increase on the x axis, so do the aviation accidents in a given year. Calculating the correlation between the two data sets we see that the correlation is 0.5056792 or 50.57% which means that the correlation is moderate. Although it may not be too clear to see, notice that this correlation does appear to be linear as one could draw a straight line from the origin rising from left to right and all points on the scatterplot will fall somewhat near this line.
I have calculated the average mean and the standard deviation of the columns for the fatalities over the periods 1983 to 2014 and the accidents during this same time period
The results for the accidents from 1983 to 2014 in the United States:
Average Mean: 32.4375
Standard Deviation: 10.74916
The results for the fatalities from 2000 to 2014 in the United States:
Average Mean: 96.53125
Standard Deviation: 152.8679
From these calculations, we can see that on average, roughly 97 people died on aviation accidents during the 30+ year period. It is important to mention that illegal acts such as suicide, sabotage and terrorism are included in the totals for accidents and fatalities. There were 5 years in which these incidents occurred, 1986, 1987, 1988, 1994, and 2001.
For these two particular columns, fatalities and accidents, I calculated the 95% confidence intervals and the results are given below:
Confidence interval for the mean of accidents: 32.44 - 3.88 and 32.44 + 3.88
Confidence interval for the mean of fatalities: 96.53 -55.11 and 96.53 +55.11
Notice how the confidence interval for the mean of fatalities is quite high, it is over half of what the actual mean is! This can be due to a variety of reasons, mainly because the standard deviation is very high as a result of the illegal acts like the attacks of September 11.
Using the NTSB data on scheduled and unscheduled air service we can create linear regression models to best predict the relationship between all aviation accidents and the amount of flight hours, miles flown and number of departures per year. I calculated 2 linear models, one aptly named LMP4 and the other named LMP5. Here, LMP stands for Linear Model Project.
The LMP5 model's response variable is all of the accidents (fatal and non fatal) per year and the predictor variable used is the number of flight hours per year. The function is f(x)=0.000001454*x+10.61 or more conservatively f(x)=1.454e^-6 *x+10.61
The LMP4 model has the same response variable as LMP5 however it has 3 predictor variables. The predictor variables are number of flight hours, miles flown, number of departures and the model LMP5 is given by g(x,y,z)=-2.140e-05*x+3.367e-08*y+1.960e-05*z-36.28
The image above shows a scatterplot between the number of accidents per year and the number of flight hours per year. Notice that as the number of flight hours increase so do the accidents.
Here we can see that as the number of miles flown increases so do the accidents. This makes sense as it is rare to have accidents on the tarmac.
The residuals of the LMP4 model when plotted using a histogram form a slight normal distribution that is a bit right skewed . The fact this this distribution looks normal means that the residuals are spread from the line of best fit with a distance that can be estimated by using a normal probability distribution.
I calculated the value of r squared for the LMP4 model to be 0.7458948
The r square vale of roughly 74.59% indicates that the model has smaller differenced between the fitted values and the observed values of the data.
The image on the left shows the fitted values when compared with the residual values both of the LMP4 model. The plot shows that all values are indeed random as not all fall around 0 and there seem to be no clear pattern between each point on the graph. The values furthest away from the brown line are outliers here.
The histogram above shows how the residuals of the LMP5 model are spread out. This model only used 1 predictor variable. Notice how aside from an outlier, the residuals seem to e spread out uniformly.
The scatterplot above shows the fitted values compared to the residuals of the LMP5 model. We can observe that the values are much further away from the 0 line as opposed to the LMP4 value. This is in part due to the r squared value of the LMP5 model being 0.2557114 or roughly 25.57%. This smaller r squared value implies that the model is not too good at predicting the accidents simply based on flight hours.
Using the NTSB data we can break up the data into two periods, 1984 to 2000 and 2001 to 2014. Was there a significant improvement in aviation safety in the new century?
Here the null hypothesis h_0: mean of data from 1984-2000 = mean of data from 2001-2014
The alternative hypothesis h_1: mean of data from 1984-2000 =\= mean of data from 2001-2014
The number of accidents from 1984 to 2000 were 546 with mean of 31.12 and standard deviation 12.53
The number of accidents from 2001 to 2014 were 525 with mean of 33.5 and standard deviation 8.6
Our test statistic is now -8.421242 and the null distribution is a t-distribution with a r degrees of freedom, where r is the integer part of a very long fraction computation that I did with R to get 1044.325 thus r=1044. Using a significance level of 0.01, we have t = -2.576 or 2.576. Thus we finally have that the p-value is 0.007957228.
Inc conclusion, we will reject the null hypothesis because the test statistics is inside the critical region, meaning that the test statistic is less than our alpha of 0.005. Since we reject the null hypothesis this means that there was indeed a significant improvement in aviation safety in the new century!