2
$\begingroup$

I'm analyzing a data set as a final project. This is a categorical data analysis course so I will be focusing on a logistic regression analysis. I scrapped together the data set from flight data and weather data. Data is for flights coming into ORD. I want to model flight delay (whether a flight is delayed or not). Here is the a summary of the variables involved:

> summary(flight_data) DAY_OF_MONTH DAY_OF_WEEK AIRLINE_ID CRS_DEP_TIME DEP_DELAY TAXI_OUT TAXI_IN Min. : 1.00 Min. :1.000 19977 :7838 Min. : 333.0 Min. :-21.000 Min. : 1.00 Min. : 1.000 1st Qu.: 6.00 1st Qu.:2.000 19805 :6082 1st Qu.: 510.0 1st Qu.: -4.000 1st Qu.: 10.00 1st Qu.: 5.000 Median :17.00 Median :4.000 20398 :3974 Median : 761.0 Median : -1.000 Median : 12.00 Median : 7.000 Mean :15.17 Mean :4.058 19386 : 576 Mean : 760.2 Mean : 9.072 Mean : 15.23 Mean : 8.009 3rd Qu.:24.00 3rd Qu.:6.000 19790 : 553 3rd Qu.: 985.0 3rd Qu.: 7.000 3rd Qu.: 17.00 3rd Qu.: 9.000 Max. :30.00 Max. :7.000 20355 : 453 Max. :1439.0 Max. :931.000 Max. :301.00 Max. :179.000 (Other): 866 CRS_ARR_TIME AIR_TIME DISTANCE Weather.Type Wind.Speed Wind.Dir region Min. : 1.0 Min. : 11.0 Min. : 67.0 DRIZZLE : 219 Min. : 0.000 Min. : 1.00 midwest :5856 1st Qu.: 625.0 1st Qu.: 55.0 1st Qu.: 334.0 FOG : 35 1st Qu.: 6.000 1st Qu.: 8.00 northeast:4487 Median : 865.0 Median : 99.0 Median : 647.0 MIST : 569 Median : 8.000 Median :20.00 south :5833 Mean : 849.9 Mean :105.8 Mean : 762.1 None :17823 Mean : 8.462 Mean :19.16 west :4166 3rd Qu.:1075.0 3rd Qu.:130.0 3rd Qu.: 925.0 RAIN : 1420 3rd Qu.:11.000 3rd Qu.:29.75 Max. :1438.0 Max. :486.0 Max. :4244.0 THUNDERSTORM: 276 Max. :25.000 Max. :37.00 

I have a few questions:

  1. I have categorical variables, as well as quantitative variables. Clearly many of the variables are on different scales (time, degrees, etc). Should I standardize my variables? Does it matter? Do I do this to just the quantitative ones?

  2. We learned a bit about GAMs in our class. I would like to check whether some of my variables appear to be linear in the log-odds of arrival-delay (if not that would justify use of a GAM). Is this the following code an appropriate approach to this question?

Windspeed:

logodds <- NULL x<-NULL for( i in unique(flight_data[,13])){ idx <- which(flight_data[,13] == i) x <- c(x,i) logodds <- c(logodds,log( sum(flight_data[idx,9])/(nrow(flight_data) - sum(flight_data[idx,9])))) } plot(logodds~x)

enter image description here

$\endgroup$

1 Answer 1

2
$\begingroup$

1) In general, standardizing the quantitative variables is not necessary, though there are situations where it may be useful. When conducting multiple regression, when should you center your predictor variables & when should you standardize them? estimates.

2) In regards to checking the linearity of your predictors, it doesn't make sense to me to check the linearity of each predictor prior to including them with other predictors in the full model. The reason is that it is possible for an apparent non-linear relationship in a univariate logistic regression to disappear after including other variables in the model, especially when the predictors are correlated. After including all desired terms in the model, you could use partial residual plots to check the functional form of each predictor. For other options or recommendations, you might want to check out Diagnostics for logistic regression?.

In regards to whether or not to use a GAM, the decision should not be predicated on whether or not the relationships appear to be linear, but what you hope to get out of the model. A multivariate logistic regression would provide interpretable parameter estimates allowing you to provide an interesting summary regarding how the variables are related to flight delays. A GAM may be able to provide a better fit to the data (overfit?), but the potential gain in predictive accuracy is at the cost of interpretability of the model. The following post is related and may help you realize the divide between parameter estimation and predicive accuracy The Two Cultures: statistics vs. machine learning?

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.