Flight delays are a significant challenge for airlines, leading to customer dissatisfaction and operational inefficiencies. This analysis aims to identify the key factors behind flight delays and develop a predictive model that helps minimize disruptions.
Data Collection
The dataset includes critical information such as departure and arrival times, weather conditions, airline details (carrier, flight route), distance between airports, and the delay time in minutes. Data was collected from public aviation databases and cleaned to ensure completeness.
Data Preprocessing
To prepare the data for modeling, several preprocessing steps were followed:
- Handling Missing Values: Missing weather data and incomplete flight records were either removed or imputed using average values.
- Outliers: Flights with extreme delays exceeding 12 hours were considered outliers and excluded.
- Feature Engineering: New features were created, including departure hour, distance traveled, day of the week, and weather conditions, which were essential in enhancing model performance.
Exploratory Data Analysis
During EDA, several interesting patterns emerged:
- Flight Delays by Time of Day: Afternoon flights between 12 PM and 6 PM experienced the most frequent delays, while early morning flights had fewer delays.
- Weather Impact: Flights during storms or heavy rain were significantly more likely to experience delays.
- Correlation Analysis: The strongest correlations were found between weather conditions, distance traveled, and flight delays.
Model Development
Several machine learning models were developed:
- Logistic Regression: This was used as the baseline model to classify delayed vs. non-delayed flights.
- Random Forest: To capture non-linear relationships in the data, Random Forest was applied.
- Gradient Boosting (XGBoost): To further enhance prediction performance, XGBoost was implemented, focusing on misclassified cases.
- Hyperparameter Tuning: Using GridSearchCV, model parameters like tree depth and the number of estimators were optimized for both Random Forest and XGBoost.
Model Evaluation
- Accuracy: 85%
- AUC-ROC: 0.90, indicating a strong ability to distinguish between delayed and non-delayed flights.
- Confusion Matrix: The model showed a slight bias toward predicting non-delays, with fewer false negatives than false positives.
Conclusion & Recommendations
The analysis revealed that weather conditions and flight time were the most significant predictors of delays. Airlines are recommended to implement predictive models for better schedule management, especially during adverse weather conditions. An early warning system based on weather forecasts could also be introduced to mitigate delays.