Advanced Multimodal Chatbot for Emotion Recognition
The rapid advancements in artificial intelligence (AI) have enabled machines to interact with humans in more natural and emotional ways. In this project, we developed a multimodal chatbot capable of recognizing and responding to emotions using text and voice inputs.
Introduction
Traditional chatbots lack emotional intelligence, making interactions feel mechanical. Our chatbot bridges this gap by integrating text-based natural language processing (NLP) with voice-based audio analysis. This system detects user emotions and provides personalized responses, enhancing user engagement in applications such as virtual assistance, healthcare, and education.
System Design and Architecture
The chatbot's architecture comprises three main components:
- Emotion Detection Pipeline: Text and voice inputs are analyzed to detect emotions using machine learning models.
- Recommendation Engine: Personalized suggestions (music, videos, or events) are provided based on the detected emotion.
- Interactive User Interface: A Flask-based web application supports text and voice inputs.
Datasets and Preprocessing
The chatbot was trained using the following publicly available datasets:
- RAVDESS: Multimodal emotional speech recordings.
- CREMA-D: Emotional speech data with diverse speakers.
- TESS: Female speaker emotional recordings.
- SAVEE: Male speaker emotional speech data.
Text data was preprocessed through tokenization, stopword removal, and feature extraction using TF-IDF. Audio data was normalized and segmented into 2.5-second clips, ensuring consistency across datasets.
Methodology
The methodology involves building two primary pipelines for emotion recognition:
- Text Emotion Recognition: Input text undergoes cleaning (stopword removal, lowercasing) and transformation into numerical features using TF-IDF. These features are fed into a Random Forest Classifier trained to predict emotions.
- Voice Emotion Recognition: Audio inputs are processed using Librosa to extract features like MFCCs (Mel-Frequency Cepstral Coefficients), Chroma STFT, and Mel Spectrograms. The features are fed into a Convolutional Neural Network (CNN) with the following structure:
- Two Conv1D layers for feature extraction and pattern recognition.
- MaxPooling layers for reducing dimensionality.
- Fully connected Dense layers for classification, using a softmax activation function for multi-class prediction.
To improve model performance, audio data was augmented using the following techniques:
- Noise Addition: Injecting random white noise into the audio signal.
- Pitch Shifting: Changing the pitch of the audio by +/- 2 semitones.
- Time Stretching: Speeding up or slowing down the audio playback.
Code Workflow
The following steps outline how the code functions end-to-end:
- Input: The user inputs text or voice through the web interface.
- Text Emotion Detection: The input text is passed through the
TF-IDF
vectorizer and predicted using the Random Forest model.
- Voice Emotion Detection: Audio inputs are saved temporarily, processed using Librosa to extract features, and passed through the CNN model.
- Combining Results: If both text and voice inputs are provided, the system integrates predictions using an ensemble approach to determine the final emotion.
- Content Recommendation: Based on the detected emotion, the Flask backend fetches relevant suggestions (music, videos, events) via APIs like Spotify, YouTube, or Eventbrite.
- Output: The detected emotion and personalized recommendations are displayed in the user interface.
Results
Text-Based Emotion Recognition: Achieved an accuracy of 85% using the Random Forest model. Precision and recall metrics were balanced across emotions.
Voice-Based Emotion Recognition: The CNN model achieved 80% accuracy. Overlapping emotions such as fear and anger posed slight challenges in misclassification.
Multimodal Integration: Combining predictions from text and voice analysis improved accuracy to 90%, demonstrating the robustness of the multimodal approach.
Applications
This chatbot has wide-ranging applications across various industries:
- Healthcare: Monitoring patients' emotional well-being in therapy sessions.
- Customer Service: Enhancing customer satisfaction with adaptive emotional responses.
- Education: Delivering personalized learning resources based on students' emotional states.
Future Work
Future enhancements for the system include:
- Real-time processing for low-latency emotion detection.
- Incorporating multilingual support to cater to a diverse audience.
- Expanding to video-based emotion recognition for a more comprehensive understanding of users' emotions.
Conclusion
This multimodal chatbot successfully integrates text and voice analysis for emotion recognition. By providing accurate emotion detection and personalized recommendations, it showcases immense potential for real-world applications.