This is a grant proposal for imbalanced data classification. Done at Virginia Tech.Grant-Proposal_-Deep-Learning-for-Imbalanced-Data-_-Portfolium
Abstract: In the field of machine learning, document retrieval typically classified as a data classification problem. Document classification is the task of assigning a predefined category to a document. If retrieving rare documents is desired, it is called the imbalanced data problem which is a common problem in real-world applications. The imbalanced class problem refers to dealing with highly skewed class distributions as the minority classes only have a small number of instances. Traditional machine learning methods usually fail to provide a generic approach to this problem and misclassifying rare documents may result in extreme costs.
To address this problem, this project develops a data-driven approach for document retrieval using deep learning techniques and creates a robust model for the imbalanced nature of the data. This project is aimed to tackle the following real-world problems: (1) How to deal with imbalanced data for document retrieval (2) What is an effective data-driven approach without artificial oversampling and altering the data distribution (3) How to detect anomaly patterns in documents belong to the minority data class (4) How to extend the proposed approach to other applications and what is the benefits? (5) How to employ the model in a deployed real-time system?
Our proposed approach jointly optimizes the model parameters and the cost model using a novel triplet loss function in addition to an attention mechanism. For this aim, the proposed loss function uses an attention mechanism to focus on anomaly patterns in the minority data. The triplet loss function is designed to ingest inter- and intra-class features simultaneously. The proposed method finds the balance between the data oversampling and under-sampling methods with sensitive-cost algorithms by inherently leveraging the attention mechanism to have a more keen observation of the characteristics of minority data (rare documents). With designing the cost function operating on the extracted features rather than the data itself, the method can be extended to different data types as well.
The proposed method not only improves the imbalanced data classification task but also introduces a new method to learn deep feature representation and can be extended to many other tasks and application areas such as Computer Vision and Speech Recognition. Furthermore, the proposed algorithm works for both binary and multi-class classification problems. Regarding the triplet loss function design, the model can be leveraged in a deployed system for both classification and verification purposes.