Lifelong Representation Learning
Our research focus is on unsupervised (or self-supervised) domain representation learning, because getting strong supervision from humans is typically not intelligent enough. Even DL has the data-intensive generalization, it still (and may always) suffers from the domain problem in two perspectives: (1) the distribution of data is changed in an end task and certain types of data never appear before; (2) the majority wins the representation where domain-specific representations are squeezed but later challenged by a domain task.
Although most NLP tasks are defined on formal writings such as articles from Wikipedia, informal texts are largely ignored in many NLP tasks but are in huge amount. As such, representations learned from formal texts are challenged by domains with informal texts (which may differ in style-of-writing or rich opinions). Consequently, the representation is not general enough.
On the word level, we show that such change of domains makes word embeddings trained from Wikipedia not suitable for reviews.
For example, on the word-level, some well-known word embeddings are trained from web pages in the world, where a certain amount of reviews may present. But they are still a small fraction among all web pages and 300 dimension embeddings may not have enough room to save those minor details. We demonstrated that even GloVe is not perfect on review-based tasks.
What is even interesting is that the domain problem is word-by-word, not domain-by-domain. For some general stop words, we believe the representation from the natural distribution of web pages win as those words should be used for any domain and GloVe provides such generality that a domain corpus may not give. For domain specific-words, we don't wish distributions from irrelevant domains bias them. For example, we may only wish bright is closely associated with screen or keyboard but not sun in a Laptop domain. Thus, the ideal case is to combine them together and we show a simple way to do that.
Similar things happen to as in contextualized representation learning. Given the complexity of contextualized representation learning, we propose to add more internal stages of training (between pre-training and fine-tuning) for contextualized word representation.
Via ablation study, we show that a language model (BERT) pre-trained on Wikipedia and Bookcorpus is incapable of handling 3 review-based tasks well. We introduce a post-training stage to adapt BERT for a specific domain (the more specific the better). The idea is simple, many training corpora such as the MASK is bright does not contain much learnable knowledge. But if I contraint the domain to laptop, you probably can guess MASK is screen.
The general unsupervised learning (such as BERT) may lead to representations more general to a wide spectrum of tasks but less specific to an end task. To this end we further add a pre-tuning stage for a specific task. This is more important in contextualized representation learning as to whether certain contexts are important or not is very task-specific. We show that less supervised training data for an end task is unable to learn enough task representation and certain methods to bridge the gap is needed.