Task and Domain Representation Learning

Our research focus is on unsupervised (or self-supervised) domain representation learning as strong supervision from humans is not AI enough. Even DL has the data-intensive generalization, it still (and may always) suffers from the domain problem in two perspectives: (1) the distribution of data is changed in an end task and certain types of data never appear before; (2) the majority wins the representation where domain-specific representations are squeezed but later challenged by a domain task.

For each perspective, we focus on two-levels of domain representation, driven by two major breakthroughs in NLP research on deep learning(word embedding and contextualized representation learning).

Out-of-distribution Domain Representation Learning

Although most NLP tasks are defined on formal writings such as articles from Wikipedia, informal texts are largely ignored in many NLP tasks but are in huge amount. As such, representations learned from formal texts are challenged by domains with informal texts (which may differ in style-of-writing or rich opinions). Consequently, the representation is not general enough.

Word Level

On the word level, we show that such change of domains makes word embeddings trained from Wikipedia not suitable for reviews.

Context Level

Similar things happen to as in contextualized representation learning. Via ablation study, we show that a language model (BERT) pre-trained on Wikipedia and Bookcorpus is incapable handle 3 review-based tasks well.

Domain Representation Learning for Minor Data

Domain issues also come in even you have the domain data presented in your training corpus but are just not in the majority. This is caused by either the minor domain is challenged by the end task or fixed size of representation learning such that minor data must make rooms for major data.

For example, on the word-level, some well-known word embeddings are trained from web pages in the world, where a certain amount of reviews may present. But they are still a small fraction among all web pages and 300 dimension embeddings may not have enough room to save those minor details. We demonstrated that even GloVe is not perfect on review-based tasks.

What is even interesting is that the domain problem is word-by-word, not domain-by-domain. For some general stop words, we believe the representation from the natural distribution of web pages win as those words should be used for any domain and GloVe provides such generality that a domain corpus may not give. For domain specific-words, we don't wish distributions from irrelevant domains bias them. For example, we may only wish bright is closely associated with screen or keyboard but not sun in a Laptop domain. Thus, the ideal case is to combine them together and we show a simple way to do that.

Task Representation Learning

In a more general perspective, the domain is just one aspect of an end task. Since we focus on unsupervised learning to avoid heavy human efforts on annotation for a specific task, this may lead to representations more general to a wide spectrum of tasks but less specific to an end task. This is more important in contextualized representation learning as to whether certain contexts are important or not is very task-specific. We show that less supervised training data for an end task is unable to learn enough task representation and certain methods to bridge the gap is needed.