Skip to content


Deep learning (DL) has gained significant improvements over the past a few years, where back-propagation serves as the core driving force to learn features or representations from multiple parameterized layers automatically. This finally bridges the gap between the raw inputs (pixels for CV and sequence of chars for NLP) and the output of an ML task. As such, parameter-intensive DL models can consume much more data than traditional ML models to allow for data-intensive generalization and as a result, better performance.

Learning with less human efforts

Looking forward, there is no free lunch for an ML model and DL is not perfect. As AI aims to free human from intensive labor work, we humans naturally apply constraints(needs) on an ML model. So neither intensive data annotation nor architecture design for a specific task is desirable. This rule out strongly supervised learning with tens of thousands of examples or architecture rich but parameter fewer models. In the end, we are looking for simple and general architectures with a huge amount of parameters to let unlabeled data to fill in. This leads to unsupervised (or self-supervised as there is no perfect unsupervised) representation learning, where training signals can be discovered from the input itself.

Statistical Generalization

Even though, the generalization from DL is still biased by the statistics of the intensive data. By statistics, we mean the majority wins in representation learning. But in real-world, the long-tail of many specifics determines the performance. What is even worse is that the trained agent is facing a dynamic world, where some data seldomly or never appear before may need to dominate the distribution later. Two directions we focus on are task representation learning and open-world learning.

In contrast, the i.i.d assumption from a ML model leads us a frequently mistake in research. We drop the timestamp TOO MUCH in testing and 99% of existing datasets lost temporal dependency of examples. By temporal dependency, I mean a USB 3.0 in training but a USB 2.0 in testing; or a iPhone Case in training but a iPhone in testing of recommender system. Randomly shuffling the available data (to manually make the testing has similar distribution as training) and spliting 20% for testing are totally wrong. This may explain why a model with good testing performance has poor performance when facing with human evaluation or real-world deployment. Obviously, a model trained from such a time-distorted dataset can let the model learn very easy post-hoc fact and not recover the hidden causal mechanism of data generation in real-world.