Tuna on Software

I stumbled upon a new term, Inductive bias, while reading an article about Transfer Learning [1], and found it quite interesting. The definition of it is as below.

The inductive bias (also known as learning bias) of a learning algorithm is the set of assumptions that the learner uses to predict outputs given inputs that it has not encountered [2].

Simply speaking, when we try to generalize a model to an unseen observation, Inductive bias comes into play. Training a model all comes down to generalization from a sample. After training a model, we need a set of assumptions about the outside world. Otherwise, it would be almost impossible to predict unseen data.

For example, we implicitly assume that the distribution of the population data would be similar to that of the sample data so that our prediction wouldn't have much error. In other words, an observation in the population should come from the same conditional distribution as the sample that was used at training a model. This is maybe the foundation of statistics and one of the examples of Inductive bias.

Another interesting example of Inductive bias is Occams' razor, which states that simpler hypotheses are more likely to be true [2]. It also posits a specific characteristic of the outside world by means of mentioning a target function (hypothesis) such that the function should have the desired property in order to perform well more generally.

Linear regression is also a good example to describe the concept of Inductive bias. It assumes that "the relationship between the attributes x and the output y is linear. The goal is to minimize the sum of squared errors" [4].

From the readings, I would argue that if we have a stronger inductive bias, we could achieve the same or comparable performance with less data as long as the bias reflects the outside world very well. From this point of view, it seems natural to think that black-box models, such as Neural Networks, require much more data than those with stronger biases, for example, some statistical models which assume a probability distribution or stochastic process [5].

The benefit of having an effective inductive bias is clear. It leads to significant gains in terms of data efficiency [6]. A model can narrow down a hypothesis space to explore with the help of inductive bias, reducing required resources, and training data. Under circumstances where obtaining labeled data is very hard or prohibitively costly, this reduction would be much more beneficial.

The next question should be how we inject inductive bias into our models. There seem to be two ways; [5]

1) If there is symmetry in the input space, exploit it.

If there is an invariant in your problem, such as translation invariant in CNN, a model can be designed to exploit it. The problem is how to discover invariants in your own domain problem and to use it properly. This is done by careful exploratory data analysis and feature engineering. Also, sharing parameters is a way to exploit symmetry in the input space, such as kernels in CNN [7].

2) When you know the generative process, you should exploit it.

I think this method is very common nowadays. Given a set of initial settings, we produce estimates and compare these to actual observations. If there is an error, a learning algorithm iteratively reduces the error by generating more closer estimate based on the generative process we assume until we reach a sufficiently good estimate.

I don't know much about research areas over these topics, but I imagine that if we could inject knowledge from a subject domain expert into a neural network, we might bring out better results from our models. Or, we could build a general framework where we design a whole new model that conforms to the inductive bias from our domain knowledge about the data generation, and this will probably be based on deep neural network.

Beyond the machine learning field, this principle could apply to other domains as well. For example, when I design a data structure, I often assume several biases of input data. Without it, it is almost impossible to design an efficient structure to maximize performance. The assumptions I took would be another example of inductive bias. Additionally, the fact we have to assume something and are unable to be best at every case would have something in common with 'the No Free Lunch Theorem' [8].

Last but not least, when we take advantage of Inductive bias, we should be very cautious of the bias being held true after deployment. The situation might change so that the bias would not work anymore in the future. Therefore, it is necessary to have some ways to monitor the ever-changing trend, such as errors drifting in a certain direction.

References

[1] https://towardsdatascience.com/a-comprehensive-hands-on-guide-to-transfer-learning-with-real-world-applications-in-deep-learning-212bf3b2f27a

[2] https://en.wikipedia.org/wiki/Inductive_bias

[3] https://stackoverflow.com/questions/35655267/what-is-inductive-bias-in-machine-learning

[4] http://www.lauradhamilton.com/inductive-biases-various-machine-learning-algorithms

[5] https://ruder.io/emnlp-2018-highlights/

[6] https://www.conll.org/keynotes-2018

[7] https://medium.com/flatiron-engineering/using-inductive-bias-as-a-guide-for-effective-machine-learning-prototyping-66e5468407a8

[8] https://en.wikipedia.org/wiki/No_free_lunch_theorem