Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2017 Jul 26;8(1):138.
doi: 10.1038/s41467-017-00181-8.

Efficient probabilistic inference in generic neural networks trained with non-probabilistic feedback

Affiliations

Efficient probabilistic inference in generic neural networks trained with non-probabilistic feedback

A Emin Orhan et al. Nat Commun. .

Abstract

Animals perform near-optimal probabilistic inference in a wide range of psychophysical tasks. Probabilistic inference requires trial-to-trial representation of the uncertainties associated with task variables and subsequent use of this representation. Previous work has implemented such computations using neural networks with hand-crafted and task-dependent operations. We show that generic neural networks trained with a simple error-based learning rule perform near-optimal probabilistic inference in nine common psychophysical tasks. In a probabilistic categorization task, error-based learning in a generic network simultaneously explains a monkey's learning curve and the evolution of qualitative aspects of its choice behavior. In all tasks, the number of neurons required for a given level of performance grows sublinearly with the input population size, a substantial improvement on previous implementations of probabilistic inference. The trained networks develop a novel sparsity-based probabilistic population code. Our results suggest that probabilistic inference emerges naturally in generic neural networks trained with error-based learning rules.Behavioural tasks often require probability distributions to be inferred about task specific variables. Here, the authors demonstrate that generic neural networks can be trained using a simple error-based learning rule to perform such probabilistic computations efficiently without any need for task specific operations.

PubMed Disclaimer

Conflict of interest statement

The authors declare no competing financial interests.

Figures

Fig. 1
Fig. 1
General task set-up and network architectures. a General task set-up. Input populations encode possibly distinct stimuli, s i, with Poisson noise. The amount of noise was controlled through multiplicative gain variables, g i, which varied from trial to trial. b Training and test conditions. In the “all g” condition, the networks were trained on all possible gain combinations (represented by the blue tiles); whereas in the “restricted g” condition, they were trained on a small subset of all possible gain combinations. In both conditions, the networks were then tested on all possible gain combinations. ci Network architectures used for the seven main tasks. Different colors represent different types of units. For tasks with continuous output variables ce, linear output units were used; whereas for tasks with categorical output variables fi, sigmoidal output units were used
Fig. 2
Fig. 2
Performance of well-trained networks in the main tasks. a Optimal estimates vs. the network outputs in “all g” conditions of the continuous tasks. Error bars represent standard deviations over trials. b Posterior probability of a given choice vs. the network output in “all g” conditions of the categorical tasks. Error bars represent standard deviations over trials. For the SD task, where the output was four-dimensional, only the marginal posterior along the first dimension is shown for clarity. c Performance in continuous tasks. d Performance in categorical tasks. Blue bars show the performance of non-probabilistic heuristic models that do not take uncertainty into account. Note that optimal performance in the CT task does not require taking uncertainty into account (see “Methods” section). Magenta bars show the performance of hand-crafted networks in categorical tasks reported in earlier works. The asterisk in the VS task indicates that the information loss value reported in ref. should be taken as a lower bound on the actual information loss, since they were not able to build a single network that solved the full visual search task in that paper. In c, d, error bars (gray) represent means and standard errors over 10 independent runs of the simulations. CC cue combination, CT coordinate transformation, KF Kalman filtering, BC binary categorization, SD stimulus demixing, CI causal inference, VS visual search. Categorical tasks are labeled in green, continuous tasks in orange
Fig. 3
Fig. 3
Modular tasks for testing the encoding of posterior uncertainty. a Schematic diagram illustrating the design of the modular tasks. In stage 1, a single hidden layer network is trained on either cue combination (CC) or coordinate transformation (CT). In stage 2, the hidden layer of the trained network is plugged into another network. The combined network is trained on a three-input CC task with the parameters of the first network fixed. b Optimal estimates vs. network outputs for the three-input CC task. Error bars represent standard deviations over trials. CT + CC is the case where the first network is trained on the coordinate transformation task, CC + CC is the case where the first network is trained on the two-input CC task. c Fractional RMSEs. Error bars represent standard errors over 10 independent runs of the simulations. The combined networks perform the three-input CC task near optimally, suggesting that the initial networks encode posterior uncertainty for the first two inputs
Fig. 4
Fig. 4
Alternative representations of sensory reliability and generalization to unseen gain conditions. a Generalization capacity of neural networks in the cue combination task. The weights assigned to cue 1 in the cue conflict trials as a function of the ratio of the input gains, g 1/g 2. In cue conflict trials, s 1 was first drawn uniformly from an interval of length l, then s 2 was generated as s 1 + Δ where Δ was one of −2l, −3l/2, −l, −l/2, l/2, l, 3l/2, and 2l. The cue weight assigned by the network was calculated through the equation: ŝ=wŝ1,opt+1-wŝ2,opt, where ŝ is the network output, ŝ1,opt and ŝ2,opt are the optimal estimates of s 1 and s 2. Note that we use relatively high gains in this particular example to make the optimal combination rule approximately linear. In the low-gain conditions used in the main simulations, the optimal combination rule is no longer linear. The network experienced only 50 training examples in the “restricted g” condition with non-conflicting cues, hence it did not see any of the conditions shown here during training. Error bars represent standard deviations over 1000 simulated trials. b Tuning functions for motion direction as reported in refs. , and for speed as reported in ref. , where the stimulus contrast or coherence variable c does not act multiplicatively on the mean firing rates. c Fractional RMSEs in CC tasks with tuning functions shown in b. The networks perform near-optimal probabilistic inference in both cases demonstrating the robustness of our approach to variations in the specific form in which stimulus reliability is encoded in the input populations
Fig. 5
Fig. 5
The mechanism underlying the sparsity-based representation of posterior uncertainty. a The variability index T var (plotted in log units) as a function of the parameters μ, μ b, and σ b for the constraint surface corresponding to K = 2. The optimal parameter values within the shown range are represented by the black star. The green and magenta dots roughly correspond to the parameter statistics in the trained coordinate transformation and causal inference networks, respectively. b For the parameter combination corresponding to the green dot, the mean input μ is negative and the mean bias μ b is positive. Increasing the gain g thus shifts the input distribution to the left and widens it: compare the black and red lines for input distributions with a small and a large gain, respectively. The means of the input distributions are indicated by small dots underneath. c This, in turn, decreases the probability of non-zero responses, but d increases the mean of the non-zero responses; hence, e the mean response of the hidden units, being a product of the two, stays approximately constant as the gain is varied. In the bottom panel of be, the results are also shown for a different parameter combination represented by the magenta dot in a. This parameter combination roughly characterizes the trained networks in the causal inference task. f For a network trained in the coordinate transformation task, distributions of the input-to-hidden layer weights, W ij; g the biases of the hidden units, b i; h scatter plot of the mean input 〈r in〉 vs. the mean hidden unit activity 〈r hid〉; i the scatter plot of the mean input vs. the kurtosis of hidden unit activity κ(r hid). jm Similar to fi, but for a network trained in the causal inference task
Fig. 6
Fig. 6
Encoding of posterior uncertainty and parameter statistics in the trained networks. a Trial-by-trial correlation between mean hidden unit response and mean input response; b trial-by-trial correlation between the sparsity (kurtosis) of hidden layer activity and mean input response; c mean input-to-hidden unit weight; d mean bias of the hidden units; e standard deviation of hidden unit biases. Parameter statistics are reported in units of the standard deviation of the input-to-hidden layer weights, σWij, to make them consistent with the mean field analysis, where all variables are measured in units of the standard deviation of the input. CC cue combination, CT coordinate transformation, KF Kalman filtering, BC binary categorization, SD stimulus demixing, CI causal inference, VS visual search. Error bars represent means and standard errors over 10 independent runs of the simulations
Fig. 7
Fig. 7
One-dimensional tuning functions of 10 representative hidden units at two input gains a in a network trained in the coordinate transformation task and b in a network trained in the causal inference task. In both a and b, the first row shows the tuning functions with respect to s 1 (averaged over s 2) and the second row shows the tuning functions with respect to s 2. Increasing the gain sharpens the tuning curves in the coordinate transformation task, whereas it has a more multiplicative effect in the causal inference task
Fig. 8
Fig. 8
Architectural constraints on near-optimal probabilistic inference. a Schematic diagram of a random feedforward network, where the input-to-hidden layer weights were set randomly and fixed. b Performance of random feedforward networks. The fractional RMSEs in the CC task are over 100%. The y axis is cut off at 100%. Random feedforward networks perform substantially worse than their fully trained counterparts. c Schematic diagram of a recurrent EI network that obeys Dale’s law: inhibitory connections are represented by the red arrows and excitatory connections by the blue arrows; inhibitory neurons are represented by lighter colors, excitatory neurons by darker colors. Both input and hidden layers are divided into excitatory-inhibitory subpopulations at a 4-to-1 ratio. Inputs to the network were presented over 10 time steps. The network’s estimate was obtained from its output at the final time point. d Performance of recurrent EI networks. Introducing more biological realism does not substantially reduce the performance. Error bars in b and d represent standard errors over 10 independent runs of the simulations
Fig. 9
Fig. 9
Error-based learning in generic neural networks accounts for a monkey subject’s performance, but not human subjects’ performance in a probabilistic binary categorization task involving arbitrary categories. a Structure of the two categories. Vertical lines indicate the optimal decision boundaries at low and high noise, σ. b Dependence of the decision boundary on sensory noise, σ, for the optimal model (OPT) and three suboptimal heuristic models. c Cumulative accuracy of monkey L (red) compared with the cumulative accuracy of a neural network trained with the same set of stimuli (blue). Cumulative accuracy at trial t is defined as the accuracy in all trials up to and including t. The network was trained fully on-line. The input noise in the network was matched to the sensory noise estimated for the monkey and the learning rate was optimized to match the monkey’s learning curve (see “Methods” section). Dotted lines indicate the 95% binomial confidence intervals. d The overbar shows the winning models for the monkey’s behavioral data throughout the course of the experiment according to the AIC metric. The QUAD model initially provides the best account of the data. The LIN model becomes the best model after a certain point during training. The area plot below shows the fractions of winning models, as measured by their AIC scores, over 50 neural network subjects trained with the same set of input noise and learning rate parameters as the one shown in c. Similar to the behavioral data, early on in the training, QUAD is the most likely model to win; LIN becomes the most likely model as training progresses. OPT-P is equivalent to an OPT model where the prior probabilities of the categories are allowed to be different from 0.5 (in the experiment, both categories were equally likely). e Average performance of six human subjects in the main experiment of ref. (red), average performance of 30 neural networks whose input noise was set to the mean sensory noise estimated for the human subjects (blue). Error bars represent standard errors across subjects. Human and monkey behavioral data in c and e were obtained from ref.
Fig. 10
Fig. 10
Low computational complexity of standard psychophysical tasks and the efficiency of generic networks. a For each number of input units, d, the minimum number of hidden units required to reach within 10% of optimal performance (15% for visual search), n *, is estimated (shown by the arrows). An example is shown here for the causal inference task with d = 20 and d = 220 input units, respectively. b n * plotted as a function of the total number of input units, d, in different tasks. Solid lines show linear fits. In these simulations, the number of training trials for each task was set to the maximum number of training trials shown in Supplementary Fig. 1. The growth of n * with d is sublinear in every case

References

    1. Battaglia PW, Jacobs RA, Aslin RN. Bayesian integration of visual and auditory signals for spatial localization. JOSA. 2003;20:1391–1397. doi: 10.1364/JOSAA.20.001391. - DOI - PubMed
    1. Ernst MO, Banks MS. Humans integrate visual and haptic information in a statistically optimal fashion. Nature. 2002;415:429–433. doi: 10.1038/415429a. - DOI - PubMed
    1. Hillis, J. M., Watt, S. J., Landy, M. S., Banks, M. S. Slant from texture and disparity cues: optimal cue combination. J. Vis. 4, 967-92 (2004). - PubMed
    1. Körding K, et al. Causal inference in multisensory perception. PLoS ONE. 2007;2:e943. doi: 10.1371/journal.pone.0000943. - DOI - PMC - PubMed
    1. Merfeld DM, Zupan L, Peterka RJ. Humans use internal models to estimate gravity and linear acceleration. Nature. 1999;398:615–618. doi: 10.1038/19303. - DOI - PubMed

Publication types

LinkOut - more resources