Assumptions.* We consider a fully-connected feedforward network, $f_0(\cdot)$, with a single layer, $N_0$, of $K_0$ linear maps, and ReLU nonlinearities. In the following section we use the term feature to describe the output of the first layer. After nonlinearities, we split the weights of $f_0(\cdot)$ into two sets: (i) the *target parameters*, $\mathbf{t}_0$ and (ii) the *non-target parameters*, $\mathbf{p}_0$. The size of the latter is $L_0 = N_0 \times K_0$, the total number of non-zero weights in the network. We denote by $L_{non-target}$, the size of the non-target parameters, that is $L_{non-target} = L_0 - K_0$. For all $L_{non-target} \geq 0$, we denote the non-target parameters by $\mathbf{p}_{0,L_{non-target}}$; $\mathbf{p}_{0,L_{non-target}} \in \mathbb{R}^{N_0}$. *Learning.* We learn the target parameters, $\mathbf{t}_0$, to achieve the optimal feature representation of the example class: $y$, using a loss function, $\mathcal{L}_0 (\cdot)$, and an accuracy $\mathcal{A}_0$. We then define the *target feature*, $\mathbf{z}_0 = \mathbf{t}_0 f_0(x)$, as a combination of the target parameters and their contribution to the overall feature representation. *Uncertainty Quantification.* We consider examples from a mixture of classes, $\mathcal{D}_u$. Each example belongs to a class, $y_u$, with some probability and we do not know which class it belongs to. We consider the uncertainty in the assignment of the classes in the mixture. In our model, the target parameters do not change; $\mathbf{t}_0 = \mathbf{t}$, and their corresponding feature is denoted by $\mathbf{z}_0 = \mathbf{t} f_0(x)$. For each training example $x \sim \mathcal{D}$, we define $y_u(x)$ as the assigned class and we introduce a weight, $\omega_u(x)$, assigned to the class $y_u(x)$; $y_u(x) \sim \text{Categorical}(\omega_u(x), [0,1])$, which represent the fraction of a given class, $y_u(x)$, in the mixture of classes, $\mathcal{D}_u$, with $\sum_{i=1}^{K_0}\omega_u(x) = 1$. We denote by $\omega_u(\cdot)$ the *distribution over classes*. Given the mixture distribution of the examples, we define the *multi-label label function* for the target parameters as: $F(\mathbf{t}; \mathcal{D}_u) = \sum_{x \in \mathcal{D}_u}\omega_u(x) f_0(x)$. Given a specific example $x \sim \mathcal{D}$, $F(\mathbf{t}; \mathcal{D}_u) = \omega_u(x) f_0(x)$. The *generalized feature*, $\mathbf{g}_0$, which is the input to the second layer, is defined as the weighted average of the feature over all classes: $\mathbf{g}_0 = \sum_{u = 1}^U F(\mathbf{t};\mathcal{D}_u) = \sum_{u = 1}^U \omega_u(x) f_0(x)$. We define the non-target feature $\mathbf{g}_{non-target}$ as: $\mathbf{g}_{non-target} = \mathbf{g}_0 - F(\mathbf{t};\mathcal{D}_y)f_0(x)$, where $f_0(x)$ and $f_0(x)$ are input to the second and first layer, respectively, and $F(\mathbf{t};\mathcal{D}_y) = \omega_y(x) f_0(x)$ is the label function for the target parameters corresponding to the true label, where $\omega_y(\cdot) = \delta(\cdot)$ is the delta function. It is easy to show that $f(\mathbf{g}_{non-target};\mathcal{D}_u)$ is a sufficient statistic for estimating $F(\mathbf{t};\mathcal{D}_u)$. Given a learning task, we then consider learning $\mathcal{L}_0 (\cdot)$ with a loss function, $\mathcal{L}_1 (\cdot)$, for the target parameters. Next, we discuss the properties of the proposed neural network architecture. We state the following proposition (proof in the Appendix): \[Prop:2\] Let $z_0 \in \mathbb{R}^{N_0}$ be the output of a single linear unit (also called feature) with ReLU. Then, given any input, $x \in \mathbb{R}^d$, with $||x||_2 > 0$, we have: $z_0 > x$ for any $z_0 > 0$, and $z_0 < x$ for any $z_0 < 0$. Our approach is to learn the non-target parameters of the network, $\mathbf{p}_0$, to make its output, $f_0(\cdot)$, the closest of all possible functions to a given function, $\phi$. We state the following corollary (proof in the Appendix): \[Prop:3\] Let $x \in \mathbb{R}^{N_0}$ be a vector of features of dimension $N_0$, $z_0 \in \mathbb{R}$ is a scalar value, and $\mathcal{H}$ is an arbitrary convex set. Then for any $c_0 \in \mathbb{R}$, there exists $z = \phi(x,z_0) \in \mathcal{H}$ such that $z_0 = c_0$. Theorem \[Theorem\] follows from the above proposition. \[Theorem\] Consider the output of a linear layer with ReLU. Then, given any non-zero input, $z_0 > 0$ (the case $z_0 < 0$ follows from Proposition \[Prop:3\]), we have: $\phi(x,z_0) = z_0 + b_0$, where $b_0$ is any scalar value. In other words, there are an infinite number of functions that map $x$ to $z_0$. Propositions \[Prop:1\] and \[Prop:3\] allow us to state the following proposition: \[Prop:4\] Given a linear layer with ReLU as an output of a neural network and a $\ell_2$ norm as the distance function between the outputs, $f_0(\cdot)$ and $\phi(\cdot)$, then, for any scalar $b_0$, there is an infinite number of linear functions, $f_0(\cdot)$, that map any given feature vector, $x \in \mathbb{R}^{N_0}$, to $z_0 = b_0 + x$. Since the set of all possible mappings from an input feature vector to any $z_0 > 0$ is infinite, we cannot learn a linear function $f_0(\cdot)$ directly. Proposition \[Prop:4\] justifies learning the non-target parameters of the network, $\mathbf{p}_0$, to map the features to a certain distance, $\epsilon$, from any output. Given $f_0(\cdot)$ and $\phi(\cdot)$, we then learn the target parameters $\mathbf{t}$ to minimize $\mathcal{L} = || \mathbf{t}f_0(x) - \phi(x) ||_2$, where $x$ is the example. If we let $\mathbf{t} = \mathbf{t}_0$ and $\epsilon = 0$, then $\mathcal{L}_0 = ||f_0(x) - \phi(x)||_2$ and $F(\mathbf{t};\mathcal{D}_u) = F(\mathbf{t}_0;\mathcal{D}_u)$. Therefore, we use $\mathcal{L}_0$ and $\mathbf{t}_0$ to learn the best target feature, $\mathbf{z}_0$, by learning $\mathbf{t}_0$ to minimize $\mathcal{L}_0$. In the following section we present a training algorithm for this optimization problem. Training Algorithm {#sec:training} ================== In order to train our model, we need to know how $f_0(\cdot)$ is represented as a combination of $\mathbf{t}_0$ and the non-target parameters of the network. Assume that $x \sim \mathcal{D}_1$, the training set corresponding to class $1$. We denote by $y = 1$, the true class of $x$. We represent $f_0(x)$ as a linear combination of the target parameters, $\mathbf{t}_0$, and an approximation of the non-target parameters, $b$: $$\begin{aligned} f_0(x) &= c_0\mathbf{t}_0 + b\\