#separator:tab
#html:true
#notetype column:1
#deck column:2
#tags column:5
Basic	Part III Notes::SLIP	<b>Definition:</b> Cumulant-Generating Function	Natural logarithm of the moment-generating function: <br>\[K(t) = \log {\mathbb{E}} [e^{tX}]\]<br> <br>Generates cumulants: \(\kappa_n\). Note that:<br> \(\kappa_1\) is the mean<br> \(\kappa_2\) is the variance<br> \(\kappa_3\) is the skewness<br> But then it gets weirder.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Normal Linear Model	\[Y_i = x_i{^{\mathrm{T}}} \beta + \varepsilon_i\] <br>with \(\varepsilon_i \sim {\mathrm{N}}(0, \sigma^2)\) independent<br> <br>Alternatively written \(Y = X\beta + \varepsilon\)<br> <br>Note our assumptions: \(Y_i\) are independent and normally distributed; linear in \(X\).<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	MLE for the slope of a normal linear model	\[\hat\beta = (X{^{\mathrm{T}}} X){^{-1}} X{^{\mathrm{T}}} Y\] <br>(By the Gauss-Markov Theorem, \(\hat \beta\) and \(\hat \sigma^2\) is the best linear unbiased estimator!)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> The two MLEs for the variance \(\sigma^2\) of a normal linear model	\[\tilde\sigma^2 = \frac{1}{n} ||Y - X\hat \beta||^2_2\] or if we'd rather have an unbiased \(\sigma^2\), \[\hat\sigma = \frac{1}{n - p} ||Y - X\hat \beta||^2_2\]<br> <br>(By the Gauss-Markov Theorem, \(\hat \beta\) and \(\hat \sigma^2\) is the best linear unbiased estimator!)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Exponential Dispersion Family	\[ \{ {\mathbb{P}}_{\theta,\phi}: \theta \in \Theta, \phi \in \Phi \} \] <br>\(\Theta \subset {\mathbb{R}}\), \(\Phi \subset (0, \infty)\), fulfilling that, for each \({\mathbb{P}}_{\theta, \phi}\), we have (\(\forall y \in {\mathbb{R}}\)): <br>\[ f(y; \theta, \phi) = f_0(y,\phi) \exp \left\{  \frac{y\theta - K(\theta)}{\phi} \right\} \] <br>where<br> \(K: \Theta \to {\mathbb{R}}\) is called the cumulant function (it's just here to normalize), and<br> \(\{ f_0(\cdot, \phi) : \phi \in \Phi \}\) is some family of density functions (that gets 'tilted' by the exponential portion).<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the equation to solve for the cumulant function \(K(\theta)\) for exponential diffusion families?	\[K(\theta) = \phi \log \int_{-\infty}^\infty e^{y\theta/\phi}f_0(y,\phi)\,dy\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the cumulant-generating function for an exponential diffusion family?	\[K(t) = \log {\mathbb{E}} [e^{ty}] = \frac{K(\phi t + \theta) - K(\theta)}{\phi}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the mean and variance of the exponential distribution family in terms of the cumulant function \(K(\theta)\)? What conclusions can we draw about \(\theta\)?	\[ {\mathbb{E}}_{\theta, \phi}[y] = K'(\theta) \] \[ {\mathrm{Var}}_{\theta, \phi}[y] = \phi K''(\theta) \]<br> <br>(this can be computed simply by taking the first or second derivative of the cumulant-generating function and plugging in \(t=0\))<br> <br>If we assume our variance is nonzero, and since \(\phi\) must be positive, it must be the case then that \(K''(\theta) &gt; 0\). Thus \(K'(\theta)\) must be strictly increasing and therefore invertible. So, considering the mean equation above, it makes sense to reparameterize everything in terms of \(\mu := K'(\theta)\)!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What properties do we care about for an exponential dispersion family?	We need a natural parameter \(\theta\),<br> a dispersion parameter \(\phi\),<br> a cumulant function \(K(\theta)\) which gives us \(\mu =  K'(\theta)\),<br> a mean parameter space \(\mathcal{M} = { \mu(\theta : \theta \in \Theta)}\),<br> a mean function \(\theta \to \mu(\theta) = (K')^{-1}(\mu)\),<br> and a variance function \(V(\mu) = K'' \circ \theta(\mu)\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Give the properties of a normal distribution \(\mathrm{N}(\mu, \sigma^2)\) (\(\mu \in \mathbb{R}, \sigma &gt; 0\)) interpreted as an exponential dispersion family	\[\theta = \mu\] \[\phi = \sigma^2\] \[K(\theta) = \theta^2/2\] \[\mathcal{M} = \mathbb{R}\] \[\mu(\theta) = \theta\] \[V(\mu) = 1\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Give the properties of a binomial distribution \(\frac{1}{m}\mathrm{Bin}(m, p)\) (\(m \in \mathbb{N}, p \in (0, 1)\)) interpreted as an exponential dispersion family	\[\theta = \mathrm{logit}(p) = \log(p / (1-p))\] \[\phi = 1/m\] \[K(\theta) = \log(1 - p) - \log(2) = \log(e^\theta + 1) - \log(2)\] \[\mathcal{M} = (0,1)\] \[\mu(\theta) = \mathrm{expit}(\theta)\] \[V(\mu) = \mu(1-\mu)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Generalized Linear Model	Independent observations \((x_i, Y_i)\) follow a generalized linear model if, for some exponential dispersion family \(\{{\mathbb{P}}_{\mu, \phi}: \mu \in {\mathcal{M}}, \phi \in \Phi\}\):<br> <br>1. \(Y_i \sim {\mathbb{P}}_{\mu_i, \phi_i}\) with \(\mu_i \in {\mathcal{M}}\) and \(\phi_i = \phi a\) with known \(a_1...a_n &gt; 0\) and possibly unknown \(\phi&gt;0\),<br> <br>2. \(g(\mu_i) = x_i{^{\mathrm{T}}} \beta\) for some unknown \(\beta \in {\mathbb{R}}^p\) with a strictly monotonic twice differentiable known 'link function' \(g: {\mathcal{M}} \to {\mathbb{R}}\).<br> (linearity after applying the link function)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Residual Squared Error	\[RSE = \sqrt{ \frac{\sum_{i=1}^n (y_i-\hat y_i)^2}{n - p} }\] where \(p\) is the number of parameters in our regression (denominator is the number of degrees of freedom).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(R^2\) (Coefficient of determination)	\[R^2 = 1 - \frac{RSS}{TSS} = 1- \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2}\] <br>\(RSS\) is the residual sum of squares, and \(TSS\) is the total sum of squares. <br>Proportion of the variation predictable from our model. We want this to be large!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the Logit function?	\[{\mathrm{logit}}(x) = \log \left( \frac{x}{1-x}\right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the Expit function?	\[{\mathrm{expit}}(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Log Likelihood of a Generalized Linear Model<br>(given \(Y = (Y_1,...,Y_n){^{\mathrm{T}}}\), \(\beta\).)	First define \(\theta_i = (K'){^{-1}} (g{^{-1}} (x_i {^{\mathrm{T}}} \beta))\). We have \[\begin{align*} {\mathcal{L}}(\beta, \phi) &= \sum_{i=1}^n \log f(Y_i; \theta_i, \phi_i)\\\\ &= {\sum_{i=1}^n} \log f_0(Y_i, \phi a_i) + {\sum_{i=1}^n} \frac{1}{\phi a_i}\{Y_i \theta_i - K(\theta_i)\} \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Proposition:</b> If an MLE exists for the Generalized Linear Model with the canonical link, it must be unique.	If an MLE \[\hat \beta = {\underset{\beta \in {\mathbb{R}}^p}{\mathrm{argmax}}\,\,} {\mathcal{L}} (\beta, \phi)\] exists, then it must satisfy the score equation:  \[\frac{\partial{\mathcal{L}}(\hat \beta, \phi)}{\partial \beta} = 0.\] Note that by our assumptions, we have \(\theta_i = x_i{^{\mathrm{T}}} \beta\), yielding: \[ \frac{\partial^2{\mathcal{L}} (\beta, \phi)}{\partial\beta\partial\beta{^{\mathrm{T}}}} = - {\sum_{i=1}^n} \frac{x_i x_i{^{\mathrm{T}}}}{\phi a_i} K''(x_i{^{\mathrm{T}}} \beta)\] Since \(K'' &gt; 0\) and \(X\) is assumed to have full rank, the Hessian must be negative definite, and so if it exists, the MLE must be unique.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Generalized Pearson Chi-Squared Statistic for \(\phi\)	\[\hat \phi = \frac{\chi^2}{n-p}\] \[\chi^2 = {\sum_{i=1}^n} \frac{(Y_i - K'(x_i{^{\mathrm{T}}} \hat \beta))^2}{a_i V(K'(x_i{^{\mathrm{T}}} \hat \beta))}\] <br> This comes from \({\mathrm{Var}}(Y_i) = a_i \phi V(\mu_i) \implies \phi\) plugged into the weighted OLS unbiased variance MLE \[\hat \sigma^2 = \frac{\sum_iw_i(Y_i - \hat\mu)^2}{n-p}\]<br> <br> This is a consistent estimator for \(\phi\)!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Explain how to derive Newton-Raphson	Suppose we want to solve \(f(x) = 0\). If we start at some point \(x\), we can simply step downhill a bunch of times to get \(x'\) with smaller and smaller \(f(x')\) via a Taylor approximation: <br>\[0 = f(x') \approx f(x) + f'(x)(x' - x)\] <br>Rearranging yields <br>\[\begin{align*} 0 &= f(x) + f'(x)(x' - x)\\\\ -f'(x)(x' - x) &= f(x)\\\\ x' &= x - \frac{f(x)}{f'(x)} \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Explain briefly what Fisher scoring is, in the context of solving a general linear model numerically	Similar to Newton-Raphson, but replaces the Jacobian \(J(\beta)\) with the relevant Fisher information: \[I(\beta) = {\mathbb{E}}(J(\beta))\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Column Space	Interpreting some \(m \times n\) matrix \(A\) as \(n\) column vectors, what do they span? \[c_1v_1 + ...+c_nv_n\] In other words, all possible products \(Ax\) for any \(x \in {\mathbb{R}}^n\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the orthogonal complement of the column space?	The orthogonal complement of the column space is the left null space: \[v \in {\mathrm{col}}(X)^\perp \iff v{^{\mathrm{T}}} X = 0{^{\mathrm{T}}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> PDF of the Multivariate Normal Distribution	\[(2\pi)^{-n/2} |\Sigma|^{-1/2} \exp \left( -\frac{1}{2} (x - \mu){^{\mathrm{T}}} \Sigma{^{-1}} (x - \mu) \right)\] where \(\Sigma\) is the covariance matrix between the components.<br> <br> Note that when the components are independent, we get just a diagonal \(\Sigma\) with \(\Sigma_{ii} = \sigma_i^2\), and this simplifies to just a product of univariate PDFs.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(\frac{\partial}{\partial v}Xv\)	\[X\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(\frac{\partial}{\partial v}v{^{\mathrm{T}}} X\)	\[X{^{\mathrm{T}}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(\frac{\partial}{\partial v}v{^{\mathrm{T}}} Xv\)	\[\frac{\partial}{\partial v}v{^{\mathrm{T}}} Xv = (X + X{^{\mathrm{T}}})v\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> {\mathrm{Var}}{(Xv)} = {\mathrm{Cov}}{(Xv)}	\[{\mathrm{Var}}(Xv) = X \Sigma X{^{\mathrm{T}}}\] where \(\Sigma\) is the covariance matrix of the random vector \(v\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Covariance Properties	1. \(\forall a, \, {\mathrm{Cov}}(X, a) =0\)<br> 2. \(\forall a, b, \,{\mathrm{Cov}}(X + a, Y + b) = {\mathrm{Cov}}(X, Y)\)<br> 3. \(\forall a, b, \,{\mathrm{Cov}}(aX, bY) = ab{\mathrm{Cov}}(X, Y)\) 4. \({\mathrm{Cov}}(X +Y, Z + W) = {\mathrm{Cov}}(X, Z) + {\mathrm{Cov}}(X,W) + {\mathrm{Cov}} (Y, Z) + {\mathrm{Cov}}(Y, W)\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Taylor Expansion<br>(For \(f\) around \(x_0\) for nearby \(x\))	\[\begin{align*} f(x)&\approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2 + ...\\\\ &= \sum_{n=0}^\infty \frac{1}{n!} f^{(n)}(x_0)(x - x_0)^n \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Maclaurin Expansion of \(e^x\)	\[\begin{align*} e^x &= 1 + x + \frac{1}{2}x^2 + ...\\\\ &= \sum_{n=0}^\infty \frac{x^n}{n!} \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Wilkes' Theorem	The likelihood ratio test (LRT) statistic \[ D = -2\log \left( \frac{\text{Likelihood for null model}}{\text{Likelihood for alternative model}} \right) \] is asymptotically distributed with the chi-squared distribution with \(\dim \Theta - \dim \Theta_0\): \[D \sim \chi^2_{\dim \Theta - \dim \Theta_0}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Proof: Solve for the confidence interval around some parameter \(j\) of the estimated slope \(\hat \beta\)	Note that we know  \[I(\beta)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] Because \(\hat \beta \overset{p}{\to} \beta\), we can apply the continuous mapping theorem and Slutsky's Theorem to get: \[I(\hat \beta)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] This yields the interval  \[C_\alpha := \left[ \hat \beta_j \pm z_{\alpha/2} \sqrt{[i(\hat \beta){^{-1}}]_{jj}} \right]\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Confidence Interval for GLMs given a new datapoint<br>(Denote the new datapoint \(x^* \in {\mathbb{R}}^p\), we want to find a confidence interval for \({\mathbb{E}}(Y | X = x^*) = g{^{-1}} (x^{*T} \beta)\).)	\[C_\alpha := g{^{-1}}\left[ x^{*T}\hat \beta \pm z_{\alpha/2} \sqrt{x^{*T}i(\hat \beta){^{-1}} x^{*}} \right]\]<br> <br> This comes from the fact that \[\hat \beta \sim N(\beta,\, i(\hat \beta){^{-1}})\] so \[x{^{\mathrm{T}}} \hat \beta \sim N(x{^{\mathrm{T}}}\beta,\,  x{^{\mathrm{T}}} i(\hat \beta){^{-1}})\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Mean parameterization \(\tilde l(\mu, \phi)\)	\[\begin{align*} \tilde l(\vec \mu, \phi) &= \log \tilde f(y, \vec \mu, \phi)\\\\ &= \log f(y, \theta(\vec \mu), \phi)\\\\ &= \sum_{i=1}^n \log f_0(y_i, \phi) + \sum_{i=1}^n \frac{1}{a_i \phi} (y_i\theta(\mu_i) - K(\theta(\mu_i)) \end{align*}\] Also known as the 'saturated' parameterization.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Deviance of a GLM	\[D(\hat\mu) := -2 \phi (\tilde l(\hat \mu, \phi) - \tilde l (Y, \phi))\] where \(\tilde l\) is the mean parameterization of the log likelihood (where we can provide a separate \(\mu_i\) for each sample \(i\)). Note then we have <br>\[ D(\hat \mu) = -2 \phi \log \left( \frac{\sup_{\mu \in {\mathbb{R}}^n: \mu_i = g{^{-1}}(x_i{^{\mathrm{T}}} \beta ),\, \beta \in {\mathbb{R}}^p} \tilde f(Y; \mu, \phi)}{\sup_{\mu \in {\mathbb{R}}^n} \tilde f(Y; \mu, \phi) }  \right) \] Here, the top model is the saturated model and the bottom model is the normal restricted model. So \(D(\hat \mu)\) is the likelihood ratio test statistic for \(H_0\): GLM, \(H_1\): saturated model.<br> <br>Also note that deviance is independent of \(\phi\)!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Dispersion-scaled Pearson Residuals	\[e_i = \frac{Y_i - \hat \mu_i}{{\mathrm{Var}}(Y_i)} = \frac{Y_i - \hat\mu_i}{\sqrt{\hat\phi a_i V(\hat \mu_i)}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Deviance Residuals	\[d_i = {\mathrm{sgn}}(Y_i - \hat \mu_i) \sqrt{D_i(\hat\mu)}\] where \(D_i(\mu)\) is the \(i\)th summand of the deviance	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the relationship between dispersion-scaled Pearson residuals and deviance residuals?	When \(Y_i \approx \hat \mu_i\), we have \(d_i = \hat \phi^{1/2} e_i + O(|Y_i - \hat \mu_i|^{3/2})\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the cumulant function for the EDF of a normal random variable.	\[K(\theta) = \frac{1}{2}\theta^2\] <br>Note that this implies \[\mu = \frac{d}{d\theta}K(\theta) = \theta\] and \[\sigma^2 = \phi \frac{d^2}{d\theta^2} K(\theta) = \phi\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the base function \(f_0(y;\phi)\) of the normal EDF	\[f_0(y;\phi) = \frac{1}{\sqrt{2\pi\phi}} \exp\left(-\frac{y^2}{2\phi}\right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can you find the \(\theta\) for the EDF representing \(\frac{1}{n}\mathrm{Bin}(n,p)\)	\[\theta = {\mathrm{logit}}(p) = \frac{p}{1-p}\] and so \[p = {\mathrm{expit}}(\theta) = \frac{e^\theta}{1+e^\theta}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can you find the \(\phi\) for the EDF representing \(\frac{1}{n}\mathrm{Bin}(n,p)\)	\[\phi = \frac{1}{n}\] and so \[n = \frac{1}{\phi}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the cumulant function of the EDF for \(\frac{1}{n}\mathrm{Bin}(n,p)\)	\[K(\theta) = \log(1 + e^\theta) - \log(2)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the normalizing \(f_0(y;\phi)\) for the EDF of \(\frac{1}{n}\mathrm{Bin}(n,p)\)	\[{n\choose{yn}} 2^{-n}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Classifier	A function \(h: {\mathbb{R}}^p \to [K]\), typically of the form \[h(x) = h(x; (X_i, Y_i)_{i \in [n]})\] <br>where \(p\) is the dimension of the classified points, \(K\) is the number of classification groups, and \(n\) is the number of data points.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Misclassification Loss	\[L(y, y') = {\mathbb{1}}_{\{ y \neq y'\}}\] <br>(really obvious, just 1 when you've got the wrong classification or 0 if not)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Prediction Error of a Classifier	\[R(h) =  {\mathbb{E}}[L(Y^*, h(X^*))] = {\mathbb{P}}[Y^* \neq h(X^*)]\] <br>where \((X^*, Y^*) \in (X^K, Y^K)\) are test points<br> <br> We want to find classifiers \(h\) that minimize this!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Conditional Class Probabilities	\[p_k(x) = {\mathbb{P}}(Y^* = k {\,|\,} X^* = x)\] <br>where \(k\) is some classification in \([K]\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bayes Classier	The minimal prediction error over all classifiers is called the Bayes risk. It is attained by the Bayes classifier \[h^*(x) = {\underset{k \in [K]}{\mathrm{argmax}}\,\,}{p^*_k(x)}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Core idea behind decision trees	Break covariate space down into rectangular regions \(R_1, R_2, ..., R_M\), each with some contributing factor \(c_m\). When \(x\) falls within each, that region contributes its \(c_m\) to a final total: \[x \mapsto \sum_{m=1}^M c_m {\mathbb{1}}_{x \in R_m}\] The hard part is finding the regions and coefficients that work!<br> <br>Idea: Start with the entire space in one big rectangle, then split it into two smaller pieces, staying axis-aligned: \[R_l(j, t) = \{ x\in R: x_j \leq t \}\] \[R_r(j, t) = \{ x \in R: x_j &gt; t \}\] where \(j \in [p]\) are the split variables and \(t \in {\mathbb{R}}\) are the split points (we choose only one variable, and one location along that variable, to split over at a time)<br> <br>Find the best split variable and split point to minimize a cost function \(Q(j, t)\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Explain the general algorithm for a regression tree	Given data \((x_i, y_i)_{i \in [n]}\) in \({\mathbb{R}}^p \times {\mathbb{R}}\), continually split any region so as long as it has at least 2 data points.<br> <br>Let \(T_j\) be the set of midpoints between adjacent points in a region \(x_{ij} \in {\mathbb{R}}\). Compute the cost-minimizing split to decide where to split: \[(\hat j_R, \hat t_R) \in {\underset{j \in [p],\, t \in T_j}{\mathrm{argmin}}\,\,} Q(j,t)\] where \(Q(j,t)\) is the loss after each split.<br> <br>Then perform the split and continue! <br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Regression decision tree loss	It's just the best case loss of the left rectangle (best choice of \(c_l\)) plus the best case loss of the right rectangle (best choice of \(c_r\)) minus the current loss of the rectangle (using \(c\)): \[\begin{align*} Q_{reg}(j,t) &=& \min_{c_l \in {\mathbb{R}}} \sum_{i : x_i \in R_l(j,t)} (y_i-c_l)^2 &=& \sum_{i : x_i \in R_l(j,t)} (y_i-c(R_l(j,t))^2\\\\ &&+ \min_{c_r \in {\mathbb{R}}} \sum_{i : x_i \in R_r(j,t)} (y_i-c_r)^2 &&+ \sum_{i : x_i \in R_r(j,t)} (y_i-c(R_r(j,t))^2\\\\ &&- \min_{c \in {\mathbb{R}}} \sum_{i: x_i \in {\mathbb{R}}} (y_i-c)^2 &&+ \sum_{i : x_i \in R} (y_i-c(R))^2 \end{align*}\] where \[c(R) = \frac{1}{N(R)}\sum_{i: x_i \in {\mathbb{R}}} y_i\] where \(N(R)\) is the number of points in a particular region.<br> <br>Note that this can never be positive! Splitting will never make things worse (because we could just have two splits with the same old height \(c\)).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Final regression function of a regression tree	\[\hat f_{CART}(x) = \sum_{m=1}^M \hat c_m {\mathbb{1}}_{x \in R_m}\] with \(\hat c_m = c(R_m)\), where \[c(R) = \frac{1}{N(R)}\sum_{i: x_i \in {\mathbb{R}}} y_i\] where \(N(R)\) is the number of points in a particular region.<br> <br>\(R_m\) are called the leaves of the decision tree!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Heuristic for when a regression tree is done growing	Each region should contain no more than 5 points	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Runtime of growing a regression tree (when optimized)	O(n)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Define a unitary matrix	A matrix \(A\) where \(A^* = A{^{-1}}\),<br> where \(A^*\) is the complex conjugate<br> <br>For real matricies this is the same as being orthogonal	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Define an orthogonal matrix, what do they represent?	A matrix \(A\) where \(A{^{\mathrm{T}}} = A{^{-1}}\)<br> <br>it represents rotations and reflections (but not stretching)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Thin Singular Value Decomposition	Given \(X \in {\mathbb{R}}^{n\times p}\), the decomposition  \[X = U D V{^{\mathrm{T}}}\] where \(U \in {\mathbb{R}}^{n\times p}\) is a unitary matrix,<br> \(D \in  {\mathbb{R}}^{p\times p}\) is a sorted diagonal matrix \((D_{11} \geq D_{22} \geq ...)\),<br> and \(V \in {\mathbb{R}}^{p \times p}\) is also unitary.<br> <br>This can be thought of as a process where a space is rotated/reflected, stretched on its axes, then rotated/reflected again.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Principal Component	Given some \(X \in {\mathbb{R}}^{n\times p}\) row-centered (\(X^T\)1 = 0), PCA finds a sequence of \(p\) unit vectors \(v_j\), where each \(v_j\) minimizes the sample variance \({\mathrm{Var}}{(Xv_j)} = \frac{1}{n}v{^{\mathrm{T}}} X{^{\mathrm{T}}} X v\) while staying orthogonal to all previous \(v_i\)s.  <br> <br>We call \(Xv_j\) the principal components.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State how a thin singular value decomposition yields the principal components	Given a singular value decomposition \(UDV\), substituting into the sample variance \(v{^{\mathrm{T}}} X{^{\mathrm{T}}} X v\) and optimizing tells us that \(v_j = V_j\) (\(j\)th column of V) and so for the principal component, we have:  \[Xv_j = XV_j = D_{jj}U_j\] where \(U_j\) is the \(j\)th row of \(U\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	In practice, how many principal components do we use?	Take the smallest \(J \leq P\) such that \[\frac{\sum_{j=1}^J D_{jj}^2}{\sum_{j=1}^p D_{jj}^2} \geq 0.8\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Goal of K-Means Clustering	Find a 'good enough' labelling minimizing <br>\[ L(c_1...c_k) = \sum_{k=1}^K \frac{1}{2 |C_k|} \sum_{i \in C_k} \sum_{j \in C_k} ||x_i - x_j||^2_2\] <br>(minimize within-cluster variation)<br> But this is NP-hard! Need a heuristic solution.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Centroid of a cluster	\[\bar X_k = \frac{1}{|C_k|} \sum_{i\in C_k} X_i\] (center of mass)<br> <br>Note that we can represent the within-cluster variance loss in terms of this: \[L(c_1...c_k) = \sum_{k=1}^K \sum_{i \in C_k} ||x_i - \bar x_k||^2_2\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> K-Means Clustering Algorithm	1. Initialize a \(K\)-partition \(C_1...C_k\) randomly.<br> <br>2. Until convergence:<br> <br>2-1. Compute the centroid of each cluster \(\hat x_i\)<br> <br>2-2. Construct a new partition \(\tilde C_1...\tilde C_k\) by assigning each point to the closest centroid.<br> <br>Note that this only finds a local minimum, not always the global minimum!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	K-Means best practices (how do we choose K?)	To choose K, try a bunch and see which ones offer the most explainability. (don't just increase to minimize loss; \(K = n\) always does that)<br> <br>Good clusterings should be stable with repeated runs, or with the addition or removal of a few data points.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	The two MLEs for the variance of a normal linear model		SLIP PartIIINotes
Basic	Part III Notes::SLIP	Explain how Newton-Raphson can be used to solve an MLE numerically	We want to find the \(\theta\) for which \(\frac{dl(\vec\theta;x)}{d\vec\theta} = 0\). So, use Newton-Raphson with the Hessian <br>\[\frac{d^2l(\vec\theta;x)}{d\vec\theta d\vec\theta{^{\mathrm{T}}}}\] except we want the maximum, not the minimum, so make sure to use the negation instead!<br> <br>\[\hat \theta^{(n+1)} = \hat \theta^{(n)} - \left( \frac{d^2l}{d\vec\theta d\vec\theta{^{\mathrm{T}}}}(\hat\theta^{(n)}) \right){^{-1}}\frac{dl}{d\vec\theta}(\hat\theta^{(n)})\]<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	When we inspect GLM residuals (Pearson or Deviance), what are we looking for?	Should both be approximately normal with mean zero, variance 1.<br> <br>Note this is based on small-dispersion asymptotics: \[\phi_i^{-1/2} (y_i - \mu_i) \overset{d}{\to} N(0, V(\mu_i))\] as \(\phi_i \to 0\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What do we need to be careful of when using dispersion-scaled residuals in R?	They aren't normalized by \(\hat \phi ^{-1/2}\), they use: \[\frac{y_i - \hat \mu_i}{\sqrt{a_i V(\mu_i)}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Sigmoid Function	\[\sigma(x) = \frac{e^x}{(1 + e^x)}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> ReLU Function	\[\sigma(x) = \max(0, x)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Softmax Function	\[g_k(z) = \frac{e^{z_k}}{\sum_{r=1}^K e^{z_k}}\] for \(k \in [K]\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Universal Approximation Theorem for Neural Networks	Any (borel-measurable) function can be approximated to a prescribed precision by a single hidden layer neural network with any nonlinear activation function.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Cross-Entropy Loss	\[L(\theta) = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K Y_{ik} \log P_k^{(\theta)}(X_i)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Gradient Descent for Neural Networks	\[\theta^{(t + 1)} = \theta^{(t)} - \alpha_t \nabla_\theta \mathrm{Err}_\mathrm{tr}(\theta^{(t)})\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> The Continuous Mapping Theorem	Given a function \(g: X \to Y\) that is continuous except at points of measure zero, \(g\) preserves convergence almost surely, in probability, and in distribution. \[X_n \overset{a.s.}{\to} X \implies g(X)_n \overset{a.s.}{\to} g(X)\] \[X_n \overset{p}{\to} X \implies g(X)_n \overset{p}{\to} g(X)\] \[X_n \overset{d}{\to} X \implies g(X)_n \overset{d}{\to} g(X)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Akaike Information Criterion (AIC)	\[AIC(M) = 2 \dim \Theta-2 \log f(z; \hat\theta)\] Smaller is better!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bayesian Information Criterion (BIC)	\[BIC(M) = (\log n) \dim \Theta-2 \log f(z; \hat \theta)\] where \(n\) is the number of data points.<br> Smaller is better!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	When should you use AIC versus BIC?	Use AIC in predictive applications, use BIC in explanatory applications (since it selects for simpler models; extra weight on the \(\dim \Theta\) term)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Prediction Error \(\mathrm{Err}\) (Risk)	\[\mathrm{Err} = {\mathbb{E}} [ L(Y^* , \hat f(X^*))]\] where \((X^*, Y^*)\) is a random vector, the test data (space of all possible outcomes) and \(\hat f\) is an arbitrary regression function fit using training data.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> In-Sample Error \(\mathrm{Err}_\mathrm{in}\)	\[\mathrm{Err}_\mathrm{in} = \frac{1}{n} \sum_{i=1}^{n} {\mathbb{E}} [ L(Y^*_i , \hat f(x_i))]\] (this is for a fixed design scenario, where the \(x_i\)s are fixed but the outcome is not. \(\hat f\) is fit using specific draws \((x_i, y_i)\) but evaluated against any possible set of \(Y^*_i\) (still at fixed \(x_i\)).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Training Error \(\mathrm{Err}_\mathrm{tr}\)	\[\mathrm{Err}_\mathrm{tr} = \frac{1}{n} \sum_{i=1}^{n}  L(y_i , \hat f(x_i))\] The evaluation of \(\hat f\) against the training data itself.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Show that we have the relation:<br> \[\mathrm{Err}_\mathrm{in} = {\mathbb{E}}(\hat{ \mathrm{Err}_\mathrm{tr}}) + \frac{2}{n} \sum_{i=1}^n {\mathrm{Cov}}(\hat f (x_i), Y_i)\] for L2 loss.	Proof sketch:<br> Just expand the squared terms. You'll need to use that \[{\mathbb{E}}[Y_i^* \hat f(X_i)] = {\mathbb{E}}[Y_i^*]{\mathbb{E}}[\hat f(X_i)] = {\mathbb{E}}[Y_i]{\mathbb{E}}[\hat f(X_i)]\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bias-Variance Tradeoff (of the prediction error)	If we're trying to model  \[Y_i = f(X_i) + {\varepsilon}_i\] where \(f: {\mathbb{R}}^p \to {\mathbb{R}}\), \({\mathbb{E}}({\varepsilon}_i) = 0\), \({\mathrm{Var}}({\varepsilon}_i) = \sigma^2 &gt; 0\), then we have the following decomposition of prediction error: \[\mathrm{Err} = \sigma^2 + {\mathbb{E}}[ \mathrm{Bias}^2(\hat f(X^*) {\,|\,} X^*)] + {\mathbb{E}}[{\mathrm{Var}} (\hat f (X^*) {\,|\,} X^*)]\] <br>Or for the in-sample error: \[\mathrm{Err}_\mathrm{in} = \sigma^2 + \frac{1}{n}\sum_{i=1}^n{\mathbb{E}}[ \mathrm{Bias}^2(\hat f(X^*_i) {\,|\,} X^*_i)] + \frac{1}{n}\sum_{i=1}^n{\mathbb{E}}[{\mathrm{Var}} (\hat f (X^*_i) {\,|\,} X^*_i)]\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the variance function of the normal EDF	\[V(\mu) = \mu\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the variance function for the binomial EDF	\[V(\mu) = \mu (1 - \mu)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the cumulant function for the Poisson EDF	\[K(\theta) = e^\theta - 1\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can you find the \(\theta\) for the EDF representing \(\textrm{Poisson}(\lambda)\)	\[\theta = \log (\mu)\] \[\mu = e^\theta\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can you find the \(\phi\) for the EDF representing \(\textrm{Poisson}(\lambda)\)	It's always one: \[\phi = 1\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Variance function for the Poisson EDF	\[V(\mu) = e^{\log \mu} = \mu\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State three reasons that we might observe overdispersion in our model	1. Missing critical covariates<br> 2. You're using the wrong distribution<br> 3. There is correlation between your datapoints	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Intuitively, what does the negative binomial distribution model?	In a binary experiment with probability \(p\) of failure, the number of failures before \(r\) successes is reached.<br> <br>NOTE: Other places use the convention that \(p\) is the probability of success, not failure. This course uses the 'probability of failure' convention.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Negative Binomial Distribution PDF	If \(X \sim \mathrm{NB}(r, p)\): \[{\mathbb{P}}(X=k) = {k + r - 1 \choose  k} p^k (1-p)^r\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Mean of the Negative Binomial PDF	If \(X \sim \mathrm{NB}(r, p)\): \[{\mathbb{E}}(X) = \frac{rp}{1-p}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Variance of the Negative Binomial PDF	If \(X \sim \mathrm{NB}(r, p)\): \[{\mathrm{Var}}(X) = \frac{rp}{(1-p)^2}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Variance Function of the Negative Binomial PDF	\[V(\mu) = \mu + \frac{\mu^2}{r}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the relationship between the Negative Binomial distribution and the Poisson distribution?	As \(r \to \infty\),  \[\mathrm{NB}\left(r, \frac{\mu}{\mu + r}\right) \overset{d}{\to} \mathrm{Pois}(\mu)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What do we need to remember when using the negative binomial in R?	The default link function is the same as for the Poisson RV: \[g(\mu) = \log \mu\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can we test between \(H_0\): Poisson model and \(H_1\): negative binomial model?	Use the modified framework: \[2 \left( l_1(\hat \beta, \hat r; Y) - l_0(\hat \beta; Y) \right) \overset{d}{\to} \frac{1}{2} \delta_0 + \frac{1}{2} \chi_1^2\] as \(n \to \infty\), where \(\delta_0\) is the point mass at zero.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Quasi-Likelihood Model	\((x_i, y_i)\) follow a quasi-likelihood model if they are independent and satisfy: \[{\mathbb{E}}[y_i] = \mu_i(\beta) = g^{-1}(x_i^T\beta)\] \[{\mathrm{Var}}[y_i] = V_i(\mu_i(\beta))\] where we fix \(V_i\) and \(g\), and we try and fit \(\beta\).<br> <br>This effectively means that we can just fix some link function and some variance function, without having to specify anything else about our model.<br> Every GLM fits this format.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Quasi-Score	\[U(\beta) = \sum_{i=1}^n \frac{y_i - \mu_i(\beta)}{V_i(\mu_i(\beta))} \mu_i'(\beta) x_i\] where \[\mu_i'(\beta) = (g{^{-1}})'(x_i{^{\mathrm{T}}} \beta)\]<br> <br>Note that this fulfills the score identity \({\mathbb{E}}[U(\beta)] = 0\), and the Fisher information can still be calculated the standard way, \(-{\mathbb{E}}[\frac{du}{d\mu(\beta)}]\)<br> <br>We can estimate \(\beta\) by solving the estimating equation \(u(\hat \beta) = 0\) (called the Quasi-Likelihood Estimator)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Quasi-Likelihood Estimator Distribution	\[(X{^{\mathrm{T}}} W X)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] where \(W\) is the diagonal matrix with entries \[W_{ii} = \frac{\mu_i'(\hat \beta)^2}{V_i(\mu_i(\hat \beta))}\] where \(\hat \beta\) is the Quasi-Likelihood Estimator.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Linear Mixed Effects Model	Given observations \((x_{ij}, z_{ij}, Y_{ij})\), with \(i \in [n]\) groups and \(j \in [d]\) individuals per group, <br>\[Y_i = X_i\beta + Z_iu_i + {\varepsilon}_i\] where \(X_i \in {\mathbb{R}}^{d \times p}\),<br> where \(Z_i \in {\mathbb{R}}^{d \times q}\),<br> \(u_i \overset{iid}{\sim} N(0, \Sigma_u)\) are random effects (\(\Sigma_u\) is positive semi definite in \({\mathbb{R}}^{q\times q}\)),<br> \({\varepsilon}_i \overset{iid}{\sim} N(0, \sigma^2 I_d)\) (independent of \(u_i\)),<br> \(\beta \in {\mathbb{R}}^p\) are fixed effects<br> <br>\(Y_i\) represents all the observations from group \(i\).<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Marginal Formulation of Mixed Effects Models	Just compute expectation and variance of \(Y_i\): \[V_i := Z_i \Sigma_u Z_i{^{\mathrm{T}}} + \sigma^2 I_d\] \[Y_i \sim N(X_i\beta,\, V_i)\]<br> <br>Let \(Y = (Y_1^T ... Y_n^T)\), similar for \(X\) and \(Z\), and let \(V = diag(V_1,...,V_n)\). Taking the log-likelihood of the random multivariate normal pdf yields: \[l(\beta, \sigma^2, \Sigma_u) = -\frac{1}{2} nd\log(2\pi) - \frac{1}{2}\log(\det(V)) - \frac{1}{2} (Y - X\beta)^T V{^{-1}} (Y - X\beta)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	MLE for \(\beta\) in a linear mixed effects model	\[\hat \beta = (X{^{\mathrm{T}}} V{^{-1}} X){^{-1}} X{^{\mathrm{T}}} V{^{-1}} Y\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Restricted Maximum Likelihood Estimation (ReML) for Linear Mixed Effects Models	We want a matrix \(L\) such that \(LX = 0\). Use \(L = I - P = I - X(X{^{\mathrm{T}}} X){^{-1}} X{^{\mathrm{T}}}\), then \(LY\) simplifies nicely: \[LY = LX\beta + L[Z_iu_i] + \varepsilon \sim N_r(0, LVL{^{\mathrm{T}}})\] So just estimate \(\sigma^2\), \(\Sigma_u\) using that as the MLE.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Generalized Linear Mixed Effects Models (GLMMs)	Assume the normal setup for linear mixed effects models, but let \(u_i\) have an arbitrary distribution with unknown parameter \(\alpha\): \(u_i \overset{d}{\sim} {\mathbb{P}}_\alpha\). Assumes \[g({\mathbb{E}}[Y_{ij} | u_i]) = x_{ij}{^{\mathrm{T}}} \beta + z_{ij}{^{\mathrm{T}}} u_i\] <br>Estimate \((\alpha, \beta, \phi)\) by MLE: \[f(y; \alpha, \beta, \phi) = \int f(y {\,|\,} u; \beta, \phi) f(u;\alpha)du\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Hierarchial Form of Linear Mixed Effects Models	\[u_i \sim N(0, \Sigma_u)\] \[Y_i {\,|\,} u_i \sim N(X_i\beta + Z_iu_i,\, \sigma^2 I_d)\]<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What do we mean by \(X_{(i)}(x)\) and a corresponding \(Y_{(i)}\)?	It's just a permutation of the data, sorting by distance to a point \(x\): \[||X_{(1)}(x) - x|| \leq ||X_{(2)}(x) - x|| \leq ... \leq ||X_{(n)}(x) - x||\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(L\)-Nearest Neighbors Classifier	We have the conditional class probabilities: \[\hat p_k(x) = \frac{1}{L}\sum_{i=1}^L {\mathbb{1}}_{Y_{(i)} = k}\] Then we just use the plug-in estimation for the classifier: \[h^{LNN}(x) := {\underset{k}{\mathrm{argmax}}\,\,}\hat p_k(x)\]<br> <br>(majority vote among \(L\) nearest neighbors)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Statement of the bounds on the 1NN Classifier Risk bounds theorem	Suppose we're doing binary classification (\(k = 2\)) and all point distances are distinct (\(||x_{(j)} - x||\) unique). Assume the true conditional class probability \(p_1(x) = {\mathbb{P}}(Y = 1 {\,|\,} X = 1)\) is continuous, and \(X\) has density bounded away from zero on its support.<br> <br>Then the misclassification risk \(R(h^*)\) fulfills \[R(h^*) \leq \lim_{n \to \infty} R(h^{1NN}) \leq 2R(h^*)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Proposition:</b> 1NN Classifier Risk<br> Suppose we're doing binary classification (\(k = 2\)) and all point distances are distinct (\(||x_{(j)} - x||\) unique). Assume the true conditional class probability \(p_1(x) = {\mathbb{P}}(Y = 1 {\,|\,} X = 1)\) is continuous, and \(X\) has density bounded away from zero on its support.<br> Show that the misclassification risk \(R(h^*)\) fulfills \[R(h^*) \leq \lim_{n \to \infty} R(h^{1NN}) \leq 2R(h^*)\]	First show that \({\mathbb{P}}(h^{1NN}(x) = 1 {\,|\,} X_1 ... X_n) = p_1(X_{(1)}(x))\), and likewise for \(k=2\). <br> <br>Use this to show \(R(h^{1NN}) = {\mathbb{E}}[p_2(X_{(1)}(X^*))p_1(X^*) + p_1(X_{(1)}(X^*))p_2(X^*)\)<br> <br>Make the approximation that \(p_1(X_{(1)} (X^*)) \approx p_1(X^*)\) by continuity. <br> <br>Plug into \(\lim_{n \to \infty} R(h^{1NN})\), recognize that \(2 {\mathbb{E}}[p_1(X^* )(1-p_1(X^* ))] \leq 2R(h^*) \)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bootstrap Aggregation (Baggining)	Given some low-bias classifier \(h\):<br> 1. For \(b = 1...B\), draw \(n\) samples \((X_i^{(b)}, Y_i^{(b)})\) WITH REPLACEMENT from observations. Use this to compute the classifier \(h^{(b)}\).<br> 2. Create a new compound classifier by majority voting among all \(b\)s: \[h^{bag}(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \frac{1}{B}\sum_{b=1}^B {\mathbb{1}}_{h^{(b)} (x) = k}\] or for regression, just an average: \[\hat f^{bag}(x) = \frac{1}{B}\sum_{b=1}^B f^{(b)} (x)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Benefit of bagging	Won't change the bias, but can reduce variance (because bootstrapped samples are more independent).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> m-out-of-n Bagging<br>(and its benefits)	Instead of drawing \(n\) samples with replacement, draw \(m &lt; n\) samples (still with replacement).<br> <br>This increases independence among bagged samples, further decreasing the variance.<br> <br>This yields a consistent estimator when \(B, m \to \infty\) with \(m / n \to 0\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Gini Index	\[G(R) := \sum_{k=1}^K \hat p_k(R)(1 - \hat p_k(R))\] where \[\hat p_k (R) := \frac{1}{N(R)}\sum_{i: X_i \in R} I_{Y_i = k}\] (fraction of points in \(R\) that correspond to class \(k\))<br> <br>Measures the impurity of a region \(R\): we want regions with a lot of points with the same label. \\(pick a label randomly from the pool of labels we know the points have; what's the misclassification rate of just picking randomly from those labels, to label each point?)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Classification Tree Loss	For classification trees, we use the loss: \[Q(j,t) := \frac{N(R_l(j,t))}{N(R)}G(R_l(j,t)) + \frac{N(R_r(j,t))}{N(R)}G(R_r(j,t)) - G(R)\] where \(G(R)\) is the Gini index.<br> <br>(Reduction in impurity that splitting gains, weighted by the fraction of points that each region takes)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Final classification function of a classification tree	\[h^{CART}(x) := \sum_{m=1}^M \hat c_m I_{x \in R_m}\] where \[\hat c_m = {\underset{k}{\mathrm{argmax}}\,\,} \hat p_k(R_m)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bias-Variance Tradeoff for CART	Shallow trees: high bias, low variance<br> Deep trees: low bias, high variance	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Random Forest Procedure	Similar to bagging:<br> 1. For \(b=1...B\), draw \(n\) samples \((X_i^{(b)}, Y_i^{(b)})\) WITH REPLACEMENT. At each split, pick the best out of some random selection \(m_{try}\) of the \(p\) variables. Continue until it has grown maximally.<br> 2. Aggregate all \(B\) tree functions. For regression, just take the average of applying each tree function; for classification, just take the majority vote.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Variance of a regression random forest	\[{\mathrm{Var}}[f^{RF} (x)] = \rho \sigma^2 +\frac{1 - \rho}{B} \sigma^2\] <br>where for \(b \neq b'\), \(\sigma^2 = {\mathrm{Var}}[f^{(b)}(x)]\) and \(\rho\) is the correlation between \(f^{(b)}(x)\) and \(f^{(b')}(x)\).<br> <br>(split correlation into diagonal and non-diagonal bits in the sum)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Out-of-bag Error (OOB)	Just the average difference between \(Y_i\) and an aggregated forest prediction \(\hat Y_i\). Estimates the prediction error. <br> <br>Note this approaches leave-one-out cross validation error as \(B \to \infty\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Cook's Distance, Intuitively	A datapoint \((x,y)\)'s cook's distance is how much it would affect the final fit if it were to be removed.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> The Parametric Bootstrap	Suppose we know the family of distributions relevant, \({\mathbb{P}}_\theta\), but not the particular \({\mathbb{P}}_{\theta_0}\) with some property \(\psi(\theta_0)\). The idea is to estimate \(\theta_0\) with an estimator \(\hat \theta\), then simulate a bunch of new trials to understand the distribution of \(\psi\) with \(\hat \psi\):<br> 1. For \(b = 1...B\), sample \(Y^{(b)} = (Y_1^{(b)}, ...,Y_n^{(b)}) \overset{iid}{\sim} {\mathbb{P}}_{\hat \theta}\) from the estimated \(\hat \theta\) derived from \(Y\).<br> Use this to compute \(\hat \theta^{(b)} = \hat \theta(Y^{(b)})\) and plug in to get \(\hat \psi^{(b)} = \psi (\hat \theta^{(b)})\).<br> 2. Approximate \(\hat \psi\) with the empirical distribution: \[\hat {\mathbb{P}}^{(b)} = \frac{1}{B} \sum_{i=1}^B \delta_{ \hat \psi^{(b)}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Percentile Bootstrap Confidence Interval	In the parametric bootstrap, simulate \(B\) trials, order by their outcome \(\hat \psi^{(b)}\), the interval \( [\hat \psi^{0.025 B}, \hat \psi^{0.975 B} ] \) is an approximate two-tailed 95 percent confidence interval.<br> <br>More formally, we denote this as \([\hat q_{\alpha/2}, \hat q_{1 - \alpha/2}]\)<br> <br>Note that you can use the likelihood ratio test statistic as \(\psi\), like: \[\psi (\theta) = 2 \{ l(\beta, \sigma^2, \Sigma_u) - l(\beta, \sigma^2, 0) \}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	When is the percentile bootstrap confidence interval most valid?	When we assume \(\hat \psi \approx N(\psi_0, \sigma^2_\psi)\) and \(\hat \psi ^* \approx N(\hat \psi, \hat \sigma^2_\psi)\) conditional on \(Y\) (where \(\hat \sigma^2_\psi \approx \sigma^2_\psi\))	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Basic Bootstrap Confidence Interval<br>(and its assumptions)	\[[2 \hat \psi - \hat q_{1 - \alpha/2}, 2 \hat \psi - \hat q_{\alpha/2}]\] Relies on the assumption that \(\hat \psi^* - \hat \psi\) under \({\mathbb{P}}^*\) is similar to \(\hat \psi - \psi_0\), so that we have: \[\begin{align*} {\mathbb{P}}(2 \hat \psi - \hat q_{1 - \alpha/2} \leq \psi_0 \leq 2 \hat \psi - \hat q_{\alpha/2}) &\leq {\mathbb{P}}( \hat q_{1 - \alpha/2} - \psi \leq \hat \psi - \psi_0 \leq \hat q_{\alpha/2} - \psi )\\\\ & \approx {\mathbb{P}}^*( \hat q_{1 - \alpha/2} - \psi \leq \hat\psi^* - \hat \psi \leq \hat q_{\alpha/2} - \psi )\\\\ &= 1 - \alpha \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Probability as an Expectation	For all events \(A\), \[{\mathbb{P}}(A) = {\mathbb{E}}[{\mathbb{1}}(A)]\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Proposition:</b> Over all classifiers, the Bayes classifier has the minimal prediction error: \[\underset{h \in H}{\min} R(h) = R(h^*) = {\mathbb{E}}[1 - P_{h^*(X^*)}(X^*)] \] where \(H\) is the set of all classifiers.	Proof sketch:<br> Express the classification risk as an expectation. <br> Use the tower property to fix some \(X^* = x\), and split into a sum over possible values of \(Y^*\). <br> Note that \({\mathbb{1}}(h(x) \neq k)\) is now deterministic, so isolate an expectation on \(Y^* = k\). <br> Note that this is equivalent to the statement of the conditional class probability \(p_k(x)\). <br> Clearly the resulting statement is minimized when \(h(x) = p_k(x)\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Plug-in Classifier	To construct a classifier \(\hat h(x)\), define an estimator \(\hat p_k(x)\), then use the plug-in formula \[\hat h(x) = {\underset{k}{\mathrm{argmin}}\,\,} \hat p_k(x)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Linear and nonlinear classifiers	A classifier is linear if its decision boundaries are all subsets of hyperplanes in \({\mathbb{R}}^p\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Class Densities and Class Prior Probabilities for LDA	Class densities: \(f_k: X^* {\,|\,} (Y^* = k)\)<br> Class prior probabilities: \(\pi_k = {\mathbb{P}}(Y^* = k)\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Conditional class probability in terms of class densities and class prior probabilities (for LDA)	\[p_k = \frac{f_k(x) \pi_k }{\sum_{j=1}^K f_j(x) \pi_j}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Assumptions for Linear Discriminant Analysis	Assume each class density  \(f_k = X^* {\,|\,} (Y^* = k)\) is normally distributed around its mean: \[f_k \sim N_p(\mu_k, \Sigma)\] Need to solve for all \(\pi_i\), \(\mu_i\), \(i \in [p]\) and \(\Sigma\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Discriminant Functions	In linear discriminant analysis, we assume each class density is normal, and so plugging into the conditional class probability, throwing out common factors, and taking a log, we get: \[h^*_k(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \delta_k(x)\] \[\delta_k(x) := -\frac{1}{2} (x - \mu_k){^{\mathrm{T}}} \Sigma{^{-1}} (x - \mu_k) + \log(\pi_k)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Decision Boundaries of LDA Bayes	The subset \(\{ x \in {\mathbb{R}}^p: p_k(x) = p_l(x) \}\), where: \[\log \left( \frac{p_k(x)}{p_l(x)}\right) = x^T \Sigma{^{-1}} (\mu_k - \mu_l) - \frac{1}{2} (\mu_k + \mu_l){^{\mathrm{T}}} \Sigma{^{-1}} (\mu_k - \mu_l) + \log \frac{\pi _k}{\pi_l}\]<br> <br>\(\Sigma{^{-1}} (\mu_k - \mu_l)\) are the normal vectors to the hyperplanes seperating the fitted classes	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Assumed values of each parameter \(\hat \pi_k\), \(\hat \mu_k\), and \(\hat \Sigma\) in \(h^{LDA}\)	\(\hat \pi_k= \frac{N_k}{n}\) (where \(N_k\) is the number of observations of class \(k\))<br> \(\hat \mu_k = \frac{1}{N_k} \sum_{i:Y_i = k}X_i\) (centriod of class \(k\))<br> \(\hat \Sigma = \frac{1}{n - K} \sum_{k=1}^K \sum_{i:Y_i=k}(X_i - \hat \mu_k) (X_i - \hat \mu_k){^{\mathrm{T}}}\) (pooled sample covariance matrix)<br> <br>Used in \(h^{LDA}(x)\): \[h^{LDA}(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \hat \delta_k(x)\] \[\hat \delta_k(x) := -\frac{1}{2} (x - \hat\mu_k){^{\mathrm{T}}} \hat\Sigma{^{-1}} (x - \hat \mu_k) + \log(\hat\pi_k)\] This is an MLE, and is therefore consistent.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Logistic Classifier	Suppose we have independent observations \(\tilde Y_i \in \{ 1, ..., K \}\) over \(K\) classes, so each \(Y_i \in \{0,1\}^k\), \(Y_{ij} = {\mathbb{1}}_{\{\hat Y_{i = j}\}}\). We have: \[Y_i {\,|\,} (X_i = x_i) \overset{ind}{\sim} \mathrm{Multinomial}(1; p_1,...,p_K(x_i))\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Conditional Class Probabilities for the Logistic Classifier	\[p_k(x_i) = \frac{e^{x_i^T \beta_k}}{1 + \sum_{j=1}^{K-1} e^{x_i^T \beta_j}}\] Note that we set \(\beta_K := 0\) as a baseline.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Logistic Classifier Log Likelihood	\[L(\beta_1,...,\beta_{K-1}) = \log \left( \prod_{i=1}^n \prod_{k=1}^K p_k(x_i)^{Y_{ik}} \right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	When should you use LDA versus the logisitic classifier?	LDA yields smaller variance because of the Gaussian assumption, however the logistic classifier is more robust if Gaussianity does not hold	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Support Vector Machine	A two-class classifier (\(K = 2\)) with points labeled \(Y_i \in \{ 1, -1\}\).<br> We seek a hyperplane \(f(x)\), with \(f(x) &gt; 0\) for class 1 and \(f(x) &lt; 0\) for class 2 (class -1).<br> We find the hyperplane fit which maximizes the margin (closest distance from points on each side to the separating hyperplane)<br> <br>\[h^{SVC}(x) = {\mathrm{sgn}} (\hat \alpha + x{^{\mathrm{T}}} \hat \beta)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Hyperplane and Separating Hyperplane	\[f(x) = \alpha + x^T\beta\] It's called separating if \[Y_i f(X_i) &gt; 0\] for all \(i \in [n]\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Optimal Separating Hyperplane for SVMs	Maximize \(M\) (the margin) over \(\alpha \in {\mathbb{R}}\), \(\beta \in {\mathbb{R}}^p\), \(M \geq 0\), subject to \[\frac{Y_i(\alpha + X_i{^{\mathrm{T}}} \beta)}{||\beta||_2} \geq M\] for all \(i \in [n]\). <br> Equivalently, we can MINIMIZE \(||\beta||_2\) subject to \[Y_i(\alpha + X_i{^{\mathrm{T}}} \beta) \geq 1\] for all \(i \in [n]\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Softened Separating Hyperplane Optimization for SVMs	Minimize \[||\beta||_2^2 + C\sum_{i=1}^n \xi_i\] over \(\alpha \in {\mathbb{R}}\), \(\beta \in {\mathbb{R}}^p\), \(\xi \in {\mathbb{R}}^n\) subject to \[Y_i(\alpha + X_i{^{\mathrm{T}}} \beta) \geq 1 - \xi_i\] \[\xi_i \geq 0\] for \(i \in [n]\). We call the \(\xi_i\) slack variables:<br> If \(\xi_i = 0\), \(X_i\) is on the right side of the boundary and beyond the margin.<br> If \(0 &lt; \xi_i  &lt; 1\), \(X_i\) is within the boundary but on the wrong side of the margin.<br> If \(\xi &gt; 1\), then \(X_i\) is on the wrong side of the boundary.<br> <br>A smaller \(C\) means a tighter margin with more misclassifications.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Alternative formulation of the soft SVM	\[(\hat \alpha, \hat \beta) \in {\underset{\alpha \in {\mathbb{R}}, \beta \in {\mathbb{R}}^p}{\mathrm{argmin}}\,\,} \left( \sum_{i=1}^n (1 - Y_i(\alpha + X_i{^{\mathrm{T}}} \beta))_+ + \lambda||\beta||^2_2 \right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Laplace Kernel	\[k(x,y) = \exp(-\gamma ||x - y||_2)\] Radial like the Gaussian kernel, but with a slower decay for far away points.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Log-Odds	Log of the odds as we know it: like '3 to 1 in our favor'.<br> For an arbitrary probability \(p\), the logit gives its log-odds. <br> For classification problems, it's always given by \[\log \left( \frac{p_k(x)}{p_j(x)} \right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Log-Odds of the Logistic Classifier	\[\log \frac{p_k(x_i)}{p_K(x_i)} = x_i^T \beta_k\] with \(\beta_K = 0\) as the baseline.<br> Extending this, we have: \[\log \left(\frac{p_k(x_i)}{p_l(x_i)} \right) = x_i{^{\mathrm{T}}} (\beta_k - \beta_l)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Model Complexity and the Bias-Variance Decomposition	Complex models have low bias, but high variance.<br> Simple models have high bias, but low variance.	SLIP PartIIINotes