#separator:tab
#html:true
#notetype column:1
#deck column:2
#tags column:5
Basic	Part III Notes::MSM	<b>Definition:</b> Bernstein's Condition	A random variable X fulfills the Bernstein condition with parameters \(\sigma, b \geq 0\) if: <br>\[ {\mathbb{E}}(|X - {\mathbb{E}} X|^k) \leq \frac{1}{2} k! \sigma^2 b^{k-2} \] <br>for \(k = 2, 3, ...\)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Reproducing Kernel Hilbert Space (RKHS)	A Hilbert space of functions \(f: X \to \mathbb{R}\) such that, for all \(x \in X\),\[f(x) = \left&lt; f, K_x \right&gt;\] <br> Note: this condition is equivalent to saying that the point-evaluation functional  \[L_x = H \to \mathbb{R},\, L_x(f) = f(x)\] is continuous for every \(x \in X\). <br>So in other words \(L_x \in H^*\) <br>This is equivalent to the condition above by the Riesz Representation Theorem	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Reproducing kernel (of an RKHS)	By definition, fixing some \(x \in X\), there exists some \(K_x\) such that for any \(f \in H,\, f(x) = \left&lt; f, K_x \right&gt;\).  <br>If we plug in some \(y\) to \(K_x\), we then get \(K_x(y) = \left&lt;K_x, K_y \right&gt;\) <br>So our reproducing kernel is the function where we take some \(x, y\) pair and perform that process: \[K(x,y) = \langle K_x, K_y \rangle\] \(K: X \times X \to \mathbb{R}\)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Topological Ordering	A permutation \(\pi\) of some set of nodes \([n], n\in {\mathbb{N}}\), is a topological ordering if:<br> \[\forall a,b \in [n],\, b \in \mathrm{de}(a) \implies \pi(a) &lt; \pi(b)\] (Higher nodes in the DAG are sorted to the front)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Multiple Testing	When we have a sequence of null hypotheses \(H_1...H_m\) where we have found associated p-values \(p_1...p_m\).<br> <br>We let \(I_0\) represent the indices of the true nulls where no relationship exists, with \(m_0 = |I_0|\).<br> <br>It's not as simple as just rejecting the small-enough p-values due to the jellybeans effect!	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	What does \(I_0\) represent in multiple testing?	A set of hypothesis indices representing true nulls that should not be rejected.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Family-Wise Error Rate (FWER)	\[{\mathbb{P}} ( N \geq 1)\] where \(N\) is the number of false rejections.<br> <br>For instance the FWER of the naiive approach to multiple testing is \[ 1 - (1 - \alpha)^m \]	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Bonferroni Correction	Reject \(H_i\) if \(p_i \leq \frac{\alpha}{m}\) for desired \(\alpha\) level.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Intersection Hypothesis	When we want to test some subset of our multiple hypotheses \(I \subseteq [m]\) are all true: \(H_I\).<br> <br>We can reject it by showing that any of the hypotheses can be rejected!	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Local Test<br>(and its critical requirement)	Given some intersection hypothesis \(H_I\), a test \(\phi_I\) that is 1 if we reject \(H_I\) or zero otherwise. We require that the error rate of \(\phi_I\) be \(\leq \alpha\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Closed Testing Procedure	Suppose you have a local test \(\phi_I\) for testing intersection hypotheses. Reject \(H_I\) iff all superset local tests \(\phi_J,\, I \subseteq J\) are rejected (test everything above \(I\)).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Holm's Procedure	Closed testing with the Bonferroni test as the local procedure: \[ \phi_I(x) = \begin{cases}            1 & \underset{i \in I}{\min} (p_i) \leq \frac{\alpha}{|I|}  \\\\           0 & \text{otherwise}        \end{cases} \]	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Holm's Procedure as a Stepdown Procedure	Let \(p_{(i)}\) represent the \(i\)th smallest p-value for respective hypothesis \(H_{(i)}\).<br> <br>Step 1: if \(p_{(1)} \leq \frac{\alpha}{m}\), reject \(H_{(1)}\) and continue; otherwise accept \(H_{(1)}...H_{(m)}\) <br>... <br>Step i: if \(p_{(i)} \leq \frac{\alpha}{m - i + 1}\), reject \(H_{(i)}\) and continue; otherwise accept \(H_{(i)}...H_{(m)}\) <br>... <br>Step m: if \(p_{(m)} \leq \alpha\), reject \(H_{(m)}\) and finish; otherwise accept only \(H_{(m)}\)<br> <br>So, to recap, iterate over p-values starting at the smallest, rejecting and continuing if \(p_{(i)} \leq \frac{\alpha}{m - i + 1}\) or ceasing otherwise.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	How good is Holm's procedure?	Uniformly more powerful than Bonferroni correction	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> False Discovery Rate (FDR)	\[ {\mathbb{E}} \left( \frac{N}{R \lor 1} \right) \] where \(N\) is the number of false rejections, \(R\) is the total number of rejections, and \(\lor\) is the max function (so we never divide by zero).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Benjamin-Hochberg Procedure	Order hypotheses by their p-values, from smallest to largest. Find the largest index underneath the slope \(p_{(i)} \leq \frac{\alpha i}{m}\): that index and everything below it is rejected, but everything above \(i\) is accepted. The FDR is guaranteed to be kept at \(\frac{\alpha m_0}{m} \leq \alpha\)<br> <br>Visual idea: find the largest index point under the line with slope \(\frac{\alpha}{m}\), that is your cutoff point.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> The reproducing kernel of an RKHS is well-defined.	Given any two reproducing kernels \(k_x, h_x \in X \times X \to {\mathbb{R}}\), show that \[||k_x - h_x||^2_H = 0\] by representing as an inner product and simplifying: <br>\[\begin{align*} ||k_x - h_x||^2_H &= \left&lt; k_x - h_x, k_x - h_x \right&gt;\\\\ &= \left&lt; k_x, k_x - h_x \right&gt; - \left&lt; h_x, k_x - h_x \right&gt;\\\\ &= (k_x - h_x)(x) - (k_x - h_x)(x)\\\\ & \text{by the reproducing property}\\\\ &= 0 \end{align*}\]	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Every DAG has a topological ordering.	First, prove by contradiction that there must be a node with no parents, called the source node (or else we must have a cycle, breaking our DAG assumption). <br>Next, build the graph up inductively from a graph with one node.  <br>Base case: Clearly any graph with one node is a topological order. <br>Inductive step: We assume that ANY graph with \(n-1\) nodes, \(G_{n-1}\), has a topological order. Thus, given a graph \(G_n\) nodes, apply the lemma we proved above to get the parent node, and remove it to get a new graph \(G_{n-1}\). By the inductive step we have a topological order \(\pi_{n-1}\). Construct \(\pi_{n}\) by shifting every node in \(\pi_{n-1}\) down one, and putting the parent node at the top at 1. We're done.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Bonferroni Correction yields a FWER \(\leq \alpha\)	Just apply Markov's inequality and chug: <br>\[\begin{align*} {\mathbb{P}} (N \geq 1) &\leq {\mathbb{E}}(N)\\\\ & \text{ by the Markov inequality}\\\\ &\leq {\mathbb{E}}\left[ \sum_{i \in I_0} {\mathbb{1}}_{\{ p_i \leq \alpha / m \}}\right]\\\\ &\leq \sum_{i \in I_0} {\mathbb{P}}\left(p_i \leq \frac{\alpha}{m}\right)\\\\ &\leq \frac{\alpha m_0}{m}\\\\ &\leq \alpha \end{align*}\]	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> When all hypotheses are tested using closed testing, a false rejection will be made with at most probability \(\alpha\)	False rejections can only be made if our intersection hypothesis is entirely composed of indices in \(I_0\). But then including upwards, we are guaranteed to check \(\phi_{I_0}\), which only happens with probability at most \(\alpha\).<br> <br>It follows that the FWER for any particular \(H_I\) has at most level \(\alpha\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Benjamin-Hochberg keeps FDR \(\leq \alpha\)	We must assume that all \(p_i,\, i \in I_0\) are independent from all the other p-values.<br> <br>Begin by breaking down R then N into their possible cases, and express as a disjoint probability in two terms.<br> <br>Simplifying the event in the probability term: break down the condition \(R = r\) into what it means for \(r\) to be the max value. Define a new set \(p_{-i}\) where we've removed the \(i\)th p-value (shifting the indices to the left), so that they are independent of \(p_i\). Remark these two terms are equivalent themselves to a reduced Benjamin-Hochberg trial with a modified cutoff line \(\frac{\alpha}{m}\).<br> <br>Plugging back in the rewritten event, seperate the probability block and simplify to get \(\leq \frac{\alpha m_0}{m} \leq \alpha\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Conditional Independence<br>(Two equivalent definitions for \(X {\perp\!\!\!\perp} Y {\,|\,} Z\), for random vectors \(X, Y, Z\).)	1. \({\mathbb{P}}(X \in A,\, Y \in B {\,|\,} Z) = {\mathbb{P}}(X \in A{\,|\,} Z){\mathbb{P}}(Y \in B{\,|\,} Z)\)<br> <br>2. \({\mathbb{P}}(X \in A {\,|\,} Y, Z) = {\mathbb{P}}(X \in A {\,|\,} Z)\)<br> (knowing \(Y\) doesn't give us any new information about \(X\))<br> <br>note that here \(X \in A\) is best read as '\(X\) lands in \(A\)'	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Weak Union rule for Conditional Independence	\[ X {\perp\!\!\!\perp} Y, Z \implies  \begin{cases}    X {\perp\!\!\!\perp} Y {\,|\,} Z\\\\   X {\perp\!\!\!\perp} Z {\,|\,} Y \end{cases} \] We can move elements from the right-hand side to the conditioning side. Also works with some other conditioning \(W\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Contraction rule for Conditional Independence	\[ \begin{rcases*} X {\perp\!\!\!\perp} Z\\\\ X {\perp\!\!\!\perp} Y {\,|\,} Z \end{rcases*} \implies X {\perp\!\!\!\perp} Y, Z \] We can apply known independencies to the conditioning side. Also works with some other conditioning \(W\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Intersection rule for Conditional Independence	\[ \begin{cases}    X {\perp\!\!\!\perp} Y {\,|\,} Z\\\\   X {\perp\!\!\!\perp} Z {\,|\,} Y \end{cases} \implies X {\perp\!\!\!\perp} Y, Z \] Converse of the weak union rule. ONLY ALLOWED when \(X, Y, Z, W\) have a joint density, and \(f_{Y,Z,W}\) is everywhere positive (need to ensure \(Y \neq Z\) basically)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	When are we allowed to use the intersection rule for conditional independence?	When \(X, Y, Z, W\) have a joint density, and \(f_{Y,Z,W}\) is everywhere positive (need to ensure \(Y \neq Z\) basically)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Conditional Independence Graph (CIG)	Given a distribution \(P\) on \({\mathbb{R}}^p\), an undirected graph \(G = ([p], E)\) where, if \(Z \sim P\), \[ (j,k), (k,j) \in E \iff Z_j \not\mathclap{{\perp\!\!\!\perp}}\mkern12mu Z_k {\,|\,} Z_{-jk} \] where \(Z_{-jk}\) is a subvector of \(Z\) without indices \(j\) and \(k\).<br> <br>Thus components are connected if there exists some dependence between them (even given all other information)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Law of Total Expectation<br>(AKA the tower property)	\[{\mathbb{E}}({\mathbb{E}}(X|Y)) = {\mathbb{E}}(X)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Chain Rule of Probability	\[\begin{align*} {\mathbb{P}}(X,Y) &= {\mathbb{P}}(Y){\mathbb{P}}(X|Y)\\\\ &= {\mathbb{P}}(X){\mathbb{P}}(Y|X) \end{align*}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Slutsky's Theorem	Let \(X_n, Y_n\) be sequences of random elements. If both<br> \(X_n \overset{d}{\to} X\)<br> \(Y_n \overset{p}{\to} c\)<br> for some constant \(c\), then<br> <br>1. \(X_n + Y_n \overset{d}{\to} X + c\)<br> 2. \(X_nY_n \overset{d}{\to} Xc\)<br> 3. \(X_n / Y_n \overset{d}{\to} X /c\) given \(c \neq 0\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Undirected Graph Separation<br>(Given an undirected graph \(G = ([p], E)\) and disjoint sets \(A, B, S \subset [p]\), S separates \(A\) and \(B\) if...)	Every path from \(A\) to \(B\) contains a node in \(S\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Global Markov Undirected Graphs<br>(A distribution \(P\) on \({\mathbb{R}}^p\) is global Markov with respect to an undirected graph \(G = ([p], E)\) when...)	Taking any \(A, B, S \subset [p]\) with \(S\) separating \(A\) and \(B\), we have that for any \(Z \sim P\), \[ Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S \]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Objective of Conditional Independence Testing	Given \(n\) iid copies \((x,y,z),\,...,\,(x_n,y_n,z_n) \in {\mathbb{R}} \times {\mathbb{R}} \times {\mathbb{R}}^p\) collected into \((X, Y, Z) \in {\mathbb{R}}^n \times {\mathbb{R}}^n \times {\mathbb{R}}^{n \times p}\), we wish to test the null hypothesis \(X {\perp\!\!\!\perp} Y {\,|\,} Z\).	MSM PartIIINotes
Basic	Part III Notes::MSM	Intuitively, how can we think about \({\mathbb{E}}(X {\,|\,} Z)\), and how could we solve for such a thing?	\({\mathbb{E}}(X {\,|\,} Z)\) represents our 'best guess' for \(X\) given some information \(Z\).<br> <br>We can solve for \({\mathbb{E}}(X {\,|\,} Z)\) by finding an \(f\) that minimizes \[{\mathbb{E}}\left[(X - f(Z))^2\right]\] (this is called the L2 projection theorem, and it links together statistical regression and the conditional expectation)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Generalized Covariance Measure (GCM)<br>(Assuming we have data \((X_i, Y_i, Z_i)\) and fitted regression functions \(\hat f(z)\) and \(\hat g(z)\)...)	The test statisitc  \[T := \sqrt{n} \frac{\tau_N}{\tau_D}\] where \[\begin{align*} \tau_N &:= \frac{1}{n}\sum_{i=1}^n \hat\epsilon_i\hat\xi_i\\\\ \tau_D^2 &:= \frac{1}{n} \sum_{i=1}^n   \hat\epsilon^2_i\hat\xi^2_i - \tau^2_N \end{align*}\] where we've defined \(\hat \epsilon_i := X_i -\hat f(Z_i)\) and \(\hat \xi := Y_i - \hat g(Z_i)\)<br> <br>Explanation:<br> Our null hypothesis is that \[{\mathbb{E}}\left[ (X - {\mathbb{E}}(X{\,|\,} Z)) (Y- {\mathbb{E}}(Y{\,|\,} Z))\right]=0,\] so we just need to quantify deviance from that. \(\hat f\) and \(\hat g\) approximately represent \({\mathbb{E}}(X{\,|\,} Z)\) and \({\mathbb{E}}(Y{\,|\,} Z)\) respectively, and so we define our average deviance among samples as \(\tau_N\).<br> The CLT tells us \(\sqrt{n}\tau_N\) should be Gaussian (mean-zero under the null, iid per \(i\)), but we need to normalize for variance if we want a standard normal distribution. So we define \(\tau_N\) to do this, using \({\mathrm{Var}}(X) = {\mathbb{E}}(X^2) - {\mathbb{E}}(X)^2\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> The Generalized Covariance Measure is asymptotically normal	Assuming there exists \(c &gt; 0\) such that \({\mathrm{Var}}(\epsilon {\,|\,} Z), {\mathrm{Var}}(\xi {\,|\,} Z) &lt; c\) almost surely,<br> \({\mathrm{Var}}(\epsilon\xi) &gt; 0\),<br> and  \[{\mathcal{E}}_f \to 0,\, {\mathcal{E}}_g \to 0,\, {\mathcal{E}}_f {\mathcal{E}}_g = o(n^{-1}),\] given \({\mathcal{E}}_f := {\mathbb{E}}\left[ \frac{1}{n} \sum_{i=1}^n (f(Z_i) - \hat f(Z_i))^2 \right]\) and likewise for \({\mathcal{E}}_f\), then \[T \sim N(0, 1)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	What is the benefit of using the Benjamin-Hochberg Procedure?	It's guaranteed to keep the FDR beneath \(\alpha\): \[\frac{\alpha m_0}{m} \leq \alpha\]	MSM PartIIINotes
Basic	Part III Notes::MSM	Vector generalization of the mean squared error (MSE) for an estimator \(\hat \theta\) of \(\theta\), and what it simplifies down to	\[{\mathbb{E}}[(\hat \theta - \theta) (\hat \theta - \theta )^T ] = {\mathrm{Var}}(\hat \theta) + Bias(\hat \theta) Bias(\hat \theta)^T = {\mathrm{Var}}(\hat \theta) + {\mathbb{E}}[\hat \theta - \theta ]{\mathbb{E}}[\hat \theta - \theta ]^T\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Equation optimized for Ridge Regression	\[ (\hat \mu_\lambda^R, \hat \beta_\lambda^R) = \underset{(\mu, \beta) \in {\mathbb{R}} \times {\mathbb{R}}^p}{argmin}(||Y - \mu 1 - X \beta||^2_2 + \lambda ||\beta||^2_2) \]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Solution for \(\hat \beta_\lambda^R\) in Ridge Regression	\[(X^TX + \lambda I){^{-1}} X^T Y\]	MSM PartIIINotes
Basic	Part III Notes::MSM	What do we know about the differences between OLS solutions \(\beta^{OLS}\) and ridge regression solutions \(\beta^R\)?	As \(\lambda \to 0\), the difference in MSEs between \(\beta^{OLS}\) and \(\beta^{R}_\lambda\) is positive definite: \[{\mathbb{E}}[(\beta^{OLS} - \beta_0)(\beta^{OLS} - \beta_0)^T] - {\mathbb{E}}[(\beta^{R} - \beta_0)(\beta^{R} - \beta_0)^T] \succeq 0\]<br> <br>Show this by first computing the bias and variance of \(\hat \beta ^R_\lambda\).	MSM PartIIINotes
Basic	Part III Notes::MSM	In MSM, what does the notation \(X_j\) mean?	The \(j\)th column of \(X\)	MSM PartIIINotes
Basic	Part III Notes::MSM	What does SVD tell about when ridge regression works best?	\(\lambda\) shrinks Y most in the small principal components of X. So if most of the signal is in the large principal components, we won't lose much.	MSM PartIIINotes
Basic	Part III Notes::MSM	Show how the kernel trick leads to an alternate form of \(\hat \beta_\lambda^R\)	Note first that for \(X \in {\mathbb{R}}^{n \times p}\), \[(X^TX + \lambda I_n) X^T = X^T(XX^T + \lambda I_p)\] Multiply by inverses on each side and then multiplying by Y on the right yields \[(X^TX + \lambda I_n){^{-1}} X^T Y = X^T(XX^T + \lambda I_p){^{-1}} Y = \hat \beta_\lambda^R\] <br>Note that the fitted values \(X\hat \beta_\lambda^R\) only depend on \(XX{^{\mathrm{T}}}\) then!	MSM PartIIINotes
Basic	Part III Notes::MSM	Kernel Form of Ridge Regression	\[X\hat \beta_\lambda^R = K(K + \lambda I){^{-1}} Y\] where \(K = X X{^{\mathrm{T}}}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	Feature Map	A function  \(\phi:X \to H\) from X to some Hilbert space \(H\).<br> <br>Note that this gives us a similarity measure \(k: X \times X \to {\mathbb{R}}\): \[k(x, x') = \langle \phi(x), \phi(x') \rangle\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Positive Definite Kernel	A symmetric map \(k: {\mathcal{X}} \times {\mathcal{X}} \to {\mathbb{R}}\) such that the matrix \(K\) with entries  \[K_{ij} = k(x_i, x_j)\] is positive semi-definite.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Cauchy-Schwarz for Kernels	\[k(x, x')^2 \leq k(x,x)k(x',x')\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Linear Kernel	\(k(x,x') := x{^{\mathrm{T}}} x'\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Polynomial Kernel	\(k(x,x') := (1 + x{^{\mathrm{T}}} x')^d\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Gaussian Kernel	\(k(x,x') := \exp\left(-\frac{||x - x'||_2^2}{2h^2}\right)\) \(h&gt;0\) is called the bandwidth.<br> <br>Note that this must be infinite-dimensional because we have a term \(e^{xx'}\), which Taylor expands to an infinite series of polynomial terms.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> First-Order Sobolev Kernel	\[k(x, x') = \min(x, x')\]<br> This must be a kernel because it can be represented as \[ k(x,x') = \int_0^1 {\mathbb{1}}_{[0, x]}(u){\mathbb{1}}_{[0, x']}(u) = \langle {\mathbb{1}}_{[0, x]}, {\mathbb{1}}_{[0, x']} \rangle_{L_2([0,1])} \]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Second-Order Sobolev Kernel	\[ k(x, x') = \int_0^{\min(x,x')} (x - u)(x' - u)du \]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Jaccard Similarity	Similarity based on what features are shared among \(1...p\). \(h(x,x')\) is defined as 1 when \(x \cup x' = \emptyset\), and otherwise \[h(x,x') = \frac{|x \cap x'|}{|x \cup x'|}\] where \(|.|\) is the set size.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> The Spectral Theorem	If \(X\) is symmetric, we have a decomposition  \[X = P{^{\mathrm{T}}} D P\] where \(D\) is diagonal and \(P\) is orthogonal. The rows of \(P\) are the eigenvectors of X. <br> <br>Furthermore, if \(X\) is positive definite, every diagonal term in \(D\) is positive.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> For every kernel \(k\) on FINITE \({\mathcal{X}}\), there exists some Hilbert space \({\mathcal{H}}\) and feature map \(\phi: {\mathcal{X}} \to {\mathcal{H}}\) for which  \[k(x,x') = \langle\phi(x), \phi(x')\rangle\]	Because \({\mathcal{X}}\) is finite and \(k\) is a kernel, we have a positive definite \(K \in {\mathbb{R}}^{n \times n}\) that contains all information about the space. Because \(K\) is positive definite, we have the decomposition \(K = P{^{\mathrm{T}}} DP\) with orthogonal \(P\) and positive diagonal \(D\). Take \({\mathcal{H}}\) to be \({\mathbb{R}}^n\) and the feature map \(\phi: x_i \mapsto (D^{1/2} P_i) \)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> For every kernel \(k\) in a general \({\mathcal{X}}\), there exists some Hilbert space \({\mathcal{H}}\) and feature map \(\phi: {\mathcal{X}} \to {\mathcal{H}}\) for which  \[k(x,x') = \langle\phi(x), \phi(x')\rangle\]<br> <br>First specify the \({\mathcal{H}}\) and \(\phi(x)\) used.	Take \({\mathcal{H}}\) to be the linear span of the functions \[ f(\cdot) = \sum_{i=1}^n a_i k(\cdot, x_i) \] for some \(n \in {\mathbb{N}}\), \(x_i \in {\mathcal{X}}\), and \(a_i \in {\mathbb{R}}\) with the inner product defined as \[ \langle f, g \rangle = \sum_{i=1}^n \sum_{j=1}^m a_i b_j k(x_i, x'_j) \] This is clearly well-defined since there's no dependence on the expansion of \(f\) or \(g\). So our feature map is  \[\phi(x) = k(\cdot , x)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Show that \({\mathcal{H}}\) defined as the space of functions of the form \[ f(\cdot) = \sum_{i=1}^n a_i k(\cdot, x_i) \] with inner product \[ \langle f, g \rangle = \sum_{i=1}^n \sum_{j=1}^m a_i b_j k(x_i, x'_j) \] and \[\phi(x) = k(\cdot , x)\] is a valid kernel and inner product.	First show that we indeed have  \[k(x,x') = \langle\phi(x), \phi(x')\rangle\] Clearly symmetric and linear, but is it positive definite?<br> Note first that \(\langle f, g\rangle \geq 0\) by positive definiteness of the kernel. So then for functions \(f_1...f_n\), coefficients \(\gamma_1...\gamma_n\), and working from the PSD requirement, we have: \[\sum_{i,j} \gamma_i \langle f_i, f_j \rangle \gamma_j = \left\langle \sum_i f_i \gamma_i, \sum_j f_j \gamma_j \right\rangle \geq 0\] Thus \(\langle\cdot , \cdot \rangle\) is a kernel. The last part to show is that \(\langle f, f\rangle = 0 \iff f = 0\). The backwards direction is obvious, but for the forwards direction, first note that \[\langle k(\cdot,x), f\rangle = \sum_i a_i k(x_i, x) = f(x)\] So, we have: \[\langle f, f\rangle = 0 \implies f(x) = \langle k(\cdot,x), f\rangle \leq \langle k(\cdot,x), k(\cdot,x)\rangle\langle f, f\rangle = 0\] where the \(\leq \) step is given by Cauchy-Schwarz for kernels.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Show the representer theorem:<br> Given some RKHS \(\mathcal{H}\) with representing kernel \(K\) and arbitrary loss function \(c: \mathbb{R}^n \times X^n \times \mathbb{R}^n \to \mathbb{R}\) and strictly increasing \(J: {\mathbb{R}} \to {\mathbb{R}}\), the minimizing \(\hat f \in \mathcal{H}: X \to \mathbb{R}\) of<br> \[Q_1(f):= c(y, x_1, ..., x_n, f(x_1), ..., f(x_n)) + J(||f||_\mathcal{H}^2)\]<br> is given by<br> \[\hat f(x) = \sum_{i=1}^n \hat \alpha_i K(x, x_i)\]<br> with \(\hat \alpha\) minimizing:<br> \[Q_2(\alpha) = c(y, x_1, ..., x_n, K\alpha) + J(\alpha ^TK\alpha)\]	Let \(U\) be the linear span of \(k(\cdot, x_i), i = 1...n\). Linear subspaces of Hilbert spaces are closed; and closed subspaces of Hilbert spaces mean that any function \(f = u + v\): \(u \in U\) and \(v \in U^\perp\).<br> Show first that \(f(x_i) = u(x_i)\) by means of the definition of RKHS, and that \(J(||f||^2_{\mathcal{H}}) \geq J(||u||^2_{\mathcal{H}})\). Reason then that \(Q_1\) is minimized when \(v = 0\).<br> Show then that we can express \(||u||^2_{\mathcal{H}}\) as \(\alpha^T K \alpha\) and \(f(x_i)\) as \(K \alpha\); thus \(Q_1(f) = Q_2(\alpha)\).	MSM PartIIINotes
Basic	Part III Notes::MSM	Functional Generalization of Ridge Regression	Find the \(\hat f_\lambda\) minimizing \[\sum_{i=1}^n (Y_i - f(x_i))^2 + \lambda ||f||^2_{\mathcal{H}}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	How can we simplify \[\sum_{i=1}^n (Y_i - f(x_i))^2 + \lambda ||f||^2_{\mathcal{H}}\] with the Representer Theorem?	Using \(c(...) = \sum_{i=1}^n (Y_i - f(x_i))^2\) and \(J(||f||^2_{\mathcal{H}}) = \lambda||f||^2_{\mathcal{H}}\), we have that the above is equivalent to minimizing \[||Y - K\alpha||^2_2 + \lambda \alpha{^{\mathrm{T}}} K \alpha\]<br> <br>Note that this gives us a way to predict an unknown point: \[\hat f(x) = \sum_{i=1}^n \hat \alpha_i k(x, x_i)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	What do we know about orthogonal matrices in the L2 norm?	Because an orthogonal matrix never scales anything, we have: \[||UX||^2_2 = ||X|^2_2 = ||XU||^2_2\]	MSM PartIIINotes
Basic	Part III Notes::MSM	Trick to pull orthogonal matrices out of an inverse like \((UDU{^{\mathrm{T}}} + \lambda I){^{-1}}\)	Note that \(I = U{^{\mathrm{T}}} U = UU{^{\mathrm{T}}}\), so we have: \[\begin{align*} (UDU{^{\mathrm{T}}} + \lambda I){^{-1}} &= (UDU{^{\mathrm{T}}} + U\lambda IU{^{\mathrm{T}}}){^{-1}}\\\\& = (U(D + \lambda I)U{^{\mathrm{T}}}){^{-1}}\\\\& = U^{-T} (D+\lambda I){^{-1}} U{^{-1}}\\\\& = U(D + \lambda I)U{^{\mathrm{T}}} \end{align*}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	AM-GM	The arithmetic mean is always at least as large as the geometric mean: for all \(a, b \in {\mathbb{R}}\), \[\frac{a+b}{2} \geq \sqrt{ab}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	How can we bound a term like \(\frac{ab}{(a+b)^2}\)?	Use AM-GM: \[\frac{a+b}{2} \geq \sqrt{ab}\] \[a+b \geq 2\sqrt{ab}\] \[(a+b)^2 \geq 4ab\] \[\frac{ab}{(a+b)^2} \leq \frac{ab}{4ab} = \frac{1}{4}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> Prove that the mean squared prediction error (MSPE) may be bounded  above in the following way: \[\begin{align*} \frac{1}{n}\mathbb{E}\left\{\sum_{i=1}^n \{f^0(x_i) - \hat{f}_\lambda(x_i)\}^2\right\}  &\leq \frac{\sigma^2}{n}\sum_{i=1}^n \frac{d_i^2}{(d_i+\lambda)^2} + \frac{\lambda\|f^0\|^2_\mathcal{H}}{4n} \\\\ &\leq \frac{\sigma^2}{n}\frac{1}{\lambda}\sum_{i=1}^n \min(d_i/4, \lambda) + \frac{\lambda\|f^0\|^2_\mathcal{H}}{4n}. \end{align*}\]	Proof sketch:<br> By the representer theorem, we have that \[[\hat f_\lambda^R(x_i)] = K(K+\lambda I){^{-1}} Y\] and \[[\hat f_\lambda^0(x_i)] = K\alpha\] Use the eigendecomposition \(K = UDU{^{\mathrm{T}}}\), \(U\) orthogonal. <br> Let \(\Theta = D U{^{\mathrm{T}}} \alpha\), note that \(Y = K \alpha + {\varepsilon}\). Split into deterministic and stochastic parts (and swap them, stochastic is the first term in the final representation)<br> In the deterministic term, use the \(D{^{-1}}\) trick to introduce \(\alpha{^{\mathrm{T}}} K \alpha = ||f^0||^2_{\mathcal{H}}\); get the \(1/4\) from bounded the fraction with AM-GM. <br> In the stochastic bit, use the trace trick to get the \(\epsilon\)s together, where you can bring the expectation in to get \(\sigma^2\).<br> The final equality falls out of AM-GM again.<br>	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Eigenfunction	Given a random variable \(X\) taking values in \({\mathcal{X}}\), we say a nonzero function \(e \in {\mathcal{H}}\) is an eigenfunction with eigenvalue \(\mu \in {\mathbb{R}}\) if \[\mu e(x) = {\mathbb{E}}[k(x,X) e(X)]\] in other words, the \({\mathbb{E}}[k(x,X)...]\) part is a linear operator acting upon \(e\): <br>\[(Te)(x) = \int k(x,x') e(x')dP(x')\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Mercer's Theorem	Given a random variable \(X\) taking values in \({\mathcal{X}}\) with a nonzero \(e \in {\mathcal{H}}\) eigenfunction with eigenvalue \(\mu \in {\mathbb{R}}\) (so that \(\mu e(x) = {\mathbb{E}}[k(x,X) e(X)]\) is fulfilled), and under some mild regularity conditions (like \({\mathbb{E}}(k(X,X)) &lt; \infty\)), then three things must hold true:<br> <br>1. The set of positive eigenvalues is at most countable.<br> <br>2. The subspace spanned by each eigenfunction has a finite dimension (known as the multiplicity of the eigenvalue)<br> <br>3. The eigenvalues are orthonormal in the sense that  \[{\mathbb{E}}[ e_j(X) e_k(X)] = {\mathbb{1}}_{j=k}\] See the decomposition of kernels.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Bochner's Theorem	Let \(k: {\mathbb{R}}^p \times {\mathbb{R}}^p \to {\mathbb{R}}\) be a continuous kernel. Then \(k\) is shift-invariant iff there exists some \(c &gt;0\) and distribution \(F\) on \({\mathbb{R}}^p\) such that when \(W \sim F\),  <br>\[k(x,x') = c {\mathbb{E}}[ e^{i(x-x'){^{\mathrm{T}}} W}] = c {\mathbb{E}}[ \cos ((x - x') {^{\mathrm{T}}} W)]\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Mean-Squared Prediction Error (MSPE)	\[\frac{1}{n} {\mathbb{E}}[||X\beta^0 - X\hat \beta||^2_2]\] Measures the average difference between true values and predicted values.	MSM PartIIINotes
Basic	Part III Notes::MSM	Why do we expect that Lasso regression could be useful?	Let \(S\) be the set \(\{ k : \beta_k^0 \neq 0 \}\) (the set of parameters that are nonzero). Oftentimes this set is smaller than \(p\), the total number of parameters. When this is the case, OLS could stand to be a lot better: consider the MSPE: \[\frac{1}{n} {\mathbb{E}}[||X\beta^0 - X\hat \beta||^2_2] = ... = \frac{\sigma^2p}{n}\] If we just dropped each of the \(p\)s that aren't in \(S\), we could significantly reduce the error.	MSM PartIIINotes
Basic	Part III Notes::MSM	Function minimized in Lasso regression	Find the \((\hat \mu^L, \hat \beta ^L)\) minimizing \[  \frac{1}{2n}||Y - X \beta||_2^2 + \lambda || \beta ||_1 \] <br>We generally assume that \(X\) is already mean-centered, giving us the loss we actually seek to minimize: \[Q_\lambda(\beta) := \frac{1}{2n}||Y - X\beta||_2^2 + \lambda ||\beta||_1\]	MSM PartIIINotes
Basic	Part III Notes::MSM	Given the minimizer \(\hat \beta ^L_\lambda\) in lasso regression, how can we express this as a conditional optimization problem?	\[||Y - X \beta||^2_2\, \text{ subject to } ||\beta||_1 \leq ||\hat \beta ^L_\lambda||_1\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Holder's Inequality	Let \(p, q \in [1, \infty]\) with \(\frac{1}{p} + \frac{1}{q} = 1\). Then we have, for any functions \(f\), \(g\), \[||fg||_1 \leq ||f||_p||g||_q\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> Prediction Error of the Lasso (slow rate):<br> Let \(\hat{\beta}\) be any Lasso solution when \[\lambda = A\sigma\sqrt{\frac{\log(p)}{n}}.\] with probability at least \(1 - 2p^{-(A^2/2-1)}\). Then: \[\frac{1}{n}\|X(\beta^0 - \hat{\beta})\|_2^2 \leq 4A\sigma\sqrt{\frac{\log(p)}{n}}\|\beta^0\|_1.\]	Proof sketch:<br> Start from the fact that we know \(\beta^L\) can do no better than the true mean \(\beta^0\): \[\frac{1}{2n} ||Y - X \hat \beta||_2^2 + \lambda ||\hat \beta||_1 \leq \frac{1}{2n} ||Y - X \beta^0||_2^2 + \lambda ||\beta^0||_1\] Substitute in \(Y = X\beta^0 + {\varepsilon}\), match the left side. <br> We can put an implicit norm around \({\varepsilon}{^{\mathrm{T}}} X (\hat \beta - \beta^0)\), use Holder's to get the one norm we need paired with an infinity norm around \(X{^{\mathrm{T}}} {\varepsilon}\)<br> In order to make everything work out nicely, we need the assumption that \(\frac{1}{n}||X{^{\mathrm{T}}} {\varepsilon}||_\infty \leq \lambda\). Call this event \(\Omega\), we assume that it is true with probability at least \(1 - 2p^{-(A^2/2-1)}\).<br> Then just use the triangle inequality and multiply both sides by two.<br>	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Subgaussian RV	We say a random variable \(X\) is subgaussian with parameter \(\sigma\) if: \[{\mathbb{E}}[e^{t(X - {\mathbb{E}}[X])}] \leq e^{t^2\sigma^2/2}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Subgaussian Tail Bound	Given \(X\) subgaussian with parameter \(\sigma\), we have: \[{\mathbb{P}}(X - {\mathbb{E}}[X] \geq t) \leq e^{-t^2 / (2\sigma^2)}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Hoeffding's Lemma	If \(X\) takes values in \([a,b]\), then \(X\) is subgaussian with parameter \((b-a)/2\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> What must be true about sequences of independent subgaussian RVs?	If independent subgaussian RVs \(X_1...X_n\) have parameters \(\sigma_i\) respectively, then the RV \(\gamma{^{\mathrm{T}}} X\) is subgaussian with parameters \(\left( \sum_i \gamma_i^2 \sigma_i^2 \right)^{1/2}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> Suppose \((\varepsilon_i)_{i=1}^n\) are independent, mean-zero and sub-Gaussian with common parameter \(\sigma\). Note that this includes \(\varepsilon \sim N_n(0, \sigma^2 I)\). Let \(\lambda = A\sigma\sqrt{\log(p)/n}\). Then \[\mathbb{P}(\|X^\top \varepsilon\|_\infty/n \leq \lambda) \geq 1 - 2p^{-(A^2/2-1)}\]	Proof sketch:<br> Start from \[...\leq \sum_{i=1}^p {\mathbb{P}}(\frac{1}{n} |{\varepsilon}^T X_i| &gt; \lambda)\] Split into two events, one where \({\varepsilon}^T X_i\) is positive, and one where it's negative. Both are subgaussian with parameter \(\sigma/\sqrt{n}\), so use the upper bound. Then just plug in \(\lambda\).<br>	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> For \(\lambda \geq 0\), any two lasso regression solutions \(\hat \beta_\lambda^{(1)}\) and \(\hat \beta_\lambda^{(2)}\) must have the same fitted values: \(X \hat \beta_\lambda^{(1)} = X \hat \beta_\lambda^{(2)}\)	First note that, by the convexity of \(||\cdot||^2_2\) with \(t = \frac{1}{2}\),  \[||Y - \frac{1}{2}X\beta^{(1)}_\lambda - \frac{1}{2}X\beta^{(2)}_\lambda ||^2_2 \leq ||Y - \frac{1}{2}X\beta^{(1)}_\lambda||^2_2 + ||Y - \frac{1}{2}X\beta^{(2)}_\lambda ||^2_2 \] and by strict convexity, equality is achieved iff \(X\beta^{(1)} = X\beta^{(2)}\).<br> Show that \[c^* \leq Q_\lambda(\frac{1}{2}X\beta^{(1)} + \frac{1}{2}X\beta^{(2)}) \leq \frac{1}{2}Q_\lambda(X\beta^{(1)}) + \frac{1}{2}Q_\lambda(X\beta^{(2)}) = c^*\] applying convexity of the L1 and L2 norms, and so equality is achieved.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Coordinate Descent	Best used to optimize functions of the form  \[f(x) = g(x) + \sum_{i=1}^n h_j(x_j)\] Start with some initial guess \(x^{(0)}\). Optimize on one component at a time, updating as you go along each cycle.	MSM PartIIINotes
Basic	Part III Notes::MSM	When will coordinate descent converge to a minimizer?	When \(A_0 = \{x : f(x) \leq f(x^{(0)}) \}\) is a compact set (with \(x^{(0)}\) the starting guess)	MSM PartIIINotes
Basic	Part III Notes::MSM	What three properties must inner products have?	1. Conjugate Symmetry: \(\langle x, y \rangle = \overline{ \langle y, x \rangle }\)<br> 2. Linearity in the first argument: \(\langle ax + by, z\rangle = a\langle x, y\rangle + b\langle x,y \rangle\)<br> 3. Positive Definiteness: \(\langle x, x \rangle \geq 0,\, \text{equality iff } x = 0 \)<br> <br> for \(x, y \in \mathcal{H},\, a, b \in \mathbb{K}\)<br> Note that 1 and 2 together makes the inner product a sesquilinear form.	MSM PartIIINotes
Basic	Part III Notes::MSM	What does a matrix being positive definite tell us about its eigenvalues?	The eigenvalues are positive (and real).	MSM PartIIINotes
Basic	Part III Notes::MSM	What is the definition of a matrix being positive-definite?	\(A\) IS SYMMETRIC, and for any nonzero real column vector \(x\),<br> \(x^TAx &gt; 0\)<br> <br>Note this implies  \[\sum_{i,j} x_i A_{ij} x_j &gt; 0\]<br> <br>Semidefinite just means it can also be zero.	MSM PartIIINotes
Basic	Part III Notes::MSM	If some matrix \(A\) has full column rank, what does that tell us about \(A^TA\)?	\(A^TA\) is positive definite	MSM PartIIINotes
Basic	Part III Notes::MSM	What are the three kernel closure properties?	1. Closure over addition and nonnegative scalars: \(a_1k_1 + a_2k_2\)<br> 2. Closure over sequences: \(\lim_{k \to \infty} k_m\)<br> 3. Closure over products: \(k_1k_2\)	MSM PartIIINotes
Basic	Part III Notes::MSM	State the Representer Theorem	Given some RKHS \(\mathcal{H}\) with representing kernel \(K\) and arbitrary loss function \(c: \mathbb{R}^n \times X^n \times \mathbb{R}^n \to \mathbb{R}\) and strictly increasing \(J: {\mathbb{R}} \to {\mathbb{R}}\), the minimizing \(\hat f \in \mathcal{H}: X \to \mathbb{R}\) of<br> \[Q_1(f):= c(y, x_1, ..., x_n, f(x_1), ..., f(x_n)) + J(||f||_\mathcal{H}^2)\]<br> is given by<br> \[\hat f(x) = \sum_{i=1}^n \hat \alpha_i K(x, x_i)\]<br> with \(\hat \alpha\) minimizing:<br> \[Q_2(\alpha) = c(y, x_1, ..., x_n, K\alpha) + J(\alpha ^TK\alpha)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	State the KKT conditions for the Lasso	\[\hat v = \frac{1}{\lambda n}X^T(Y - X \beta^L_\lambda)\]<br> where \(\hat v\) satisfies \(||\hat v||_\infty \leq 1\) and \(\hat v_{\hat S} = \mathrm{sgn}(\hat \beta^L_{\lambda,\,\hat S})\) where \(\hat S = \{ j:\, \beta^L_{\lambda,\, j} \neq 0 \}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	State Markov's Inequality	If \(X\) is a nonnegative random variable and \(a &gt; 0\), then we have:<br> \(\mathrm{P}(X \geq a) \leq \frac{\mathrm{E}(X)}{a}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	State Chebyshev's Inequality	\(\mathbb{P}(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}\)<br> Note that this is useless when \(k \leq 1\), since all probabilities are \(\leq 1\)<br> <br> Equivalently we have<br> \(\mathbb{P}(|X-\mu| \geq a) \leq \frac{{\mathrm{Var}}(X)}{a^2}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	State the definition of function convexity.	\(f((1-t)x + ty) \leq (1-t)f(x) + tf(y)\)<br> for all \(x, y \in C\)<br> given convex space \(C\), \(f: C \to \mathbb{R}^n\)	MSM PartIIINotes
Basic	Part III Notes::MSM	Write the Lagrangian function as used in the Lagrangian Method to minimize a function \(f(x)\) subject to \(g(x) = 0\)<br> (\(f: C \to \mathbb{R}\) (\(C \subseteq \mathbb{R}^d\) convex), \(g: \mathbb{R}^d \to \mathbb{R}^d\))	\(L(x, \theta) = f(x) + \theta^Tg(x)\)<br> Find minima in \(x, \theta\)<br> <br> We're looking for points where the tangents to the contours of \(f\) and \(g\) are parallel (or where \(f\) alone is minimized and \(\theta = 0\))	MSM PartIIINotes
Basic	Part III Notes::MSM	Given a convex function \(f: C \to \mathbb{R}\), when is a vector \(v \in \mathbb{R}^d\) a subgradient of \(f\) at \(x_0\)?	\(\forall x \in C,\, f(x) - f(x_0) \geq v^T(x-x_0)\)<br> <br> So in one dimension we have:<br> \(\forall x \in C,\, f(x)-f(x_0) \geq c(x-x_0)\)<br> given \(c \in \mathbb{R}^n\)<br> any \(c\) such that our function stays above the line drawn out from \(x\) with slope \(c\)	MSM PartIIINotes
Basic	Part III Notes::MSM	What is a subdifferential of a function \(f\) at \(x\), and how is it denoted?	The set of all subgradients of \(f\) at \(x\).<br> It's denoted \(\partial f(x)\)	MSM PartIIINotes
Basic	Part III Notes::MSM	For the interior of a convex region, if \(f\) is differentiable, what are the subdifferentials of \(f\)?	Just the gradient of \(f\) at any point \(x\):<br> \(\partial f(x) = \{ \nabla f(x) \}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	What are the simplifying properties of subdifferentials?	1. Subdifferentials scale how you'd expect:<br> \(\partial(\alpha f)(x) = \{ \alpha v,\, v\in \partial f(x) \} \)<br> 2. When adding functions, we consider all permutations:<br> \(\partial(f_1 + f_2)(x) = \{ v_1 + v_2,\, v_1\in \partial f_1(x),\, v_2\in \partial f_2(x) \} \)<br> 3. Subdifferentials ignore constant adds and respect linearity:<br> \(h(x) = f(Ax + b) \implies \partial h(x) = \{ A{^{\mathrm{T}}} g : g \in \partial f(Ax+b) \}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	For a convex function \(f\), how must any minimizer relate to subdifferentials?	We have<br> \[x^* \in \underset{x \in \mathbb{R}^d}{\mathrm{argmin}}\, f(x) \iff 0 \in \partial f(x^*)\]<br> <br> (both are equivalent to saying \(f(y) \leq f(x^*),\, \forall y\))	MSM PartIIINotes
Basic	Part III Notes::MSM	What is the subdifferential of the L1 norm \(\partial ||x||_1\)?	\[\partial ||x||_1 = \{ v \in \mathbb{R}^d : ||v||_\infty \leq 1,\, v_A = \mathrm{sgn}(x_A)\}\]<br> <br> where \(A \in \{ j : x_j \neq 0 \}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	For \(x \in \mathbb{R}^d\), state an equivalent form of \(\partial ||x||_1\).	\[\partial x ||x || _1 = \{ v \in \mathbb{R}^d,\, ||v||_\infty \leq 1 \text{ and } v_A = \mathrm{sgn}(x_A)\}\]<br> where \(A = \{ j:\, x_j \neq 0\}\)<br> <br> In other words, \(v\) where the magnitude of all components of \(v\) do not exceed 1, \(v\) is positive in the components where \(x\) is positive and negative where the components of \(x\) is negative.	MSM PartIIINotes
Basic	Part III Notes::MSM	What do we know about the fitted values of Lasso minimizers?	They are unique<br> (though the minimizing \(\beta^L_\lambda\) may not be unique)	MSM PartIIINotes
Basic	Part III Notes::MSM	What do the sets \(S\) and \(N\) denote in the context of Lasso regression?	\(S := \{ k: \, \beta_k^0 \neq 0 \}\) is the set of nonzero terms of the true minimizer \(\beta^0\)<br> \(N:= (1, 2, ..., p) \setminus S\) is the set of zero terms of the true minimizer \(\beta^0\)<br> <br> So for instance \(X_N\) is the matrix of ignored predictor columns.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Sum form of positive-semi-definiteness	A matrix \(X\) is positive semi-definite iff \[\sum_{i,j}a_iX_{ij} a_j \geq 0\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Kernel from a Feature Map	Given some feature map \(\phi:{\mathcal{X}} \to {\mathcal{H}}\), the following is a kernel: \[k(x,y) = \langle \phi(x), \phi(y) \rangle \]<br> <br>This is because \[\sum_{i,j} a_i k(x_i, x_j)a_j = \sum_{i,j} a_i \langle \phi(x_i), \phi(x_j) \rangle a_j = \langle \sum_i a_i \phi(x_i), \sum_i a_i \phi(x_i) \rangle \geq 0\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Set Convexity	A set \(C\) is convex if, for all \(x, y \in C\), \(t \in [0,1]\), \[tx + (1-t)y \in C\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> The Gamma Integral	\[\Gamma(z) = \int_{0}^\infty u^{z-1}e^{-u}du\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> How to compute \(\Gamma(n)\)	Know that  \[\Gamma(n) = (n - 1)!\] \[\Gamma(1/2) = \sqrt{\pi}\] \[\Gamma(z+1) = z\Gamma(z)\] for \(n \in {\mathbb{N}}\), \(z \in {\mathbb{R}}\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Laplace Transform of a Power	\[t^{-\alpha} = \frac{1}{\Gamma(z)} \int_0^\infty v^{z-1}e^{-tz}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Basic Inequality for the Lasso	\[\frac{1}{2n}||Y - X \hat \beta||_2^2 + \lambda || \hat \beta ||_1 \leq \frac{1}{2n}||Y - X \beta^0||_2^2 + \lambda || \beta^0 ||_1\] where \(\hat \beta\) is the true optimized solution of the Lasso loss equation.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> Let \(\lambda &gt; 0\) and define \(\Delta = X_N^\top X_S (X_S^\top X_S)^{-1} \text{sgn}(\beta_S^0)\). If \(\|\Delta\|_\infty \leq 1\) and for \(k \in S\), \[ |\beta_k^0| &gt; \lambda |\text{sgn}(\beta_S^0)^\top [\{\tfrac{1}{n} X_S^\top X_S\}^{-1}]_k|, \] then there exists a Lasso solution \(\hat{\beta}_\lambda^{\text{L}}\) with \(\text{sgn}(\hat{\beta}_\lambda^{\text{L}}) = \text{sgn}(\beta^0)\). As a partial converse, if there exists a Lasso solution \(\hat{\beta}_\lambda^{\text{L}}\) with \(\text{sgn}(\hat{\beta}_\lambda^{\text{L}}) = \text{sgn}(\beta^0)\), then \(\|\Delta\|_\infty \leq 1\).	Proof sketch:<br> Plug into the KKT conditions for the Lasso. Use the following decomposition: \[ X{^{\mathrm{T}}} X = \left[\begin{array}{cc}   X_S{^{\mathrm{T}}} X_S & X_S{^{\mathrm{T}}} X_N \\\\   X_N{^{\mathrm{T}}} X_S & X_N{^{\mathrm{T}}} X_N \end{array}\right] \] Also note that if \(\mathrm{sgn}(\beta^0) = \mathrm{sgn}(\hat \beta)\), then \(\hat \beta_N = 0\).<br>	MSM PartIIINotes
Basic	Part III Notes::MSM	Characteristic Function Bound of Bernstein's Inequality	Let \(W_1\), \(W_2\), ... be IID random variables, all with mean \(\mu\), and each satisfying Bernstein's condition with parameter \((\sigma, b)\). Then we have \[ {\mathbb{E}}(e^{\alpha (W_i - \mu)}) \leq \exp\left( \frac{\alpha^2 \sigma^2}{2(1 - b|\alpha|)} \right) \] for \(|\alpha| &lt; 1/b\).	MSM PartIIINotes
Basic	Part III Notes::MSM	Probability bound of Bernstein's Inequality	Let \(W_1\), \(W_2\), ... be IID random variables, all with mean \(\mu\), and each satisfying Bernstein's condition with parameter \((\sigma, b)\). Then we have \[ {\mathbb{P}}\left(\frac{1}{n} \sum_{i=1}^n W_i - \mu \geq t\right) \leq \exp\left( -\frac{nt^2}{2(\sigma^2 + bt)} \right) \] for \(t &gt; 0\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Product of Subgaussian RVs	Suppose \(W, Z\) be mean zero and subgaussian with parameters \(\sigma_W\) and \(\sigma_Z\) respectively. Then \(WZ\) fulfills Bernstein's condition with parameter \((8 \sigma_W \sigma_Z, 4 \sigma_W \sigma_Z)\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> If a distribution \(P\) on \({\mathbb{R}}^n\) has a density with respect to the product measure which is everywhere positive, then its conditional independence graph is global Markov	Proof sketch: Use backwards induction on the size of the separating set.<br> Base case: \(A\) has one element, \(B\) has one element, everything else is in \(S\) (case \(p - 2\)). Fulfills global Markov by independence graph definition.<br> Inductive step: Grow \(A\) and \(B\) to include all 'independent' points. Either \(A\) or \(B\) must have at least two points; WLOG it's \(A\). Call one of the points \(j\), and the rest of \(A\) \(A_-\). By the inductive hypothesis we have: \[Z_{A_-} {\perp\!\!\!\perp} Z_B {\,|\,} Z_{S \cup \{ j\}}\] \[Z_{j} {\perp\!\!\!\perp} Z_B {\,|\,} Z_{S \cup A_-}\] So, applying the intersection rule (and using our assumption), we have \[Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S\] as required.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Markov Blanket	In a regression problem where \(X\) is a set of predictors and \(Y\) is a response, a Markov blanket \(X_S\) is a subset of \(X\) that fulfills \[Y {\perp\!\!\!\perp} X_{S^c} {\,|\,} X_S\] (so \(X_S\) contains all predictors relevant to predicting \(Y\)). <br> <br>There's guaranteed to be a unique minimal blanket (take the weak union of every individually dependent covariate)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Structural Causal Models (SCMs)	A tuple \({\mathcal{S}} = (P_j, h_j, Q_j)_{j \in [p]}\) where:<br> \(P_j\) is the set of 'dependencies' for node \(j\): in other words, \(\mathrm{Pa}(j)\).<br> \(h_j: {\mathbb{R}}^{|P_j|} \times {\mathbb{R}} \to {\mathbb{R}}\): A 'decision function' for node \(j\), which takes in every defined dependency and a source of randomness, then makes a decision as to node \(j\)'s value.<br> \(Q_j\): A probability distribution over \({\mathbb{R}}\), the source of randomness for each node.<br> Provided alongside it is the full random vector \(Z_j := h_j(Z_{P_j}, {\varepsilon}_j)\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Interventions and Perfect Interventions to a Structural Causal Model (SCM)	Replace a \(h_j(Z_{P_j}, {\varepsilon}_j)\) with a new \(\tilde h_j(Z_{P_j}, {\varepsilon}_j)\), hoping to achieve some downstream result. If \(\tilde h_j\) is just some constant function, we say it is a perfect intervention.	MSM PartIIINotes
Basic	Part III Notes::MSM	How do we denote an intervention in an SCM?	\[...|\mathrm{do}(Z_j = \cdot)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Collider in a path in a DAG	Given some DAG with an undirected path \(j_1,..., j_m\) with length \(m \geq 3\), we say \(j_l\) is a collider (relative to the path) if we have \[j_{l-1} \to j_l \leftarrow j_{l + 1}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Blocked path in a DAG	A path \(j_1,..., j_m\) is blocked by a set of nodes \(S\) if there exists a node \(v\) on the path such that either:<br> \(v\) IS NOT a collider, and \(v \in S\), or<br> \(v\) IS a collider, and neither \(v\) nor any of its descendants are in \(S\).<br> <br>So in other words, a path is not blocked if non-colliders are not in \(S\) and colliders are either in \(S\), or have descendants in \(S\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> d-Seperation in a DAG<br>(and how it's denoted)	Given disjoint subsets \(A, B\) in a graph \(G\), we say \(A\) and \(B\) are d-separated by a subset \(S\) if every path between \(A\) and \(B\) is blocked by \(S\). We denote this as: \[A {\perp\!\!\!\perp}_G B {\,|\,} S\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Global Markov Directed (Causal) Graphs	A distribution \(P\) on \({\mathbb{R}}^p\) is global Markov with respect to a DAG \(G\) on \(p\) vertices if, whenever \(Z \sim P\), and \(A, B, S\) are disjoint sets of vertices,  \[A {\perp\!\!\!\perp}_G B {\,|\,} S \implies Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S\]<br> <br>Note that distributions generated from an SCM are always global Markov with respect the SCM DAG.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Let \(p \geq 2\) and \(Z \sim P\). If \(Z_j \not {\perp\!\!\!\perp} Z_k {\,|\,} Z_{-jk}\) for \(j, k \in [p]\), then any causal DAG generating \(P\) must have either \(j \to k\), \(k \to j\), or \(j \to l \leftarrow k\) for some \(l\).	Proof sketch:<br> By definition, \(j\) and \(k\) can not be d-seperated by \(S := [p] \setminus \{ j, k\}\). The only possible way to avoid being blocked by every other point is to either have a path directly between \(j\) and \(k\), or to have \(j \to l \leftarrow k\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> If nodes \(j\) and \(k\) in a DAG with \(p\) vertices are not adjacent and \(\pi\) is a topological order on \([p]\) with \(\pi(j) &lt; \pi (k)\), then they are d-separated by \(\mathrm{Pa(k)}\).	Proof sketch:<br> Consider a path \(j = j_1,...,j_m=k\). We can't have \(k \to j\) by the topological order assumption. We can't have \(j_{m-1} \to k\) because then \(j_{m-1} \in \mathrm{Pa}(k)\). We could try to have a V-shaped path  \[j = j_1 \to ... \to j_{l-1} \to j_l \leftarrow ... \leftarrow j_m = k\] (must exist due to topological ordering). But if we take the maximal \(j_l\), we can see that \(j_l\) can't have a descendant in \(\mathrm{Pa}(k)\) because we'd have a cycle in a DAG. \(j\) and \(k\) must be d-separated.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Fundamental result of Gaussian causal models	\[Z_A {\,|\,} Z_B = z_B \sim N_{|A|}(\mu_A + \Sigma_{A,B} \Sigma_{B,B}{^{-1}} (z_B - \mu_B), \Sigma_{A, A} - \Sigma_{A,B} \Sigma_{B,B}{^{-1}} \Sigma_{B,A})\]<br> <br>For nodewise regression, specialize this to \(A = \{k\}\) and \(B = A^c\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Graphical Lasso	Minimizes \[-\log \det (\Omega) + \mathrm{tr}(S\Omega) + \lambda \sum_{j,k}|\Omega_{jk}|\] where \(\Omega\) is the precision matrix \(\Sigma{^{-1}}\), and \(S := \frac{1}{n} \sum_{i=1}^n x_ix_i{^{\mathrm{T}}} - \hat X \hat X{^{\mathrm{T}}}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Weak Law of Large Numbers	If \(W_1, W_2, ...\) are iid real-valued random variables and \({\mathbb{E}}(W_i) = \mu &lt; \infty\), then as \(n \to \infty\), \[\frac{1}{n}\sum_{i=1}^n W_i \overset{p}{\to} \mu\]	MSM PartIIINotes
Basic	Part III Notes::MSM	Mercer's Theorem decomposition for kernels	\[k(x,y) = \sum_{j \in J} \mu_j e_j(x) e_j(y)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> The Square Root Lasso	Minimizes \[\frac{1}{\sqrt{n}}||Y-\bar Y1 - X\beta||_2 + \gamma||\beta||_1\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Group Lasso Penalty	\[\lambda \sum_{j=1}^q m_j ||\beta_{G_j}||_2\] where we have group partitions \(G_1...G_j\) and we usually define \(m_k\)	MSM PartIIINotes