#separator:tab
#html:true
#notetype column:1
#deck column:2
#tags column:5
Basic	Part III Notes::MSM	<b>Definition:</b> Bernstein's Condition	A random variable X fulfills the Bernstein condition with parameters \(\sigma, b \geq 0\) if: <br>\[ {\mathbb{E}}(|X - {\mathbb{E}} X|^k) \leq \frac{1}{2} k! \sigma^2 b^{k-2} \] <br>for \(k = 2, 3, ...\)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Reproducing Kernel Hilbert Space (RKHS)	A Hilbert space of functions \(f: X \to \mathbb{R}\) such that, for all \(x \in X\),\[f(x) = \left&lt; f, K_x \right&gt;\] <br> Note: this condition is equivalent to saying that the point-evaluation functional  \[L_x = H \to \mathbb{R},\, L_x(f) = f(x)\] is continuous for every \(x \in X\). <br>So in other words \(L_x \in H^*\) <br>This is equivalent to the condition above by the Riesz Representation Theorem	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Reproducing kernel (of an RKHS)	By definition, fixing some \(x \in X\), there exists some \(K_x\) such that for any \(f \in H,\, f(x) = \left&lt; f, K_x \right&gt;\).  <br>If we plug in some \(y\) to \(K_x\), we then get \(K_x(y) = \left&lt;K_x, K_y \right&gt;\) <br>So our reproducing kernel is the function where we take some \(x, y\) pair and perform that process: \[K(x,y) = \langle K_x, K_y \rangle\] \(K: X \times X \to \mathbb{R}\)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Topological Ordering	A permutation \(\pi\) of some set of nodes \([n], n\in {\mathbb{N}}\), is a topological ordering if:<br> \[\forall a,b \in [n],\, b \in \mathrm{de}(a) \implies \pi(a) &lt; \pi(b)\] (Higher nodes in the DAG are sorted to the front)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Multiple Testing	When we have a sequence of null hypotheses \(H_1...H_m\) where we have found associated p-values \(p_1...p_m\).<br> <br>We let \(I_0\) represent the indices of the true nulls where no relationship exists, with \(m_0 = |I_0|\).<br> <br>It's not as simple as just rejecting the small-enough p-values due to the jellybeans effect!	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	What does \(I_0\) represent in multiple testing?	A set of hypothesis indices representing true nulls that should not be rejected.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Family-Wise Error Rate (FWER)	\[{\mathbb{P}} ( N \geq 1)\] where \(N\) is the number of false rejections.<br> <br>For instance the FWER of the naiive approach to multiple testing is \[ 1 - (1 - \alpha)^m \]	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Bonferroni Correction	Reject \(H_i\) if \(p_i \leq \frac{\alpha}{m}\) for desired \(\alpha\) level.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Intersection Hypothesis	When we want to test some subset of our multiple hypotheses \(I \subseteq [m]\) are all true: \(H_I\).<br> <br>We can reject it by showing that any of the hypotheses can be rejected!	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Local Test<br>(and its critical requirement)	Given some intersection hypothesis \(H_I\), a test \(\phi_I\) that is 1 if we reject \(H_I\) or zero otherwise. We require that the error rate of \(\phi_I\) be \(\leq \alpha\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Closed Testing Procedure	Suppose you have a local test \(\phi_I\) for testing intersection hypotheses. Reject \(H_I\) iff all superset local tests \(\phi_J,\, I \subseteq J\) are rejected (test everything above \(I\)).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Holm's Procedure	Closed testing with the Bonferroni test as the local procedure: \[ \phi_I(x) = \begin{cases}            1 & \underset{i \in I}{\min} (p_i) \leq \frac{\alpha}{|I|}  \\\\           0 & \text{otherwise}        \end{cases} \]	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Holm's Procedure as a Stepdown Procedure	Let \(p_{(i)}\) represent the \(i\)th smallest p-value for respective hypothesis \(H_{(i)}\).<br> <br>Step 1: if \(p_{(1)} \leq \frac{\alpha}{m}\), reject \(H_{(1)}\) and continue; otherwise accept \(H_{(1)}...H_{(m)}\) <br>... <br>Step i: if \(p_{(i)} \leq \frac{\alpha}{m - i + 1}\), reject \(H_{(i)}\) and continue; otherwise accept \(H_{(i)}...H_{(m)}\) <br>... <br>Step m: if \(p_{(m)} \leq \alpha\), reject \(H_{(m)}\) and finish; otherwise accept only \(H_{(m)}\)<br> <br>So, to recap, iterate over p-values starting at the smallest, rejecting and continuing if \(p_{(i)} \leq \frac{\alpha}{m - i + 1}\) or ceasing otherwise.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	How good is Holm's procedure?	Uniformly more powerful than Bonferroni correction	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> False Discovery Rate (FDR)	\[ {\mathbb{E}} \left( \frac{N}{R \lor 1} \right) \] where \(N\) is the number of false rejections, \(R\) is the total number of rejections, and \(\lor\) is the max function (so we never divide by zero).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Benjamin-Hochberg Procedure	Order hypotheses by their p-values, from smallest to largest. Find the largest index underneath the slope \(p_{(i)} \leq \frac{\alpha i}{m}\): that index and everything below it is rejected, but everything above \(i\) is accepted. The FDR is guaranteed to be kept at \(\frac{\alpha m_0}{m} \leq \alpha\)<br> <br>Visual idea: find the largest index point under the line with slope \(\frac{\alpha}{m}\), that is your cutoff point.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> The reproducing kernel of an RKHS is well-defined.	Given any two reproducing kernels \(k_x, h_x \in X \times X \to {\mathbb{R}}\), show that \[||k_x - h_x||^2_H = 0\] by representing as an inner product and simplifying: <br>\[\begin{align*} ||k_x - h_x||^2_H &= \left&lt; k_x - h_x, k_x - h_x \right&gt;\\\\ &= \left&lt; k_x, k_x - h_x \right&gt; - \left&lt; h_x, k_x - h_x \right&gt;\\\\ &= (k_x - h_x)(x) - (k_x - h_x)(x)\\\\ & \text{by the reproducing property}\\\\ &= 0 \end{align*}\]	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Every DAG has a topological ordering.	First, prove by contradiction that there must be a node with no parents, called the source node (or else we must have a cycle, breaking our DAG assumption). <br>Next, build the graph up inductively from a graph with one node.  <br>Base case: Clearly any graph with one node is a topological order. <br>Inductive step: We assume that ANY graph with \(n-1\) nodes, \(G_{n-1}\), has a topological order. Thus, given a graph \(G_n\) nodes, apply the lemma we proved above to get the parent node, and remove it to get a new graph \(G_{n-1}\). By the inductive step we have a topological order \(\pi_{n-1}\). Construct \(\pi_{n}\) by shifting every node in \(\pi_{n-1}\) down one, and putting the parent node at the top at 1. We're done.	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Bonferroni Correction yields a FWER \(\leq \alpha\)	Just apply Markov's inequality and chug: <br>\[\begin{align*} {\mathbb{P}} (N \geq 1) &\leq {\mathbb{E}}(N)\\\\ & \text{ by the Markov inequality}\\\\ &\leq {\mathbb{E}}\left[ \sum_{i \in I_0} {\mathbb{1}}_{\{ p_i \leq \alpha / m \}}\right]\\\\ &\leq \sum_{i \in I_0} {\mathbb{P}}\left(p_i \leq \frac{\alpha}{m}\right)\\\\ &\leq \frac{\alpha m_0}{m}\\\\ &\leq \alpha \end{align*}\]	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> When all hypotheses are tested using closed testing, a false rejection will be made with at most probability \(\alpha\)	False rejections can only be made if our intersection hypothesis is entirely composed of indices in \(I_0\). But then including upwards, we are guaranteed to check \(\phi_{I_0}\), which only happens with probability at most \(\alpha\).<br> <br>It follows that the FWER for any particular \(H_I\) has at most level \(\alpha\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Benjamin-Hochberg keeps FDR \(\leq \alpha\)	We must assume that all \(p_i,\, i \in I_0\) are independent from all the other p-values.<br> <br>Begin by breaking down R then N into their possible cases, and express as a disjoint probability in two terms.<br> <br>Simplifying the event in the probability term: break down the condition \(R = r\) into what it means for \(r\) to be the max value. Define a new set \(p_{-i}\) where we've removed the \(i\)th p-value (shifting the indices to the left), so that they are independent of \(p_i\). Remark these two terms are equivalent themselves to a reduced Benjamin-Hochberg trial with a modified cutoff line \(\frac{\alpha}{m}\).<br> <br>Plugging back in the rewritten event, seperate the probability block and simplify to get \(\leq \frac{\alpha m_0}{m} \leq \alpha\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Conditional Independence<br>(Two equivalent definitions for \(X {\perp\!\!\!\perp} Y {\,|\,} Z\), for random vectors \(X, Y, Z\).)	1. \({\mathbb{P}}(X \in A,\, Y \in B {\,|\,} Z) = {\mathbb{P}}(X \in A{\,|\,} Z){\mathbb{P}}(Y \in B{\,|\,} Z)\)<br> <br>2. \({\mathbb{P}}(X \in A {\,|\,} Y, Z) = {\mathbb{P}}(X \in A {\,|\,} Z)\)<br> (knowing \(Y\) doesn't give us any new information about \(X\))<br> <br>note that here \(X \in A\) is best read as '\(X\) lands in \(A\)'	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Weak Union rule for Conditional Independence	\[ X {\perp\!\!\!\perp} Y, Z \implies  \begin{cases}    X {\perp\!\!\!\perp} Y {\,|\,} Z\\\\   X {\perp\!\!\!\perp} Z {\,|\,} Y \end{cases} \] We can move elements from the right-hand side to the conditioning side. Also works with some other conditioning \(W\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Contraction rule for Conditional Independence	\[ \begin{rcases*} X {\perp\!\!\!\perp} Z\\\\ X {\perp\!\!\!\perp} Y {\,|\,} Z \end{rcases*} \implies X {\perp\!\!\!\perp} Y, Z \] We can apply known independencies to the conditioning side. Also works with some other conditioning \(W\).	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Intersection rule for Conditional Independence	\[ \begin{cases}    X {\perp\!\!\!\perp} Y {\,|\,} Z\\\\   X {\perp\!\!\!\perp} Z {\,|\,} Y \end{cases} \implies X {\perp\!\!\!\perp} Y, Z \] Converse of the weak union rule. ONLY ALLOWED when \(X, Y, Z, W\) have a joint density, and \(f_{Y,Z,W}\) is everywhere positive (need to ensure \(Y \neq Z\) basically)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	When are we allowed to use the intersection rule for conditional independence?	When \(X, Y, Z, W\) have a joint density, and \(f_{Y,Z,W}\) is everywhere positive (need to ensure \(Y \neq Z\) basically)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Conditional Independence Graph (CIG)	Given a distribution \(P\) on \({\mathbb{R}}^p\), an undirected graph \(G = ([p], E)\) where, if \(Z \sim P\), \[ (j,k), (k,j) \in E \iff Z_j \not\mathclap{{\perp\!\!\!\perp}}\mkern12mu Z_k {\,|\,} Z_{-jk} \] where \(Z_{-jk}\) is a subvector of \(Z\) without indices \(j\) and \(k\).<br> <br>Thus components are connected if there exists some dependence between them (even given all other information)	Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Censored Random Variable	\[X := \min{(T, C)}\] \[V := \begin{cases}    1 & T \leq C\\\\   0 & T &gt; C \end{cases} \] Where \(T\) is the time to event and \(C\) is the time to censoring. \(V\) is called the visibility.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Survivor Function<br>(Also its two properties)	\[F(t) = {\mathbb{P}}(T &gt; t) = \int_t^\infty f(t)\,dt\] <br>1. Decreasing in \(t\) <br> 2. When continuous, \(F(0) = 1\)<br> <br>NOTE: Our professor denotes the survivor function as \(F(t)\), but the usual convention is to write \(S(t)\).	MedicalStats PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Law of Total Expectation<br>(AKA the tower property)	\[{\mathbb{E}}({\mathbb{E}}(X|Y)) = {\mathbb{E}}(X)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Chain Rule of Probability	\[\begin{align*} {\mathbb{P}}(X,Y) &= {\mathbb{P}}(Y){\mathbb{P}}(X|Y)\\\\ &= {\mathbb{P}}(X){\mathbb{P}}(Y|X) \end{align*}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Slutsky's Theorem	Let \(X_n, Y_n\) be sequences of random elements. If both<br> \(X_n \overset{d}{\to} X\)<br> \(Y_n \overset{p}{\to} c\)<br> for some constant \(c\), then<br> <br>1. \(X_n + Y_n \overset{d}{\to} X + c\)<br> 2. \(X_nY_n \overset{d}{\to} Xc\)<br> 3. \(X_n / Y_n \overset{d}{\to} X /c\) given \(c \neq 0\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Undirected Graph Separation<br>(Given an undirected graph \(G = ([p], E)\) and disjoint sets \(A, B, S \subset [p]\), S separates \(A\) and \(B\) if...)	Every path from \(A\) to \(B\) contains a node in \(S\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Global Markov Undirected Graphs<br>(A distribution \(P\) on \({\mathbb{R}}^p\) is global Markov with respect to an undirected graph \(G = ([p], E)\) when...)	Taking any \(A, B, S \subset [p]\) with \(S\) separating \(A\) and \(B\), we have that for any \(Z \sim P\), \[ Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S \]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Objective of Conditional Independence Testing	Given \(n\) iid copies \((x,y,z),\,...,\,(x_n,y_n,z_n) \in {\mathbb{R}} \times {\mathbb{R}} \times {\mathbb{R}}^p\) collected into \((X, Y, Z) \in {\mathbb{R}}^n \times {\mathbb{R}}^n \times {\mathbb{R}}^{n \times p}\), we wish to test the null hypothesis \(X {\perp\!\!\!\perp} Y {\,|\,} Z\).	MSM PartIIINotes
Basic	Part III Notes::MSM	Intuitively, how can we think about \({\mathbb{E}}(X {\,|\,} Z)\), and how could we solve for such a thing?	\({\mathbb{E}}(X {\,|\,} Z)\) represents our 'best guess' for \(X\) given some information \(Z\).<br> <br>We can solve for \({\mathbb{E}}(X {\,|\,} Z)\) by finding an \(f\) that minimizes \[{\mathbb{E}}\left[(X - f(Z))^2\right]\] (this is called the L2 projection theorem, and it links together statistical regression and the conditional expectation)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Generalized Covariance Measure (GCM)<br>(Assuming we have data \((X_i, Y_i, Z_i)\) and fitted regression functions \(\hat f(z)\) and \(\hat g(z)\)...)	The test statisitc  \[T := \sqrt{n} \frac{\tau_N}{\tau_D}\] where \[\begin{align*} \tau_N &:= \frac{1}{n}\sum_{i=1}^n \hat\epsilon_i\hat\xi_i\\\\ \tau_D^2 &:= \frac{1}{n} \sum_{i=1}^n   \hat\epsilon^2_i\hat\xi^2_i - \tau^2_N \end{align*}\] where we've defined \(\hat \epsilon_i := X_i -\hat f(Z_i)\) and \(\hat \xi := Y_i - \hat g(Z_i)\)<br> <br>Explanation:<br> Our null hypothesis is that \[{\mathbb{E}}\left[ (X - {\mathbb{E}}(X{\,|\,} Z)) (Y- {\mathbb{E}}(Y{\,|\,} Z))\right]=0,\] so we just need to quantify deviance from that. \(\hat f\) and \(\hat g\) approximately represent \({\mathbb{E}}(X{\,|\,} Z)\) and \({\mathbb{E}}(Y{\,|\,} Z)\) respectively, and so we define our average deviance among samples as \(\tau_N\).<br> The CLT tells us \(\sqrt{n}\tau_N\) should be Gaussian (mean-zero under the null, iid per \(i\)), but we need to normalize for variance if we want a standard normal distribution. So we define \(\tau_N\) to do this, using \({\mathrm{Var}}(X) = {\mathbb{E}}(X^2) - {\mathbb{E}}(X)^2\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> The Generalized Covariance Measure is asymptotically normal	Assuming there exists \(c &gt; 0\) such that \({\mathrm{Var}}(\epsilon {\,|\,} Z), {\mathrm{Var}}(\xi {\,|\,} Z) &lt; c\) almost surely,<br> \({\mathrm{Var}}(\epsilon\xi) &gt; 0\),<br> and  \[{\mathcal{E}}_f \to 0,\, {\mathcal{E}}_g \to 0,\, {\mathcal{E}}_f {\mathcal{E}}_g = o(n^{-1}),\] given \({\mathcal{E}}_f := {\mathbb{E}}\left[ \frac{1}{n} \sum_{i=1}^n (f(Z_i) - \hat f(Z_i))^2 \right]\) and likewise for \({\mathcal{E}}_f\), then \[T \sim N(0, 1)\]	MSM PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Kaplan-Meier Estimate, Form 2	\[ \hat F(t) = \prod_{j:a_j \leq t} \left(1 - \frac{d_j}{r_j}\right)\] <br>with \(d_j  = |\{ i: x_i = a_i \land v_i = 1 \}|\) (non-censured death times at event time \(a_i\))<br> and \(r_j = |\{ i : x_i \geq a_i \}|\) (all events at and after \(t = a_i\); number at risk)<br> <br> Note this is based on the erroneous assumptions that \(a_i...a_m\) are fixed constants and that \(T\) is discrete only with events at \(a_j\). Though it ends up being computationally equivalent to method 1!<br> \(\frac{d_j}{r_j}\) represents the probability of any event occurring at \(a_i\) (of the number of people at risk, how many were known to have perished in the time interval?)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> The hypothesis tested in the Log Rank Test	Null hypothesis: \(F_j^{(0)} = F_j^{(1)}\) for all \(j \in [m]\)<br> Alternate hypothesis: \(F_j^{(0)} \not= F_j^{(1)}\) for any \(j \in [m]\)<br> <br> where \(F_j^{(k)}\) represents the survivor function over \((a_{j}, a_{j+1}]\) for the \(k\)th group in \(k \in \{0, 1\}\). (assume the \(a_j\) are all the event times of the two groups put together)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> The Log-Rank Test	\[\frac{Z}{S} \underset{approx}{\sim} {\mathrm{N}}(0, 1)\] <br>with <br>\[Z = \sum_j Z_j = \sum_j d_j^{(0)} - d_j\frac{r_j^{(0)}}{r_j}\] Z is the number of observed (uncensored) events minus the number of expected events were \(H_0\) true.<br>  <br> NONEXAMINABLE: \[S^2 = \sum_j \frac{d_j (r_j - d_j)r_j^{(0)}r_j^{(1)}}{r_j^{2}(r_j-1)}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	What's a good rule of thumb for when the Log-Rank test will be powerful?	The curves are well-seperated and don't cross	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Relative Risk	\[\begin{align*} RR &= \frac{\text{sum of observed / sum of expected (in group 0)}}{\text{sum of observed / sum of expected (in group 1)}}\\\\ &= \frac{\sum_j d_j^{(0)} / \sum_j d_j \frac{r_j^{(0)}}{r_j} }{\sum_j d_j^{(1)} / \sum_j d_j \frac{r_j^{(1)}}{r_j}} \end{align*}\] <br>Events are \(RR\times\) more likely in group 0 than group 1.	MedicalStats PartIIINotes
Basic	Part III Notes::LogicAndComputability	What do we mean by \(\Gamma \vdash \varphi\)?	In the context \(\Gamma\), we have a proof of \(\rho\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What do we mean by \(\Gamma_1 A \vdash B\)?	Adding A to the context \(\Gamma\), we have a proof of B.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\land\) - I)	\[ \frac{\Gamma \vdash A \quad \Gamma\vdash B}{\Gamma \vdash A \land B } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\lor\) - I)	\[ \frac{\Gamma \vdash A}{\Gamma \vdash A \lor B } \text{  and  } \frac{\Gamma \vdash B}{\Gamma \vdash A \lor B} \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\land\) - E)	\[ \frac{\Gamma \vdash A \land B}{\Gamma \vdash A} \text{  and  } \frac{\Gamma \vdash A \land B}{\Gamma \vdash B} \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\lor\) - E)	\[ \frac{ \Gamma \vdash A \lor B \quad \Gamma_1 A \vdash C \quad \Gamma_1 B \vdash C }{ \Gamma \vdash C } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\bot\) - E)	For all \(A\), \[ \frac{ \Gamma \vdash \bot }{ \Gamma \vdash A } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\rightarrow\) - I)	\[ \frac{ \Gamma_1 A \vdash B }{ \Gamma \vdash A \rightarrow B } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\rightarrow\) - E)	\[ \frac{ \Gamma \vdash A \rightarrow B \quad \Gamma \vdash A }{ \Gamma \vdash B } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(?x)	For all \(A\), \[ \frac{ \cdot }{ \Gamma_1 A \vdash A } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(LEM)	For all \(A\), \[ \frac{ \cdot }{ \Gamma \vdash A \lor (A \rightarrow \bot) } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> IQC	First-order intuitionistic logic. Includes the axioms for quantifiers (that's what the Q stands for)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What do we mean by \(\varphi [x := t]\) for some term t?	Whatever \(\varphi\) is, with all instances of \(x\) replaced with \(t\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\exists\)-I)	For all terms \(t\), \[ \frac{ \Gamma \vdash \varphi[x := t] }{ \Gamma \vdash \exists x. \varphi(x) } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\exists\)-E)	\[ \frac{ \Gamma \vdash \exists x.\varphi \quad \Gamma_1\varphi \vdash \psi }{ \Gamma \vdash \psi } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\forall\)-I)	For all \(x\) not free in \(\Gamma\), \[ \frac{ \Gamma \vdash \varphi }{ \Gamma \vdash \forall x.\varphi } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	(\(\forall\)-E)	For all terms \(t\), \[ \frac{ \Gamma \vdash \forall x.\varphi }{ \Gamma \vdash \varphi [x := t] } \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What does \({\mathcal{U}}\) and \(V\) denote in the simply-typed \(\lambda\)-calculus?	\({\mathcal{U}}\) is a countable set of primitive types. <br> <br>\(V\) is an infinite set of variables.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What does \(\Pi\) represent in the simply-typed \(\lambda\)-calculus? Define its grammar	\(\Pi\) is the set of types in \(\lambda\)-calculus, denoted \(\lambda_\Pi (\rightarrow)\): \[\begin{align*} \Pi :=&\,\,\, {\mathcal{U}}\\\\ &|\, \Pi \to \Pi \end{align*}\] <br>(also sometimes called the implicational fragment of \(\lambda_\Pi\))	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What does \(\lambda_\Pi\) represent in the simply-typed \(\lambda\)-calculus? Define its grammar and label it	\(\lambda_\Pi\) is the set of terms in simply-typed \(\lambda\)-calculus: \[\begin{align*} \Pi :=&\,\,\, V\\\\ &|\, \lambda V : \Pi . \lambda_\Pi & (\lambda-\text{abstraction})\\\\ &|\, \lambda_\Pi \lambda_\Pi & (\lambda-\text{application}) \end{align*}\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Cumulant-Generating Function	Natural logarithm of the moment-generating function: <br>\[K(t) = \log {\mathbb{E}} [e^{tX}]\]<br> <br>Generates cumulants: \(\kappa_n\). Note that:<br> \(\kappa_1\) is the mean<br> \(\kappa_2\) is the variance<br> \(\kappa_3\) is the skewness<br> But then it gets weirder.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Normal Linear Model	\[Y_i = x_i{^{\mathrm{T}}} \beta + \varepsilon_i\] <br>with \(\varepsilon_i \sim {\mathrm{N}}(0, \sigma^2)\) independent<br> <br>Alternatively written \(Y = X\beta + \varepsilon\)<br> <br>Note our assumptions: \(Y_i\) are independent and normally distributed; linear in \(X\).<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	MLE for the slope of a normal linear model	\[\hat\beta = (X{^{\mathrm{T}}} X){^{-1}} X{^{\mathrm{T}}} Y\] <br>(By the Gauss-Markov Theorem, \(\hat \beta\) and \(\hat \sigma^2\) is the best linear unbiased estimator!)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> The two MLEs for the variance \(\sigma^2\) of a normal linear model	\[\tilde\sigma^2 = \frac{1}{n} ||Y - X\hat \beta||^2_2\] or if we'd rather have an unbiased \(\sigma^2\), \[\hat\sigma = \frac{1}{n - p} ||Y - X\hat \beta||^2_2\]<br> <br>(By the Gauss-Markov Theorem, \(\hat \beta\) and \(\hat \sigma^2\) is the best linear unbiased estimator!)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Exponential Dispersion Family	\[ \{ {\mathbb{P}}_{\theta,\phi}: \theta \in \Theta, \phi \in \Phi \} \] <br>\(\Theta \subset {\mathbb{R}}\), \(\Phi \subset (0, \infty)\), fulfilling that, for each \({\mathbb{P}}_{\theta, \phi}\), we have (\(\forall y \in {\mathbb{R}}\)): <br>\[ f(y; \theta, \phi) = f_0(y,\phi) \exp \left\{  \frac{y\theta - K(\theta)}{\phi} \right\} \] <br>where<br> \(K: \Theta \to {\mathbb{R}}\) is called the cumulant function (it's just here to normalize), and<br> \(\{ f_0(\cdot, \phi) : \phi \in \Phi \}\) is some family of density functions (that gets 'tilted' by the exponential portion).<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the equation to solve for the cumulant function \(K(\theta)\) for exponential diffusion families?	\[K(\theta) = \phi \log \int_{-\infty}^\infty e^{y\theta/\phi}f_0(y,\phi)\,dy\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the cumulant-generating function for an exponential diffusion family?	\[K(t) = \log {\mathbb{E}} [e^{ty}] = \frac{K(\phi t + \theta) - K(\theta)}{\phi}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the mean and variance of the exponential distribution family in terms of the cumulant function \(K(\theta)\)? What conclusions can we draw about \(\theta\)?	\[ {\mathbb{E}}_{\theta, \phi}[y] = K'(\theta) \] \[ {\mathrm{Var}}_{\theta, \phi}[y] = \phi K''(\theta) \]<br> <br>(this can be computed simply by taking the first or second derivative of the cumulant-generating function and plugging in \(t=0\))<br> <br>If we assume our variance is nonzero, and since \(\phi\) must be positive, it must be the case then that \(K''(\theta) &gt; 0\). Thus \(K'(\theta)\) must be strictly increasing and therefore invertible. So, considering the mean equation above, it makes sense to reparameterize everything in terms of \(\mu := K'(\theta)\)!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What properties do we care about for an exponential dispersion family?	We need a natural parameter \(\theta\),<br> a dispersion parameter \(\phi\),<br> a cumulant function \(K(\theta)\) which gives us \(\mu =  K'(\theta)\),<br> a mean parameter space \(\mathcal{M} = { \mu(\theta : \theta \in \Theta)}\),<br> a mean function \(\theta \to \mu(\theta) = (K')^{-1}(\mu)\),<br> and a variance function \(V(\mu) = K'' \circ \theta(\mu)\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Give the properties of a normal distribution \(\mathrm{N}(\mu, \sigma^2)\) (\(\mu \in \mathbb{R}, \sigma &gt; 0\)) interpreted as an exponential dispersion family	\[\theta = \mu\] \[\phi = \sigma^2\] \[K(\theta) = \theta^2/2\] \[\mathcal{M} = \mathbb{R}\] \[\mu(\theta) = \theta\] \[V(\mu) = 1\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Give the properties of a binomial distribution \(\frac{1}{m}\mathrm{Bin}(m, p)\) (\(m \in \mathbb{N}, p \in (0, 1)\)) interpreted as an exponential dispersion family	\[\theta = \mathrm{logit}(p) = \log(p / (1-p))\] \[\phi = 1/m\] \[K(\theta) = \log(1 - p) - \log(2) = \log(e^\theta + 1) - \log(2)\] \[\mathcal{M} = (0,1)\] \[\mu(\theta) = \mathrm{expit}(\theta)\] \[V(\mu) = \mu(1-\mu)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Generalized Linear Model	Independent observations \((x_i, Y_i)\) follow a generalized linear model if, for some exponential dispersion family \(\{{\mathbb{P}}_{\mu, \phi}: \mu \in {\mathcal{M}}, \phi \in \Phi\}\):<br> <br>1. \(Y_i \sim {\mathbb{P}}_{\mu_i, \phi_i}\) with \(\mu_i \in {\mathcal{M}}\) and \(\phi_i = \phi a\) with known \(a_1...a_n &gt; 0\) and possibly unknown \(\phi&gt;0\),<br> <br>2. \(g(\mu_i) = x_i{^{\mathrm{T}}} \beta\) for some unknown \(\beta \in {\mathbb{R}}^p\) with a strictly monotonic twice differentiable known 'link function' \(g: {\mathcal{M}} \to {\mathbb{R}}\).<br> (linearity after applying the link function)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Residual Squared Error	\[RSE = \sqrt{ \frac{\sum_{i=1}^n (y_i-\hat y_i)^2}{n - p} }\] where \(p\) is the number of parameters in our regression (denominator is the number of degrees of freedom).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(R^2\) (Coefficient of determination)	\[R^2 = 1 - \frac{RSS}{TSS} = 1- \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2}\] <br>\(RSS\) is the residual sum of squares, and \(TSS\) is the total sum of squares. <br>Proportion of the variation predictable from our model. We want this to be large!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the Logit function?	\[{\mathrm{logit}}(x) = \log \left( \frac{x}{1-x}\right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the Expit function?	\[{\mathrm{expit}}(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Log Likelihood of a Generalized Linear Model<br>(given \(Y = (Y_1,...,Y_n){^{\mathrm{T}}}\), \(\beta\).)	First define \(\theta_i = (K'){^{-1}} (g{^{-1}} (x_i {^{\mathrm{T}}} \beta))\). We have \[\begin{align*} {\mathcal{L}}(\beta, \phi) &= \sum_{i=1}^n \log f(Y_i; \theta_i, \phi_i)\\\\ &= {\sum_{i=1}^n} \log f_0(Y_i, \phi a_i) + {\sum_{i=1}^n} \frac{1}{\phi a_i}\{Y_i \theta_i - K(\theta_i)\} \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Proposition:</b> If an MLE exists for the Generalized Linear Model with the canonical link, it must be unique.	If an MLE \[\hat \beta = {\underset{\beta \in {\mathbb{R}}^p}{\mathrm{argmax}}\,\,} {\mathcal{L}} (\beta, \phi)\] exists, then it must satisfy the score equation:  \[\frac{\partial{\mathcal{L}}(\hat \beta, \phi)}{\partial \beta} = 0.\] Note that by our assumptions, we have \(\theta_i = x_i{^{\mathrm{T}}} \beta\), yielding: \[ \frac{\partial^2{\mathcal{L}} (\beta, \phi)}{\partial\beta\partial\beta{^{\mathrm{T}}}} = - {\sum_{i=1}^n} \frac{x_i x_i{^{\mathrm{T}}}}{\phi a_i} K''(x_i{^{\mathrm{T}}} \beta)\] Since \(K'' &gt; 0\) and \(X\) is assumed to have full rank, the Hessian must be negative definite, and so if it exists, the MLE must be unique.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Generalized Pearson Chi-Squared Statistic for \(\phi\)	\[\hat \phi = \frac{\chi^2}{n-p}\] \[\chi^2 = {\sum_{i=1}^n} \frac{(Y_i - K'(x_i{^{\mathrm{T}}} \hat \beta))^2}{a_i V(K'(x_i{^{\mathrm{T}}} \hat \beta))}\] <br> This comes from \({\mathrm{Var}}(Y_i) = a_i \phi V(\mu_i) \implies \phi\) plugged into the weighted OLS unbiased variance MLE \[\hat \sigma^2 = \frac{\sum_iw_i(Y_i - \hat\mu)^2}{n-p}\]<br> <br> This is a consistent estimator for \(\phi\)!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Explain how to derive Newton-Raphson	Suppose we want to solve \(f(x) = 0\). If we start at some point \(x\), we can simply step downhill a bunch of times to get \(x'\) with smaller and smaller \(f(x')\) via a Taylor approximation: <br>\[0 = f(x') \approx f(x) + f'(x)(x' - x)\] <br>Rearranging yields <br>\[\begin{align*} 0 &= f(x) + f'(x)(x' - x)\\\\ -f'(x)(x' - x) &= f(x)\\\\ x' &= x - \frac{f(x)}{f'(x)} \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Explain briefly what Fisher scoring is, in the context of solving a general linear model numerically	Similar to Newton-Raphson, but replaces the Jacobian \(J(\beta)\) with the relevant Fisher information: \[I(\beta) = {\mathbb{E}}(J(\beta))\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Column Space	Interpreting some \(m \times n\) matrix \(A\) as \(n\) column vectors, what do they span? \[c_1v_1 + ...+c_nv_n\] In other words, all possible products \(Ax\) for any \(x \in {\mathbb{R}}^n\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the orthogonal complement of the column space?	The orthogonal complement of the column space is the left null space: \[v \in {\mathrm{col}}(X)^\perp \iff v{^{\mathrm{T}}} X = 0{^{\mathrm{T}}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> PDF of the Multivariate Normal Distribution	\[(2\pi)^{-n/2} |\Sigma|^{-1/2} \exp \left( -\frac{1}{2} (x - \mu){^{\mathrm{T}}} \Sigma{^{-1}} (x - \mu) \right)\] where \(\Sigma\) is the covariance matrix between the components.<br> <br> Note that when the components are independent, we get just a diagonal \(\Sigma\) with \(\Sigma_{ii} = \sigma_i^2\), and this simplifies to just a product of univariate PDFs.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(\frac{\partial}{\partial v}Xv\)	\[X\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(\frac{\partial}{\partial v}v{^{\mathrm{T}}} X\)	\[X{^{\mathrm{T}}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(\frac{\partial}{\partial v}v{^{\mathrm{T}}} Xv\)	\[\frac{\partial}{\partial v}v{^{\mathrm{T}}} Xv = (X + X{^{\mathrm{T}}})v\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> {\mathrm{Var}}{(Xv)} = {\mathrm{Cov}}{(Xv)}	\[{\mathrm{Var}}(Xv) = X \Sigma X{^{\mathrm{T}}}\] where \(\Sigma\) is the covariance matrix of the random vector \(v\).	SLIP PartIIINotes
Basic	Part III Notes::Astrostatistics	MGF of the Univariate Normal RV	\[\phi_X(t) = e^{i\mu t - \frac{1}{2} \sigma^2t^2}\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Easiest way to find the distribution of the sum of two random variables	Use the property that the MGF of a sum of RVs is the product of their MGFs: \[\phi_{X+Y}(t) = \phi_X(t) * \phi_Y(t)\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Variance of the sample mean	\[{\mathrm{Var}}(\hat \mu) = \frac{\sigma^2}{n}\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Fisher Information Matrix	\[\bigl[\mathcal{I}(\theta)\bigr]_{i, j} =   \operatorname{E}\left[\left.     \left(\frac{\partial}{\partial\theta_i} \log f(X;\theta)\right)     \left(\frac{\partial}{\partial\theta_j} \log f(X;\theta)\right)   \,\, \right| \,\,\theta\right]\] Or, equivalently, given some regularity conditions, \[ \bigl[\mathcal{I}(\theta) \bigr]_{i, j} =   -\operatorname{E}\left[\left.     \frac{\partial^2}{\partial\theta_i\, \partial\theta_j} \log f(X;\theta)    \,\, \right| \,\, \theta\right]\,\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Univariate Unbiased Cramer-Rao Lower Bound	Given an unbiased estimator \(T\) for \(\theta\), we have: \[{\mathrm{Var}}(\hat \theta) \geq \frac{1}{I(\theta)}\] (this is achieved by unbiased MLEs)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Multivariate Cramer-Rao Lower Bound	For any unbiased estimator \(\vec T\) of \(\vec \theta\), \[ {\mathrm{Cov}} (\vec T) - I(\theta){^{-1}} \] must be positive-semi-definite. In particular, this means that \[{\mathrm{Var}}(T_i) \geq [I(\vec\theta){^{-1}}]_{ii}\] for any index \(i\).	Astrostatistics PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Covariance Properties	1. \(\forall a, \, {\mathrm{Cov}}(X, a) =0\)<br> 2. \(\forall a, b, \,{\mathrm{Cov}}(X + a, Y + b) = {\mathrm{Cov}}(X, Y)\)<br> 3. \(\forall a, b, \,{\mathrm{Cov}}(aX, bY) = ab{\mathrm{Cov}}(X, Y)\) 4. \({\mathrm{Cov}}(X +Y, Z + W) = {\mathrm{Cov}}(X, Z) + {\mathrm{Cov}}(X,W) + {\mathrm{Cov}} (Y, Z) + {\mathrm{Cov}}(Y, W)\)	SLIP PartIIINotes
Basic	Part III Notes::LogicAndComputability	If we know \(\Gamma \vdash \varphi\), what do we know about \(\Gamma_1 \psi\) and \(\Gamma [p:= \psi]\)?	By Lemma 1.1.4, we must have \(\Gamma_1 \psi \vdash \varphi\)<br> <br>Furthermore, if we have a primitive proposition \(p\) and any proposition \(\psi\), then \(\Gamma [p:= \psi] \vdash \varphi[p:=\psi]\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Context (in the \(\lambda\)-calculus)	A set of pairs of the form \(x : \sigma\)<br> (where \(x \in V\) is a variable and \(\sigma \in \Pi\) is a type)<br> <br> Note that we write \(\Gamma_1 x : \sigma\) for \(\Gamma \cup \{x : \sigma\}\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Range of \(\Gamma\)	Denoted \(|\Gamma|\), the set of all types that appear in \(\Gamma\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Typability Relation<br>(basic definition)	\[\Vdash\, \subseteq C \times \lambda_\Pi \times \Pi\] where \(C\) is the set of all contexts. Links every term in a context to a type. Must fulfill three properties (recursing on any term in \(\lambda_\Pi\)). <br>This type system is denoted \(\lambda_\Pi (\to)\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> \(\alpha\)-equivalency	Two \(\lambda\)-terms are \(\alpha\)-equivalent if they differ only in the name of the bound variables.<br> <br>e.g. \(\lambda x : {\mathbb{Z}} . x^2 \equiv_\alpha \lambda y : {\mathbb{Z}}.y^2\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What does it mean for a \(\lambda\)-term to be closed?	It has no free variables.	LogicAndComputability PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Taylor Expansion<br>(For \(f\) around \(x_0\) for nearby \(x\))	\[\begin{align*} f(x)&\approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2 + ...\\\\ &= \sum_{n=0}^\infty \frac{1}{n!} f^{(n)}(x_0)(x - x_0)^n \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Maclaurin Expansion of \(e^x\)	\[\begin{align*} e^x &= 1 + x + \frac{1}{2}x^2 + ...\\\\ &= \sum_{n=0}^\infty \frac{x^n}{n!} \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Wilkes' Theorem	The likelihood ratio test (LRT) statistic \[ D = -2\log \left( \frac{\text{Likelihood for null model}}{\text{Likelihood for alternative model}} \right) \] is asymptotically distributed with the chi-squared distribution with \(\dim \Theta - \dim \Theta_0\): \[D \sim \chi^2_{\dim \Theta - \dim \Theta_0}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Proof: Solve for the confidence interval around some parameter \(j\) of the estimated slope \(\hat \beta\)	Note that we know  \[I(\beta)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] Because \(\hat \beta \overset{p}{\to} \beta\), we can apply the continuous mapping theorem and Slutsky's Theorem to get: \[I(\hat \beta)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] This yields the interval  \[C_\alpha := \left[ \hat \beta_j \pm z_{\alpha/2} \sqrt{[i(\hat \beta){^{-1}}]_{jj}} \right]\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Confidence Interval for GLMs given a new datapoint<br>(Denote the new datapoint \(x^* \in {\mathbb{R}}^p\), we want to find a confidence interval for \({\mathbb{E}}(Y | X = x^*) = g{^{-1}} (x^{*T} \beta)\).)	\[C_\alpha := g{^{-1}}\left[ x^{*T}\hat \beta \pm z_{\alpha/2} \sqrt{x^{*T}i(\hat \beta){^{-1}} x^{*}} \right]\]<br> <br> This comes from the fact that \[\hat \beta \sim N(\beta,\, i(\hat \beta){^{-1}})\] so \[x{^{\mathrm{T}}} \hat \beta \sim N(x{^{\mathrm{T}}}\beta,\,  x{^{\mathrm{T}}} i(\hat \beta){^{-1}})\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Mean parameterization \(\tilde l(\mu, \phi)\)	\[\begin{align*} \tilde l(\vec \mu, \phi) &= \log \tilde f(y, \vec \mu, \phi)\\\\ &= \log f(y, \theta(\vec \mu), \phi)\\\\ &= \sum_{i=1}^n \log f_0(y_i, \phi) + \sum_{i=1}^n \frac{1}{a_i \phi} (y_i\theta(\mu_i) - K(\theta(\mu_i)) \end{align*}\] Also known as the 'saturated' parameterization.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Deviance of a GLM	\[D(\hat\mu) := -2 \phi (\tilde l(\hat \mu, \phi) - \tilde l (Y, \phi))\] where \(\tilde l\) is the mean parameterization of the log likelihood (where we can provide a separate \(\mu_i\) for each sample \(i\)). Note then we have <br>\[ D(\hat \mu) = -2 \phi \log \left( \frac{\sup_{\mu \in {\mathbb{R}}^n: \mu_i = g{^{-1}}(x_i{^{\mathrm{T}}} \beta ),\, \beta \in {\mathbb{R}}^p} \tilde f(Y; \mu, \phi)}{\sup_{\mu \in {\mathbb{R}}^n} \tilde f(Y; \mu, \phi) }  \right) \] Here, the top model is the saturated model and the bottom model is the normal restricted model. So \(D(\hat \mu)\) is the likelihood ratio test statistic for \(H_0\): GLM, \(H_1\): saturated model.<br> <br>Also note that deviance is independent of \(\phi\)!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Dispersion-scaled Pearson Residuals	\[e_i = \frac{Y_i - \hat \mu_i}{{\mathrm{Var}}(Y_i)} = \frac{Y_i - \hat\mu_i}{\sqrt{\hat\phi a_i V(\hat \mu_i)}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Deviance Residuals	\[d_i = {\mathrm{sgn}}(Y_i - \hat \mu_i) \sqrt{D_i(\hat\mu)}\] where \(D_i(\mu)\) is the \(i\)th summand of the deviance	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the relationship between dispersion-scaled Pearson residuals and deviance residuals?	When \(Y_i \approx \hat \mu_i\), we have \(d_i = \hat \phi^{1/2} e_i + O(|Y_i - \hat \mu_i|^{3/2})\)	SLIP PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Typing Rule for Variables	1. Type of standalone variables: \[\Gamma_1 x : \tau \Vdash x : \tau\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Typing Rule for \(\lambda\)-abstractions	2. Type of \(\lambda\)-abstractions:<br> \[\Gamma_1 x: \sigma \Vdash M : \tau \implies \Gamma \Vdash (\lambda x : \sigma . M): \sigma \to \tau\]<br> (assuming \(x\) does not occur in \(\Gamma\), \(M \in \lambda_\Pi\))<br>	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Typing Rule for \(\lambda\)-applications	3. Type of \(\lambda\)-applications: \[(\Gamma \Vdash M : \sigma \to \tau)\land(\Gamma \Vdash N : \sigma) \implies \Gamma \Vdash (MN): \tau\] (\(M,N \in \lambda_\Pi\))<br>	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Substitution in \(\lambda_\Pi\)<br>(Inductive definition)	Given \(\lambda\)-terms M, N and a variable \(x\), we define the substitution \(M[x := N]\) by pattern matching inductively:<br> <br>Case 1: M is a variable equal to x: \[x[x:=N] := N\] <br>Case 2: M is a variable not equal to x: \[y[x := N] := y\] <br>Case 3: M is a \(\lambda\)-abstraction: \[(\lambda a:\sigma. A)[x := N] := (\lambda a: \sigma.A[x := N])\] (assuming \(x \neq y\) and \(y\) not free in N)<br> <br>Case 4: M is a \(\lambda\)-application: \[(PQ)[x := N] := (P[x := N])(Q[x := N])\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> \(\beta\)-Reduction<br>(Name the two sides of the reduction)	\[(\lambda x : \sigma.P)Q \rightarrow_\beta P[x:= Q]\] <br>The left side is called the \(\beta\)-redex and the right side is called the \(\beta\)-contraction.<br> <br>We also write \(M \twoheadrightarrow_\beta N\) if M \(\beta\)-reduces to N after a finite number of reductions.<br> <br>Note that this is preserved under \(\lambda\)-abstraction and \(\lambda\)-application: if \(P \to_\beta P'\), \[\lambda x : \sigma . P \to_\beta \lambda x : \sigma . P'\] \[PZ \to_\beta P'Z\] \[ZP \to_\beta ZP'\] (we can choose to evaluate in any order; we'll end up in the same place either way)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> \(\beta\)-Equivalence	The smallest equivalence relation containing \(\to_\beta\), denoted \(\equiv_\beta\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> \(\eta\)-reduction	Simplify pointless \(\lambda\)-abstractions: \[\lambda x:\sigma . (Px) \to_\eta P\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> \(\beta\)-Normal Form	M is in \(\beta\)-NF if there is no N such that \(M \to_\beta N\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	When we write a \(\lambda\)-term like \(ABC\), what do we actually mean?	\[(AB)C\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Free Variables Lemma<br>(Given \(\Gamma \Vdash M : \sigma\),)	then we must have:<br> <br>1. For all \(\Gamma \subseteq \Gamma'\), \(\Gamma' \Vdash M : \sigma\).<br> <br>2. The free variables of M occur in \(\Gamma\)<br> <br>3. There exists some \(\Gamma' \subseteq \Gamma\) consisting of exactly the free variables of M, preserving \(\Gamma' \Vdash M : \sigma\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Generation Lemma	Just read each type inference rule in reverse:<br> <br>1. If \(\Gamma \Vdash x:\sigma\), \[x : \sigma \in \Gamma\] <br>2. If \(\Gamma \Vdash (\lambda x : M) : \sigma\), then there exists types \(\tau, \rho \in \Pi\) such that \[\Gamma_1 x : \tau \Vdash M : \rho \quad \land \quad \sigma = (\tau \to \rho)\] <br>3. If \(\Gamma \Vdash (MN) : \sigma\), then there exists types \(\tau \in \Pi\) such that \[\Gamma \Vdash M : \tau \to \sigma\quad \land  \quad \Gamma \Vdash N : \tau\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Substitution Lemma	1. If \(\Gamma \Vdash M : \sigma\) and \(\alpha\) is some type variable, then: \[\Gamma[\alpha := \tau] \Vdash M:\sigma[\alpha := \tau]\] (how substitution behaves for types in a context)<br> <br>2. If \(\Gamma_1 x : \tau \Vdash M : \sigma\) and \(\Gamma \Vdash N : \tau\), then \[\Gamma \Vdash M[x := N] : \sigma\] (how substitution behaves for terms in a context)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> Subject Reduction	If \(\Gamma \Vdash M : \sigma\) and \(M \to_\beta N\), then  \[\Gamma \Vdash N : \sigma\] (\(\beta\)-reduction preserves types)<br> <br>Proven by induction with the generation and substitution lemmas.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Theorem:</b> Church-Rosser Theorem for \(\lambda_\Pi\)	Supposing \(\Gamma \Vdash M : \sigma\), if \(M \twoheadrightarrow_\beta N_1\) and \(M \twoheadrightarrow_\beta N_2\), then there exists some \(L \in \lambda_\Pi\) such that \[N_1 \twoheadrightarrow_\beta L \quad \land \quad N_2 \twoheadrightarrow_\beta L\] with \(\Gamma \Vdash L . \sigma\).<br> <br>Also called the confluence property or the diamond property.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Corollary to the Church-Rosser Theorem for \(\lambda_\Pi\)	If \(M \in \lambda_\Pi\) admits a \(\beta\)-NF, then it is unique.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Give an example of a \(\lambda\) term that is impossible to type	\[\lambda x.xx\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> The height function	A recursively defined map on types \(h: \Pi \to {\mathbb{N}}\) mapping:<br> <br>1. \(v \in V\) to 0<br> 2. \(\sigma \to \tau\) to \(1 + \max(h(\sigma), h(\tau))\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Theorem:</b> The Weak Normalization Theorem for \(\lambda_\Pi(\to)\)	Let \(\Gamma \Vdash M : \sigma\). There exists a finite reduction path  \[M := M_0 \to_\beta M_1 \to_\beta ...\to_\beta M_n\] where \(M_n\) is a \(\beta\)-NF.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Theorem:</b> The Strong Normalization Theorem for \(\lambda_\Pi(\to)\)	Let \(\Gamma \Vdash M : \sigma\). There exists no infinite reduction chain \[M := M_0 \to_\beta M_1 \to_\beta ...\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> Curry-Howard for IPC(\(\to\))	Let \(\Gamma\) be a context and \(\varphi\) be a proposition (type). Then:<br> <br>1. If \(\Gamma \Vdash M : \varphi\), then \[|\Gamma| = \left\{ \tau \in \Pi : (x : \tau) \in \Gamma \text{ for some variable } x \right\} \vdash_{IPC} \varphi\]<br> <br>2. If \(\Gamma \vdash_{IPC} \varphi\), then there exists some \(M \in \lambda_\Pi(\rightarrow)\) such that \[\left\{ (x_\psi : \psi) \,|\, \psi \in \Gamma \right\} \Vdash M : \varphi\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::Astrostatistics	Definition of CDF	\[F(x) = P(X\leq x) = \int_{-\infty}^x f(x)\,dx\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Characteristic Function (Statistics)	\[\varphi_X(t) = {\mathbb{E}}(e^{itX})\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> The Delta Method	Just a second-order Taylor approximation with a statistics coat of paint: \[{\mathbb{E}}[g(X)] \approx g({\mathbb{E}} [X]) + \frac{1}{2} g''({\mathbb{E}}[X]) {\mathrm{Var}}(X)\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Univariate General Cramer-Rao Lower Bound	Given any estimator \(T\) for \(\theta\), we have: \[{\mathrm{Var}}(\hat \theta) \geq  \frac{\left(1 + \frac{d}{d\theta} {\mathrm{Bias}}(\hat \theta)\right)^2}{I(\theta)}\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	How do you solve a problem with selection effects?	Use Bayes' Theorem conditioning on the selection function (usually a step function), should end up with a normal CDF normalizing factor	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	State the Frequentist model of statistics (in the context of a regression)	We have some true variables we want to compute, but we can only view data that is confounded with some source of error (following a distribution)<br> <br> So the underlying (latent) variables we seek to understand are deterministic, there's just a bunch of stochastic error complicating matters.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	State the Bayesian model of statistics (in the context of a regression)	Interpret probability as the degree of certainty in an event rather than its long-run chance. We start with some guess as to how we think the parameter is distributed, and use the data to update that prior into a more accurate posterior. <br> <br> The parameters are themselves random variables! We estimate their distribution, not just a single value.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Monte Carlo Integration<br>(for integrating over some interval I)	\[\hat I = \frac{1}{m} \sum_{i=1}^m f(\theta_i)\] where \(f(\theta)\) is some data processing to estimate some statistic (mean, variance, etc)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Monte Carlo integrator for the posterior mean	\[f(\theta) = \theta\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Monte Carlo integrator for the posterior variance	\[f(\theta) = (\theta - {\mathbb{E}}[\theta {\,|\,} D ])^2\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Monte Carlo integrator in an interval \([a,b]\)	\(f(\theta) = I_{[a, b]}(\theta)\)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Variance of a Monte Carlo Integrator	\[{\mathrm{Var}}(\hat I) = \frac{1}{m} {\mathrm{Var}}[f(\theta)]\] Note that this is independent of the dimension of \(\theta\)! <br>The Monte Carlo error is just the square root of this (the standard deviation)<br> <br>Note that this can also be used to estimate the sample variance: \[\hat {\mathrm{Var}}(\{ f(\theta_i) \}) = \frac{1}{m-1} \sum_{i=1}^m(f(\theta_i) - \hat I)\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Direct Sampling method for estimating a posterior distribution	In the case where a posterior distribution can be broken down into the product of named distributions, simply draw from one distribution, feed into the next, etc, and do this to get a bunch of samples. <br> <br> Note that it's easy to get the marginals this way: just ignore all but one value!	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Kernel Density Estimation	Creates a smooth histogram from a bunch of samples \(\theta_i\): <br>\[\hat P(\theta {\,|\,} D) = \frac{1}{m} \sum_{i=1}^m  N(\theta {\,|\,} D_i, b_w^2)\] <br>where \(b_w\) is the bandwidth.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Silverman's Rule of Thumb	\[b_w = \left( \frac{4 \hat \sigma^5}{3m} \right)^{1/5}\] <br>where \(\sigma\) is the estimated sample standard deviation, \(\hat \sigma^2 = \hat {\mathrm{Var}}(\{ \theta_i \})\)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> The Metropolis Algorithm	1. Pick some \(\mu_0\)<br> 2. Choose a new \(\mu_{\text{prop}} \sim {\mathrm{N}}(\mu_{\text{prev}}, \tau^2)\)<br> Note: the jump distribution has to be symmetric: \(J(a|b) = J(b|a)\)<br> 3. Accept with probability \(\min(1, r),\,\, r = \frac{{\mathbb{P}}(\mu_{\text{prop}}{\,|\,}\vec y)}{{\mathbb{P}}(\mu_{\text{prev}}{\,|\,}\vec y)}\)<br> 4. Repeat 2 and 3 until you have enough samples.<br> <br> This works for higher dimensions as well, just draw from a multivariate Gaussian jump distribution.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	State two post-processing strategies to properly format MCMC results	1. Get rid of burn-in: throw out about the first 20 percent of your samples<br> 2. Thinning: Throw out every other sample	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Stationary Distribution	Given some distribution \(P\) and some random process kernel \(T\): \[\sum_x P(x)T(x \to x') = P(x')\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Detailed Balance<br>(and what it implies)	Given some distribution \(P\) and some random process kernel \(T\): \[P(x)T(x \to x') = P(x')T(x'\to x)\] This implies that \(P\) is a stationary distribution: just sum both sides over \(x\)!	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Metropolis-Hastings	1. Pick some \(\mu_0\)<br> 2. Choose a new \(\mu_{\text{prop}} \sim J(\mu_{\text{prev}}, \tau^2)\)<br> Note: the jump distribution need not be symmetric: \(J(a|b) \neq J(b|a)\)<br> 3. Accept with probability \(\min(1, r),\,\, r = \frac{{\mathbb{P}}(\mu_{\text{prop}}{\,|\,}\vec y) / J(\mu_{\text{prop}}{\,|\,} \mu_\text{prev})}{{\mathbb{P}}(\mu_{\text{prev}}{\,|\,}\vec y)/ J(\mu_{\text{prev}}{\,|\,} \mu_\text{prop})}\)<br> 4. Repeat 2 and 3 until you have enough samples.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Gibbs Sampling	Assumes that we have marginal distributions along every parameter alone. <br>1. Pick some \(\vec \mu_0\)<br> 2. Run one Gibbs cycle: <br> for each \(i \in [d]\),  set \(\mu_{next,i} \sim {\mathbb{P}}(\mu_{prev, i}{\,|\,} \mu_{prev, -i}, y)\)<br> 4. Repeat 2 until you have enough samples.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Metropolis-Within-Gibbs	If we can't express the conditional distributions necessary to run Gibbs sampling, estimate them by attempting to take a Metropolis step	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	How do we choose a good jump function when tuning the Metropolis algorithm?	We can use Laplace approximation to show that the best choice is \(c^2A{^{-1}}\), where \(A\) is the Hessian of the log-posterior and \(c \approx \frac{2.4}{\sqrt{\dim \theta}}\)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	What acceptance rate should we aim for when using the Metropolis algorithm?	44 percent in one dimension, 23 percent in dimensions greater than 5	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	How can we compare the convergence of two different MCMC methods?	Check that your Gelman-Rubin ratio is about 1.<br> Check the autocorrelation timescale, compare effective sample sizes.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Explain mixed samplers and parameter blocking	Mixed samplers describe the process of choosing different means of sampling different parameters of the function we want to sample from.<br> <br> Parameter blocking refers to updating some parameters together (in blocks), especially ones that are highly correlated (to prevent zig-zagging during Gibbs, for instance).	Astrostatistics PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Binary Relation<br>(and when it is said to contain another relation)	Given some set \(X\), \(X \times X\) is said to be a relation on \(X\). It contains another relation \(Y\) if for every \((a,b) \in Y\), \((a,b) \in X\).<br> <br>It's an equivalence relation when it is reflexive, symmetric, and transitive.<br> <br>Note that any intersection of equivalence relations always is an equivalence relation.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> The Three Laws of Lattices	A lattice L is a set equipped with a a binary operators \(\wedge\) and \(\vee\) such that both operations obey: <br> <br>- commutativity, and<br> - associativity <br>Plus together they must follow the absorption laws. <br>Note together these imply \(a \vee a = a\) and \(a \wedge a = a\) (this is called being idempotent)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Distributive Lattice	A lattice that fulfills distributivity: \[a \wedge (b \vee  c) = (a \wedge b) \vee (a \wedge c)\] Note this is true iff \[a \vee (b \wedge  c) = (a \vee b) \wedge (a \vee c)\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Bounded Lattice	A lattice where there exists \(\bot, \top \in L\) such that: \[a \vee \bot = a,\] \[a \wedge \top = a\] Note that these imply: \[a \wedge \bot = \bot\] \[a \vee \top = \top\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Complemented Lattice	A bounded lattice such that, for all \(a \in L\), there exists some \(a^* \in L\) such that: \[a \wedge a^* = \bot,\] \[a \vee a^* = \top\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Boolean Lattice	A lattice that is distributive, bounded, and complemented.<br> <br>(Note that the structure this creates makes the complement function \(*\) equivalent to \(\lnot\) since we must have \(a \lor \lnot a = \top\) and \(a \land \lnot a = \bot\). This is also where \(\wedge\) and \(\vee\) come from! \(\lor\) prefers \(\top\) and \(\land\) prefers \(\bot\))	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Induced Order from a Lattice	We say \(a \leq b\) when \(a \wedge b = a\) and equivalently \(a \vee b = b\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> Induced Order of Bounded Lattices	For any bounded lattice, for the induced order, we have: \[a \wedge b = \inf\{a, b\},\] \[a \vee b = \sup\{a, b\}\] <br>Conversely, for any partial order, finite meets and joins form a lattice as above.<br> <br>Note then that for any total order, we have \(\wedge \equiv \min\) and \(\vee \equiv \max\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Draw the free distributive lattice on {x, y}	\begin{tikzpicture}[   node/.style={circle, draw, fill=white, inner sep=2pt, minimum size=6pt},   label/.style={font=\small},   >=stealth ] <br>% Nodes \node[node] (bot)  at (0, 0)  {}; \node[node] (xay)  at (0, 1.5) {}; \node[node] (x)    at (-1.5, 3) {}; \node[node] (y)    at (1.5, 3)  {}; \node[node] (xvy)  at (0, 4.5) {}; \node[node] (top)  at (0, 6)   {}; <br>% Edges \draw (bot) -- (xay); \draw (xay) -- (x); \draw (xay) -- (y); \draw (x)   -- (xvy); \draw (y)   -- (xvy); \draw (xvy) -- (top); <br>% Labels \node[label, below] at (bot.south)  {\(\bot\)}; \node[label, left]  at (xay.west)   {\(x \wedge y\)}; \node[label, left]  at (x.west)     {\(x\)}; \node[label, right] at (y.east)     {\(y\)}; \node[label, left]  at (xvy.west)   {\(x \vee y\)}; \node[label, above] at (top.north)  {\(\top\)}; <br>\end{tikzpicture}	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Valuation	A function \(v:L \to \{0, 1\}\) assigning truth values to statements in a consistent manner (respecting logical connectives)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Heyting Algebra	A bounded lattice \(H\) equipped with an extra binary relation \(\Rightarrow: H \times H \to H\) such that \[a \wedge b \leq c \iff a \leq (b \Rightarrow c)\]<br> <br>(So \(a \Rightarrow b\) can be thought of the weakest thing you can assume that, when combined with \(b\), gets you to \(c\))	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Heyting Homomorphism	A function that preserves \(\wedge\), \(\vee\), and \(\Rightarrow\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Show that every Boolean algebra is a Heyting algebra	We can construct our own \(\Rightarrow\) function: \[a \Rightarrow b := (\lnot a) \lor b\]<br> <br>An example of this are the power sets of a set.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Show that every topology on a set X is a Heyting algebra	We construct a \(\Rightarrow\) function: \[U \Rightarrow V := {\mathrm{Int}}(U^c \cup V)\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> H-Valuation	Given some language \(L\) with a set of primitive propositions \(P\), an H-valuation \(v: P \to H\) is extended to \(L\) recursively by:<br> <br>1. \(v(\bot) = \bot\)<br> 2. \(v(A\land B) = v(A) \land v(B)\)<br> 3. \(v(A\lor B) = v(A) \lor v(B)\)<br> 4. \(v(A \to B) = v(A) \Rightarrow v(B)\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> H-Valid	A proposition \(A \in L\) is H-valid if, for all H-valuations \(v\), \(v(A) = \top\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> H-Consequence	A proposition \(A \in L\) is an H-consequence of a finite set of propositions \(\Gamma \subseteq L\) if, for all valuations \(v\), \(v(\bigwedge \Gamma) \leq v(A)\).<br> <br>We denote this by \(\Gamma \vDash_H A\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Lemma:</b> Soundness of Heyting Semantics	If H is a Heyting algebra and v is an H-valuation, we always have: \[\Gamma \vdash_{IPC} A \implies \Gamma_{H,v} \vDash A\]<br> <br>Proven over the structure of all IPC rules.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Example:</b> Show that the LEM is not intuitionistically valid	By the soundness of Heyting semantics, we need only to find an example of a Heyting algebra with a valuation where \(v(p \lor \lnot p) \neq \top\) for some \(p \in L\).<br> <br>Consider the topology \(\{ \emptyset, \{ 1\}, \{1,2\} \}\) (called the Sierpinski space). Take some primitive proposition \(p\), and define our valuation so that \(v(p) = \{1\}\). Then we have<br> <br>\[v(\lnot p) = v(p) \Rightarrow \emptyset = {\mathrm{Int}} (\{1,2\} \backslash \{ 1\}) = \emptyset\] <br>and thus <br>\[v(p \lor \lnot p) = v(p) \lor v(\lnot p) = \{ 1\} \neq \top = \{ 1, 2\}\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Pierce's Law	\[((p \to q) \to p) \to p\] This is a tautology in classical logic, but not intuitionistically valid.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Example:</b> Show that Pierce's Law is not intuitionistically valid.	By the soundness of Heyting semantics, we need only to find an example of a Heyting algebra with a valuation where \(v(p \lor \lnot p) \neq \top\) for some \(p \in L\).<br> <br>Take the usual topology on \({\mathbb{R}}^2\) with the valuation:<br> <br>\[p \longmapsto {\mathbb{R}}^2 \backslash \{ 0, 0\}\] \[q \longmapsto \emptyset\] <br>Plugging in, you get \(((p \to q) \to p) \to p = {\mathbb{R}}^2 \backslash \{0,0\} \neq \top = {\mathbb{R}}^2\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Lindenbaum-Tarski Algebra	A Lindenbaum-Tarski Algebra \(F^Q(T)\) over an L-theory T is an algebra over equivalence classes of propositions where \(\varphi \sim \psi\) \[T, \varphi \vdash_Q \psi \text{  and  } T, \psi \vdash_Q \varphi\] (equivalence class over \(\varphi \iff \psi\))<br> <br>If \(\bowtie\) is any logical connective in \(T\), we set \([\varphi] \bowtie [\psi]\) where \([\varphi \bowtie \psi]\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> Show that the LT Algebra of any theory relative to IPC \(\backslash \to\) (IPC without implication) is a distributive lattice.	Commutativity and associativity are trivial, just show the two absorption laws and a distributivity law by IPC rules: \[([A] \lor [B])\land [A] = [(A \lor B) \land A] = [A]\] \[([A] \land [B])\lor [A] = [(A \land B) \lor A] = [A]\] \[[A] \land ([B] \lor [C]) = [A \land (B \lor C)] = [(A  \land B) \lor (A \land C)] = ([A]  \land [B]) \lor ([A] \land [C]) \]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Lemma:</b> Show that the LT algebra of any theory relative to IPC is a Heyting Algebra	We showed that IPC \(\backslash \to\) is a distributive lattice, so we just need to show that IPC has \(\Rightarrow\) and is bounded. Use \(\to\) as \(\Rightarrow\), \(\bot\) as \(\bot\), and \(\bot \to \bot\) as \(\top\). So just show that:<br> <br>Assuming \([A] \land [B] \leq [C]\), we have \([A] \leq ([B] \Rightarrow [C])\)<br> <br>Assuming \([A] \leq ([B] \Rightarrow [C])\), we have \([A] \land [B] \leq C\)<br> <br>Both directions needed to make \(\bot \leq A\) for all \(A\)<br> <br>Both directions needed to make \(A \leq \top\) for all \(A\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Theorem:</b> Completeness of Heyting semantics	A formula is provable in IPC if it is H-valid for every Heyting algebra H (and only if by soundness)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Principal Up-set<br>(of some poset \(S\))	\[a\uparrow\, := \{s \in S: a \leq s\}\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Terminal Segment<br>(of a poset \(S\))	A subset \(U \subseteq S\) where \(a \uparrow\, \subseteq U\) for every \(a \in U\)<br> <br> <br> Note that this isn't the same as a filter: for any lattice with incomparable elements, we can escape the terminal segment with finite meets!	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> If \(S\) is a poset, \(T(S) := \{ U \subseteq S : U \text{ is a terminal segment} \}\) is a Heyting algebra	Order by set inclusion, the fact that this is a bounded lattice and closed is taken as trivial. The key step is using \[U \Rightarrow V := \{ s \in S: (s \uparrow) \cap U \subseteq V \}\] <br>Must show that this is a terminal segment and both directions of the Heyting Algebra definition.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Kripke Model	Let P be some set of primitive propositions. A triple \((S, \leq, \Vdash)\) where \((S, \leq)\) is a poset, with S called the set of worlds and \(\leq\) the accessibility relation; and \(\Vdash \subseteq S \times P\) is the forcing relation.<br> It must fulfill the Persistence Property. <br>Note that for \(X \subseteq S\) we say \(X \Vdash \varphi\) if all worlds in \(x \in X\) force \(\varphi\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Extended Forcing Relation of the Kripke Model	When \(p\) is not a primitive proposition, we define the forcing relation inductively as follows:<br> <br>1. There is no world that believes in \(\bot\): \(\not \exists s,\, s \Vdash \bot\)<br> <br>2. We write \(s \Vdash \varphi \land \psi\) iff \(s \Vdash \varphi\) and \(s \Vdash \psi\)<br> <br>3. We write \(s \Vdash \varphi \lor \psi\) iff \(s \Vdash \varphi\) or \(s \Vdash \psi\)<br> <br>4. We write \(s \Vdash (\varphi \to \psi)\) iff, for every more knowledgeable world \(s \leq s'\), \(s' \Vdash \varphi\) implies \(s' \Vdash \psi\)<br> (worlds are humble: they believe only in things that they believe more knowledgeable worlds believe in)<br>	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What are the conditions for negation in a Kripke model?	\[s \Vdash \lnot \varphi \iff \forall s' \geq s,\,\, s' \not \Vdash \varphi\] \(\lnot \varphi\) holds iff no more knowledgeable worlds believe in it	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What are the conditions for double negation in a Kripke model?	\[s \Vdash \lnot \lnot \varphi \iff \forall s' \geq s,\,\, \exists s'' \geq s',\,\, s'' \Vdash \varphi\] In all more knowledgeable worlds, there exists some even more knowledgeable world that believes in \(\varphi\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Filter	A filter \(F\) on a lattice \(L\) is a subset \(F \subseteq L\) with the following properties:<br> <br>1. \(F \neq \emptyset\)<br> 2. F is a terminal segment of \(L\): ie if \(x \leq y\) and \(x \in F\), then \(y \in F\).<br> 3. \(F\) is closed under finite meets.<br>	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Prime Filter	A filter is prime if \(x \vee y \in F \implies x \in F \text{ or } y \in F\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> The Hypergeometric Distribution	The probability of \(k\) successes given \(n\) draws, where we draw without replacement from \(N\) objects with \(K\) that we want. <br>\[\mathrm{Hypergeometric}(N, K, n)\] <br>(probability distribution over \(k\))	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Expectation of the hypergeometric distribution	\[{\mathbb{E}}(\mathrm{Hypergeometric}(N, K, n)) = n \frac{K}{N}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MSM	What is the benefit of using the Benjamin-Hochberg Procedure?	It's guaranteed to keep the FDR beneath \(\alpha\): \[\frac{\alpha m_0}{m} \leq \alpha\]	MSM PartIIINotes
Basic	Part III Notes::SLIP	State the cumulant function for the EDF of a normal random variable.	\[K(\theta) = \frac{1}{2}\theta^2\] <br>Note that this implies \[\mu = \frac{d}{d\theta}K(\theta) = \theta\] and \[\sigma^2 = \phi \frac{d^2}{d\theta^2} K(\theta) = \phi\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the base function \(f_0(y;\phi)\) of the normal EDF	\[f_0(y;\phi) = \frac{1}{\sqrt{2\pi\phi}} \exp\left(-\frac{y^2}{2\phi}\right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can you find the \(\theta\) for the EDF representing \(\frac{1}{n}\mathrm{Bin}(n,p)\)	\[\theta = {\mathrm{logit}}(p) = \frac{p}{1-p}\] and so \[p = {\mathrm{expit}}(\theta) = \frac{e^\theta}{1+e^\theta}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can you find the \(\phi\) for the EDF representing \(\frac{1}{n}\mathrm{Bin}(n,p)\)	\[\phi = \frac{1}{n}\] and so \[n = \frac{1}{\phi}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the cumulant function of the EDF for \(\frac{1}{n}\mathrm{Bin}(n,p)\)	\[K(\theta) = \log(1 + e^\theta) - \log(2)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the normalizing \(f_0(y;\phi)\) for the EDF of \(\frac{1}{n}\mathrm{Bin}(n,p)\)	\[{n\choose{yn}} 2^{-n}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Classifier	A function \(h: {\mathbb{R}}^p \to [K]\), typically of the form \[h(x) = h(x; (X_i, Y_i)_{i \in [n]})\] <br>where \(p\) is the dimension of the classified points, \(K\) is the number of classification groups, and \(n\) is the number of data points.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Misclassification Loss	\[L(y, y') = {\mathbb{1}}_{\{ y \neq y'\}}\] <br>(really obvious, just 1 when you've got the wrong classification or 0 if not)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Prediction Error of a Classifier	\[R(h) =  {\mathbb{E}}[L(Y^*, h(X^*))] = {\mathbb{P}}[Y^* \neq h(X^*)]\] <br>where \((X^*, Y^*) \in (X^K, Y^K)\) are test points<br> <br> We want to find classifiers \(h\) that minimize this!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Conditional Class Probabilities	\[p_k(x) = {\mathbb{P}}(Y^* = k {\,|\,} X^* = x)\] <br>where \(k\) is some classification in \([K]\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bayes Classier	The minimal prediction error over all classifiers is called the Bayes risk. It is attained by the Bayes classifier \[h^*(x) = {\underset{k \in [K]}{\mathrm{argmax}}\,\,}{p^*_k(x)}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Core idea behind decision trees	Break covariate space down into rectangular regions \(R_1, R_2, ..., R_M\), each with some contributing factor \(c_m\). When \(x\) falls within each, that region contributes its \(c_m\) to a final total: \[x \mapsto \sum_{m=1}^M c_m {\mathbb{1}}_{x \in R_m}\] The hard part is finding the regions and coefficients that work!<br> <br>Idea: Start with the entire space in one big rectangle, then split it into two smaller pieces, staying axis-aligned: \[R_l(j, t) = \{ x\in R: x_j \leq t \}\] \[R_r(j, t) = \{ x \in R: x_j &gt; t \}\] where \(j \in [p]\) are the split variables and \(t \in {\mathbb{R}}\) are the split points (we choose only one variable, and one location along that variable, to split over at a time)<br> <br>Find the best split variable and split point to minimize a cost function \(Q(j, t)\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Explain the general algorithm for a regression tree	Given data \((x_i, y_i)_{i \in [n]}\) in \({\mathbb{R}}^p \times {\mathbb{R}}\), continually split any region so as long as it has at least 2 data points.<br> <br>Let \(T_j\) be the set of midpoints between adjacent points in a region \(x_{ij} \in {\mathbb{R}}\). Compute the cost-minimizing split to decide where to split: \[(\hat j_R, \hat t_R) \in {\underset{j \in [p],\, t \in T_j}{\mathrm{argmin}}\,\,} Q(j,t)\] where \(Q(j,t)\) is the loss after each split.<br> <br>Then perform the split and continue! <br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Regression decision tree loss	It's just the best case loss of the left rectangle (best choice of \(c_l\)) plus the best case loss of the right rectangle (best choice of \(c_r\)) minus the current loss of the rectangle (using \(c\)): \[\begin{align*} Q_{reg}(j,t) &=& \min_{c_l \in {\mathbb{R}}} \sum_{i : x_i \in R_l(j,t)} (y_i-c_l)^2 &=& \sum_{i : x_i \in R_l(j,t)} (y_i-c(R_l(j,t))^2\\\\ &&+ \min_{c_r \in {\mathbb{R}}} \sum_{i : x_i \in R_r(j,t)} (y_i-c_r)^2 &&+ \sum_{i : x_i \in R_r(j,t)} (y_i-c(R_r(j,t))^2\\\\ &&- \min_{c \in {\mathbb{R}}} \sum_{i: x_i \in {\mathbb{R}}} (y_i-c)^2 &&+ \sum_{i : x_i \in R} (y_i-c(R))^2 \end{align*}\] where \[c(R) = \frac{1}{N(R)}\sum_{i: x_i \in {\mathbb{R}}} y_i\] where \(N(R)\) is the number of points in a particular region.<br> <br>Note that this can never be positive! Splitting will never make things worse (because we could just have two splits with the same old height \(c\)).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Final regression function of a regression tree	\[\hat f_{CART}(x) = \sum_{m=1}^M \hat c_m {\mathbb{1}}_{x \in R_m}\] with \(\hat c_m = c(R_m)\), where \[c(R) = \frac{1}{N(R)}\sum_{i: x_i \in {\mathbb{R}}} y_i\] where \(N(R)\) is the number of points in a particular region.<br> <br>\(R_m\) are called the leaves of the decision tree!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Heuristic for when a regression tree is done growing	Each region should contain no more than 5 points	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Runtime of growing a regression tree (when optimized)	O(n)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Define a unitary matrix	A matrix \(A\) where \(A^* = A{^{-1}}\),<br> where \(A^*\) is the complex conjugate<br> <br>For real matricies this is the same as being orthogonal	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Define an orthogonal matrix, what do they represent?	A matrix \(A\) where \(A{^{\mathrm{T}}} = A{^{-1}}\)<br> <br>it represents rotations and reflections (but not stretching)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Thin Singular Value Decomposition	Given \(X \in {\mathbb{R}}^{n\times p}\), the decomposition  \[X = U D V{^{\mathrm{T}}}\] where \(U \in {\mathbb{R}}^{n\times p}\) is a unitary matrix,<br> \(D \in  {\mathbb{R}}^{p\times p}\) is a sorted diagonal matrix \((D_{11} \geq D_{22} \geq ...)\),<br> and \(V \in {\mathbb{R}}^{p \times p}\) is also unitary.<br> <br>This can be thought of as a process where a space is rotated/reflected, stretched on its axes, then rotated/reflected again.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Principal Component	Given some \(X \in {\mathbb{R}}^{n\times p}\) row-centered (\(X^T\)1 = 0), PCA finds a sequence of \(p\) unit vectors \(v_j\), where each \(v_j\) minimizes the sample variance \({\mathrm{Var}}{(Xv_j)} = \frac{1}{n}v{^{\mathrm{T}}} X{^{\mathrm{T}}} X v\) while staying orthogonal to all previous \(v_i\)s.  <br> <br>We call \(Xv_j\) the principal components.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State how a thin singular value decomposition yields the principal components	Given a singular value decomposition \(UDV\), substituting into the sample variance \(v{^{\mathrm{T}}} X{^{\mathrm{T}}} X v\) and optimizing tells us that \(v_j = V_j\) (\(j\)th column of V) and so for the principal component, we have:  \[Xv_j = XV_j = D_{jj}U_j\] where \(U_j\) is the \(j\)th row of \(U\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	In practice, how many principal components do we use?	Take the smallest \(J \leq P\) such that \[\frac{\sum_{j=1}^J D_{jj}^2}{\sum_{j=1}^p D_{jj}^2} \geq 0.8\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Goal of K-Means Clustering	Find a 'good enough' labelling minimizing <br>\[ L(c_1...c_k) = \sum_{k=1}^K \frac{1}{2 |C_k|} \sum_{i \in C_k} \sum_{j \in C_k} ||x_i - x_j||^2_2\] <br>(minimize within-cluster variation)<br> But this is NP-hard! Need a heuristic solution.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Centroid of a cluster	\[\bar X_k = \frac{1}{|C_k|} \sum_{i\in C_k} X_i\] (center of mass)<br> <br>Note that we can represent the within-cluster variance loss in terms of this: \[L(c_1...c_k) = \sum_{k=1}^K \sum_{i \in C_k} ||x_i - \bar x_k||^2_2\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> K-Means Clustering Algorithm	1. Initialize a \(K\)-partition \(C_1...C_k\) randomly.<br> <br>2. Until convergence:<br> <br>2-1. Compute the centroid of each cluster \(\hat x_i\)<br> <br>2-2. Construct a new partition \(\tilde C_1...\tilde C_k\) by assigning each point to the closest centroid.<br> <br>Note that this only finds a local minimum, not always the global minimum!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	K-Means best practices (how do we choose K?)	To choose K, try a bunch and see which ones offer the most explainability. (don't just increase to minimize loss; \(K = n\) always does that)<br> <br>Good clusterings should be stable with repeated runs, or with the addition or removal of a few data points.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	The two MLEs for the variance of a normal linear model		SLIP PartIIINotes
Basic	Part III Notes::SLIP	Explain how Newton-Raphson can be used to solve an MLE numerically	We want to find the \(\theta\) for which \(\frac{dl(\vec\theta;x)}{d\vec\theta} = 0\). So, use Newton-Raphson with the Hessian <br>\[\frac{d^2l(\vec\theta;x)}{d\vec\theta d\vec\theta{^{\mathrm{T}}}}\] except we want the maximum, not the minimum, so make sure to use the negation instead!<br> <br>\[\hat \theta^{(n+1)} = \hat \theta^{(n)} - \left( \frac{d^2l}{d\vec\theta d\vec\theta{^{\mathrm{T}}}}(\hat\theta^{(n)}) \right){^{-1}}\frac{dl}{d\vec\theta}(\hat\theta^{(n)})\]<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	When we inspect GLM residuals (Pearson or Deviance), what are we looking for?	Should both be approximately normal with mean zero, variance 1.<br> <br>Note this is based on small-dispersion asymptotics: \[\phi_i^{-1/2} (y_i - \mu_i) \overset{d}{\to} N(0, V(\mu_i))\] as \(\phi_i \to 0\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What do we need to be careful of when using dispersion-scaled residuals in R?	They aren't normalized by \(\hat \phi ^{-1/2}\), they use: \[\frac{y_i - \hat \mu_i}{\sqrt{a_i V(\mu_i)}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Sigmoid Function	\[\sigma(x) = \frac{e^x}{(1 + e^x)}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> ReLU Function	\[\sigma(x) = \max(0, x)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Softmax Function	\[g_k(z) = \frac{e^{z_k}}{\sum_{r=1}^K e^{z_k}}\] for \(k \in [K]\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Universal Approximation Theorem for Neural Networks	Any (borel-measurable) function can be approximated to a prescribed precision by a single hidden layer neural network with any nonlinear activation function.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Cross-Entropy Loss	\[L(\theta) = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K Y_{ik} \log P_k^{(\theta)}(X_i)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Gradient Descent for Neural Networks	\[\theta^{(t + 1)} = \theta^{(t)} - \alpha_t \nabla_\theta \mathrm{Err}_\mathrm{tr}(\theta^{(t)})\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> The Continuous Mapping Theorem	Given a function \(g: X \to Y\) that is continuous except at points of measure zero, \(g\) preserves convergence almost surely, in probability, and in distribution. \[X_n \overset{a.s.}{\to} X \implies g(X)_n \overset{a.s.}{\to} g(X)\] \[X_n \overset{p}{\to} X \implies g(X)_n \overset{p}{\to} g(X)\] \[X_n \overset{d}{\to} X \implies g(X)_n \overset{d}{\to} g(X)\]	SLIP PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Absorption Laws of Lattices	\[a \vee (a \wedge b) = a,\] \[a \wedge (a \vee b) = a\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Akaike Information Criterion (AIC)	\[AIC(M) = 2 \dim \Theta-2 \log f(z; \hat\theta)\] Smaller is better!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bayesian Information Criterion (BIC)	\[BIC(M) = (\log n) \dim \Theta-2 \log f(z; \hat \theta)\] where \(n\) is the number of data points.<br> Smaller is better!	SLIP PartIIINotes
Basic	Part III Notes::SLIP	When should you use AIC versus BIC?	Use AIC in predictive applications, use BIC in explanatory applications (since it selects for simpler models; extra weight on the \(\dim \Theta\) term)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Prediction Error \(\mathrm{Err}\) (Risk)	\[\mathrm{Err} = {\mathbb{E}} [ L(Y^* , \hat f(X^*))]\] where \((X^*, Y^*)\) is a random vector, the test data (space of all possible outcomes) and \(\hat f\) is an arbitrary regression function fit using training data.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> In-Sample Error \(\mathrm{Err}_\mathrm{in}\)	\[\mathrm{Err}_\mathrm{in} = \frac{1}{n} \sum_{i=1}^{n} {\mathbb{E}} [ L(Y^*_i , \hat f(x_i))]\] (this is for a fixed design scenario, where the \(x_i\)s are fixed but the outcome is not. \(\hat f\) is fit using specific draws \((x_i, y_i)\) but evaluated against any possible set of \(Y^*_i\) (still at fixed \(x_i\)).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Training Error \(\mathrm{Err}_\mathrm{tr}\)	\[\mathrm{Err}_\mathrm{tr} = \frac{1}{n} \sum_{i=1}^{n}  L(y_i , \hat f(x_i))\] The evaluation of \(\hat f\) against the training data itself.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Show that we have the relation:<br> \[\mathrm{Err}_\mathrm{in} = {\mathbb{E}}(\hat{ \mathrm{Err}_\mathrm{tr}}) + \frac{2}{n} \sum_{i=1}^n {\mathrm{Cov}}(\hat f (x_i), Y_i)\] for L2 loss.	Proof sketch:<br> Just expand the squared terms. You'll need to use that \[{\mathbb{E}}[Y_i^* \hat f(X_i)] = {\mathbb{E}}[Y_i^*]{\mathbb{E}}[\hat f(X_i)] = {\mathbb{E}}[Y_i]{\mathbb{E}}[\hat f(X_i)]\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bias-Variance Tradeoff (of the prediction error)	If we're trying to model  \[Y_i = f(X_i) + {\varepsilon}_i\] where \(f: {\mathbb{R}}^p \to {\mathbb{R}}\), \({\mathbb{E}}({\varepsilon}_i) = 0\), \({\mathrm{Var}}({\varepsilon}_i) = \sigma^2 &gt; 0\), then we have the following decomposition of prediction error: \[\mathrm{Err} = \sigma^2 + {\mathbb{E}}[ \mathrm{Bias}^2(\hat f(X^*) {\,|\,} X^*)] + {\mathbb{E}}[{\mathrm{Var}} (\hat f (X^*) {\,|\,} X^*)]\] <br>Or for the in-sample error: \[\mathrm{Err}_\mathrm{in} = \sigma^2 + \frac{1}{n}\sum_{i=1}^n{\mathbb{E}}[ \mathrm{Bias}^2(\hat f(X^*_i) {\,|\,} X^*_i)] + \frac{1}{n}\sum_{i=1}^n{\mathbb{E}}[{\mathrm{Var}} (\hat f (X^*_i) {\,|\,} X^*_i)]\]	SLIP PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Accelerated-Life Families	A family of distributions generated by \(F(\lambda t)\) for some scaling \(\lambda &gt; 0\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Proportional-Hazards Families	A family of distributions generated by \(F(t)^k\) for some shape parameter \(k &gt;0\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Survivor Function of the Weibull Distribution	\[F(t) = e^{-(\theta t)^k}\] where \(\theta &gt;0\) is the shape parameter and \(\theta &gt; 0\) is the scale parameter. <br>Note that this reduces to the exponential RV when \(k = 1\). <br>Weibull distributions are both an accelerated life and proportional hazards family!	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	State the contribution to the likelihood for an observed event versus a censored event.	When \(v_i = 1\) (uncensored), the contribution is the density \(f(x_i, \theta)\). <br> When \(v_i = 0\) (censored), the contribution is \({\mathbb{P}}(x_i &lt; T_i) = F(x_i, \theta)\) (where \(F\) is the survivor function). <br>So the total likelihood is \[L(\theta) = \prod_{v_i = 1}f(x_i, \theta)\prod_{v_i = 0}F(x_i, \theta)\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	PDF of the exponential distribution	\[f(t; \theta) = \theta e^{-\theta t}\] given \(\theta &gt; 0\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Survivor Function of the exponential distribution	\[F(t; \theta) = e^{-\theta t}\] given \(\theta &gt;0\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Requirements for a nonparametric survival function \(\tilde F(t)\)	1. Bounded between zero and one: \(0 \leq \tilde F(t) \leq 1\)<br> 2. Decreasing: where \(a \leq b\), \(\tilde F(a) \geq \tilde F(b)\)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Contributions to a non-parametric survivor function likelihood	Where uncensored, we have: \[{\mathbb{P}}(T = x_i) = {\mathbb{P}}(T \geq x_i) - {\mathbb{P}}(T &gt; x_i) = F(x_i^-) - F(x_i) = F_i^- - F_i\] where the minus subscript represents the limit from the left.<br> <br>Where censored, we have: \[{\mathbb{P}}(T &gt; x_i) = F(x_i) = F_i\] <br>So for each event the likelihood contribution is \[L(\theta) = \prod_{v_i = 1}(F_i^- - F_i) \prod_{v_i = 0} F_i\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	How can we simplify a non-parametric survivor function likelihood?	We can reduce the number of parameters by our assumptions: \[F_1^- = 1\] \[F_n = 0\] Also the function should only change between \(F_i^-\) and \(F_i\) terms! (we're trying to build a step function; piecewise linear)<br> We should end up with a polynomial we can maximimize easily.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Non-parametric likelihood contributions of left-censored observations and interval-censored observations	Left-censored: \(1 - F(x_i) = 1 - F_i\)<br> Interval-censored on an interval \((a, b)\): \(F(a) - F(b)\)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Hazard Function (raw form)	\[h(t) = \lim_{\Delta \to 0} \frac{{\mathbb{P}}(t &lt; T &lt; t + \Delta {\,|\,} T &gt; t)}{\Delta}\] You can simplify this to the form generally used using Bayes' Theorem and recognizing the definition of derivative: \[h(t) = \frac{f(t)}{F(t)}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Hazard Function (practical form)	\[h(t) = \frac{f(t)}{F(t)}\] where \(F(t)\) is the survivor function.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Integrated Hazard	\[H(t) = \int_0^t h(t') dt'\] Models the accumulation of hazard over time.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Survivor function as a function of integrated hazard	\[F(t) = e^{-H(t)}\] or equivalently \[H(t) = -\log(F(t))\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Counting Process	A function \(N : {\mathbb{R}} \to {\mathbb{N}}\) that counts how many times an event has occurred. It always fulfills:<br> \[N(0) = 0\] Always increasing<br> Always right-continuous (increments at an event time, not right after it)<br>	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Survival Analysis as a Counting Process	In survival analysis, we only care about the first time an event happens, so \(N(t) \in \{0, 1\}\).<br> <br>Putting it in our usual notation, we have: \[N(t) = I\{ X \leq t, V = 1 \}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> History (of a stochastic process)	\({\mathcal{H}}_t\) denotes knowledge of everything that has happened in a stochastic process up to and including \(t\)<br> <br>\({\mathcal{H}}_{t-}\) denotes knowledge up to but not including \(t\).<br> <br>For our purposes, \({\mathcal{H}}_{t-}\) is completely captured by membership to the risk set at \(t\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	What do we mean by \(dN(t)\)?	It's called Stieltjes Notation: \[dN(t) = N(t + dt) - N(t)\] It can only take the values \(0\) or \(1\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Intensity \(\lambda(t)\)	\[\lambda_i(t) = Y_i(t)h_i(t)\] <br>where \(Y_i(t) = I\{X_i \geq t\}\) (1 when person \(i\) is still in the risk set). <br>Let \(\Lambda\) be the integrated intensity, so equivalently we have <br>\[d\Lambda_i(t) = \sum_i Y_i(t)dH_i(t)\] <br>Define \(\Lambda_+ = \sum_i \Lambda_i\) and \(Y_+ = \sum_i Y_i\), and assume \(H_i(t) = H(t)\) for all \(i\) (could also assume proportionality instead) <br>\[d\Lambda_+(t) = Y_+(t) dH(t)\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Show that \(dN(t) - d\Lambda(t)\) is a martingale	Work from the definition of the derivative of the integrated intensity: \[\begin{align*} {\mathbb{P}}(dN(t) = 1 {\,|\,} {\mathcal{H}}_{t-}) &= d\Lambda(t)\\\\ {\mathbb{E}}[dN(t) {\,|\,} {\mathcal{H}}_{t-}] &= d\Lambda(t)\\\\ {\mathbb{E}}[dN(t) - d\Lambda(t) {\,|\,} {\mathcal{H}}_{t-}] &= 0 \end{align*}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Given the per-individual martingale identity \[{\mathbb{E}}[dN(t) - d\Lambda(t) {\,|\,} {\mathcal{H}}_{t-}]= 0,\] show that the population derivative of integrated intensity equals the derivative of the population counting process at all \(t\).	Define the population counting process and integrated hazard: \[N_+(t) = \sum_i N_i(t)\] \[\Lambda_+(t) = \sum_i \Lambda_i(t)\] By our assumption, we have: \[\begin{align*} {\mathbb{E}}[dN(t)] - {\mathbb{E}}[ d\Lambda(t) {\,|\,} {\mathcal{H}}_{t-}]&= 0\\\\ {\mathbb{E}}[dN(t)] &= {\mathbb{E}}[ d\Lambda(t) {\,|\,} {\mathcal{H}}_{t-}] \end{align*}\] So by the method of moments, we have: \[dN_+(t) = d\Lambda_+(t)\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Use \(dN_+(t) = d\Lambda_+(t)\) and \(d\Lambda_+(t) = Y_+(t) dH(t)\) to build the Nelson-Aalen Estimator (find an estimator \(d\hat H(t)\))	Substituting in and isolating \(d\hat H(t)\), we get: \[d\hat H(t) = \frac{dN_+(s)}{Y_+(s)}\] Integrate both sides: \[\int_0^t d\hat H(s)ds = \hat H(t) = \int_0^t \frac{dN_+(s)}{Y_+(s)} ds\] But \(N_+(t)\) is just a step function that jumps at each failure time, so this integral is just a sum of 'snapshots' at each \(t_j \leq t\): \[\hat H(t) = \sum_{t_j \leq t} \frac{1}{Y_+(t_j)}\] Sum of snapshots of hazard increments.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Nelson-Aalen Estimator	Given a set of events \(\{a_1, ...,a_j,..., a_n\}\) with no ties, we can estimate the cumulative hazard \(H(t)\) as \[\hat H(t) = \sum_{a_j \leq t} \frac{1}{Y_+(a_j)}\] where \(Y_+(t)\) is the number of people at risk at \(t\) (this is exactly the \(r_j\) used in Kaplan-Meier).<br> <br>Note that we can estimate the survivor function using Nelson-Aalen: \[\hat F(t) = \exp(-\hat H(t))\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Semiparametric Modeling	Model the bits we care about explaining (\(\beta\)) with a parametric method, and model the bits we don't care so much about (\(\psi\)) with a nonparametric method.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Proportional Hazards Modeling	Semiparametric method for modeling hazard. Assume that everyone has hazards in the same proportional hazards family: \[h_2(t) = Kh_1(t),\, K&gt; 0\] Express \(K\) with the explainable parameters \(\beta\), and express \(h_0(t)\) with the nonparametric \(\psi\): \[h(t, z^i, \beta, \psi) = \phi(z^i, \beta)h_0(t, \psi)\] <br>Note that this is a strong assumption!	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Partial Likelihood	An expression of likelihood that neglects some of the data in order to solve for only some of the model parameters.<br> <br>In Cox regression, it was shown that this is mathematically valid!	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> (Partial) Likelihood in Proportional Hazards Modeling	\[L(\beta) = \prod_{i\,:\, v_i = 1} \frac{\phi(z^i, \beta)}{\sum_{i'\in R_i} \phi(z^{i'}, \beta)}\] where \(R_i\) is the risk set at \(t_i\). Note that \(h_0(x_i, \psi)\) gets cancelled out in the fraction.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Cox Regression	Proportional hazards modeling with the following form for \(\phi(z, \beta)\): \[\phi(z, \beta) = e^{\beta^T z}\]<br> <br>This yields \[L(\beta) = \prod_{i\,:\, v_i = 1} \frac{e^{\beta^T z^i}}{\sum_{i'\in R_i} e^{\beta^T z^{i'}}}\] It's easy to take the log-likelihood then: \[\log L(\beta) = \sum_i v_i \left( \beta^Tz^i - \log{\sum_{i' \in R_i} e^{\beta^T z^{i'}}} \right)\] And a derivative with respect to \(\beta\): \[ \frac{d}{d\beta} \log L(\beta) = \sum_i v_i \left( z^i - \sum_{i' \in R_i}\frac{z^{i'}e^{\beta^T z^{i'}}}{ e^{\beta^T z^{i'}}} \right) \] Use Newton-Raphson from here.	MedicalStats PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Partial Recursive Function	The smallest class of functions \({\mathbb{N}}^k \to {\mathbb{N}}\) that contains the three atomic functions, and is closed under composition, primitive recursion, and minimization.<br> <br>For the class of primitive recursive functions, ignore minimization.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> The three atomic partial recursive function building blocks	1. Projections: Pick one number out of a list ( \[\Pi_i^k: {\mathbb{N}}^k \to {\mathbb{N}}\] \[\Pi_i^k: (n_1,...,n_k) \mapsto n_i\] 2. Successor: Add one \[S^+ : {\mathbb{N}} \to {\mathbb{N}}\] \[S^+: n \to n + 1\] 3. Zero: Any number to zero \[Z: {\mathbb{N}} \to {\mathbb{N}}\] \[Z : n \mapsto 0\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Partial Recursive Function Closure under Composition	Given \(g: {\mathbb{N}}^k \to {\mathbb{N}}\), and for each \(i \in [k]\) we have a function \(h_i: {\mathbb{N}}^m \to {\mathbb{N}}\), then there exists a function \(f : {\mathbb{N}}^m \to {\mathbb{N}}\) given by <br>\[f(\bar n) := g(h_1(\bar n), ..., h_k(\bar n))\] <br>Note that \(\bar n\) refers to a tuple of natural numbers. <br> Sort of like a weird inverse map-reduce, except we're mapping over a list of functions by applying \(\bar n\) to all of them.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Partial Recursion Function Closure under Primitive Recursion	Given a base function \(g: {\mathbb{N}}^m \to {\mathbb{N}}\) and step function \(h: {\mathbb{N}}^{m+2} \to {\mathbb{N}}\), there exists a function \(f: {\mathbb{N}}^{m+1} \to {\mathbb{N}}\) defined by: \[\begin{align*} f(0, \bar n) &:= g(\bar n)\\\\ f(k+1, \bar n) &:= h(f(k, \bar n), k, \bar n) \end{align*}\] <br>The first parameter is the fuel, the second is the surrounding context (constant through the process)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Partial Recursive Function Closure under Minimization	Given \(g : {\mathbb{N}}^{m + 1} \to {\mathbb{N}}\), there exists a function \(f: {\mathbb{N}}^m \to {\mathbb{N}}\) that maps \(\bar n\) to the smallest \(k\) such that \(g(k, \bar n)=0\).<br> <br>Note that it may not exist!<br> This is sometimes denoted \(f(\bar n) = \mu k. g(k,\bar n) = 0\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Total Recursive Function	A partial recursive function for which every every input maps to some outputs.<br> <br>This just means it always terminates: there's no input that will search forever. All primitive recursive functions are total.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Godel Numbering	Let \(L\) be a language. A Godel Numbering is an injection \(L \hookrightarrow {\mathbb{N}}\) with three properties:<br> <br>1. It's computable: there exists an algorithm you can follow to find the Godel number of any expression in finite time<br> <br>2. Its image is a recursive subset of \({\mathbb{N}}\) (membership is decidable).<br> <br>3. Its inverse (where defined) is also computable.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Recursively Enumerable Set	\(X \subseteq {\mathbb{N}}\) is recursively enumerable (or recognizable, or semi-decidable) if any of the following are true:<br> <br>1. \(X\) is the image of some partial recursive function \(f: {\mathbb{N}} \to {\mathbb{N}}\)<br> (dovetail computations and generate a big list)<br> <br>2. \(X\) is the image of some total recursive function \(f: {\mathbb{N}} \to {\mathbb{N}}\)<br> (just run on all numbers sequentially)<br> <br>3. \(X\) is the domain of some partial recursive function \(f: {\mathbb{N}} \to {\mathbb{N}}\)<br> (same as 1, but mark the input on the list, not the output)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Recursive Set	A set \(X \subseteq {\mathbb{N}}\) is recursive if both \(X\) and its complement \({\mathbb{N}} \,\backslash\, X\) are recursively enumerable.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Recursive Language	There exists an algorithm to check whether a string of symbols is an L-formula<br> <br>(this is generally trivial: just checking if a statement makes sense)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Recursive L-Theory	An L-Theory \(T\) is recursive if membership in \(T\) is decidable (we can check if a sentence is assumed or not)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Decidable L-Theory	An L-Theory \(T\) is decidable if there is an algorithm for deciding \(T \Vdash \varphi\) for any \(\varphi\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> Craig's Trick	Every first-order theory with a recursively-enumerable set of axioms admits a recursive axiomatization<br> <br>(if your list of axioms is recognizable, it can be reconfigured into a decidable one)<br> <br>Proof:<br> There exists a totally recursive \(f\) with \(f(n) = \phi_i\) for \(n \in {\mathbb{N}}\). <br> Map each \(\phi_i\) to a logically equivalent \(\bigwedge_{j=1}^i \phi_i\).<br> The algorithm to check if a statement \(A\) is an axiom is thus:<br> First, check if \(A\) is equivalent to \(f(1)\) (can do this in finite time due to finite sentence length). If it is, return yes, otherwise continue.<br> Second, find any decompositions \(A = \bigwedge_j^i A_j\). Then just check axiom \(i\) for equivalence.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Notation for the Godel numbering of an expression	\(\lceil x \rceil\) maps \(x\) to its Godel numbering<br> \(\lfloor x \rfloor\) maps a Godel numbering \(x\) to the statement in represents	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> The set of total recursive functions (or their respective Godel numberings) is not recursively enumerable	Proof:<br> By definition, there exists some function \(f\) such that, for every total function \(h\), there exists some \(n \in {\mathbb{N}}\) such that \(h = \lfloor f(n) \rfloor\). <br> In particular, this includes \(g(x) = \lfloor f(x) \rfloor (x) + 1\).<br> Show a contradiction from there.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> The Language of Arithmetic \(\mathcal{L}_{PA}\)	The first-order language with signature \((0, 1, +, \cdot, &lt;)\) where \(0, 1\) are constants, \(+, \cdot \) are functions with arity two, and \(&lt;\) is a relation with arity two.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> The Base Theory of Arithmetic \(PA^-\)	An \(\mathcal{L}_{PA}\)-theory fulfilling:<br> <br>1. \(+, \cdot\) are commutative and associative, and have identity elements 0 and 1.<br> (\(+, \cdot\) individually)<br> <br>2. \(\cdot\) distributes over \(+\).<br> (distributivity)<br> <br>3. \(&lt;\) is a linear ordering compatible with \(+\) and \(\cdot\).<br> (\(&lt;\) is a compatible linear ordering)<br> <br>4. \(\forall x, y,\, (x &lt; y \implies \exists z,\, x + z = y)\).<br> (Quantifier definition of \(&lt;\))<br> <br>5. \(0 &lt; 1\) and \(\forall x,\, (x &gt; 0 \implies x \geq 1)\).<br> (Zero is less than 1, \(\leq\) base definition)<br> <br>6. \(\forall x,\, x \geq 0\).<br> (Everything is greater than or equal to zero)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> The Theory of Peano Arithmetic (PA)	\(PA^-\) with the extra set of axioms, the scheme of induction:<br> For each \(\mathcal{L}_{PA}\)-formula \(\varphi(x, \bar y)\), we have the axiom \[I\varphi := \forall \bar y,\, (\varphi(0, \bar y) \land \forall x,\, (\varphi (x, \bar y) \to \varphi(x + 1, \bar y)) \to \forall x,\, \varphi(x, \bar y)\]<br> <br>(proving the base case \(n = 0\) and proving the inductive case \(n \implies n+1\) is enough to prove something for all \(n \in {\mathbb{N}}\))	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Delta-0 Formula (\(\Delta_0\))	A formula whose quantifiers (\(\forall\), \(\exists\)) are bounded: for some fixed \(n \in {\mathbb{N}}\), either \[\forall x,\, x&lt;n,\, \varphi(x)\] or \[\exists x,\, x&lt;n,\, \varphi(x)\] for \(n\) not free in \(\varphi(x)\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Strict Order Axioms	1. Irreflexivity: \(\forall x,\, x \not &lt; x\)<br> 2. Asymmetry: \(\forall a,b,\, a &lt; b \implies b \not &lt; a\)<br> 3. Transitivity: \(\forall a, b, c,\, (a &lt; b) \land (b &lt; c) \implies a &lt; c\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Pi-1 Formula and Sigma-1 Formula (\(\Pi_1\), \(\Sigma_1\))	A formula \(\varphi(\bar x)\) is a \(\Sigma_1\)-formula if it is provably equivalent to a there-exists statement with a \(\Delta_0\)-formula body, ie, there exists a \(\Delta_0\) \(\psi(\bar x, \bar y)\) such that \[PA \vdash \left(\varphi(\bar x) \iff \exists \bar y.\, \psi(\bar x, \bar y) \right)\] Similarly \(\varphi(\bar x)\) is a \(\Pi_1\)-formula if it is provably equivalent to a for-all statement with a \(\Delta_0\)-formula body, ie, there exists a \(\Delta_0\) \(\psi(\bar x, \bar y)\) such that \[PA \vdash \left(\varphi(\bar x) \iff \forall \bar y.\, \psi(\bar x, \bar y) \right)\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	State a total recursive bijection from \({\mathbb{N}}^2 \to {\mathbb{N}}\), denoted \(\langle x, y \rangle\)	\[ \langle x, y \rangle := \frac{(x+y)(x + y + 1)}{2} + y \] This is called the Cantor Pairing Function.<br> <br>Intuition: it's just diagonalization. Count up the triangular number underneath the square we want the number of, then add how many squares into the next triangle we are.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Godel's Beta Function	There exists a function \(\beta(x, y)\) derivable in PA such that for any sequence \(x\), there is a code \(u\) such that \(\beta(u, i)\) = \(x_i\).<br> <br>(Decodes the sequence and gives us the \(i\)th element. This is the projection function...)<br> <br>Note that with Cantor's pairing function, we can reduce this to just one coded number.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Godel's Lemma	Let \(M \Vdash PA\), \(n \in {\mathbb{N}}\), and \(x_0,...,x_{n-1} \in M\).<br> Then there exists a \(u \in M\) such that \(M \Vdash ((u)_i = x_i)\) for all \(i &lt; n\).<br> <br>If M believes in PA, M can prove that Godel's beta function works as intended.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> Let \(f: {\mathbb{N}}^k \to {\mathbb{N}}\) be a partial function.<br> \(f\) is (partial) recursive iff there is a \(\Sigma_1\)-formula \(\theta(\bar x, y)\) such that  \[y = f(\bar x) \iff {\mathbb{N}} \Vdash \theta(\bar x, y)\] <br>(ie the graph of \(f\) is \(\Sigma_1\)-definable in \({\mathbb{N}}\))	First we show that if the graph of \(f\) is \(\Sigma_1\)-definable, then \(f\) must be partial recursive. We can just implement a minimization procedure then run the outer there-exists clause, checking every case exhaustively (computing results in \({\mathbb{N}}\) as normal).<br> Constructing a \(\Sigma_1\) statement from a partial recursive \(f\) is more complicated. The base functions are simple. For the rest:<br> Composition: show this is equivalent to a there-exists over each subfunction output, with \(f\) applied to all of them together making the required full output.<br> Primitive Recursion: absolutely requires Godel's beta-function so that you can get an existence statement over the sequence length (we don't know how many times we'll need to run through the loop). <br> A similar argument holds for minimization.	LogicAndComputability PartIIINotes
Basic	Part III Notes::SLIP	State the variance function of the normal EDF	\[V(\mu) = \mu\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the variance function for the binomial EDF	\[V(\mu) = \mu (1 - \mu)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State the cumulant function for the Poisson EDF	\[K(\theta) = e^\theta - 1\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can you find the \(\theta\) for the EDF representing \(\textrm{Poisson}(\lambda)\)	\[\theta = \log (\mu)\] \[\mu = e^\theta\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can you find the \(\phi\) for the EDF representing \(\textrm{Poisson}(\lambda)\)	It's always one: \[\phi = 1\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Variance function for the Poisson EDF	\[V(\mu) = e^{\log \mu} = \mu\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	State three reasons that we might observe overdispersion in our model	1. Missing critical covariates<br> 2. You're using the wrong distribution<br> 3. There is correlation between your datapoints	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Intuitively, what does the negative binomial distribution model?	In a binary experiment with probability \(p\) of failure, the number of failures before \(r\) successes is reached.<br> <br>NOTE: Other places use the convention that \(p\) is the probability of success, not failure. This course uses the 'probability of failure' convention.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Negative Binomial Distribution PDF	If \(X \sim \mathrm{NB}(r, p)\): \[{\mathbb{P}}(X=k) = {k + r - 1 \choose  k} p^k (1-p)^r\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Mean of the Negative Binomial PDF	If \(X \sim \mathrm{NB}(r, p)\): \[{\mathbb{E}}(X) = \frac{rp}{1-p}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Variance of the Negative Binomial PDF	If \(X \sim \mathrm{NB}(r, p)\): \[{\mathrm{Var}}(X) = \frac{rp}{(1-p)^2}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Variance Function of the Negative Binomial PDF	\[V(\mu) = \mu + \frac{\mu^2}{r}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What is the relationship between the Negative Binomial distribution and the Poisson distribution?	As \(r \to \infty\),  \[\mathrm{NB}\left(r, \frac{\mu}{\mu + r}\right) \overset{d}{\to} \mathrm{Pois}(\mu)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What do we need to remember when using the negative binomial in R?	The default link function is the same as for the Poisson RV: \[g(\mu) = \log \mu\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	How can we test between \(H_0\): Poisson model and \(H_1\): negative binomial model?	Use the modified framework: \[2 \left( l_1(\hat \beta, \hat r; Y) - l_0(\hat \beta; Y) \right) \overset{d}{\to} \frac{1}{2} \delta_0 + \frac{1}{2} \chi_1^2\] as \(n \to \infty\), where \(\delta_0\) is the point mass at zero.	SLIP PartIIINotes
Basic	Part III Notes::LogicAndComputability	What is meant when we write \(T(S)\) where \(S\) is a poset?	The set of terminal segments of \(S\): \[\{ U \subseteq S : U \text{ is a terminal segment} \}\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What is a simple way to make a Kripke model?	Take a mapping ('valuation') \(v\) from the set of primitive propositions \(P\) to the set of terminal segments \(T(S)\) (known to be a Heyting algebra).<br> <br>This induces a Kripke model by setting \[s \Vdash p \iff s \in v(p)\] for \(s \in S\), \(p \in P\).<br> It's clear to see that this fulfills the persistence property: \[s \Vdash p \implies s \in v(p) \implies (s\uparrow) \subseteq  v(p) \implies s' \in v(p) \implies s' \Vdash p\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Proper Filter	A filter is called 'proper' when it isn't just the whole set \(X\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Extension of Filters	If \(F\) is a proper filter and \(x \not \in F\), then there exists a prime filter extending \(F\) that still avoids containing \(x\) (by Zorn's Lemma)	LogicAndComputability PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Quasi-Likelihood Model	\((x_i, y_i)\) follow a quasi-likelihood model if they are independent and satisfy: \[{\mathbb{E}}[y_i] = \mu_i(\beta) = g^{-1}(x_i^T\beta)\] \[{\mathrm{Var}}[y_i] = V_i(\mu_i(\beta))\] where we fix \(V_i\) and \(g\), and we try and fit \(\beta\).<br> <br>This effectively means that we can just fix some link function and some variance function, without having to specify anything else about our model.<br> Every GLM fits this format.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Quasi-Score	\[U(\beta) = \sum_{i=1}^n \frac{y_i - \mu_i(\beta)}{V_i(\mu_i(\beta))} \mu_i'(\beta) x_i\] where \[\mu_i'(\beta) = (g{^{-1}})'(x_i{^{\mathrm{T}}} \beta)\]<br> <br>Note that this fulfills the score identity \({\mathbb{E}}[U(\beta)] = 0\), and the Fisher information can still be calculated the standard way, \(-{\mathbb{E}}[\frac{du}{d\mu(\beta)}]\)<br> <br>We can estimate \(\beta\) by solving the estimating equation \(u(\hat \beta) = 0\) (called the Quasi-Likelihood Estimator)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Quasi-Likelihood Estimator Distribution	\[(X{^{\mathrm{T}}} W X)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] where \(W\) is the diagonal matrix with entries \[W_{ii} = \frac{\mu_i'(\hat \beta)^2}{V_i(\mu_i(\hat \beta))}\] where \(\hat \beta\) is the Quasi-Likelihood Estimator.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Linear Mixed Effects Model	Given observations \((x_{ij}, z_{ij}, Y_{ij})\), with \(i \in [n]\) groups and \(j \in [d]\) individuals per group, <br>\[Y_i = X_i\beta + Z_iu_i + {\varepsilon}_i\] where \(X_i \in {\mathbb{R}}^{d \times p}\),<br> where \(Z_i \in {\mathbb{R}}^{d \times q}\),<br> \(u_i \overset{iid}{\sim} N(0, \Sigma_u)\) are random effects (\(\Sigma_u\) is positive semi definite in \({\mathbb{R}}^{q\times q}\)),<br> \({\varepsilon}_i \overset{iid}{\sim} N(0, \sigma^2 I_d)\) (independent of \(u_i\)),<br> \(\beta \in {\mathbb{R}}^p\) are fixed effects<br> <br>\(Y_i\) represents all the observations from group \(i\).<br>	SLIP PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Precision	The inverse of variance, often \(\sigma^{-2}\) or \(\Sigma^{-1}\)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Posterior of two multivariate normals, one determining the mean of the other<br>((same as proportional product of two normal pdfs))	Given \(N(\vec \mu_1, \Sigma_1)\) and \(N(\vec \mu_2, \Sigma_2)\), the product of their pdfs is a new \(N(\mu^*, \Sigma^*)\) with \[\Sigma^* = \left(\Sigma_1^{-1} + \Sigma_2^{-1}\right){^{-1}}\] \[\mu^* = \Sigma^* \left(\Sigma_1^{-1}\vec\mu_1 + \Sigma_2^{-1}\vec\mu_2\right)\] For the posterior result, just use \(\mu_2 := A\).<br> <br>The precisions just sum. The mean is the precision-weighted average.	Astrostatistics PartIIINotes
Basic	Part III Notes::MedicalStats	The two kinds of ties in a Nelson-Aalen model, and how we can account of them	1. Tie by lack of precision<br> Suppose in the dataset ordered 1,2,3,4, 2 and 3 tied. Then we can account for this by just adding the two possibilities: \[...\left( \frac{\phi_2 }{\phi_2 + \phi_3 +\phi_4}\frac{\phi_3}{\phi_3 + \phi_4} + \frac{\phi_3 }{\phi_2 + \phi_3 +\phi_4}\frac{\phi_2}{\phi_2 + \phi_4} \right)...\] We can approximate this with \[...\left( \frac{\phi_2\phi_3 }{(\phi_2 + \phi_3 +\phi_4)(\frac{1}{2}\phi_2 + \frac{1}{2}\phi_3 + \phi_4)} \right)...\] 2. Genuine Tie<br> Same example, use the following: \[...\left( \frac{\phi_2\phi_3 }{\phi_2\phi_3 + \phi_2\phi_4 + \phi_3\phi_4} \right)...\]	MedicalStats PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Chinese Remainder Theorem	Take integers a list of \(k\) integers \(n_1,...,n_k\), all greater than one. Assume they are pairwise coprime: no pair shares any divisor besides 1. no Let \(N\) be their product, \(n_1...n_k\). Specify a list of \(k\) possible remainders for each \(n_k\), \(0 \leq a_k &lt; n_k\).<br> <br>Then there exists only one possible solution \(0 \leq x &lt; N\) such that the remainder of \(x\) with each \(n_i\) is \(a_i\).<br> <br>Another way to state it: with the same assumptions above, the system \[\begin{align*} x \equiv a_1\mod n_1\\\\ ...\\\\ x \equiv a_k\mod n_k \end{align*}\] has only one solution \(x\) mod \(N\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	How do we encode numbers for use in Godel's Beta Function?	Suppose we want to code the numbers \( x_0, ..., x_{n-1}\). Let \(m = \max (0, x_0, ..., x_{n-1})!\). Define pairwise coprime \(b_1 :=m + 1,\, b_2:= 2m + 1, ..., b_n:= nm + 1\). Then our tuple becomes coded as the unique \(x\) guaranteed by the Chinese Remainder Theorem such that the remainder by \(b_i\) is \(x_i\). Thus we have \(u = \langle a, m \rangle\) <br>	LogicAndComputability PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Estimating Baseline Hazard	\[ \hat H_0(t) = \sum_{i : x_i \leq t,\, v_i = 1} \frac{1}{\sum_{i' \in R_i}\hat \phi(z_{'i}, \beta)} \]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Many Baseline Hazards Modeling	Suppose we had a treatment trail where individuals in the US and the UK are separately given the same medication. We expect the treatment effect to be the same, but the baseline risk to be different. <br> <br>To model this: we say there are \(L\) strata, with the function \(q(i)\) mapping the \(i\)th individual to their strata group. Then just use the custom risk set: \[R_i := \{ i' :\, x_{i'} \geq x_i \land q(i') = q(i) \}\] For an event occurring to individual \(i\), we only consider other individuals in the same strata (with known events occurring after that individual's events). <br> <br>Estimate \(\beta\) by taking the product of each stratum's partial likelihood. Estimate the baseline hazard separately.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Matched Pair Analysis Setup	Smallest possible stratification. We have two treatments \(k \in \{ 0,1\}\) and \(L\) pairs where one is exposed to treatment, and the other is not. Using our partial hazard split per individual: <br>\[h_{k, l}(t) = \phi_k h_l(t)\] <br>Clearly we can't estimate each individual \(h_l(t)\), but we can estimate the \(\phi_k\) shared among each pair. <br> <br>Assume we have the baseline effect \(\phi_0 = 1\), and the treatment effect \(\phi_1 = e^\beta\) (or combined \(\phi_k=e^{k \beta}\). \(\beta\) is called the log hazard ratio between treatments.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Matched Pair Analysis Likelihood Contributions	Ignore pairs with ties: nothing useful we can learn from a tie. There's only two pieces of information we need from each pair \(l\): \[k(l) = I\{X_{1,l} &lt; X_{0,l} \}\] Just the index of whichever treatment fails first. \[v(l) = V_{k(l), l}\] 0 if it was actually a censoring, not a failure; 1 if not.<br> This yields the full likelihood contribution: \[ \left(\frac{e^{k(l)\beta}}{1 + e^\beta} \right)^{v(l)} \] This ends up just being equivalent to a binomial likelihood!<br> <br>Note that there's no likelihood contribution when:<br> - there's a tie<br> - one individual is censored before the other has an event<br> - both individuals are censored	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	When we graph log(survival) and log(log(survival)) over time, what are we graphing?	log(survival): Integrated hazard, slope is hazard<br> log(log(survival)): We'll see a constant vertical difference between two curves if their hazards are proportional.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Probability Integral Transform for Hazard	Take the random variable \(T\) with integrated hazard \(H(t)\). Then the random variable \(U := H(T)\) is a \(Exp(1)\) distribution. <br>Proof: \[\begin{align*} {\mathbb{P}}(u &gt; U) &= {\mathbb{P}}(u &gt; H(T))\\\\ &= {\mathbb{P}}(H^{-1}(u) &gt; T) \\\\ &= F(H^{-1}(T))\\\\  &= e^{-H(H^{-1}(u))} \\\\ &= e^{-u}  \end{align*}\] <br>We can use this to quantify how accurate our models are!	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Cox-Snell Residual	Fit the model to construct \(y_i = \hat H_i(x_i)\). By the PIT for Hazard, we expect \(y_i \sim Exp(1)\). <br> <br>Problem: we still may have censored individuals. So we can use a method like Kaplan-Meier to instead build a nonparametric representation of \(\hat F_{y_i}\).<br>  <br>Graphically, \(\hat F_y(y)\) should look like a straight line against \(y\), with a slope of -1.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Trick for accounting for smaller-than-expected \(y_i\) mean in computing the Cox-Snell Residual	Add one to the \(y_i\)s for each censored individual: \[y_{i}' := y_i + (1-v_i)\] <br>This comes from the memoryless property of the exponential: \[{\mathbb{E}}[Y | Y + c] = c + 1\] This strategy is called mean imputation.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Martingale Residual	\[y_{i}'' = 1-y_{i}' = v_i - \hat H_i(x_i)\] We know this must have mean zero: \({\mathbb{E}}[y_{i}''] = 0\)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	How can we use Martingale plots as a means of designing an explanatory model?	1. Fit a model without using explanatory variables (\(z\)).<br> 2. Plot the Martingale residuals \((y_i'')\) against \(z\): where the graph is above 0, we see more events than expected. For instance if you see a log plot, that implies you should probably use Cox regression.<br> 3. Fit a fixed dataset and see if the means have been corrected towards zero.	MedicalStats PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Marginal Formulation of Mixed Effects Models	Just compute expectation and variance of \(Y_i\): \[V_i := Z_i \Sigma_u Z_i{^{\mathrm{T}}} + \sigma^2 I_d\] \[Y_i \sim N(X_i\beta,\, V_i)\]<br> <br>Let \(Y = (Y_1^T ... Y_n^T)\), similar for \(X\) and \(Z\), and let \(V = diag(V_1,...,V_n)\). Taking the log-likelihood of the random multivariate normal pdf yields: \[l(\beta, \sigma^2, \Sigma_u) = -\frac{1}{2} nd\log(2\pi) - \frac{1}{2}\log(\det(V)) - \frac{1}{2} (Y - X\beta)^T V{^{-1}} (Y - X\beta)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	MLE for \(\beta\) in a linear mixed effects model	\[\hat \beta = (X{^{\mathrm{T}}} V{^{-1}} X){^{-1}} X{^{\mathrm{T}}} V{^{-1}} Y\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Restricted Maximum Likelihood Estimation (ReML) for Linear Mixed Effects Models	We want a matrix \(L\) such that \(LX = 0\). Use \(L = I - P = I - X(X{^{\mathrm{T}}} X){^{-1}} X{^{\mathrm{T}}}\), then \(LY\) simplifies nicely: \[LY = LX\beta + L[Z_iu_i] + \varepsilon \sim N_r(0, LVL{^{\mathrm{T}}})\] So just estimate \(\sigma^2\), \(\Sigma_u\) using that as the MLE.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Generalized Linear Mixed Effects Models (GLMMs)	Assume the normal setup for linear mixed effects models, but let \(u_i\) have an arbitrary distribution with unknown parameter \(\alpha\): \(u_i \overset{d}{\sim} {\mathbb{P}}_\alpha\). Assumes \[g({\mathbb{E}}[Y_{ij} | u_i]) = x_{ij}{^{\mathrm{T}}} \beta + z_{ij}{^{\mathrm{T}}} u_i\] <br>Estimate \((\alpha, \beta, \phi)\) by MLE: \[f(y; \alpha, \beta, \phi) = \int f(y {\,|\,} u; \beta, \phi) f(u;\alpha)du\]	SLIP PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> The Persistence Property for Kripke Models	\[(s \Vdash p) \land (s \leq s') \implies s' \Vdash p\] Worlds are ordered by the knowledge they believe.<br>	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	How do we actually implement Godel's beta function given an encoded number?	\[ \beta(u, i) := a \% (m(i+1) + 1) \] where \((a, m)\) are the unique numbers such that \(u = \langle a, m \rangle\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Frailty	Each individual has some frailty \(u\) that describes how susceptible they are to risk. We denote the population risk \(\bar F(t)\): <br>\[\bar F = {\mathbb{E}}_u[F(t{\,|\,} U = u)]\] <br>Note however that this doesn't imply that the population hazard \(\bar h(t)\) is \({\mathbb{E}}_u[h(t{\,|\,} u)]\): \[\bar h(t) = \frac{\bar f(t {\,|\,} u)}{\bar F(t {\,|\,} u)}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Population Hazard under Proportional Frailty	\[ h(t{\,|\,} U=u) = uh_0(t)\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Gamma Distribution PDF	\[Gamma(x; p, \lambda) = \frac{\lambda^p x^{p-1}e^{-\lambda x}}{\Gamma(p)}\] \(p\) is called the shape or index parameter, \(\lambda\) is the rate parameter.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Mean of the Gamma Distribution	\[\frac{p}{\lambda}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Variance of the Gamma Distribution	\[\frac{p}{\lambda^2}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Laplace Transform of the Gamma Distribution	With \(Gamma(x; p, \lambda)\): \[\left(\frac{p}{1 + p} \right)^{\lambda}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Population Hazard under Gamma-distributed Frailty	\[\bar h(t) = \frac{p h_0(t)}{p + H_0(t)}\] <br>A lower \(p = \lambda = \psi\) means a higher variance among individuals: they either die off quickly or stick around.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Cox Regression with Frailty	\[h_{z}(t {\,|\,} U = u) = u e^{\beta z}h_0(t)\] <br>You can use this to substitute into population hazard: <br>\[\bar h(t) = \frac{p e^{\beta t} h_0(t)}{p + e^{\beta z} H_0(t)}\] <br>Note then the population hazard ratio will start at \(e^\beta\) at \(t = 0\), but then converge to 1 as \(t \to \infty\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	What is the central problem with frailty in practice? How can we resolve this?	There's not enough information in \((X_i, V_i)\) pairs alone to assess \(g\) and \(H_0\) separately. Three resolutions:<br> <br>1. Be aware that we can't tell apart a homogeneous population with decreasing hazard from a population varying in frailty but with constant hazard over time.<br> <br>2. Measure as many explanatory variables as possible<br> <br>3. Use a model where frailty goes into the error term (accelerated frailty): \[F(t{\,|\,} u) = F_0(ue^{\beta z}t)\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	The two approaches to competing risks modeling	1. Two different times to events, \(T_A\) and \(T_b\), we only observe one.<br> <br>2. One time-to-event variable \(T\), and at event time, choose between event type \(A\) or \(B\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	What do \(F(t)\), \(\tilde f_A(t)\), and \(G_A(t)\) represent in competing hazards modeling, and how do we compute them?	\(F(t)\) is just the probability of any event before \(t\), so  \[F(t) = \exp(-H_A(t) - H_B(t))\] <br>\(\tilde f_A(t)\) is the event density at \(t\) (not a proper distribution): \[\tilde f_A(t) = h_A(t)F(t)\] <br>\(G_A(t)\) is the probability of a the specific event \(A\) occurring before or at \(t\), so  \[G_A(t) = \int_0^t \tilde f_A(t') dt'\]<br> <br>Note we can also compute \(\tilde F_A(t) = \exp(-H_A(t))\), the survivor function for \(A\) if the event \(B\) entirely disappeared.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Aalen-Johansen Estimator	\[\hat G_A(t) = \sum_{j: a_j \leq t} \left[ \prod_{j'=1}^{j' = j-1} (1 - \Delta \hat H_{j'}) \right] \Delta \hat H_j^A\] <br>where  \[\Delta \hat H_j = \frac{1}{r_j}\] \[\Delta \hat H^A_j = \frac{1}{r_j}{\mathbb{1}}_{E_j = A}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Net Survival Setup	Want to consider excess deaths (E) versus the chance of a background event that has nothing to do with the disease we're studying (B). <br>\[h^i(t) = h^i_B(t) + h_E^i(t)\] <br>We want an unbiased estimator for  \[d\bar H_E(t) = \frac{\sum_i F_E^i(t) dH_E^i(t)}{\sum_i F_E^i(t)}\] Incremental excess integrated population hazard<br> <br>(Note that we assume \(h^i_B(t)\) is known from government data)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Pohar-Perme Estimator	\[d\tilde H_E(t) = \frac{\sum_i dN_i(t) / F_B^i(t)}{\sum_i Y_i(t) / F_B^i(t)} - \frac{\sum_i Y_i(t) dH^i_B(t) / F_B^i(t)}{\sum_i Y_i(t) / F_B^i(t)}\]	MedicalStats PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Hierarchial Form of Linear Mixed Effects Models	\[u_i \sim N(0, \Sigma_u)\] \[Y_i {\,|\,} u_i \sim N(X_i\beta + Z_iu_i,\, \sigma^2 I_d)\]<br>	SLIP PartIIINotes
Basic	Part III Notes::SLIP	What do we mean by \(X_{(i)}(x)\) and a corresponding \(Y_{(i)}\)?	It's just a permutation of the data, sorting by distance to a point \(x\): \[||X_{(1)}(x) - x|| \leq ||X_{(2)}(x) - x|| \leq ... \leq ||X_{(n)}(x) - x||\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> \(L\)-Nearest Neighbors Classifier	We have the conditional class probabilities: \[\hat p_k(x) = \frac{1}{L}\sum_{i=1}^L {\mathbb{1}}_{Y_{(i)} = k}\] Then we just use the plug-in estimation for the classifier: \[h^{LNN}(x) := {\underset{k}{\mathrm{argmax}}\,\,}\hat p_k(x)\]<br> <br>(majority vote among \(L\) nearest neighbors)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Statement of the bounds on the 1NN Classifier Risk bounds theorem	Suppose we're doing binary classification (\(k = 2\)) and all point distances are distinct (\(||x_{(j)} - x||\) unique). Assume the true conditional class probability \(p_1(x) = {\mathbb{P}}(Y = 1 {\,|\,} X = 1)\) is continuous, and \(X\) has density bounded away from zero on its support.<br> <br>Then the misclassification risk \(R(h^*)\) fulfills \[R(h^*) \leq \lim_{n \to \infty} R(h^{1NN}) \leq 2R(h^*)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Proposition:</b> 1NN Classifier Risk<br> Suppose we're doing binary classification (\(k = 2\)) and all point distances are distinct (\(||x_{(j)} - x||\) unique). Assume the true conditional class probability \(p_1(x) = {\mathbb{P}}(Y = 1 {\,|\,} X = 1)\) is continuous, and \(X\) has density bounded away from zero on its support.<br> Show that the misclassification risk \(R(h^*)\) fulfills \[R(h^*) \leq \lim_{n \to \infty} R(h^{1NN}) \leq 2R(h^*)\]	First show that \({\mathbb{P}}(h^{1NN}(x) = 1 {\,|\,} X_1 ... X_n) = p_1(X_{(1)}(x))\), and likewise for \(k=2\). <br> <br>Use this to show \(R(h^{1NN}) = {\mathbb{E}}[p_2(X_{(1)}(X^*))p_1(X^*) + p_1(X_{(1)}(X^*))p_2(X^*)\)<br> <br>Make the approximation that \(p_1(X_{(1)} (X^*)) \approx p_1(X^*)\) by continuity. <br> <br>Plug into \(\lim_{n \to \infty} R(h^{1NN})\), recognize that \(2 {\mathbb{E}}[p_1(X^* )(1-p_1(X^* ))] \leq 2R(h^*) \)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bootstrap Aggregation (Baggining)	Given some low-bias classifier \(h\):<br> 1. For \(b = 1...B\), draw \(n\) samples \((X_i^{(b)}, Y_i^{(b)})\) WITH REPLACEMENT from observations. Use this to compute the classifier \(h^{(b)}\).<br> 2. Create a new compound classifier by majority voting among all \(b\)s: \[h^{bag}(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \frac{1}{B}\sum_{b=1}^B {\mathbb{1}}_{h^{(b)} (x) = k}\] or for regression, just an average: \[\hat f^{bag}(x) = \frac{1}{B}\sum_{b=1}^B f^{(b)} (x)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Benefit of bagging	Won't change the bias, but can reduce variance (because bootstrapped samples are more independent).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> m-out-of-n Bagging<br>(and its benefits)	Instead of drawing \(n\) samples with replacement, draw \(m &lt; n\) samples (still with replacement).<br> <br>This increases independence among bagged samples, further decreasing the variance.<br> <br>This yields a consistent estimator when \(B, m \to \infty\) with \(m / n \to 0\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Gini Index	\[G(R) := \sum_{k=1}^K \hat p_k(R)(1 - \hat p_k(R))\] where \[\hat p_k (R) := \frac{1}{N(R)}\sum_{i: X_i \in R} I_{Y_i = k}\] (fraction of points in \(R\) that correspond to class \(k\))<br> <br>Measures the impurity of a region \(R\): we want regions with a lot of points with the same label. \\(pick a label randomly from the pool of labels we know the points have; what's the misclassification rate of just picking randomly from those labels, to label each point?)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Classification Tree Loss	For classification trees, we use the loss: \[Q(j,t) := \frac{N(R_l(j,t))}{N(R)}G(R_l(j,t)) + \frac{N(R_r(j,t))}{N(R)}G(R_r(j,t)) - G(R)\] where \(G(R)\) is the Gini index.<br> <br>(Reduction in impurity that splitting gains, weighted by the fraction of points that each region takes)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Final classification function of a classification tree	\[h^{CART}(x) := \sum_{m=1}^M \hat c_m I_{x \in R_m}\] where \[\hat c_m = {\underset{k}{\mathrm{argmax}}\,\,} \hat p_k(R_m)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Bias-Variance Tradeoff for CART	Shallow trees: high bias, low variance<br> Deep trees: low bias, high variance	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Random Forest Procedure	Similar to bagging:<br> 1. For \(b=1...B\), draw \(n\) samples \((X_i^{(b)}, Y_i^{(b)})\) WITH REPLACEMENT. At each split, pick the best out of some random selection \(m_{try}\) of the \(p\) variables. Continue until it has grown maximally.<br> 2. Aggregate all \(B\) tree functions. For regression, just take the average of applying each tree function; for classification, just take the majority vote.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Variance of a regression random forest	\[{\mathrm{Var}}[f^{RF} (x)] = \rho \sigma^2 +\frac{1 - \rho}{B} \sigma^2\] <br>where for \(b \neq b'\), \(\sigma^2 = {\mathrm{Var}}[f^{(b)}(x)]\) and \(\rho\) is the correlation between \(f^{(b)}(x)\) and \(f^{(b')}(x)\).<br> <br>(split correlation into diagonal and non-diagonal bits in the sum)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Out-of-bag Error (OOB)	Just the average difference between \(Y_i\) and an aggregated forest prediction \(\hat Y_i\). Estimates the prediction error. <br> <br>Note this approaches leave-one-out cross validation error as \(B \to \infty\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Cook's Distance, Intuitively	A datapoint \((x,y)\)'s cook's distance is how much it would affect the final fit if it were to be removed.	SLIP PartIIINotes
Basic	Part III Notes::MSM	Vector generalization of the mean squared error (MSE) for an estimator \(\hat \theta\) of \(\theta\), and what it simplifies down to	\[{\mathbb{E}}[(\hat \theta - \theta) (\hat \theta - \theta )^T ] = {\mathrm{Var}}(\hat \theta) + Bias(\hat \theta) Bias(\hat \theta)^T = {\mathrm{Var}}(\hat \theta) + {\mathbb{E}}[\hat \theta - \theta ]{\mathbb{E}}[\hat \theta - \theta ]^T\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Equation optimized for Ridge Regression	\[ (\hat \mu_\lambda^R, \hat \beta_\lambda^R) = \underset{(\mu, \beta) \in {\mathbb{R}} \times {\mathbb{R}}^p}{argmin}(||Y - \mu 1 - X \beta||^2_2 + \lambda ||\beta||^2_2) \]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Solution for \(\hat \beta_\lambda^R\) in Ridge Regression	\[(X^TX + \lambda I){^{-1}} X^T Y\]	MSM PartIIINotes
Basic	Part III Notes::MSM	What do we know about the differences between OLS solutions \(\beta^{OLS}\) and ridge regression solutions \(\beta^R\)?	As \(\lambda \to 0\), the difference in MSEs between \(\beta^{OLS}\) and \(\beta^{R}_\lambda\) is positive definite: \[{\mathbb{E}}[(\beta^{OLS} - \beta_0)(\beta^{OLS} - \beta_0)^T] - {\mathbb{E}}[(\beta^{R} - \beta_0)(\beta^{R} - \beta_0)^T] \succeq 0\]<br> <br>Show this by first computing the bias and variance of \(\hat \beta ^R_\lambda\).	MSM PartIIINotes
Basic	Part III Notes::MSM	In MSM, what does the notation \(X_j\) mean?	The \(j\)th column of \(X\)	MSM PartIIINotes
Basic	Part III Notes::MSM	What does SVD tell about when ridge regression works best?	\(\lambda\) shrinks Y most in the small principal components of X. So if most of the signal is in the large principal components, we won't lose much.	MSM PartIIINotes
Basic	Part III Notes::MSM	Show how the kernel trick leads to an alternate form of \(\hat \beta_\lambda^R\)	Note first that for \(X \in {\mathbb{R}}^{n \times p}\), \[(X^TX + \lambda I_n) X^T = X^T(XX^T + \lambda I_p)\] Multiply by inverses on each side and then multiplying by Y on the right yields \[(X^TX + \lambda I_n){^{-1}} X^T Y = X^T(XX^T + \lambda I_p){^{-1}} Y = \hat \beta_\lambda^R\] <br>Note that the fitted values \(X\hat \beta_\lambda^R\) only depend on \(XX{^{\mathrm{T}}}\) then!	MSM PartIIINotes
Basic	Part III Notes::MSM	Kernel Form of Ridge Regression	\[X\hat \beta_\lambda^R = K(K + \lambda I){^{-1}} Y\] where \(K = X X{^{\mathrm{T}}}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	Feature Map	A function  \(\phi:X \to H\) from X to some Hilbert space \(H\).<br> <br>Note that this gives us a similarity measure \(k: X \times X \to {\mathbb{R}}\): \[k(x, x') = \langle \phi(x), \phi(x') \rangle\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Positive Definite Kernel	A symmetric map \(k: {\mathcal{X}} \times {\mathcal{X}} \to {\mathbb{R}}\) such that the matrix \(K\) with entries  \[K_{ij} = k(x_i, x_j)\] is positive semi-definite.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Cauchy-Schwarz for Kernels	\[k(x, x')^2 \leq k(x,x)k(x',x')\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Linear Kernel	\(k(x,x') := x{^{\mathrm{T}}} x'\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Polynomial Kernel	\(k(x,x') := (1 + x{^{\mathrm{T}}} x')^d\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Gaussian Kernel	\(k(x,x') := \exp\left(-\frac{||x - x'||_2^2}{2h^2}\right)\) \(h&gt;0\) is called the bandwidth.<br> <br>Note that this must be infinite-dimensional because we have a term \(e^{xx'}\), which Taylor expands to an infinite series of polynomial terms.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> First-Order Sobolev Kernel	\[k(x, x') = \min(x, x')\]<br> This must be a kernel because it can be represented as \[ k(x,x') = \int_0^1 {\mathbb{1}}_{[0, x]}(u){\mathbb{1}}_{[0, x']}(u) = \langle {\mathbb{1}}_{[0, x]}, {\mathbb{1}}_{[0, x']} \rangle_{L_2([0,1])} \]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Second-Order Sobolev Kernel	\[ k(x, x') = \int_0^{\min(x,x')} (x - u)(x' - u)du \]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Jaccard Similarity	Similarity based on what features are shared among \(1...p\). \(h(x,x')\) is defined as 1 when \(x \cup x' = \emptyset\), and otherwise \[h(x,x') = \frac{|x \cap x'|}{|x \cup x'|}\] where \(|.|\) is the set size.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> The Spectral Theorem	If \(X\) is symmetric, we have a decomposition  \[X = P{^{\mathrm{T}}} D P\] where \(D\) is diagonal and \(P\) is orthogonal. The rows of \(P\) are the eigenvectors of X. <br> <br>Furthermore, if \(X\) is positive definite, every diagonal term in \(D\) is positive.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> For every kernel \(k\) on FINITE \({\mathcal{X}}\), there exists some Hilbert space \({\mathcal{H}}\) and feature map \(\phi: {\mathcal{X}} \to {\mathcal{H}}\) for which  \[k(x,x') = \langle\phi(x), \phi(x')\rangle\]	Because \({\mathcal{X}}\) is finite and \(k\) is a kernel, we have a positive definite \(K \in {\mathbb{R}}^{n \times n}\) that contains all information about the space. Because \(K\) is positive definite, we have the decomposition \(K = P{^{\mathrm{T}}} DP\) with orthogonal \(P\) and positive diagonal \(D\). Take \({\mathcal{H}}\) to be \({\mathbb{R}}^n\) and the feature map \(\phi: x_i \mapsto (D^{1/2} P_i) \)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> For every kernel \(k\) in a general \({\mathcal{X}}\), there exists some Hilbert space \({\mathcal{H}}\) and feature map \(\phi: {\mathcal{X}} \to {\mathcal{H}}\) for which  \[k(x,x') = \langle\phi(x), \phi(x')\rangle\]<br> <br>First specify the \({\mathcal{H}}\) and \(\phi(x)\) used.	Take \({\mathcal{H}}\) to be the linear span of the functions \[ f(\cdot) = \sum_{i=1}^n a_i k(\cdot, x_i) \] for some \(n \in {\mathbb{N}}\), \(x_i \in {\mathcal{X}}\), and \(a_i \in {\mathbb{R}}\) with the inner product defined as \[ \langle f, g \rangle = \sum_{i=1}^n \sum_{j=1}^m a_i b_j k(x_i, x'_j) \] This is clearly well-defined since there's no dependence on the expansion of \(f\) or \(g\). So our feature map is  \[\phi(x) = k(\cdot , x)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Show that \({\mathcal{H}}\) defined as the space of functions of the form \[ f(\cdot) = \sum_{i=1}^n a_i k(\cdot, x_i) \] with inner product \[ \langle f, g \rangle = \sum_{i=1}^n \sum_{j=1}^m a_i b_j k(x_i, x'_j) \] and \[\phi(x) = k(\cdot , x)\] is a valid kernel and inner product.	First show that we indeed have  \[k(x,x') = \langle\phi(x), \phi(x')\rangle\] Clearly symmetric and linear, but is it positive definite?<br> Note first that \(\langle f, g\rangle \geq 0\) by positive definiteness of the kernel. So then for functions \(f_1...f_n\), coefficients \(\gamma_1...\gamma_n\), and working from the PSD requirement, we have: \[\sum_{i,j} \gamma_i \langle f_i, f_j \rangle \gamma_j = \left\langle \sum_i f_i \gamma_i, \sum_j f_j \gamma_j \right\rangle \geq 0\] Thus \(\langle\cdot , \cdot \rangle\) is a kernel. The last part to show is that \(\langle f, f\rangle = 0 \iff f = 0\). The backwards direction is obvious, but for the forwards direction, first note that \[\langle k(\cdot,x), f\rangle = \sum_i a_i k(x_i, x) = f(x)\] So, we have: \[\langle f, f\rangle = 0 \implies f(x) = \langle k(\cdot,x), f\rangle \leq \langle k(\cdot,x), k(\cdot,x)\rangle\langle f, f\rangle = 0\] where the \(\leq \) step is given by Cauchy-Schwarz for kernels.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Show the representer theorem:<br> Given some RKHS \(\mathcal{H}\) with representing kernel \(K\) and arbitrary loss function \(c: \mathbb{R}^n \times X^n \times \mathbb{R}^n \to \mathbb{R}\) and strictly increasing \(J: {\mathbb{R}} \to {\mathbb{R}}\), the minimizing \(\hat f \in \mathcal{H}: X \to \mathbb{R}\) of<br> \[Q_1(f):= c(y, x_1, ..., x_n, f(x_1), ..., f(x_n)) + J(||f||_\mathcal{H}^2)\]<br> is given by<br> \[\hat f(x) = \sum_{i=1}^n \hat \alpha_i K(x, x_i)\]<br> with \(\hat \alpha\) minimizing:<br> \[Q_2(\alpha) = c(y, x_1, ..., x_n, K\alpha) + J(\alpha ^TK\alpha)\]	Let \(U\) be the linear span of \(k(\cdot, x_i), i = 1...n\). Linear subspaces of Hilbert spaces are closed; and closed subspaces of Hilbert spaces mean that any function \(f = u + v\): \(u \in U\) and \(v \in U^\perp\).<br> Show first that \(f(x_i) = u(x_i)\) by means of the definition of RKHS, and that \(J(||f||^2_{\mathcal{H}}) \geq J(||u||^2_{\mathcal{H}})\). Reason then that \(Q_1\) is minimized when \(v = 0\).<br> Show then that we can express \(||u||^2_{\mathcal{H}}\) as \(\alpha^T K \alpha\) and \(f(x_i)\) as \(K \alpha\); thus \(Q_1(f) = Q_2(\alpha)\).	MSM PartIIINotes
Basic	Part III Notes::MSM	Functional Generalization of Ridge Regression	Find the \(\hat f_\lambda\) minimizing \[\sum_{i=1}^n (Y_i - f(x_i))^2 + \lambda ||f||^2_{\mathcal{H}}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	How can we simplify \[\sum_{i=1}^n (Y_i - f(x_i))^2 + \lambda ||f||^2_{\mathcal{H}}\] with the Representer Theorem?	Using \(c(...) = \sum_{i=1}^n (Y_i - f(x_i))^2\) and \(J(||f||^2_{\mathcal{H}}) = \lambda||f||^2_{\mathcal{H}}\), we have that the above is equivalent to minimizing \[||Y - K\alpha||^2_2 + \lambda \alpha{^{\mathrm{T}}} K \alpha\]<br> <br>Note that this gives us a way to predict an unknown point: \[\hat f(x) = \sum_{i=1}^n \hat \alpha_i k(x, x_i)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	What do we know about orthogonal matrices in the L2 norm?	Because an orthogonal matrix never scales anything, we have: \[||UX||^2_2 = ||X|^2_2 = ||XU||^2_2\]	MSM PartIIINotes
Basic	Part III Notes::MSM	Trick to pull orthogonal matrices out of an inverse like \((UDU{^{\mathrm{T}}} + \lambda I){^{-1}}\)	Note that \(I = U{^{\mathrm{T}}} U = UU{^{\mathrm{T}}}\), so we have: \[\begin{align*} (UDU{^{\mathrm{T}}} + \lambda I){^{-1}} &= (UDU{^{\mathrm{T}}} + U\lambda IU{^{\mathrm{T}}}){^{-1}}\\\\& = (U(D + \lambda I)U{^{\mathrm{T}}}){^{-1}}\\\\& = U^{-T} (D+\lambda I){^{-1}} U{^{-1}}\\\\& = U(D + \lambda I)U{^{\mathrm{T}}} \end{align*}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	AM-GM	The arithmetic mean is always at least as large as the geometric mean: for all \(a, b \in {\mathbb{R}}\), \[\frac{a+b}{2} \geq \sqrt{ab}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	How can we bound a term like \(\frac{ab}{(a+b)^2}\)?	Use AM-GM: \[\frac{a+b}{2} \geq \sqrt{ab}\] \[a+b \geq 2\sqrt{ab}\] \[(a+b)^2 \geq 4ab\] \[\frac{ab}{(a+b)^2} \leq \frac{ab}{4ab} = \frac{1}{4}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> Prove that the mean squared prediction error (MSPE) may be bounded  above in the following way: \[\begin{align*} \frac{1}{n}\mathbb{E}\left\{\sum_{i=1}^n \{f^0(x_i) - \hat{f}_\lambda(x_i)\}^2\right\}  &\leq \frac{\sigma^2}{n}\sum_{i=1}^n \frac{d_i^2}{(d_i+\lambda)^2} + \frac{\lambda\|f^0\|^2_\mathcal{H}}{4n} \\\\ &\leq \frac{\sigma^2}{n}\frac{1}{\lambda}\sum_{i=1}^n \min(d_i/4, \lambda) + \frac{\lambda\|f^0\|^2_\mathcal{H}}{4n}. \end{align*}\]	Proof sketch:<br> By the representer theorem, we have that \[[\hat f_\lambda^R(x_i)] = K(K+\lambda I){^{-1}} Y\] and \[[\hat f_\lambda^0(x_i)] = K\alpha\] Use the eigendecomposition \(K = UDU{^{\mathrm{T}}}\), \(U\) orthogonal. <br> Let \(\Theta = D U{^{\mathrm{T}}} \alpha\), note that \(Y = K \alpha + {\varepsilon}\). Split into deterministic and stochastic parts (and swap them, stochastic is the first term in the final representation)<br> In the deterministic term, use the \(D{^{-1}}\) trick to introduce \(\alpha{^{\mathrm{T}}} K \alpha = ||f^0||^2_{\mathcal{H}}\); get the \(1/4\) from bounded the fraction with AM-GM. <br> In the stochastic bit, use the trace trick to get the \(\epsilon\)s together, where you can bring the expectation in to get \(\sigma^2\).<br> The final equality falls out of AM-GM again.<br>	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Eigenfunction	Given a random variable \(X\) taking values in \({\mathcal{X}}\), we say a nonzero function \(e \in {\mathcal{H}}\) is an eigenfunction with eigenvalue \(\mu \in {\mathbb{R}}\) if \[\mu e(x) = {\mathbb{E}}[k(x,X) e(X)]\] in other words, the \({\mathbb{E}}[k(x,X)...]\) part is a linear operator acting upon \(e\): <br>\[(Te)(x) = \int k(x,x') e(x')dP(x')\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Mercer's Theorem	Given a random variable \(X\) taking values in \({\mathcal{X}}\) with a nonzero \(e \in {\mathcal{H}}\) eigenfunction with eigenvalue \(\mu \in {\mathbb{R}}\) (so that \(\mu e(x) = {\mathbb{E}}[k(x,X) e(X)]\) is fulfilled), and under some mild regularity conditions (like \({\mathbb{E}}(k(X,X)) &lt; \infty\)), then three things must hold true:<br> <br>1. The set of positive eigenvalues is at most countable.<br> <br>2. The subspace spanned by each eigenfunction has a finite dimension (known as the multiplicity of the eigenvalue)<br> <br>3. The eigenvalues are orthonormal in the sense that  \[{\mathbb{E}}[ e_j(X) e_k(X)] = {\mathbb{1}}_{j=k}\] See the decomposition of kernels.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Bochner's Theorem	Let \(k: {\mathbb{R}}^p \times {\mathbb{R}}^p \to {\mathbb{R}}\) be a continuous kernel. Then \(k\) is shift-invariant iff there exists some \(c &gt;0\) and distribution \(F\) on \({\mathbb{R}}^p\) such that when \(W \sim F\),  <br>\[k(x,x') = c {\mathbb{E}}[ e^{i(x-x'){^{\mathrm{T}}} W}] = c {\mathbb{E}}[ \cos ((x - x') {^{\mathrm{T}}} W)]\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Mean-Squared Prediction Error (MSPE)	\[\frac{1}{n} {\mathbb{E}}[||X\beta^0 - X\hat \beta||^2_2]\] Measures the average difference between true values and predicted values.	MSM PartIIINotes
Basic	Part III Notes::MSM	Why do we expect that Lasso regression could be useful?	Let \(S\) be the set \(\{ k : \beta_k^0 \neq 0 \}\) (the set of parameters that are nonzero). Oftentimes this set is smaller than \(p\), the total number of parameters. When this is the case, OLS could stand to be a lot better: consider the MSPE: \[\frac{1}{n} {\mathbb{E}}[||X\beta^0 - X\hat \beta||^2_2] = ... = \frac{\sigma^2p}{n}\] If we just dropped each of the \(p\)s that aren't in \(S\), we could significantly reduce the error.	MSM PartIIINotes
Basic	Part III Notes::MSM	Function minimized in Lasso regression	Find the \((\hat \mu^L, \hat \beta ^L)\) minimizing \[  \frac{1}{2n}||Y - X \beta||_2^2 + \lambda || \beta ||_1 \] <br>We generally assume that \(X\) is already mean-centered, giving us the loss we actually seek to minimize: \[Q_\lambda(\beta) := \frac{1}{2n}||Y - X\beta||_2^2 + \lambda ||\beta||_1\]	MSM PartIIINotes
Basic	Part III Notes::MSM	Given the minimizer \(\hat \beta ^L_\lambda\) in lasso regression, how can we express this as a conditional optimization problem?	\[||Y - X \beta||^2_2\, \text{ subject to } ||\beta||_1 \leq ||\hat \beta ^L_\lambda||_1\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Holder's Inequality	Let \(p, q \in [1, \infty]\) with \(\frac{1}{p} + \frac{1}{q} = 1\). Then we have, for any functions \(f\), \(g\), \[||fg||_1 \leq ||f||_p||g||_q\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> Prediction Error of the Lasso (slow rate):<br> Let \(\hat{\beta}\) be any Lasso solution when \[\lambda = A\sigma\sqrt{\frac{\log(p)}{n}}.\] with probability at least \(1 - 2p^{-(A^2/2-1)}\). Then: \[\frac{1}{n}\|X(\beta^0 - \hat{\beta})\|_2^2 \leq 4A\sigma\sqrt{\frac{\log(p)}{n}}\|\beta^0\|_1.\]	Proof sketch:<br> Start from the fact that we know \(\beta^L\) can do no better than the true mean \(\beta^0\): \[\frac{1}{2n} ||Y - X \hat \beta||_2^2 + \lambda ||\hat \beta||_1 \leq \frac{1}{2n} ||Y - X \beta^0||_2^2 + \lambda ||\beta^0||_1\] Substitute in \(Y = X\beta^0 + {\varepsilon}\), match the left side. <br> We can put an implicit norm around \({\varepsilon}{^{\mathrm{T}}} X (\hat \beta - \beta^0)\), use Holder's to get the one norm we need paired with an infinity norm around \(X{^{\mathrm{T}}} {\varepsilon}\)<br> In order to make everything work out nicely, we need the assumption that \(\frac{1}{n}||X{^{\mathrm{T}}} {\varepsilon}||_\infty \leq \lambda\). Call this event \(\Omega\), we assume that it is true with probability at least \(1 - 2p^{-(A^2/2-1)}\).<br> Then just use the triangle inequality and multiply both sides by two.<br>	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Subgaussian RV	We say a random variable \(X\) is subgaussian with parameter \(\sigma\) if: \[{\mathbb{E}}[e^{t(X - {\mathbb{E}}[X])}] \leq e^{t^2\sigma^2/2}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Subgaussian Tail Bound	Given \(X\) subgaussian with parameter \(\sigma\), we have: \[{\mathbb{P}}(X - {\mathbb{E}}[X] \geq t) \leq e^{-t^2 / (2\sigma^2)}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Hoeffding's Lemma	If \(X\) takes values in \([a,b]\), then \(X\) is subgaussian with parameter \((b-a)/2\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> What must be true about sequences of independent subgaussian RVs?	If independent subgaussian RVs \(X_1...X_n\) have parameters \(\sigma_i\) respectively, then the RV \(\gamma{^{\mathrm{T}}} X\) is subgaussian with parameters \(\left( \sum_i \gamma_i^2 \sigma_i^2 \right)^{1/2}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> Suppose \((\varepsilon_i)_{i=1}^n\) are independent, mean-zero and sub-Gaussian with common parameter \(\sigma\). Note that this includes \(\varepsilon \sim N_n(0, \sigma^2 I)\). Let \(\lambda = A\sigma\sqrt{\log(p)/n}\). Then \[\mathbb{P}(\|X^\top \varepsilon\|_\infty/n \leq \lambda) \geq 1 - 2p^{-(A^2/2-1)}\]	Proof sketch:<br> Start from \[...\leq \sum_{i=1}^p {\mathbb{P}}(\frac{1}{n} |{\varepsilon}^T X_i| &gt; \lambda)\] Split into two events, one where \({\varepsilon}^T X_i\) is positive, and one where it's negative. Both are subgaussian with parameter \(\sigma/\sqrt{n}\), so use the upper bound. Then just plug in \(\lambda\).<br>	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> For \(\lambda \geq 0\), any two lasso regression solutions \(\hat \beta_\lambda^{(1)}\) and \(\hat \beta_\lambda^{(2)}\) must have the same fitted values: \(X \hat \beta_\lambda^{(1)} = X \hat \beta_\lambda^{(2)}\)	First note that, by the convexity of \(||\cdot||^2_2\) with \(t = \frac{1}{2}\),  \[||Y - \frac{1}{2}X\beta^{(1)}_\lambda - \frac{1}{2}X\beta^{(2)}_\lambda ||^2_2 \leq ||Y - \frac{1}{2}X\beta^{(1)}_\lambda||^2_2 + ||Y - \frac{1}{2}X\beta^{(2)}_\lambda ||^2_2 \] and by strict convexity, equality is achieved iff \(X\beta^{(1)} = X\beta^{(2)}\).<br> Show that \[c^* \leq Q_\lambda(\frac{1}{2}X\beta^{(1)} + \frac{1}{2}X\beta^{(2)}) \leq \frac{1}{2}Q_\lambda(X\beta^{(1)}) + \frac{1}{2}Q_\lambda(X\beta^{(2)}) = c^*\] applying convexity of the L1 and L2 norms, and so equality is achieved.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Coordinate Descent	Best used to optimize functions of the form  \[f(x) = g(x) + \sum_{i=1}^n h_j(x_j)\] Start with some initial guess \(x^{(0)}\). Optimize on one component at a time, updating as you go along each cycle.	MSM PartIIINotes
Basic	Part III Notes::MSM	When will coordinate descent converge to a minimizer?	When \(A_0 = \{x : f(x) \leq f(x^{(0)}) \}\) is a compact set (with \(x^{(0)}\) the starting guess)	MSM PartIIINotes
Basic	Part III Notes::MSM	What three properties must inner products have?	1. Conjugate Symmetry: \(\langle x, y \rangle = \overline{ \langle y, x \rangle }\)<br> 2. Linearity in the first argument: \(\langle ax + by, z\rangle = a\langle x, y\rangle + b\langle x,y \rangle\)<br> 3. Positive Definiteness: \(\langle x, x \rangle \geq 0,\, \text{equality iff } x = 0 \)<br> <br> for \(x, y \in \mathcal{H},\, a, b \in \mathbb{K}\)<br> Note that 1 and 2 together makes the inner product a sesquilinear form.	MSM PartIIINotes
Basic	Part III Notes::MSM	What does a matrix being positive definite tell us about its eigenvalues?	The eigenvalues are positive (and real).	MSM PartIIINotes
Basic	Part III Notes::MSM	What is the definition of a matrix being positive-definite?	\(A\) IS SYMMETRIC, and for any nonzero real column vector \(x\),<br> \(x^TAx &gt; 0\)<br> <br>Note this implies  \[\sum_{i,j} x_i A_{ij} x_j &gt; 0\]<br> <br>Semidefinite just means it can also be zero.	MSM PartIIINotes
Basic	Part III Notes::MSM	If some matrix \(A\) has full column rank, what does that tell us about \(A^TA\)?	\(A^TA\) is positive definite	MSM PartIIINotes
Basic	Part III Notes::MSM	What are the three kernel closure properties?	1. Closure over addition and nonnegative scalars: \(a_1k_1 + a_2k_2\)<br> 2. Closure over sequences: \(\lim_{k \to \infty} k_m\)<br> 3. Closure over products: \(k_1k_2\)	MSM PartIIINotes
Basic	Part III Notes::MSM	State the Representer Theorem	Given some RKHS \(\mathcal{H}\) with representing kernel \(K\) and arbitrary loss function \(c: \mathbb{R}^n \times X^n \times \mathbb{R}^n \to \mathbb{R}\) and strictly increasing \(J: {\mathbb{R}} \to {\mathbb{R}}\), the minimizing \(\hat f \in \mathcal{H}: X \to \mathbb{R}\) of<br> \[Q_1(f):= c(y, x_1, ..., x_n, f(x_1), ..., f(x_n)) + J(||f||_\mathcal{H}^2)\]<br> is given by<br> \[\hat f(x) = \sum_{i=1}^n \hat \alpha_i K(x, x_i)\]<br> with \(\hat \alpha\) minimizing:<br> \[Q_2(\alpha) = c(y, x_1, ..., x_n, K\alpha) + J(\alpha ^TK\alpha)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	State the KKT conditions for the Lasso	\[\hat v = \frac{1}{\lambda n}X^T(Y - X \beta^L_\lambda)\]<br> where \(\hat v\) satisfies \(||\hat v||_\infty \leq 1\) and \(\hat v_{\hat S} = \mathrm{sgn}(\hat \beta^L_{\lambda,\,\hat S})\) where \(\hat S = \{ j:\, \beta^L_{\lambda,\, j} \neq 0 \}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	State Markov's Inequality	If \(X\) is a nonnegative random variable and \(a &gt; 0\), then we have:<br> \(\mathrm{P}(X \geq a) \leq \frac{\mathrm{E}(X)}{a}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	State Chebyshev's Inequality	\(\mathbb{P}(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}\)<br> Note that this is useless when \(k \leq 1\), since all probabilities are \(\leq 1\)<br> <br> Equivalently we have<br> \(\mathbb{P}(|X-\mu| \geq a) \leq \frac{{\mathrm{Var}}(X)}{a^2}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	State the definition of function convexity.	\(f((1-t)x + ty) \leq (1-t)f(x) + tf(y)\)<br> for all \(x, y \in C\)<br> given convex space \(C\), \(f: C \to \mathbb{R}^n\)	MSM PartIIINotes
Basic	Part III Notes::MSM	Write the Lagrangian function as used in the Lagrangian Method to minimize a function \(f(x)\) subject to \(g(x) = 0\)<br> (\(f: C \to \mathbb{R}\) (\(C \subseteq \mathbb{R}^d\) convex), \(g: \mathbb{R}^d \to \mathbb{R}^d\))	\(L(x, \theta) = f(x) + \theta^Tg(x)\)<br> Find minima in \(x, \theta\)<br> <br> We're looking for points where the tangents to the contours of \(f\) and \(g\) are parallel (or where \(f\) alone is minimized and \(\theta = 0\))	MSM PartIIINotes
Basic	Part III Notes::MSM	Given a convex function \(f: C \to \mathbb{R}\), when is a vector \(v \in \mathbb{R}^d\) a subgradient of \(f\) at \(x_0\)?	\(\forall x \in C,\, f(x) - f(x_0) \geq v^T(x-x_0)\)<br> <br> So in one dimension we have:<br> \(\forall x \in C,\, f(x)-f(x_0) \geq c(x-x_0)\)<br> given \(c \in \mathbb{R}^n\)<br> any \(c\) such that our function stays above the line drawn out from \(x\) with slope \(c\)	MSM PartIIINotes
Basic	Part III Notes::MSM	What is a subdifferential of a function \(f\) at \(x\), and how is it denoted?	The set of all subgradients of \(f\) at \(x\).<br> It's denoted \(\partial f(x)\)	MSM PartIIINotes
Basic	Part III Notes::MSM	For the interior of a convex region, if \(f\) is differentiable, what are the subdifferentials of \(f\)?	Just the gradient of \(f\) at any point \(x\):<br> \(\partial f(x) = \{ \nabla f(x) \}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	What are the simplifying properties of subdifferentials?	1. Subdifferentials scale how you'd expect:<br> \(\partial(\alpha f)(x) = \{ \alpha v,\, v\in \partial f(x) \} \)<br> 2. When adding functions, we consider all permutations:<br> \(\partial(f_1 + f_2)(x) = \{ v_1 + v_2,\, v_1\in \partial f_1(x),\, v_2\in \partial f_2(x) \} \)<br> 3. Subdifferentials ignore constant adds and respect linearity:<br> \(h(x) = f(Ax + b) \implies \partial h(x) = \{ A{^{\mathrm{T}}} g : g \in \partial f(Ax+b) \}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	For a convex function \(f\), how must any minimizer relate to subdifferentials?	We have<br> \[x^* \in \underset{x \in \mathbb{R}^d}{\mathrm{argmin}}\, f(x) \iff 0 \in \partial f(x^*)\]<br> <br> (both are equivalent to saying \(f(y) \leq f(x^*),\, \forall y\))	MSM PartIIINotes
Basic	Part III Notes::MSM	What is the subdifferential of the L1 norm \(\partial ||x||_1\)?	\[\partial ||x||_1 = \{ v \in \mathbb{R}^d : ||v||_\infty \leq 1,\, v_A = \mathrm{sgn}(x_A)\}\]<br> <br> where \(A \in \{ j : x_j \neq 0 \}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	For \(x \in \mathbb{R}^d\), state an equivalent form of \(\partial ||x||_1\).	\[\partial x ||x || _1 = \{ v \in \mathbb{R}^d,\, ||v||_\infty \leq 1 \text{ and } v_A = \mathrm{sgn}(x_A)\}\]<br> where \(A = \{ j:\, x_j \neq 0\}\)<br> <br> In other words, \(v\) where the magnitude of all components of \(v\) do not exceed 1, \(v\) is positive in the components where \(x\) is positive and negative where the components of \(x\) is negative.	MSM PartIIINotes
Basic	Part III Notes::MSM	What do we know about the fitted values of Lasso minimizers?	They are unique<br> (though the minimizing \(\beta^L_\lambda\) may not be unique)	MSM PartIIINotes
Basic	Part III Notes::MSM	What do the sets \(S\) and \(N\) denote in the context of Lasso regression?	\(S := \{ k: \, \beta_k^0 \neq 0 \}\) is the set of nonzero terms of the true minimizer \(\beta^0\)<br> \(N:= (1, 2, ..., p) \setminus S\) is the set of zero terms of the true minimizer \(\beta^0\)<br> <br> So for instance \(X_N\) is the matrix of ignored predictor columns.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Sum form of positive-semi-definiteness	A matrix \(X\) is positive semi-definite iff \[\sum_{i,j}a_iX_{ij} a_j \geq 0\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Kernel from a Feature Map	Given some feature map \(\phi:{\mathcal{X}} \to {\mathcal{H}}\), the following is a kernel: \[k(x,y) = \langle \phi(x), \phi(y) \rangle \]<br> <br>This is because \[\sum_{i,j} a_i k(x_i, x_j)a_j = \sum_{i,j} a_i \langle \phi(x_i), \phi(x_j) \rangle a_j = \langle \sum_i a_i \phi(x_i), \sum_i a_i \phi(x_i) \rangle \geq 0\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Set Convexity	A set \(C\) is convex if, for all \(x, y \in C\), \(t \in [0,1]\), \[tx + (1-t)y \in C\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> The Gamma Integral	\[\Gamma(z) = \int_{0}^\infty u^{z-1}e^{-u}du\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> How to compute \(\Gamma(n)\)	Know that  \[\Gamma(n) = (n - 1)!\] \[\Gamma(1/2) = \sqrt{\pi}\] \[\Gamma(z+1) = z\Gamma(z)\] for \(n \in {\mathbb{N}}\), \(z \in {\mathbb{R}}\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Laplace Transform of a Power	\[t^{-\alpha} = \frac{1}{\Gamma(z)} \int_0^\infty v^{z-1}e^{-tz}\]	MSM PartIIINotes
Basic	Part III Notes::MedicalStats	Causation versus association	Causation is the language of intervention. Association is the language of probability.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Confounding	When some event is a common cause of two variables. The two variables may appear correlated, but don't cause each other.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Collider Bias / Berkson's Paradox	Celebrity example: You may notice that ugly celebrities tend to be talented, and that untalented celebrities tend to be attractive. But this is only because they wouldn't be a celebrity if they weren't either.<br> <br>When two different events cause the same outcome, and that outcome makes observation easier, then those two events may seem erroneously negatively correlated.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Reverse Causation	If symptoms are consistently treated, treatment may appear to 'cause' the illness.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	How can we measure causality?	1. Perform controlled experiments<br> 2. Perform randomized experiments<br> 3. Account for confounding in analyses<br> Use instrumental variables<br> 4. Assume that some part of the data-generating process behaves like randomization (eg mendelian randomization)<br> This is called a natural experiment	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Negative Controls	To check your work, check causality of obviously uncorrelated phenomena	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Difference-in-Differences	Sudden divergence of trendlines in two populations: there must be some reason for the divergence!	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Regression Continuity	Check that measured data that should be smooth is actually smooth. Jump points are suspicious.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Immortal Time Bias	Must consider the age requirement of some property: nobel prize winners are older than non-nobel prize winners.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Intention-to-treat	Who will be included in the study must be set-in-stone before time zero (can't disclude people on the fly)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Case-control studies	Case studies paired with a control group. Not optimal, but sometimes necessary for rare diseases.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Cohort Studies	Look at a (snapshot) population and observe what effects it over time.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Rubin's Dictum	For objective causal inference, design trumps analysis.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Transition Rate, and when it is a Markov process and time homogeneous	\[q_{rs}(t; {\mathcal{F}}_t)=\lim_{\delta t \to 0} \frac{{\mathbb{P}}(X(t + \delta t) = s {\,|\,} X(t) = r,\, {\mathcal{F}}_t)}{\delta t}\] \(X(t)\) represents the state the individual is in at time \(t\), \({\mathcal{F}}_t\) is all information about the process prior to \(t\).<br> <br>It's a Markov process when \(q_{rs}\) has no dependence on \({\mathcal{F}}_t\), and it's time-homogeneous when \(q_{rs}\) has no dependence on time (time probabilities over an interval only depend on the duration of the interval).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Transition Intensity Matrix \(Q\)	\[Q_{r,s} := q_{r,s} \text{ for } r \neq s\] \[Q_{r,r} = -\sum_{s \neq r} q_{rs}\] So rows of \(Q\) sum to zero.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Sojourn Time	Distribution of time spent in state \(r\) before moving<br> Has an exponential distribution with rate \[\lambda = \sum_{s,\, s \neq r} q_{rs} = -q_{rr}\]<br> <br>So the average time in a state before switching is given by  <br>\[\frac{1}{\lambda} = \frac{1}{\sum_{s,\, s \neq r} q_{rs}} = \frac{1}{-q_{rr}}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Probability that next state is \(s\) given current state is \(r\) in a continuous Markov process	\[\frac{q_{rs}}{\sum_{j,\, j\neq r} q_{rj}} = \frac{-q_{rs}}{q_{rr}}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Transition Probability Matrix \(P(t)\)	\[P_{rs}(t) = {\mathbb{P}}(\text{State $s$ at time $t$} {\,|\,} \text{State $r$ at time $0$})\] Solution to the forward Kolmogorov equations: \[\frac{dP(t)}{dt} = P(t)Q(t)\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	How to solve for the transition probability matrix given \(P(0) = I\), \(Q\) is time-homogeneous	\[P(t) = \exp (tQ) = \sum_{n=0}^\infty \frac{t_n}{n!}Q^n\] Use an eigendecomposition for \(Q = UDU\inv\), \[= \exp(tQ) = U\exp (Dt) U\inv\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Expected time matrix \(T(t)\) (in a continuous time Markov process)	Expected total length of time spent in state \(s\), starting from state \(r\): \[T_{rs}(t) = {\mathbb{E}}\left[ \int_0^t I_{X(u) = s} du \right] = \int_0^t P_{rs}(u)du\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Expected number of visits to a state (in a continuous-time Markov process)	\[\sum_{i\neq s}\int_0^t P_{ri}(u) q_{is}du = \sum_{i\neq s} T_{ri}(t)q_{is}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	How to find the probability of ever visiting a state (in a continuous-time Markov process)	Create a new problem by zeroing out the \(s\)th row of Q (a trap state), then just compute the new \(P^*(t) = \exp(tQ^*)\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Likelihood contribution of discrete data in a continuous-time Markov process	Let \(a\) be \(x_{i0}\) (state before), and let \(b\) be \(x_{i1}\) (state after), then \[{\mathbb{P}}(X_{i1} = b {\,|\,} a) = P_{ab}(t_{i1} - t_{i0} {\,|\,} \theta)\] Take the product of these all to get the full likelihood.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Likelihood contribution of a known time interruption (like death) in a continuous time Markov process	Problem: we don't know what state the individual was in prior to death. So we have to allow for every possibility: \[\sum_{s \neq d} p_{rs}(t_{i,n_i} - t_{i, n_{i-1}})q_{sd}\] where \(d\) is the death state, and \(r\) is the last known pre-death state \(x_{i,n_{i-1}}\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Likelihood contribution of a censored individual in a continuous time Markov process	Problem: We don't know what disease state the individual is in at the end of the time period, only that they never died (or else we would have known) \[\sum_{s \neq d} P_{r,s}(t_{i, n_i} - t_{i,n_{i-1}}) = 1 - P_{r,d}(t_{i, n_i} - t_{i,n_{i-1}})\] where \(d\) is the death state, and \(r\) is the last known pre-death state \(x_{i,n_i-1}\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Net Survival Integral	\[F_{net} = \exp \int_0^t  d\bar H_E(s) ds\] What fraction of patients would survive to time \(t\) if it were the only possible cause of death?	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	How can we make up for left-truncated data?	Divide by \(F(S_i)\) in the likelihood contributions (if the individual had had the event before \(S_i\), they would not have been included in the dataset)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	Accounting for piecewise hazard rates	Divide time into intervals; use left-truncation on all but the first region and right-censor any observation after the cutoff time. <br>\[\hat \theta_{interval} = \frac{\text{number of events in interval}}{\text{sum of all time at risk in interval}}\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Best way to account for left-truncated data in practice	Kaplan-meier accounts for this already; just make sure that left-truncated individuals aren't considered as part of the risk set until their introduction.<br> <br>This gives the same estimator as the likelihood and piecewise approaches.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Period Survival	We want to characterize survival rates in some restricted period, say 2024. Denote the interval of interest \((a, b]\).<br> <br>Convert calendar time (time \(u\), \(y_i\) diagnosis date, \(z_i\) event observation date, \(w_i\) event or censored) into individual's time (\(s_i\) truncation time, \(x_i\) event/censoring time, \(v_i\) event or censored)<br> <br>Only include individuals that spend some amount of time in the interval pre-event. Left-truncate time before interval cutoff; censor events that occur after the time cutoff.	MedicalStats PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Basic Inequality for the Lasso	\[\frac{1}{2n}||Y - X \hat \beta||_2^2 + \lambda || \hat \beta ||_1 \leq \frac{1}{2n}||Y - X \beta^0||_2^2 + \lambda || \beta^0 ||_1\] where \(\hat \beta\) is the true optimized solution of the Lasso loss equation.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> Let \(\lambda &gt; 0\) and define \(\Delta = X_N^\top X_S (X_S^\top X_S)^{-1} \text{sgn}(\beta_S^0)\). If \(\|\Delta\|_\infty \leq 1\) and for \(k \in S\), \[ |\beta_k^0| &gt; \lambda |\text{sgn}(\beta_S^0)^\top [\{\tfrac{1}{n} X_S^\top X_S\}^{-1}]_k|, \] then there exists a Lasso solution \(\hat{\beta}_\lambda^{\text{L}}\) with \(\text{sgn}(\hat{\beta}_\lambda^{\text{L}}) = \text{sgn}(\beta^0)\). As a partial converse, if there exists a Lasso solution \(\hat{\beta}_\lambda^{\text{L}}\) with \(\text{sgn}(\hat{\beta}_\lambda^{\text{L}}) = \text{sgn}(\beta^0)\), then \(\|\Delta\|_\infty \leq 1\).	Proof sketch:<br> Plug into the KKT conditions for the Lasso. Use the following decomposition: \[ X{^{\mathrm{T}}} X = \left[\begin{array}{cc}   X_S{^{\mathrm{T}}} X_S & X_S{^{\mathrm{T}}} X_N \\\\   X_N{^{\mathrm{T}}} X_S & X_N{^{\mathrm{T}}} X_N \end{array}\right] \] Also note that if \(\mathrm{sgn}(\beta^0) = \mathrm{sgn}(\hat \beta)\), then \(\hat \beta_N = 0\).<br>	MSM PartIIINotes
Basic	Part III Notes::MSM	Characteristic Function Bound of Bernstein's Inequality	Let \(W_1\), \(W_2\), ... be IID random variables, all with mean \(\mu\), and each satisfying Bernstein's condition with parameter \((\sigma, b)\). Then we have \[ {\mathbb{E}}(e^{\alpha (W_i - \mu)}) \leq \exp\left( \frac{\alpha^2 \sigma^2}{2(1 - b|\alpha|)} \right) \] for \(|\alpha| &lt; 1/b\).	MSM PartIIINotes
Basic	Part III Notes::MSM	Probability bound of Bernstein's Inequality	Let \(W_1\), \(W_2\), ... be IID random variables, all with mean \(\mu\), and each satisfying Bernstein's condition with parameter \((\sigma, b)\). Then we have \[ {\mathbb{P}}\left(\frac{1}{n} \sum_{i=1}^n W_i - \mu \geq t\right) \leq \exp\left( -\frac{nt^2}{2(\sigma^2 + bt)} \right) \] for \(t &gt; 0\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Product of Subgaussian RVs	Suppose \(W, Z\) be mean zero and subgaussian with parameters \(\sigma_W\) and \(\sigma_Z\) respectively. Then \(WZ\) fulfills Bernstein's condition with parameter \((8 \sigma_W \sigma_Z, 4 \sigma_W \sigma_Z)\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Theorem:</b> If a distribution \(P\) on \({\mathbb{R}}^n\) has a density with respect to the product measure which is everywhere positive, then its conditional independence graph is global Markov	Proof sketch: Use backwards induction on the size of the separating set.<br> Base case: \(A\) has one element, \(B\) has one element, everything else is in \(S\) (case \(p - 2\)). Fulfills global Markov by independence graph definition.<br> Inductive step: Grow \(A\) and \(B\) to include all 'independent' points. Either \(A\) or \(B\) must have at least two points; WLOG it's \(A\). Call one of the points \(j\), and the rest of \(A\) \(A_-\). By the inductive hypothesis we have: \[Z_{A_-} {\perp\!\!\!\perp} Z_B {\,|\,} Z_{S \cup \{ j\}}\] \[Z_{j} {\perp\!\!\!\perp} Z_B {\,|\,} Z_{S \cup A_-}\] So, applying the intersection rule (and using our assumption), we have \[Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S\] as required.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Markov Blanket	In a regression problem where \(X\) is a set of predictors and \(Y\) is a response, a Markov blanket \(X_S\) is a subset of \(X\) that fulfills \[Y {\perp\!\!\!\perp} X_{S^c} {\,|\,} X_S\] (so \(X_S\) contains all predictors relevant to predicting \(Y\)). <br> <br>There's guaranteed to be a unique minimal blanket (take the weak union of every individually dependent covariate)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Structural Causal Models (SCMs)	A tuple \({\mathcal{S}} = (P_j, h_j, Q_j)_{j \in [p]}\) where:<br> \(P_j\) is the set of 'dependencies' for node \(j\): in other words, \(\mathrm{Pa}(j)\).<br> \(h_j: {\mathbb{R}}^{|P_j|} \times {\mathbb{R}} \to {\mathbb{R}}\): A 'decision function' for node \(j\), which takes in every defined dependency and a source of randomness, then makes a decision as to node \(j\)'s value.<br> \(Q_j\): A probability distribution over \({\mathbb{R}}\), the source of randomness for each node.<br> Provided alongside it is the full random vector \(Z_j := h_j(Z_{P_j}, {\varepsilon}_j)\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Interventions and Perfect Interventions to a Structural Causal Model (SCM)	Replace a \(h_j(Z_{P_j}, {\varepsilon}_j)\) with a new \(\tilde h_j(Z_{P_j}, {\varepsilon}_j)\), hoping to achieve some downstream result. If \(\tilde h_j\) is just some constant function, we say it is a perfect intervention.	MSM PartIIINotes
Basic	Part III Notes::MSM	How do we denote an intervention in an SCM?	\[...|\mathrm{do}(Z_j = \cdot)\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Collider in a path in a DAG	Given some DAG with an undirected path \(j_1,..., j_m\) with length \(m \geq 3\), we say \(j_l\) is a collider (relative to the path) if we have \[j_{l-1} \to j_l \leftarrow j_{l + 1}\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Blocked path in a DAG	A path \(j_1,..., j_m\) is blocked by a set of nodes \(S\) if there exists a node \(v\) on the path such that either:<br> \(v\) IS NOT a collider, and \(v \in S\), or<br> \(v\) IS a collider, and neither \(v\) nor any of its descendants are in \(S\).<br> <br>So in other words, a path is not blocked if non-colliders are not in \(S\) and colliders are either in \(S\), or have descendants in \(S\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> d-Seperation in a DAG<br>(and how it's denoted)	Given disjoint subsets \(A, B\) in a graph \(G\), we say \(A\) and \(B\) are d-separated by a subset \(S\) if every path between \(A\) and \(B\) is blocked by \(S\). We denote this as: \[A {\perp\!\!\!\perp}_G B {\,|\,} S\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Global Markov Directed (Causal) Graphs	A distribution \(P\) on \({\mathbb{R}}^p\) is global Markov with respect to a DAG \(G\) on \(p\) vertices if, whenever \(Z \sim P\), and \(A, B, S\) are disjoint sets of vertices,  \[A {\perp\!\!\!\perp}_G B {\,|\,} S \implies Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S\]<br> <br>Note that distributions generated from an SCM are always global Markov with respect the SCM DAG.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> Let \(p \geq 2\) and \(Z \sim P\). If \(Z_j \not {\perp\!\!\!\perp} Z_k {\,|\,} Z_{-jk}\) for \(j, k \in [p]\), then any causal DAG generating \(P\) must have either \(j \to k\), \(k \to j\), or \(j \to l \leftarrow k\) for some \(l\).	Proof sketch:<br> By definition, \(j\) and \(k\) can not be d-seperated by \(S := [p] \setminus \{ j, k\}\). The only possible way to avoid being blocked by every other point is to either have a path directly between \(j\) and \(k\), or to have \(j \to l \leftarrow k\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Proposition:</b> If nodes \(j\) and \(k\) in a DAG with \(p\) vertices are not adjacent and \(\pi\) is a topological order on \([p]\) with \(\pi(j) &lt; \pi (k)\), then they are d-separated by \(\mathrm{Pa(k)}\).	Proof sketch:<br> Consider a path \(j = j_1,...,j_m=k\). We can't have \(k \to j\) by the topological order assumption. We can't have \(j_{m-1} \to k\) because then \(j_{m-1} \in \mathrm{Pa}(k)\). We could try to have a V-shaped path  \[j = j_1 \to ... \to j_{l-1} \to j_l \leftarrow ... \leftarrow j_m = k\] (must exist due to topological ordering). But if we take the maximal \(j_l\), we can see that \(j_l\) can't have a descendant in \(\mathrm{Pa}(k)\) because we'd have a cycle in a DAG. \(j\) and \(k\) must be d-separated.	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Fundamental result of Gaussian causal models	\[Z_A {\,|\,} Z_B = z_B \sim N_{|A|}(\mu_A + \Sigma_{A,B} \Sigma_{B,B}{^{-1}} (z_B - \mu_B), \Sigma_{A, A} - \Sigma_{A,B} \Sigma_{B,B}{^{-1}} \Sigma_{B,A})\]<br> <br>For nodewise regression, specialize this to \(A = \{k\}\) and \(B = A^c\).	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Graphical Lasso	Minimizes \[-\log \det (\Omega) + \mathrm{tr}(S\Omega) + \lambda \sum_{j,k}|\Omega_{jk}|\] where \(\Omega\) is the precision matrix \(\Sigma{^{-1}}\), and \(S := \frac{1}{n} \sum_{i=1}^n x_ix_i{^{\mathrm{T}}} - \hat X \hat X{^{\mathrm{T}}}\)	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Weak Law of Large Numbers	If \(W_1, W_2, ...\) are iid real-valued random variables and \({\mathbb{E}}(W_i) = \mu &lt; \infty\), then as \(n \to \infty\), \[\frac{1}{n}\sum_{i=1}^n W_i \overset{p}{\to} \mu\]	MSM PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Assumptions made in the SIR model	\(X_i(t)\) maps the \(i\)th individual to the state they are in at time \(t\).<br> \(S(t), I(t), R(t)\) are the number of people susceptible, infected, and recovered, respectively. ie, \[S(t) = \sum_{i=1}^N I_{X_i(t) = S}\] \(N\) is the total population, assume it is fixed, so that for all \(t\), \[N = S(t) + I(t) + R(t)\] The rate of moving from \(S\) to \(I\) is denoted \(\lambda(t)\), is usually equal to \(\beta I(t)\).<br> The rate of moving from \(I\) to \(R\) is denoted \(\gamma(t)\), usually just a constant \(\gamma\)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Gillespie Algorithm	Continuous time discrete individual algorithm:<br> Two events:<br> someone is infected (\((s, i) \mapsto (s-1, i+1)\))<br> or someone recovers (\((s, i) \mapsto (s, i-1)\)).<br> Draw next time: \(\sim \exp (\beta is + \gamma i)\),<br> At that next time, pick one of the two events (like a binomial): infected w.p. \(\frac{\beta is}{\beta is + \gamma i}\)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Chain-Binomial Model	Discrete time discrete individual algorithm, sort of like a probabilistic ODE:<br> Take time steps of size \(\delta: [t, t + \delta)\)<br> \(S_{t + \delta} = S_t - B_t\)<br> \(I_{t + \delta} = I_t + B_t - C_t\)<br> \(R_{t + \delta} = R_t + C_t\)<br> where \(B_t, C_t\) are binomial RVs: \[B_t \sim \mathrm{Binom}(S_t, 1 - \exp (-\beta I_t \delta)) \] \[C_t \sim \mathrm{Binom}(I_t, 1 - \exp (-\gamma \delta)) \] (where the exponential comes from \(\geq 1\) event in a Poisson RV)<br> We can simplify \(B_t\) if we assume \(\beta \delta\) is small: \[B_t \sim \mathrm{Binom}(S_t, \beta I_t \delta)\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Reed-Frost Model	Given state \((S_t, I_t)\), individuals have a latent (infected) time of 1, and at the end of their latent period, they try to infect every susceptible individual with probability \(p\). So \[I_{t+1} \sim \mathrm{Binomial}(S_t, 1 - (1-p)^{I_t})\] Useful for small populations like households.	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Greenwood Formulation	Chain binomial model where instead of the infection rate being proportional to the number of infected individuals, it's just \(p\) if any infected individual exists: \[I_{t + 1} \sim \mathrm{Binomial} (S_t, p)\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Continuous Deterministic SIR Model	Assume we start with \(N\) susceptible individuals and 1 infected individual. System of ODEs:<br> \(\frac{d}{dt}S(t) = -\lambda(t) S(t)\)<br> \(\frac{d}{dt} I(t) = \lambda(t) S(t) - \gamma I(t)\)<br> \(\frac{d}{dt}R(t) = \gamma I(t)\)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Threshold result for the SIR ODE system	We need \(\frac{dI}{dt}(0) &gt; 0\), so we must have \(S(0) = N &gt; \gamma / \beta\)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> The reproductive number \(R_0\)	We need \(\frac{dI}{dt}(0) &gt; 0\), so we must have \(S(0) = N &gt; \gamma / \beta\)<br> <br>Define \(R_0 = N \frac{\beta}{\gamma}\): number of secondary infections caused by one infected individual in a fully susceptible population (epidemic will only take off if at time zero, \(R_0 &gt; 1\).<br> <br>Related also to \(R_e(t)\), just the \(R_0\) with the current \(S(t)\) in place of \(N\).	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Prevalence of an infection	\[\pi(t) = I(t) / N\]	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Delayed Incubation Period Equation	\[\mu(t) = \int_0^t h(u) f(t - u) du\] where \(\mu(t)\) is the rate of newly observed cases at \(t\), \(h(t)\) is the rate of infections at \(t\), and \(f\) is a probability distribution for the incubation length of the disease. Note this is a convolution!<br> <br>Note this simplifies to the following in discrete time: \[\mu_k = \sum_{i=1}^k h_i f_{k-i}\]	MedicalStats PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> The Parametric Bootstrap	Suppose we know the family of distributions relevant, \({\mathbb{P}}_\theta\), but not the particular \({\mathbb{P}}_{\theta_0}\) with some property \(\psi(\theta_0)\). The idea is to estimate \(\theta_0\) with an estimator \(\hat \theta\), then simulate a bunch of new trials to understand the distribution of \(\psi\) with \(\hat \psi\):<br> 1. For \(b = 1...B\), sample \(Y^{(b)} = (Y_1^{(b)}, ...,Y_n^{(b)}) \overset{iid}{\sim} {\mathbb{P}}_{\hat \theta}\) from the estimated \(\hat \theta\) derived from \(Y\).<br> Use this to compute \(\hat \theta^{(b)} = \hat \theta(Y^{(b)})\) and plug in to get \(\hat \psi^{(b)} = \psi (\hat \theta^{(b)})\).<br> 2. Approximate \(\hat \psi\) with the empirical distribution: \[\hat {\mathbb{P}}^{(b)} = \frac{1}{B} \sum_{i=1}^B \delta_{ \hat \psi^{(b)}}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Percentile Bootstrap Confidence Interval	In the parametric bootstrap, simulate \(B\) trials, order by their outcome \(\hat \psi^{(b)}\), the interval \( [\hat \psi^{0.025 B}, \hat \psi^{0.975 B} ] \) is an approximate two-tailed 95 percent confidence interval.<br> <br>More formally, we denote this as \([\hat q_{\alpha/2}, \hat q_{1 - \alpha/2}]\)<br> <br>Note that you can use the likelihood ratio test statistic as \(\psi\), like: \[\psi (\theta) = 2 \{ l(\beta, \sigma^2, \Sigma_u) - l(\beta, \sigma^2, 0) \}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	When is the percentile bootstrap confidence interval most valid?	When we assume \(\hat \psi \approx N(\psi_0, \sigma^2_\psi)\) and \(\hat \psi ^* \approx N(\hat \psi, \hat \sigma^2_\psi)\) conditional on \(Y\) (where \(\hat \sigma^2_\psi \approx \sigma^2_\psi\))	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Basic Bootstrap Confidence Interval<br>(and its assumptions)	\[[2 \hat \psi - \hat q_{1 - \alpha/2}, 2 \hat \psi - \hat q_{\alpha/2}]\] Relies on the assumption that \(\hat \psi^* - \hat \psi\) under \({\mathbb{P}}^*\) is similar to \(\hat \psi - \psi_0\), so that we have: \[\begin{align*} {\mathbb{P}}(2 \hat \psi - \hat q_{1 - \alpha/2} \leq \psi_0 \leq 2 \hat \psi - \hat q_{\alpha/2}) &\leq {\mathbb{P}}( \hat q_{1 - \alpha/2} - \psi \leq \hat \psi - \psi_0 \leq \hat q_{\alpha/2} - \psi )\\\\ & \approx {\mathbb{P}}^*( \hat q_{1 - \alpha/2} - \psi \leq \hat\psi^* - \hat \psi \leq \hat q_{\alpha/2} - \psi )\\\\ &= 1 - \alpha \end{align*}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Probability as an Expectation	For all events \(A\), \[{\mathbb{P}}(A) = {\mathbb{E}}[{\mathbb{1}}(A)]\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Proposition:</b> Over all classifiers, the Bayes classifier has the minimal prediction error: \[\underset{h \in H}{\min} R(h) = R(h^*) = {\mathbb{E}}[1 - P_{h^*(X^*)}(X^*)] \] where \(H\) is the set of all classifiers.	Proof sketch:<br> Express the classification risk as an expectation. <br> Use the tower property to fix some \(X^* = x\), and split into a sum over possible values of \(Y^*\). <br> Note that \({\mathbb{1}}(h(x) \neq k)\) is now deterministic, so isolate an expectation on \(Y^* = k\). <br> Note that this is equivalent to the statement of the conditional class probability \(p_k(x)\). <br> Clearly the resulting statement is minimized when \(h(x) = p_k(x)\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Plug-in Classifier	To construct a classifier \(\hat h(x)\), define an estimator \(\hat p_k(x)\), then use the plug-in formula \[\hat h(x) = {\underset{k}{\mathrm{argmin}}\,\,} \hat p_k(x)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Linear and nonlinear classifiers	A classifier is linear if its decision boundaries are all subsets of hyperplanes in \({\mathbb{R}}^p\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Class Densities and Class Prior Probabilities for LDA	Class densities: \(f_k: X^* {\,|\,} (Y^* = k)\)<br> Class prior probabilities: \(\pi_k = {\mathbb{P}}(Y^* = k)\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	Conditional class probability in terms of class densities and class prior probabilities (for LDA)	\[p_k = \frac{f_k(x) \pi_k }{\sum_{j=1}^K f_j(x) \pi_j}\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Assumptions for Linear Discriminant Analysis	Assume each class density  \(f_k = X^* {\,|\,} (Y^* = k)\) is normally distributed around its mean: \[f_k \sim N_p(\mu_k, \Sigma)\] Need to solve for all \(\pi_i\), \(\mu_i\), \(i \in [p]\) and \(\Sigma\).	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Discriminant Functions	In linear discriminant analysis, we assume each class density is normal, and so plugging into the conditional class probability, throwing out common factors, and taking a log, we get: \[h^*_k(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \delta_k(x)\] \[\delta_k(x) := -\frac{1}{2} (x - \mu_k){^{\mathrm{T}}} \Sigma{^{-1}} (x - \mu_k) + \log(\pi_k)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Decision Boundaries of LDA Bayes	The subset \(\{ x \in {\mathbb{R}}^p: p_k(x) = p_l(x) \}\), where: \[\log \left( \frac{p_k(x)}{p_l(x)}\right) = x^T \Sigma{^{-1}} (\mu_k - \mu_l) - \frac{1}{2} (\mu_k + \mu_l){^{\mathrm{T}}} \Sigma{^{-1}} (\mu_k - \mu_l) + \log \frac{\pi _k}{\pi_l}\]<br> <br>\(\Sigma{^{-1}} (\mu_k - \mu_l)\) are the normal vectors to the hyperplanes seperating the fitted classes	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Assumed values of each parameter \(\hat \pi_k\), \(\hat \mu_k\), and \(\hat \Sigma\) in \(h^{LDA}\)	\(\hat \pi_k= \frac{N_k}{n}\) (where \(N_k\) is the number of observations of class \(k\))<br> \(\hat \mu_k = \frac{1}{N_k} \sum_{i:Y_i = k}X_i\) (centriod of class \(k\))<br> \(\hat \Sigma = \frac{1}{n - K} \sum_{k=1}^K \sum_{i:Y_i=k}(X_i - \hat \mu_k) (X_i - \hat \mu_k){^{\mathrm{T}}}\) (pooled sample covariance matrix)<br> <br>Used in \(h^{LDA}(x)\): \[h^{LDA}(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \hat \delta_k(x)\] \[\hat \delta_k(x) := -\frac{1}{2} (x - \hat\mu_k){^{\mathrm{T}}} \hat\Sigma{^{-1}} (x - \hat \mu_k) + \log(\hat\pi_k)\] This is an MLE, and is therefore consistent.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Logistic Classifier	Suppose we have independent observations \(\tilde Y_i \in \{ 1, ..., K \}\) over \(K\) classes, so each \(Y_i \in \{0,1\}^k\), \(Y_{ij} = {\mathbb{1}}_{\{\hat Y_{i = j}\}}\). We have: \[Y_i {\,|\,} (X_i = x_i) \overset{ind}{\sim} \mathrm{Multinomial}(1; p_1,...,p_K(x_i))\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Conditional Class Probabilities for the Logistic Classifier	\[p_k(x_i) = \frac{e^{x_i^T \beta_k}}{1 + \sum_{j=1}^{K-1} e^{x_i^T \beta_j}}\] Note that we set \(\beta_K := 0\) as a baseline.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Logistic Classifier Log Likelihood	\[L(\beta_1,...,\beta_{K-1}) = \log \left( \prod_{i=1}^n \prod_{k=1}^K p_k(x_i)^{Y_{ik}} \right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	When should you use LDA versus the logisitic classifier?	LDA yields smaller variance because of the Gaussian assumption, however the logistic classifier is more robust if Gaussianity does not hold	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Support Vector Machine	A two-class classifier (\(K = 2\)) with points labeled \(Y_i \in \{ 1, -1\}\).<br> We seek a hyperplane \(f(x)\), with \(f(x) &gt; 0\) for class 1 and \(f(x) &lt; 0\) for class 2 (class -1).<br> We find the hyperplane fit which maximizes the margin (closest distance from points on each side to the separating hyperplane)<br> <br>\[h^{SVC}(x) = {\mathrm{sgn}} (\hat \alpha + x{^{\mathrm{T}}} \hat \beta)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Hyperplane and Separating Hyperplane	\[f(x) = \alpha + x^T\beta\] It's called separating if \[Y_i f(X_i) &gt; 0\] for all \(i \in [n]\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Optimal Separating Hyperplane for SVMs	Maximize \(M\) (the margin) over \(\alpha \in {\mathbb{R}}\), \(\beta \in {\mathbb{R}}^p\), \(M \geq 0\), subject to \[\frac{Y_i(\alpha + X_i{^{\mathrm{T}}} \beta)}{||\beta||_2} \geq M\] for all \(i \in [n]\). <br> Equivalently, we can MINIMIZE \(||\beta||_2\) subject to \[Y_i(\alpha + X_i{^{\mathrm{T}}} \beta) \geq 1\] for all \(i \in [n]\)	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Softened Separating Hyperplane Optimization for SVMs	Minimize \[||\beta||_2^2 + C\sum_{i=1}^n \xi_i\] over \(\alpha \in {\mathbb{R}}\), \(\beta \in {\mathbb{R}}^p\), \(\xi \in {\mathbb{R}}^n\) subject to \[Y_i(\alpha + X_i{^{\mathrm{T}}} \beta) \geq 1 - \xi_i\] \[\xi_i \geq 0\] for \(i \in [n]\). We call the \(\xi_i\) slack variables:<br> If \(\xi_i = 0\), \(X_i\) is on the right side of the boundary and beyond the margin.<br> If \(0 &lt; \xi_i  &lt; 1\), \(X_i\) is within the boundary but on the wrong side of the margin.<br> If \(\xi &gt; 1\), then \(X_i\) is on the wrong side of the boundary.<br> <br>A smaller \(C\) means a tighter margin with more misclassifications.	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Alternative formulation of the soft SVM	\[(\hat \alpha, \hat \beta) \in {\underset{\alpha \in {\mathbb{R}}, \beta \in {\mathbb{R}}^p}{\mathrm{argmin}}\,\,} \left( \sum_{i=1}^n (1 - Y_i(\alpha + X_i{^{\mathrm{T}}} \beta))_+ + \lambda||\beta||^2_2 \right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Laplace Kernel	\[k(x,y) = \exp(-\gamma ||x - y||_2)\] Radial like the Gaussian kernel, but with a slower decay for far away points.	SLIP PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Shrinkage Estimator	The idea is to bias the individual estimates towards the population mean to reduce the final MSE.<br> This is equivalent to just using Hierarchical Bayes with \(\tau^2\), where you estimate \(\tau^2\) given data first, then consider that for your other estimators.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Posterior Odds Ratio<br>(and the Bayes factor)	\[ \frac{P(M_1 {\,|\,} D)}{P(M_2 {\,|\,} D)} = \frac{P(D {\,|\,} M_1)}{P(D {\,|\,} M_2)} \frac{P(M_1)}{P(M_2)} \] \(M_1\) and \(M_2\) are our models. The first multiplicand is the Bayes factor, the second multiplicand is the prior odds of one model versus the other.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Evidence	\[P(D {\,|\,} M_1) = \int_\Theta P(D {\,|\,} \theta, M_1)P(\theta {\,|\,} M_1) d\theta\] also called the marginal likelihood.<br> <br>Note that \(P(\theta {\,|\,} M_1)\) needs to be a proper prior!	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> The Jefferies Scale	Let \(\alpha\) be a placeholder for the natural log of the Bayes factor.<br> \(\alpha &lt; 1\): Terrible<br> \(1 &lt; \alpha &lt; 2.5\): Weak but significant<br> \(2.5 &lt; \alpha &lt; 5\): Significant<br> \(5 &lt; \alpha\): Decisive	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Harmonic Mean Estimator	\[P(D) \approx \left[ \frac{1}{M} \sum_{i=1}^M P(D {\,|\,} \theta_i){^{-1}}  \right]{^{-1}}\] This is unstable in practice	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Savage-Dickey Ratio	When we are comparing nested models, it's possible to simplify the Bayes factor to \[ \frac{P(\psi {\,|\,} D, M_1)}{P(\psi {\,|\,} M_1)} \] where \(\psi = 0\).	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Bayesian Model Averaging	Bayesian model comparison between many models, not just two candidates.	Astrostatistics PartIIINotes
Basic	Part III Notes::LogicAndComputability	\(PA \vdash \varphi\) versus \(PA\vDash\varphi\) versus \({\mathbb{N}} \vDash \varphi\)	\(PA \vdash \varphi\) means that you can prove \(\varphi\) from axioms in \(PA\),<br> \(PA \vDash \varphi\) means that \(\varphi\) holds in every model of \(PA\) (even nonstandard ones),<br> \({\mathbb{N}} \vDash \varphi\) means that \(\varphi\) holds in the standard model specifically.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> \(\Sigma_1\)-definable Set	A set \(A \subseteq {\mathbb{R}}^n\) is \(\Sigma_1\)-definable if there exists a \(\Sigma_1\)-formula \(\varphi\) such that \[\bar n \in A \iff {\mathbb{N}} \vDash \varphi(\bar n) \] We need to construct a \(\Sigma_1\) that is true if and only if \(\bar n \in A\).<br> For instance, \(A\) is the set of perfect squares, then we could use \(\varphi(n) = \exists x,\, (n = xx)\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Initial Segment	A subset of a poset that is downward closed (the reverse of a terminal segment)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> \({\mathbb{N}}\) forms an initial segment of every model \(M\) of \(PA^-\)	Map \(0 \in M\) to \(0 \in {\mathbb{N}}\), \(S(0) \in M\) to \(1 \in {\mathbb{N}}\), and onwards.<br> It's an initial segment because  \[PA^- \vdash \forall x. (x \leq k \to x = 0 \lor x = 1 \lor ...\lor x = k)\] for all standard \(k \in {\mathbb{N}}\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Function Representation in a Theory	Let \(f: {\mathbb{N}}^l \to {\mathbb{N}}\) be a total function with \(T\) being any \(L_{PA}\) theory extending \(PA^-\). \(f\) is represented in \(T\) if there exists an \(L_{PA}\) formula \(\theta(x_1,...,x_l,y)\) such that, for all \(\bar n \in {\mathbb{N}}^l\),<br> 1) \[T \vdash \exists !y.\theta(\hat n, y)\] 2) \[k = f(\bar n) \implies T\vdash \theta(\bar n, k)\] If \(\theta\) can be made a \(\Sigma_1\), we say \(f\) is \(\Sigma_1\)-represented in \(T\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Subset Representation in a Theory	A subset \(S \subseteq {\mathbb{N}}^l\) is represented in an \(L_{PA}\) theory extending \(PA^-\) if there exists an \(L_{PA}\) formula \(\theta(x_1,...,x_l)\) such that for all \(\bar n \in {\mathbb{N}}^l\):<br> 1) \[\bar n \in S \implies T \vdash \theta(\bar n)\] 2) \[n \not \in S \implies T \vdash \neg \theta(\bar n)\] If \(\theta\) can be made a \(\Sigma_1\), we say \(S\) is \(\Sigma_1\)-represented in \(T\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> Every total recursive function \(f: {\mathbb{N}}^m \to {\mathbb{N}}\) is \(\Sigma_1\)-represented in \(PA^-\)	We apply the previous result for \({\mathbb{N}}\) to get a \(\Sigma_1\) \(\exists \bar z. \phi(\bar x, y, \bar z)\) and equivalently \(\exists z. \phi(\bar x, y, z)\) by coding. <br> By example sheet 4 question 2, any \(\Sigma_1\)-sentence true in \({\mathbb{N}}\) is true in any model of \(PA^-\). So we have the second part of the representation definition, but we need the first part. That's tricky because we don't yet have uniqueness for the \(\phi\) statement above (we could pick a \(z\) larger than the smallest minimizable).<br> So just write a predicate that forbids smaller solutions: \[\psi(\bar x, y, z) := \phi(\bar x, y, z) \land \forall u,\, u \leq (y + z). \forall v, v \leq (y + z). (u+v &lt; y + z \to \neg \phi(\hat x, u, v))\]<br> <br>Corollary:<br> Every recursive set \(A \subseteq {\mathbb{N}}^m\) is \(\Sigma_1\)-represented in \(PA^-\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	What do we call an \(L_{PA}\) formula with one free variable?	An unary predicate	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> State the Diagonalization Lemma	Let \(T\) be an \(L_{PA}\) theory in which every total recursive function is \(\Sigma_1\)-represented, and let \(\theta(x)\) be an \(L_{PA}\) formula with one free variable. Then there exists some sentence \(G\) of \(L_{PA}\) such that \[T \vdash (G \iff \theta(\lceil G \rceil))\] <br>(So for any numeric property, there's a sentence \(G\) that provably states that its own Godel number has that property)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Diag	\(\mathrm{diag}(n):\)<br> If \(n = \lceil \sigma(x)\rceil\), return \(\lceil \forall y. (y = n \to \sigma(y)) \rceil\)<br> Else return 0.<br> <br>We're just rewriting the sentence \(\sigma(x)\) coded by \(n\) to include \(n\) itself.<br> diag is the coded function that plugs a number into the conditional it represents (if possible).	LogicAndComputability PartIIINotes
Basic	Part III Notes::MSM	Mercer's Theorem decomposition for kernels	\[k(x,y) = \sum_{j \in J} \mu_j e_j(x) e_j(y)\]	MSM PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Marginalization	\[P(x) = \int P(x,y)dy = \int P(x{\,|\,} y) P(y) dy = {\mathbb{E}}_{y}[P(x {\,|\,} y)] \]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Split Multivariate Gaussian Marginals	Suppose we have a partioned multivariate Gaussian like  \[\boldsymbol{f} = \begin{pmatrix} \boldsymbol{U} \\\\ \boldsymbol{V} \end{pmatrix} \sim N\left( \begin{bmatrix} \boldsymbol{\mu}_U \\\\ \boldsymbol{\mu}_V \end{bmatrix}, \begin{bmatrix} \boldsymbol{\Sigma}_U & \boldsymbol{\Sigma}_{UV} \\\\ \boldsymbol{\Sigma}_{VU} & \boldsymbol{\Sigma}_V \end{bmatrix} \right)\] Then its marginals are exactly what we'd expect: \[P(U) = \int P(U,V)dV = N(U {\,|\,} \mu_U, \Sigma_U)\] <br>\[P(V) = \int P(U,V)dU = N(V {\,|\,} \mu_V, \Sigma_V)\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Split Multivariate Gaussian Conditional Probabilities	\[ U {\,|\,} V \sim N({\mathbb{E}}[U {\,|\,} V], {\mathrm{Var}}[U {\,|\,} V]) \] <br>\[{\mathbb{E}} [U {\,|\,} V] = \mu_U + \Sigma_{UV} \Sigma_V{^{-1}} (V - \mu_V)\] <br>\[{\mathrm{Var}}[U {\,|\,} V] = \Sigma_U - \Sigma_{UV} \Sigma_V{^{-1}} \Sigma_{VU}\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	How can we construct a joint probability from \[V \sim N(V_0, \Sigma_V)\] and \[U {\,|\,} V \sim N(U_0 + XV, \Sigma_{U {\,|\,} V})\]	\[\begin{pmatrix} \boldsymbol{U} \\\\ \boldsymbol{V} \end{pmatrix} \sim N\left( \begin{pmatrix} \boldsymbol{U}_0 + \boldsymbol{X}\boldsymbol{V}_0 \\\\ \boldsymbol{V}_0 \end{pmatrix}, \begin{pmatrix} \boldsymbol{X}\boldsymbol{\Sigma}_V \boldsymbol{X}^T + \boldsymbol{\Sigma}_{U|V} & \boldsymbol{X}\boldsymbol{\Sigma}_V \\\\ \boldsymbol{\Sigma}_V \boldsymbol{X}^T & \boldsymbol{\Sigma}_V \end{pmatrix} \right)\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> MLE and Unbiased Estimator for Normal RV \(\sigma\)	MLE: \[\hat \sigma^2 = \frac{(x_i-\hat \mu)^2}{N}\] Unbiased Estimator: \[\hat \sigma^2 = \frac{(x_i-\hat \mu)^2}{N-1}\] where \(\hat \mu = \bar x\).	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Gaussian Process	Time-dependent Gaussian distribution \[f(t) \sim GP(m(t), K_{A, \tau^2}(t, t'))\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Squared Exponential Kernel	\[k(t,t') = A^2 \exp(-|t-t'|^2 / \tau^2)\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Stationary Gaussain Process	\[K(t,t') = K(t + a, t' + a)\] for all \(a\)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> How to draw probabilistic graphical models	Open circle / square nodes: latent parameter, unobserved data<br> Shaded nodes: Something we condition on (ie data)<br> Filled dot: Known constant<br> Plate: iid replications of what's inside	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Occam Factor<br>(and how to derive it)	Define \(g(\theta) = P(\theta {\,|\,} D, M)P(D {\,|\,} M) = P(D {\,|\,} \theta, M)P(\theta {\,|\,} M)\). <br> Find a MAP estimate \(\theta_0 = {\underset{\Theta}{\mathrm{argmax}}\,\,} \ln(g(\theta))\), use a second order Taylor expansion for \(\ln(g(\theta))\) then exponentiate to get: \[g(\theta) \approx g(\theta) \exp(-\frac{1}{2} (\theta - \theta_0){^{\mathrm{T}}} A (\theta - \theta_0))\] where \(A\) is the Hessian of the log of \(g\) at the MAP<br>  From there see that \[\int g(\theta)d\theta = \int P(\theta {\,|\,} D, M)P(D {\,|\,} M) d\theta  = P(D{\,|\,} M) \int P(\theta {\,|\,} D, M)d\theta = P(D {\,|\,} M)\] So we have: \[P(D|M) \approx g(\theta_{MAP}) \det(A / 2\pi)^{-1/2} =  P(D {\,|\,} \theta_0) P(\theta_0) \det(A / 2\pi)^{-1/2}\] \(P(\theta_0) \det(A / 2\pi)^{-1/2}\) is called the Occam factor.	Astrostatistics PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Fixed Point Theorem	There exists a closed \(\lambda\)-term \(Y\) such that for all \(F\), \[F(YF) \equiv_\beta YF\]<br> <br>\(Y\) is called a fixed-point function.	LogicAndComputability PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Laplace Approximation for the Evidence	Find a MAP estimate \(\theta_0 = \underset{\Theta}{\mathrm{argmax}}\,\, \ln(P^*(\theta \,|\, D))\), use a second order Taylor expansion for \(\ln(P^*(\theta \,|\, D))\) then exponentiate to get: \[P^*(\theta \,|\, D) \approx P^*(\theta_0 \,|\, D) \exp(-\frac{1}{2} (\theta - \theta_0)^{\mathrm{T}} A (\theta - \theta_0))\] where \(A\) is the Hessian at the MAP<br>	Astrostatistics PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Population Survivor under Gamma-distributed Frailty	Assume the density of \(u\) is given by \(g(u)\), with \({\mathbb{E}}[g(u)] = 1\). Then we have: <br>\[{\mathbb{E}}_u[F(t)] = {\mathbb{E}}_u[e^{-H(t{\,|\,} u)}] = \int_0^\infty \exp (-u H_0(t))g(u) du\] Recognizing the Laplace transform and plugging in \(u \sim Gamma(p, \lambda)\): \[\bar F(t) = \left( \frac{p}{p+H_0(t)} \right)^\lambda\] <br>Usually we want the expected value of \(g(u)\) to be one, so we normally use \(p = \lambda\), denoted \(\psi\).	MedicalStats PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Inverse of a 2x2 Matrix	For a \(2 \times 2\) matrix: \[A = \begin{pmatrix} a & b \\\\ c & d \end{pmatrix}\] its inverse is: \[A^{-1} = \frac{1}{ad - bc}\begin{pmatrix} d & -b \\\\ -c & a \end{pmatrix}\] <br>To invert, swap the entries on the diagonal, negate the off-diagonal, and divide everything by the determinant.	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	Goal of Nested Sampling	We need to be able to evaluate the likelihood \(L(\theta) : = P(D {\,|\,} \theta, M)\) and the prior \(\pi(\theta) := P(\theta {\,|\,} M)\), we want to approximate the integral \[Z = \int P(D {\,|\,} \theta, M) P(\theta | M) d \theta = \int L(\theta)\pi(\theta) d\theta\]<br> <br>(remember the prior is diffuse, and the likelihood is peaked, which makes this problem difficult)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Nested Sampling	1. Take \(N_{live}\) points and their likelihoods, start with \(Z = 0\).<br> 2. Kill the point with the smallest likelihood, call it \(L_i^*\) for step \(i\)<br> 3. (trivial) Compute the shrinkage factor \(t_i = \frac{X_i}{X_{i-1}} \approx e^{-1/N_{live}}\) (approximate with the mean of the beta-distribution \(\mathrm{Beta(N_{live}, 1)}\))<br> 4. Accumulate evidence: \[\Delta Z = L_i^*(X_{i-1} - X_i) = L_i^* (1 - t_i) X_{i-1}\] 5. Sample new point with likelihood greater than \(L_i^*\)<br> 6. Repeat steps 2-5 until convergence<br> 7. Add information about remaining points to the evidence: \[\Delta Z  = \bar L X_{end} = \bar L e^{-m/N_{live}} \] where \(\bar L\) is the average likelihood of remaining points, and \(m\) is the number of steps taken so far.<br> 8. Compute the weighted posterior samples: \[w_i = \frac{L_i^*(1-t_i)X_{i-1}}{Z}\]	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Variational Inference	Idea: approximate the evidence with a distribution that is 'close' (measured by Kullback-Leibler divergence)	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Kullback-Leibler Divergence	\[KL(q(\theta) {\,||\,}  p(\theta) ) = - \int q(\theta) \ln \left( \frac{p(\theta)}{q(\theta)} \right)d\theta\] Nonnegative and zero only when \(p(\theta) = q(\theta)\), but NOT symmetric and so not a metric!	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> ELBO<br>(and how to derive it)	Assume we want to minimize \(KL(q(\theta) {\,||\,} P(\theta {\,|\,} D))\). Break down the KL divergence to extract the \(p(D)\) piece, yielding \[KL(q(\theta) {\,||\,} P(\theta {\,|\,} D)) + {\mathbb{E}}_q[\ln(q(\theta)) - \ln (P(D {\,|\,} \theta)P(\theta))] = \ln(P(D))\] Call the expectation bit the Evidence (Log) Lower Bound (ELBO). If we want to minimize the KL divergence, we have to minimize it (can't ever do better than \(\ln(P(D))\))	Astrostatistics PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Axiom of Extensionality	Two sets are equal if they have the same elements.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Axiom of Separation	We can define sets as we normally do: \[A = \{ x \in S : \varphi(x) \}\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Axiom of Pairing	If you have two sets, there exists a set that contains those two sets: \[\{ A, B \}\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Axiom of Choice	Suppose you have a set \(X\) of nonempty sets. Then there exists a function \(f\) that maps each \(x \in X\) to one of its elements.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Statement of Diaconescu's Theorem	We can deduce the Law of Excluded Middle (\(\forall \varphi,\,\varphi \lor \neg \varphi \)) from the Axion of Choice in a minimal intuitionistic setting.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Proposition:</b> Diaconescu's Theorem	Want to show \(\forall \varphi,\,\varphi \lor \neg \varphi \); fix \(\varphi\).<br> Use the axiom of seperation to get: \[A =  \{ x \in \{ 0, 1 \} : \varphi  \lor (x = 0)  \}\] \[B =  \{ x \in \{ 0, 1 \} : \varphi  \lor (x = 1)  \}\] Use the Axiom of Pairing to get \(\{ A, B \}\). Because \(0 \in A\) and \(1 \in B\), \(A\) and \(B\) are inhabited, so we have a choice function  \[f: \{ A, B \} \to A \cup B\] with \(f(A) \in A\) and \(f(B) \in B\).<br> So we have a proof of \[(f(A) = 0 \lor \varphi) \land (f(B) = 1 \lor \varphi)\] By intuistic logic, we can split into four cases:  \[f(A) =0,\, f(B) = 0\] \(f(B) = 0\) implies \(\varphi\) above. \[f(A) = 0,\, f(B) = 1\] Hard case, see below. \[f(A) = 1, f(B) = 0\] \(f(A) = 1\) implies \(\varphi\) above. \[f(A) = 1, f(B) = 1\] \(f(A) = 1\) implies \(\varphi\) above.<br> <br>Hard case: \(f(A) = 0,\, f(B) = 1\). Prove \(\neg \varphi = \varphi \to \bot\) by the following:<br> Assume \(\varphi\).<br> By the Axiom of Extensionality, \(A = B\)<br> But \(f(A)\) and \(f(B)\) don't have the same value! \[0 = f(A) = f(B) = 1\] Contradiction, we have \(\neg \varphi\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> \(\beta\)-redex	A term we can simplify into a \(\beta\)-contractum: \[(\lambda x: \sigma . p) Q \to_\beta P[x := Q]\] Read this as 'function application candidate'	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Weak Normalization Theorem \(m\) function	\[m(M) := (0,0) \, \text{ if $M$ is in beta-normal form}\] \[m(M) := (h(M), \mathrm{redex}(M))\] <br>where \(h(M)\) is the greatest height of a redex in \(M\), and \(\mathrm{redex}(M)\) is the number of occurances of redexes in \(M\) of maximal height.<br> <br>(the height of a redex is the height of the type of the function being applied)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Grammar of the Full Simply-Typed Lambda Calculus Types	\[\begin{align*} \lambda_\Pi &:\, V & \text{Variable}\\\\ &|\, \lambda V: \Pi . \lambda_\Pi & \text{$\lambda$-Abstraction}\\\\ &|\, \lambda_\Pi\lambda_\Pi & \text{$\lambda$-Application}\\\\ &|\, \left&lt; \Pi, \Pi \right&gt; & \text{Product Constructor}\\\\ &|\, \Pi_1(\lambda_\Pi)\, | \, \Pi_2(\lambda_\Pi) & \text{Projection Types / Product Eliminators}\\\\ &|\, \iota_1(\lambda_\Pi)\, | \, \iota_2(\lambda_\Pi) & \text{Left and Right Coproduct Constructors}\\\\ &|\, \mathrm{Case}(\lambda_\Pi; V.\lambda_\Pi; V. \lambda_\Pi) & \text{Coproduct Eliminator}\\\\ &|\, * & \text{Terminal Type Constructor}\\\\ &|\, !_\Pi \lambda_\Pi & \text{Explosion Type} \end{align*}\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Explosion Typing Rule	\[\frac{\Gamma \Vdash M : 0}{\Gamma \Vdash !_\varphi M : \varphi}\] for every type \(\varphi \in \Pi\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Case Beta-Reduction	\[\mathrm{Case}(\iota_1(M); x.K; y.L) )\to_\beta  K[x := M]\] \[\mathrm{Case}(\iota_2(M); x.K; y.L) )\to_\beta  L[y := M]\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Extra BHK interpretation \(\eta\)-reductions	\[\left&lt; \pi_1 M, \pi_2 M \right&gt; \to_\eta M\] Also if  \(\Gamma \Vdash M : 1\) we have \(M \to_\eta *\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Risk Set	\[R = \{ j : x_j \geq a_j \}\] Note \(r_j = |R|\)<br> Note that this includes everyone that hasn't been censored yet right before \(t_j\)	MedicalStats PartIIINotes
Basic	Part III Notes::MedicalStats	<b>Definition:</b> Event Set / Death Set	\[D = \{ i : x_i = a_i \land v_i = 1 \}\] Note \(d_j = |D|\) The number of individuals with (uncensored) events exactly at \(t_i\)	MedicalStats PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Log-Odds	Log of the odds as we know it: like '3 to 1 in our favor'.<br> For an arbitrary probability \(p\), the logit gives its log-odds. <br> For classification problems, it's always given by \[\log \left( \frac{p_k(x)}{p_j(x)} \right)\]	SLIP PartIIINotes
Basic	Part III Notes::SLIP	<b>Definition:</b> Log-Odds of the Logistic Classifier	\[\log \frac{p_k(x_i)}{p_K(x_i)} = x_i^T \beta_k\] with \(\beta_K = 0\) as the baseline.<br> Extending this, we have: \[\log \left(\frac{p_k(x_i)}{p_l(x_i)} \right) = x_i{^{\mathrm{T}}} (\beta_k - \beta_l)\]	SLIP PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> The Apparent Magnitude Equation	\[m = M + \mu\] where \(m\) is the true apparent magnitude, \(M\) is the absolute magnitude, and \(\mu\) is the distance modulus \(\mu = 25 + 5 \log_{10} (d)\).	Astrostatistics PartIIINotes
Basic	Part III Notes::Astrostatistics	<b>Definition:</b> Parallax Equation	\[\frac{\omega}{\mathrm{arcsec}} = \frac{\mathrm{parsec}}{d}\] where \(\omega\) is the true parallax angle and \(d\) is the distance to the star.<br> (note this uses small angle approximations, is not valid for large \(\omega\))	Astrostatistics PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> G	\[G := \forall y. (y = n \to \psi(y))\] where  \[\psi (x) := \forall z. (\delta(x, z) \to \theta(z))\] (\(\delta(x, z)\) is the \(\Sigma_1\)-representation of diag)<br> and \(n = \lceil \psi (x)\rceil\)<br>	LogicAndComputability PartIIINotes
Basic	Part III Notes::MedicalStats	MLE for the log-likelihood for the exponential survivor function	\[\hat \theta = \frac{V_+}{X_+}\] where \(V_+\) is the total number of (non-censored) events, and \(X_+\) is the total time spent at risk.	MedicalStats PartIIINotes
Basic	Part III Notes::SLIP	Model Complexity and the Bias-Variance Decomposition	Complex models have low bias, but high variance.<br> Simple models have high bias, but low variance.	SLIP PartIIINotes
Basic	Part III Notes::Astrostatistics	Goal of ELBO	We want to find a \(q\) that minimizes  \[KL(q(\theta) {\,||\,} P(\theta {\,|\,} D))\] (approximate our evidence with a 'close' \(q\) distribution)	Astrostatistics PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Crude Incompleteness	Let \(T\) be a recursive set of (Godel numbers of) \(L_{PA}\) sentences that is consistent (never contains both \(\phi\) and \(\neg \phi\)) and contains all the \(\Sigma_1\) and \(\Pi_1\) sentences provable in \(PA^-\).<br> Then there exists a \(\Pi_1\) sentence \(\tau\) such that \(\tau \not\in T\) and \(\neg \tau \not \in T\) (\(T\) can't prove \(\tau\) or its negation).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Godel-Rosser Theorem	Let \(T\) be a consistent \(L_{PA}\) theory extending \(PA^-\) and admitting a recursively enumerable axiomatization. <br> Then \(T\) is \(\Pi_1\) incomplete: there exists a \(\Pi_1\) sentence \(\tau\) for which \(T \not \vdash \tau\) and \(T \not \vdash \neg \tau\)<br> <br>Assume that \(T\) is complete to get a contradiction. Use Craig's trick to get a recursive axiomatization, do exhaustive proof search from axioms (paired with completeness) to show that \(S\) is recursive. But then we can apply Crude Incompleteness to get a contradiction.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Recursive (and countable) \(L_{PA}\)-structures	A (countable) \(L_{PA}\)-structure \({\mathcal{M}}\) is recursive if and only if there exist recursive functions \(\oplus: {\mathbb{N}}^2 \to {\mathbb{N}}\), \(\otimes: {\mathbb{N}}^2 \to {\mathbb{N}}\), a binary recursive relation \(\preceq \subseteq {\mathbb{N}}^2\), and natural numbers \(n_0, n_1 \in {\mathbb{N}}\) such that  \[ {\mathcal{M}} \cong ({\mathbb{N}}, \oplus, \otimes, \preceq, n_0, n_1)\]	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Prime Function	The \(\Sigma_1\) formula \(\mathrm{pr}(x,y)\) is true only when \(y\) is the \(x\)th prime number.  \[PA \vdash \forall x,\exists! y, \mathrm{pr}(x,y)\] So if \({\mathbb{N}}\) believes \(y\) is the \(x\)th prime number, all models of \(PA\) do. <br> We denote the \(k\)th prime \(\pi_k\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Overspill Lemma	Let \({\mathcal{M}}\) be a nonstandard model of \(PA\), \(\phi(x,\bar y)\) be an \(L_{PA}\) formula, and \(\bar a \in {\mathcal{M}}\).<br> If \({\mathcal{M}} \vDash \phi(n, \bar a)\) for all standard natural numbers \(n\), then there exists a nonstandard \(e \in M\) such that \({\mathcal{M}} \vDash \forall x \leq e. \phi(x,\bar a)\)	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	Proof of Overspill	Suppose there exists some function \(\mathrm{st}(x,\bar a)\), that, given some helper data \(\bar a\), is true only when \(x\) is a standard natural number. <br> But clearly we can apply 'st' inductively from zero: \[M \vDash \mathrm{st}(0, \bar a) \land \forall x. \mathrm{st}(x, \bar a) \to \mathrm{st}(x+1,\bar a)\] and remember by the inductive axiom of \(PA\), we must then have \(\forall x, \mathrm{st}(x,\bar a)\)! But that defeats the whole point: every element of a nonstandard model can't be standard!	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Canonical Coding	A subset \(S \subseteq {\mathbb{N}}\) is canonically coded in a model \({\mathcal{M}}\) of \(PA\) if there exists some element \(c \in {\mathcal{M}}\) such that \[S = \{ n \in {\mathbb{N}}: \exists y. (\pi_{n^*} \times y) = c \}\] where \(n^*\) is the standard natural number \(n\) in the model.<br> <br>We proved in class that for any nonstandard model \({\mathcal{M}}\) of \(PA\), there exists a non-recursive set \(S\) which is canonically coded in \({\mathcal{M}}\).	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Recursive Inseparability	Subsets \(A\), \(B\) of \({\mathbb{N}}\) are recursively inseparable if they are disjoint and there is no recursive subset \(C\) of \({\mathbb{N}}\) such that  \(B \cap C = \emptyset\) and \(A \subseteq C\).<br> <br>We proved in class that they must exist.	LogicAndComputability PartIIINotes
Basic	Part III Notes::LogicAndComputability	<b>Definition:</b> Tennenbaum's Theorem	Let \({\mathcal{M}} = (M, \oplus, \otimes, \preceq, n_0, n_1)\) be a countable nonstandard model of \(PA\). Then \(\oplus\) is not recursive.	LogicAndComputability PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> The Square Root Lasso	Minimizes \[\frac{1}{\sqrt{n}}||Y-\bar Y1 - X\beta||_2 + \gamma||\beta||_1\]	MSM PartIIINotes
Basic	Part III Notes::MSM	<b>Definition:</b> Group Lasso Penalty	\[\lambda \sum_{j=1}^q m_j ||\beta_{G_j}||_2\] where we have group partitions \(G_1...G_j\) and we usually define \(m_k\)	MSM PartIIINotes