#separator:tab
#html:true
#notetype column:1
#deck column:2
#tags column:5
Basic Part III Notes::MSM Definition: Bernstein's Condition A random variable X fulfills the Bernstein condition with parameters \(\sigma, b \geq 0\) if:
\[ {\mathbb{E}}(|X - {\mathbb{E}} X|^k) \leq \frac{1}{2} k! \sigma^2 b^{k-2} \]
for \(k = 2, 3, ...\) Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Reproducing Kernel Hilbert Space (RKHS) A Hilbert space of functions \(f: X \to \mathbb{R}\) such that, for all \(x \in X\),\[f(x) = \left< f, K_x \right>\]
Note: this condition is equivalent to saying that the point-evaluation functional \[L_x = H \to \mathbb{R},\, L_x(f) = f(x)\] is continuous for every \(x \in X\).
So in other words \(L_x \in H^*\)
This is equivalent to the condition above by the Riesz Representation Theorem Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Reproducing kernel (of an RKHS) By definition, fixing some \(x \in X\), there exists some \(K_x\) such that for any \(f \in H,\, f(x) = \left< f, K_x \right>\).
If we plug in some \(y\) to \(K_x\), we then get \(K_x(y) = \left<K_x, K_y \right>\)
So our reproducing kernel is the function where we take some \(x, y\) pair and perform that process: \[K(x,y) = \langle K_x, K_y \rangle\] \(K: X \times X \to \mathbb{R}\) Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Topological Ordering A permutation \(\pi\) of some set of nodes \([n], n\in {\mathbb{N}}\), is a topological ordering if:
\[\forall a,b \in [n],\, b \in \mathrm{de}(a) \implies \pi(a) < \pi(b)\] (Higher nodes in the DAG are sorted to the front) Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Multiple Testing When we have a sequence of null hypotheses \(H_1...H_m\) where we have found associated p-values \(p_1...p_m\).
We let \(I_0\) represent the indices of the true nulls where no relationship exists, with \(m_0 = |I_0|\).
It's not as simple as just rejecting the small-enough p-values due to the jellybeans effect! Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM What does \(I_0\) represent in multiple testing? A set of hypothesis indices representing true nulls that should not be rejected. Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Family-Wise Error Rate (FWER) \[{\mathbb{P}} ( N \geq 1)\] where \(N\) is the number of false rejections.
For instance the FWER of the naiive approach to multiple testing is \[ 1 - (1 - \alpha)^m \] Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Bonferroni Correction Reject \(H_i\) if \(p_i \leq \frac{\alpha}{m}\) for desired \(\alpha\) level. Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Intersection Hypothesis When we want to test some subset of our multiple hypotheses \(I \subseteq [m]\) are all true: \(H_I\).
We can reject it by showing that any of the hypotheses can be rejected! Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Local Test
(and its critical requirement) Given some intersection hypothesis \(H_I\), a test \(\phi_I\) that is 1 if we reject \(H_I\) or zero otherwise. We require that the error rate of \(\phi_I\) be \(\leq \alpha\). Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Closed Testing Procedure Suppose you have a local test \(\phi_I\) for testing intersection hypotheses. Reject \(H_I\) iff all superset local tests \(\phi_J,\, I \subseteq J\) are rejected (test everything above \(I\)). Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Holm's Procedure Closed testing with the Bonferroni test as the local procedure: \[ \phi_I(x) = \begin{cases} 1 & \underset{i \in I}{\min} (p_i) \leq \frac{\alpha}{|I|} \\\\ 0 & \text{otherwise} \end{cases} \] Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Holm's Procedure as a Stepdown Procedure Let \(p_{(i)}\) represent the \(i\)th smallest p-value for respective hypothesis \(H_{(i)}\).
Step 1: if \(p_{(1)} \leq \frac{\alpha}{m}\), reject \(H_{(1)}\) and continue; otherwise accept \(H_{(1)}...H_{(m)}\)
...
Step i: if \(p_{(i)} \leq \frac{\alpha}{m - i + 1}\), reject \(H_{(i)}\) and continue; otherwise accept \(H_{(i)}...H_{(m)}\)
...
Step m: if \(p_{(m)} \leq \alpha\), reject \(H_{(m)}\) and finish; otherwise accept only \(H_{(m)}\)
So, to recap, iterate over p-values starting at the smallest, rejecting and continuing if \(p_{(i)} \leq \frac{\alpha}{m - i + 1}\) or ceasing otherwise. Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM How good is Holm's procedure? Uniformly more powerful than Bonferroni correction Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: False Discovery Rate (FDR) \[ {\mathbb{E}} \left( \frac{N}{R \lor 1} \right) \] where \(N\) is the number of false rejections, \(R\) is the total number of rejections, and \(\lor\) is the max function (so we never divide by zero). Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Benjamin-Hochberg Procedure Order hypotheses by their p-values, from smallest to largest. Find the largest index underneath the slope \(p_{(i)} \leq \frac{\alpha i}{m}\): that index and everything below it is rejected, but everything above \(i\) is accepted. The FDR is guaranteed to be kept at \(\frac{\alpha m_0}{m} \leq \alpha\)
Visual idea: find the largest index point under the line with slope \(\frac{\alpha}{m}\), that is your cutoff point. Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Proposition: The reproducing kernel of an RKHS is well-defined. Given any two reproducing kernels \(k_x, h_x \in X \times X \to {\mathbb{R}}\), show that \[||k_x - h_x||^2_H = 0\] by representing as an inner product and simplifying:
\[\begin{align*} ||k_x - h_x||^2_H &= \left< k_x - h_x, k_x - h_x \right>\\\\ &= \left< k_x, k_x - h_x \right> - \left< h_x, k_x - h_x \right>\\\\ &= (k_x - h_x)(x) - (k_x - h_x)(x)\\\\ & \text{by the reproducing property}\\\\ &= 0 \end{align*}\] Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Proposition: Every DAG has a topological ordering. First, prove by contradiction that there must be a node with no parents, called the source node (or else we must have a cycle, breaking our DAG assumption).
Next, build the graph up inductively from a graph with one node.
Base case: Clearly any graph with one node is a topological order.
Inductive step: We assume that ANY graph with \(n-1\) nodes, \(G_{n-1}\), has a topological order. Thus, given a graph \(G_n\) nodes, apply the lemma we proved above to get the parent node, and remove it to get a new graph \(G_{n-1}\). By the inductive step we have a topological order \(\pi_{n-1}\). Construct \(\pi_{n}\) by shifting every node in \(\pi_{n-1}\) down one, and putting the parent node at the top at 1. We're done. Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Proposition: Bonferroni Correction yields a FWER \(\leq \alpha\) Just apply Markov's inequality and chug:
\[\begin{align*} {\mathbb{P}} (N \geq 1) &\leq {\mathbb{E}}(N)\\\\ & \text{ by the Markov inequality}\\\\ &\leq {\mathbb{E}}\left[ \sum_{i \in I_0} {\mathbb{1}}_{\{ p_i \leq \alpha / m \}}\right]\\\\ &\leq \sum_{i \in I_0} {\mathbb{P}}\left(p_i \leq \frac{\alpha}{m}\right)\\\\ &\leq \frac{\alpha m_0}{m}\\\\ &\leq \alpha \end{align*}\] Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Proposition: When all hypotheses are tested using closed testing, a false rejection will be made with at most probability \(\alpha\) False rejections can only be made if our intersection hypothesis is entirely composed of indices in \(I_0\). But then including upwards, we are guaranteed to check \(\phi_{I_0}\), which only happens with probability at most \(\alpha\).
It follows that the FWER for any particular \(H_I\) has at most level \(\alpha\). Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Proposition: Benjamin-Hochberg keeps FDR \(\leq \alpha\) We must assume that all \(p_i,\, i \in I_0\) are independent from all the other p-values.
Begin by breaking down R then N into their possible cases, and express as a disjoint probability in two terms.
Simplifying the event in the probability term: break down the condition \(R = r\) into what it means for \(r\) to be the max value. Define a new set \(p_{-i}\) where we've removed the \(i\)th p-value (shifting the indices to the left), so that they are independent of \(p_i\). Remark these two terms are equivalent themselves to a reduced Benjamin-Hochberg trial with a modified cutoff line \(\frac{\alpha}{m}\).
Plugging back in the rewritten event, seperate the probability block and simplify to get \(\leq \frac{\alpha m_0}{m} \leq \alpha\). Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Conditional Independence
(Two equivalent definitions for \(X {\perp\!\!\!\perp} Y {\,|\,} Z\), for random vectors \(X, Y, Z\).) 1. \({\mathbb{P}}(X \in A,\, Y \in B {\,|\,} Z) = {\mathbb{P}}(X \in A{\,|\,} Z){\mathbb{P}}(Y \in B{\,|\,} Z)\)
2. \({\mathbb{P}}(X \in A {\,|\,} Y, Z) = {\mathbb{P}}(X \in A {\,|\,} Z)\)
(knowing \(Y\) doesn't give us any new information about \(X\))
note that here \(X \in A\) is best read as '\(X\) lands in \(A\)' Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Weak Union rule for Conditional Independence \[ X {\perp\!\!\!\perp} Y, Z \implies \begin{cases} X {\perp\!\!\!\perp} Y {\,|\,} Z\\\\ X {\perp\!\!\!\perp} Z {\,|\,} Y \end{cases} \] We can move elements from the right-hand side to the conditioning side. Also works with some other conditioning \(W\). Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Contraction rule for Conditional Independence \[ \begin{rcases*} X {\perp\!\!\!\perp} Z\\\\ X {\perp\!\!\!\perp} Y {\,|\,} Z \end{rcases*} \implies X {\perp\!\!\!\perp} Y, Z \] We can apply known independencies to the conditioning side. Also works with some other conditioning \(W\). Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Intersection rule for Conditional Independence \[ \begin{cases} X {\perp\!\!\!\perp} Y {\,|\,} Z\\\\ X {\perp\!\!\!\perp} Z {\,|\,} Y \end{cases} \implies X {\perp\!\!\!\perp} Y, Z \] Converse of the weak union rule. ONLY ALLOWED when \(X, Y, Z, W\) have a joint density, and \(f_{Y,Z,W}\) is everywhere positive (need to ensure \(Y \neq Z\) basically) Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM When are we allowed to use the intersection rule for conditional independence? When \(X, Y, Z, W\) have a joint density, and \(f_{Y,Z,W}\) is everywhere positive (need to ensure \(Y \neq Z\) basically) Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MSM Definition: Conditional Independence Graph (CIG) Given a distribution \(P\) on \({\mathbb{R}}^p\), an undirected graph \(G = ([p], E)\) where, if \(Z \sim P\), \[ (j,k), (k,j) \in E \iff Z_j \not\mathclap{{\perp\!\!\!\perp}}\mkern12mu Z_k {\,|\,} Z_{-jk} \] where \(Z_{-jk}\) is a subvector of \(Z\) without indices \(j\) and \(k\).
Thus components are connected if there exists some dependence between them (even given all other information) Modern-Statistical-Methods-Notes MSM PartIIINotes
Basic Part III Notes::MedicalStats Definition: Censored Random Variable \[X := \min{(T, C)}\] \[V := \begin{cases} 1 & T \leq C\\\\ 0 & T > C \end{cases} \] Where \(T\) is the time to event and \(C\) is the time to censoring. \(V\) is called the visibility. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Survivor Function
(Also its two properties) \[F(t) = {\mathbb{P}}(T > t) = \int_t^\infty f(t)\,dt\]
1. Decreasing in \(t\)
2. When continuous, \(F(0) = 1\)
NOTE: Our professor denotes the survivor function as \(F(t)\), but the usual convention is to write \(S(t)\). MedicalStats PartIIINotes
Basic Part III Notes::MSM Definition: Law of Total Expectation
(AKA the tower property) \[{\mathbb{E}}({\mathbb{E}}(X|Y)) = {\mathbb{E}}(X)\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Chain Rule of Probability \[\begin{align*} {\mathbb{P}}(X,Y) &= {\mathbb{P}}(Y){\mathbb{P}}(X|Y)\\\\ &= {\mathbb{P}}(X){\mathbb{P}}(Y|X) \end{align*}\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Slutsky's Theorem Let \(X_n, Y_n\) be sequences of random elements. If both
\(X_n \overset{d}{\to} X\)
\(Y_n \overset{p}{\to} c\)
for some constant \(c\), then
1. \(X_n + Y_n \overset{d}{\to} X + c\)
2. \(X_nY_n \overset{d}{\to} Xc\)
3. \(X_n / Y_n \overset{d}{\to} X /c\) given \(c \neq 0\). MSM PartIIINotes
Basic Part III Notes::MSM Definition: Undirected Graph Separation
(Given an undirected graph \(G = ([p], E)\) and disjoint sets \(A, B, S \subset [p]\), S separates \(A\) and \(B\) if...) Every path from \(A\) to \(B\) contains a node in \(S\). MSM PartIIINotes
Basic Part III Notes::MSM Definition: Global Markov Undirected Graphs
(A distribution \(P\) on \({\mathbb{R}}^p\) is global Markov with respect to an undirected graph \(G = ([p], E)\) when...) Taking any \(A, B, S \subset [p]\) with \(S\) separating \(A\) and \(B\), we have that for any \(Z \sim P\), \[ Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S \] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Objective of Conditional Independence Testing Given \(n\) iid copies \((x,y,z),\,...,\,(x_n,y_n,z_n) \in {\mathbb{R}} \times {\mathbb{R}} \times {\mathbb{R}}^p\) collected into \((X, Y, Z) \in {\mathbb{R}}^n \times {\mathbb{R}}^n \times {\mathbb{R}}^{n \times p}\), we wish to test the null hypothesis \(X {\perp\!\!\!\perp} Y {\,|\,} Z\). MSM PartIIINotes
Basic Part III Notes::MSM Intuitively, how can we think about \({\mathbb{E}}(X {\,|\,} Z)\), and how could we solve for such a thing? \({\mathbb{E}}(X {\,|\,} Z)\) represents our 'best guess' for \(X\) given some information \(Z\).
We can solve for \({\mathbb{E}}(X {\,|\,} Z)\) by finding an \(f\) that minimizes \[{\mathbb{E}}\left[(X - f(Z))^2\right]\] (this is called the L2 projection theorem, and it links together statistical regression and the conditional expectation) MSM PartIIINotes
Basic Part III Notes::MSM Definition: Generalized Covariance Measure (GCM)
(Assuming we have data \((X_i, Y_i, Z_i)\) and fitted regression functions \(\hat f(z)\) and \(\hat g(z)\)...) The test statisitc \[T := \sqrt{n} \frac{\tau_N}{\tau_D}\] where \[\begin{align*} \tau_N &:= \frac{1}{n}\sum_{i=1}^n \hat\epsilon_i\hat\xi_i\\\\ \tau_D^2 &:= \frac{1}{n} \sum_{i=1}^n \hat\epsilon^2_i\hat\xi^2_i - \tau^2_N \end{align*}\] where we've defined \(\hat \epsilon_i := X_i -\hat f(Z_i)\) and \(\hat \xi := Y_i - \hat g(Z_i)\)
Explanation:
Our null hypothesis is that \[{\mathbb{E}}\left[ (X - {\mathbb{E}}(X{\,|\,} Z)) (Y- {\mathbb{E}}(Y{\,|\,} Z))\right]=0,\] so we just need to quantify deviance from that. \(\hat f\) and \(\hat g\) approximately represent \({\mathbb{E}}(X{\,|\,} Z)\) and \({\mathbb{E}}(Y{\,|\,} Z)\) respectively, and so we define our average deviance among samples as \(\tau_N\).
The CLT tells us \(\sqrt{n}\tau_N\) should be Gaussian (mean-zero under the null, iid per \(i\)), but we need to normalize for variance if we want a standard normal distribution. So we define \(\tau_N\) to do this, using \({\mathrm{Var}}(X) = {\mathbb{E}}(X^2) - {\mathbb{E}}(X)^2\). MSM PartIIINotes
Basic Part III Notes::MSM Theorem: The Generalized Covariance Measure is asymptotically normal Assuming there exists \(c > 0\) such that \({\mathrm{Var}}(\epsilon {\,|\,} Z), {\mathrm{Var}}(\xi {\,|\,} Z) < c\) almost surely,
\({\mathrm{Var}}(\epsilon\xi) > 0\),
and \[{\mathcal{E}}_f \to 0,\, {\mathcal{E}}_g \to 0,\, {\mathcal{E}}_f {\mathcal{E}}_g = o(n^{-1}),\] given \({\mathcal{E}}_f := {\mathbb{E}}\left[ \frac{1}{n} \sum_{i=1}^n (f(Z_i) - \hat f(Z_i))^2 \right]\) and likewise for \({\mathcal{E}}_f\), then \[T \sim N(0, 1)\] MSM PartIIINotes
Basic Part III Notes::MedicalStats Definition: Kaplan-Meier Estimate, Form 2 \[ \hat F(t) = \prod_{j:a_j \leq t} \left(1 - \frac{d_j}{r_j}\right)\]
with \(d_j = |\{ i: x_i = a_i \land v_i = 1 \}|\) (non-censured death times at event time \(a_i\))
and \(r_j = |\{ i : x_i \geq a_i \}|\) (all events at and after \(t = a_i\); number at risk)
Note this is based on the erroneous assumptions that \(a_i...a_m\) are fixed constants and that \(T\) is discrete only with events at \(a_j\). Though it ends up being computationally equivalent to method 1!
\(\frac{d_j}{r_j}\) represents the probability of any event occurring at \(a_i\) (of the number of people at risk, how many were known to have perished in the time interval?) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: The hypothesis tested in the Log Rank Test Null hypothesis: \(F_j^{(0)} = F_j^{(1)}\) for all \(j \in [m]\)
Alternate hypothesis: \(F_j^{(0)} \not= F_j^{(1)}\) for any \(j \in [m]\)
where \(F_j^{(k)}\) represents the survivor function over \((a_{j}, a_{j+1}]\) for the \(k\)th group in \(k \in \{0, 1\}\). (assume the \(a_j\) are all the event times of the two groups put together) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: The Log-Rank Test \[\frac{Z}{S} \underset{approx}{\sim} {\mathrm{N}}(0, 1)\]
with
\[Z = \sum_j Z_j = \sum_j d_j^{(0)} - d_j\frac{r_j^{(0)}}{r_j}\] Z is the number of observed (uncensored) events minus the number of expected events were \(H_0\) true.
NONEXAMINABLE: \[S^2 = \sum_j \frac{d_j (r_j - d_j)r_j^{(0)}r_j^{(1)}}{r_j^{2}(r_j-1)}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats What's a good rule of thumb for when the Log-Rank test will be powerful? The curves are well-seperated and don't cross MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Relative Risk \[\begin{align*} RR &= \frac{\text{sum of observed / sum of expected (in group 0)}}{\text{sum of observed / sum of expected (in group 1)}}\\\\ &= \frac{\sum_j d_j^{(0)} / \sum_j d_j \frac{r_j^{(0)}}{r_j} }{\sum_j d_j^{(1)} / \sum_j d_j \frac{r_j^{(1)}}{r_j}} \end{align*}\]
Events are \(RR\times\) more likely in group 0 than group 1. MedicalStats PartIIINotes
Basic Part III Notes::LogicAndComputability What do we mean by \(\Gamma \vdash \varphi\)? In the context \(\Gamma\), we have a proof of \(\rho\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What do we mean by \(\Gamma_1 A \vdash B\)? Adding A to the context \(\Gamma\), we have a proof of B. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\land\) - I) \[ \frac{\Gamma \vdash A \quad \Gamma\vdash B}{\Gamma \vdash A \land B } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\lor\) - I) \[ \frac{\Gamma \vdash A}{\Gamma \vdash A \lor B } \text{ and } \frac{\Gamma \vdash B}{\Gamma \vdash A \lor B} \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\land\) - E) \[ \frac{\Gamma \vdash A \land B}{\Gamma \vdash A} \text{ and } \frac{\Gamma \vdash A \land B}{\Gamma \vdash B} \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\lor\) - E) \[ \frac{ \Gamma \vdash A \lor B \quad \Gamma_1 A \vdash C \quad \Gamma_1 B \vdash C }{ \Gamma \vdash C } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\bot\) - E) For all \(A\), \[ \frac{ \Gamma \vdash \bot }{ \Gamma \vdash A } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\rightarrow\) - I) \[ \frac{ \Gamma_1 A \vdash B }{ \Gamma \vdash A \rightarrow B } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\rightarrow\) - E) \[ \frac{ \Gamma \vdash A \rightarrow B \quad \Gamma \vdash A }{ \Gamma \vdash B } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (?x) For all \(A\), \[ \frac{ \cdot }{ \Gamma_1 A \vdash A } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (LEM) For all \(A\), \[ \frac{ \cdot }{ \Gamma \vdash A \lor (A \rightarrow \bot) } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: IQC First-order intuitionistic logic. Includes the axioms for quantifiers (that's what the Q stands for) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What do we mean by \(\varphi [x := t]\) for some term t? Whatever \(\varphi\) is, with all instances of \(x\) replaced with \(t\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\exists\)-I) For all terms \(t\), \[ \frac{ \Gamma \vdash \varphi[x := t] }{ \Gamma \vdash \exists x. \varphi(x) } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\exists\)-E) \[ \frac{ \Gamma \vdash \exists x.\varphi \quad \Gamma_1\varphi \vdash \psi }{ \Gamma \vdash \psi } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\forall\)-I) For all \(x\) not free in \(\Gamma\), \[ \frac{ \Gamma \vdash \varphi }{ \Gamma \vdash \forall x.\varphi } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability (\(\forall\)-E) For all terms \(t\), \[ \frac{ \Gamma \vdash \forall x.\varphi }{ \Gamma \vdash \varphi [x := t] } \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What does \({\mathcal{U}}\) and \(V\) denote in the simply-typed \(\lambda\)-calculus? \({\mathcal{U}}\) is a countable set of primitive types.
\(V\) is an infinite set of variables. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What does \(\Pi\) represent in the simply-typed \(\lambda\)-calculus? Define its grammar \(\Pi\) is the set of types in \(\lambda\)-calculus, denoted \(\lambda_\Pi (\rightarrow)\): \[\begin{align*} \Pi :=&\,\,\, {\mathcal{U}}\\\\ &|\, \Pi \to \Pi \end{align*}\]
(also sometimes called the implicational fragment of \(\lambda_\Pi\)) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What does \(\lambda_\Pi\) represent in the simply-typed \(\lambda\)-calculus? Define its grammar and label it \(\lambda_\Pi\) is the set of terms in simply-typed \(\lambda\)-calculus: \[\begin{align*} \Pi :=&\,\,\, V\\\\ &|\, \lambda V : \Pi . \lambda_\Pi & (\lambda-\text{abstraction})\\\\ &|\, \lambda_\Pi \lambda_\Pi & (\lambda-\text{application}) \end{align*}\] LogicAndComputability PartIIINotes
Basic Part III Notes::SLIP Definition: Cumulant-Generating Function Natural logarithm of the moment-generating function:
\[K(t) = \log {\mathbb{E}} [e^{tX}]\]
Generates cumulants: \(\kappa_n\). Note that:
\(\kappa_1\) is the mean
\(\kappa_2\) is the variance
\(\kappa_3\) is the skewness
But then it gets weirder. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Normal Linear Model \[Y_i = x_i{^{\mathrm{T}}} \beta + \varepsilon_i\]
with \(\varepsilon_i \sim {\mathrm{N}}(0, \sigma^2)\) independent
Alternatively written \(Y = X\beta + \varepsilon\)
Note our assumptions: \(Y_i\) are independent and normally distributed; linear in \(X\).
SLIP PartIIINotes
Basic Part III Notes::SLIP MLE for the slope of a normal linear model \[\hat\beta = (X{^{\mathrm{T}}} X){^{-1}} X{^{\mathrm{T}}} Y\]
(By the Gauss-Markov Theorem, \(\hat \beta\) and \(\hat \sigma^2\) is the best linear unbiased estimator!) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: The two MLEs for the variance \(\sigma^2\) of a normal linear model \[\tilde\sigma^2 = \frac{1}{n} ||Y - X\hat \beta||^2_2\] or if we'd rather have an unbiased \(\sigma^2\), \[\hat\sigma = \frac{1}{n - p} ||Y - X\hat \beta||^2_2\]
(By the Gauss-Markov Theorem, \(\hat \beta\) and \(\hat \sigma^2\) is the best linear unbiased estimator!) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Exponential Dispersion Family \[ \{ {\mathbb{P}}_{\theta,\phi}: \theta \in \Theta, \phi \in \Phi \} \]
\(\Theta \subset {\mathbb{R}}\), \(\Phi \subset (0, \infty)\), fulfilling that, for each \({\mathbb{P}}_{\theta, \phi}\), we have (\(\forall y \in {\mathbb{R}}\)):
\[ f(y; \theta, \phi) = f_0(y,\phi) \exp \left\{ \frac{y\theta - K(\theta)}{\phi} \right\} \]
where
\(K: \Theta \to {\mathbb{R}}\) is called the cumulant function (it's just here to normalize), and
\(\{ f_0(\cdot, \phi) : \phi \in \Phi \}\) is some family of density functions (that gets 'tilted' by the exponential portion).
SLIP PartIIINotes
Basic Part III Notes::SLIP What is the equation to solve for the cumulant function \(K(\theta)\) for exponential diffusion families? \[K(\theta) = \phi \log \int_{-\infty}^\infty e^{y\theta/\phi}f_0(y,\phi)\,dy\] SLIP PartIIINotes
Basic Part III Notes::SLIP What is the cumulant-generating function for an exponential diffusion family? \[K(t) = \log {\mathbb{E}} [e^{ty}] = \frac{K(\phi t + \theta) - K(\theta)}{\phi}\] SLIP PartIIINotes
Basic Part III Notes::SLIP What is the mean and variance of the exponential distribution family in terms of the cumulant function \(K(\theta)\)? What conclusions can we draw about \(\theta\)? \[ {\mathbb{E}}_{\theta, \phi}[y] = K'(\theta) \] \[ {\mathrm{Var}}_{\theta, \phi}[y] = \phi K''(\theta) \]
(this can be computed simply by taking the first or second derivative of the cumulant-generating function and plugging in \(t=0\))
If we assume our variance is nonzero, and since \(\phi\) must be positive, it must be the case then that \(K''(\theta) > 0\). Thus \(K'(\theta)\) must be strictly increasing and therefore invertible. So, considering the mean equation above, it makes sense to reparameterize everything in terms of \(\mu := K'(\theta)\)! SLIP PartIIINotes
Basic Part III Notes::SLIP What properties do we care about for an exponential dispersion family? We need a natural parameter \(\theta\),
a dispersion parameter \(\phi\),
a cumulant function \(K(\theta)\) which gives us \(\mu = K'(\theta)\),
a mean parameter space \(\mathcal{M} = { \mu(\theta : \theta \in \Theta)}\),
a mean function \(\theta \to \mu(\theta) = (K')^{-1}(\mu)\),
and a variance function \(V(\mu) = K'' \circ \theta(\mu)\). SLIP PartIIINotes
Basic Part III Notes::SLIP Give the properties of a normal distribution \(\mathrm{N}(\mu, \sigma^2)\) (\(\mu \in \mathbb{R}, \sigma > 0\)) interpreted as an exponential dispersion family \[\theta = \mu\] \[\phi = \sigma^2\] \[K(\theta) = \theta^2/2\] \[\mathcal{M} = \mathbb{R}\] \[\mu(\theta) = \theta\] \[V(\mu) = 1\] SLIP PartIIINotes
Basic Part III Notes::SLIP Give the properties of a binomial distribution \(\frac{1}{m}\mathrm{Bin}(m, p)\) (\(m \in \mathbb{N}, p \in (0, 1)\)) interpreted as an exponential dispersion family \[\theta = \mathrm{logit}(p) = \log(p / (1-p))\] \[\phi = 1/m\] \[K(\theta) = \log(1 - p) - \log(2) = \log(e^\theta + 1) - \log(2)\] \[\mathcal{M} = (0,1)\] \[\mu(\theta) = \mathrm{expit}(\theta)\] \[V(\mu) = \mu(1-\mu)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Generalized Linear Model Independent observations \((x_i, Y_i)\) follow a generalized linear model if, for some exponential dispersion family \(\{{\mathbb{P}}_{\mu, \phi}: \mu \in {\mathcal{M}}, \phi \in \Phi\}\):
1. \(Y_i \sim {\mathbb{P}}_{\mu_i, \phi_i}\) with \(\mu_i \in {\mathcal{M}}\) and \(\phi_i = \phi a\) with known \(a_1...a_n > 0\) and possibly unknown \(\phi>0\),
2. \(g(\mu_i) = x_i{^{\mathrm{T}}} \beta\) for some unknown \(\beta \in {\mathbb{R}}^p\) with a strictly monotonic twice differentiable known 'link function' \(g: {\mathcal{M}} \to {\mathbb{R}}\).
(linearity after applying the link function) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Residual Squared Error \[RSE = \sqrt{ \frac{\sum_{i=1}^n (y_i-\hat y_i)^2}{n - p} }\] where \(p\) is the number of parameters in our regression (denominator is the number of degrees of freedom). SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: \(R^2\) (Coefficient of determination) \[R^2 = 1 - \frac{RSS}{TSS} = 1- \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2}\]
\(RSS\) is the residual sum of squares, and \(TSS\) is the total sum of squares.
Proportion of the variation predictable from our model. We want this to be large! SLIP PartIIINotes
Basic Part III Notes::SLIP What is the Logit function? \[{\mathrm{logit}}(x) = \log \left( \frac{x}{1-x}\right)\] SLIP PartIIINotes
Basic Part III Notes::SLIP What is the Expit function? \[{\mathrm{expit}}(x) = \frac{1}{1 + e^{-x}} = \frac{e^x}{1 + e^x}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Log Likelihood of a Generalized Linear Model
(given \(Y = (Y_1,...,Y_n){^{\mathrm{T}}}\), \(\beta\).) First define \(\theta_i = (K'){^{-1}} (g{^{-1}} (x_i {^{\mathrm{T}}} \beta))\). We have \[\begin{align*} {\mathcal{L}}(\beta, \phi) &= \sum_{i=1}^n \log f(Y_i; \theta_i, \phi_i)\\\\ &= {\sum_{i=1}^n} \log f_0(Y_i, \phi a_i) + {\sum_{i=1}^n} \frac{1}{\phi a_i}\{Y_i \theta_i - K(\theta_i)\} \end{align*}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Proposition: If an MLE exists for the Generalized Linear Model with the canonical link, it must be unique. If an MLE \[\hat \beta = {\underset{\beta \in {\mathbb{R}}^p}{\mathrm{argmax}}\,\,} {\mathcal{L}} (\beta, \phi)\] exists, then it must satisfy the score equation: \[\frac{\partial{\mathcal{L}}(\hat \beta, \phi)}{\partial \beta} = 0.\] Note that by our assumptions, we have \(\theta_i = x_i{^{\mathrm{T}}} \beta\), yielding: \[ \frac{\partial^2{\mathcal{L}} (\beta, \phi)}{\partial\beta\partial\beta{^{\mathrm{T}}}} = - {\sum_{i=1}^n} \frac{x_i x_i{^{\mathrm{T}}}}{\phi a_i} K''(x_i{^{\mathrm{T}}} \beta)\] Since \(K'' > 0\) and \(X\) is assumed to have full rank, the Hessian must be negative definite, and so if it exists, the MLE must be unique. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Generalized Pearson Chi-Squared Statistic for \(\phi\) \[\hat \phi = \frac{\chi^2}{n-p}\] \[\chi^2 = {\sum_{i=1}^n} \frac{(Y_i - K'(x_i{^{\mathrm{T}}} \hat \beta))^2}{a_i V(K'(x_i{^{\mathrm{T}}} \hat \beta))}\]
This comes from \({\mathrm{Var}}(Y_i) = a_i \phi V(\mu_i) \implies \phi\) plugged into the weighted OLS unbiased variance MLE \[\hat \sigma^2 = \frac{\sum_iw_i(Y_i - \hat\mu)^2}{n-p}\]
This is a consistent estimator for \(\phi\)! SLIP PartIIINotes
Basic Part III Notes::SLIP Explain how to derive Newton-Raphson Suppose we want to solve \(f(x) = 0\). If we start at some point \(x\), we can simply step downhill a bunch of times to get \(x'\) with smaller and smaller \(f(x')\) via a Taylor approximation:
\[0 = f(x') \approx f(x) + f'(x)(x' - x)\]
Rearranging yields
\[\begin{align*} 0 &= f(x) + f'(x)(x' - x)\\\\ -f'(x)(x' - x) &= f(x)\\\\ x' &= x - \frac{f(x)}{f'(x)} \end{align*}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Explain briefly what Fisher scoring is, in the context of solving a general linear model numerically Similar to Newton-Raphson, but replaces the Jacobian \(J(\beta)\) with the relevant Fisher information: \[I(\beta) = {\mathbb{E}}(J(\beta))\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Column Space Interpreting some \(m \times n\) matrix \(A\) as \(n\) column vectors, what do they span? \[c_1v_1 + ...+c_nv_n\] In other words, all possible products \(Ax\) for any \(x \in {\mathbb{R}}^n\). SLIP PartIIINotes
Basic Part III Notes::SLIP What is the orthogonal complement of the column space? The orthogonal complement of the column space is the left null space: \[v \in {\mathrm{col}}(X)^\perp \iff v{^{\mathrm{T}}} X = 0{^{\mathrm{T}}}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: PDF of the Multivariate Normal Distribution \[(2\pi)^{-n/2} |\Sigma|^{-1/2} \exp \left( -\frac{1}{2} (x - \mu){^{\mathrm{T}}} \Sigma{^{-1}} (x - \mu) \right)\] where \(\Sigma\) is the covariance matrix between the components.
Note that when the components are independent, we get just a diagonal \(\Sigma\) with \(\Sigma_{ii} = \sigma_i^2\), and this simplifies to just a product of univariate PDFs. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: \(\frac{\partial}{\partial v}Xv\) \[X\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: \(\frac{\partial}{\partial v}v{^{\mathrm{T}}} X\) \[X{^{\mathrm{T}}}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: \(\frac{\partial}{\partial v}v{^{\mathrm{T}}} Xv\) \[\frac{\partial}{\partial v}v{^{\mathrm{T}}} Xv = (X + X{^{\mathrm{T}}})v\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: {\mathrm{Var}}{(Xv)} = {\mathrm{Cov}}{(Xv)} \[{\mathrm{Var}}(Xv) = X \Sigma X{^{\mathrm{T}}}\] where \(\Sigma\) is the covariance matrix of the random vector \(v\). SLIP PartIIINotes
Basic Part III Notes::Astrostatistics MGF of the Univariate Normal RV \[\phi_X(t) = e^{i\mu t - \frac{1}{2} \sigma^2t^2}\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Easiest way to find the distribution of the sum of two random variables Use the property that the MGF of a sum of RVs is the product of their MGFs: \[\phi_{X+Y}(t) = \phi_X(t) * \phi_Y(t)\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Variance of the sample mean \[{\mathrm{Var}}(\hat \mu) = \frac{\sigma^2}{n}\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Fisher Information Matrix \[\bigl[\mathcal{I}(\theta)\bigr]_{i, j} = \operatorname{E}\left[\left. \left(\frac{\partial}{\partial\theta_i} \log f(X;\theta)\right) \left(\frac{\partial}{\partial\theta_j} \log f(X;\theta)\right) \,\, \right| \,\,\theta\right]\] Or, equivalently, given some regularity conditions, \[ \bigl[\mathcal{I}(\theta) \bigr]_{i, j} = -\operatorname{E}\left[\left. \frac{\partial^2}{\partial\theta_i\, \partial\theta_j} \log f(X;\theta) \,\, \right| \,\, \theta\right]\,\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Univariate Unbiased Cramer-Rao Lower Bound Given an unbiased estimator \(T\) for \(\theta\), we have: \[{\mathrm{Var}}(\hat \theta) \geq \frac{1}{I(\theta)}\] (this is achieved by unbiased MLEs) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Multivariate Cramer-Rao Lower Bound For any unbiased estimator \(\vec T\) of \(\vec \theta\), \[ {\mathrm{Cov}} (\vec T) - I(\theta){^{-1}} \] must be positive-semi-definite. In particular, this means that \[{\mathrm{Var}}(T_i) \geq [I(\vec\theta){^{-1}}]_{ii}\] for any index \(i\). Astrostatistics PartIIINotes
Basic Part III Notes::SLIP Definition: Covariance Properties 1. \(\forall a, \, {\mathrm{Cov}}(X, a) =0\)
2. \(\forall a, b, \,{\mathrm{Cov}}(X + a, Y + b) = {\mathrm{Cov}}(X, Y)\)
3. \(\forall a, b, \,{\mathrm{Cov}}(aX, bY) = ab{\mathrm{Cov}}(X, Y)\) 4. \({\mathrm{Cov}}(X +Y, Z + W) = {\mathrm{Cov}}(X, Z) + {\mathrm{Cov}}(X,W) + {\mathrm{Cov}} (Y, Z) + {\mathrm{Cov}}(Y, W)\) SLIP PartIIINotes
Basic Part III Notes::LogicAndComputability If we know \(\Gamma \vdash \varphi\), what do we know about \(\Gamma_1 \psi\) and \(\Gamma [p:= \psi]\)? By Lemma 1.1.4, we must have \(\Gamma_1 \psi \vdash \varphi\)
Furthermore, if we have a primitive proposition \(p\) and any proposition \(\psi\), then \(\Gamma [p:= \psi] \vdash \varphi[p:=\psi]\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Context (in the \(\lambda\)-calculus) A set of pairs of the form \(x : \sigma\)
(where \(x \in V\) is a variable and \(\sigma \in \Pi\) is a type)
Note that we write \(\Gamma_1 x : \sigma\) for \(\Gamma \cup \{x : \sigma\}\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Range of \(\Gamma\) Denoted \(|\Gamma|\), the set of all types that appear in \(\Gamma\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Typability Relation
(basic definition) \[\Vdash\, \subseteq C \times \lambda_\Pi \times \Pi\] where \(C\) is the set of all contexts. Links every term in a context to a type. Must fulfill three properties (recursing on any term in \(\lambda_\Pi\)).
This type system is denoted \(\lambda_\Pi (\to)\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: \(\alpha\)-equivalency Two \(\lambda\)-terms are \(\alpha\)-equivalent if they differ only in the name of the bound variables.
e.g. \(\lambda x : {\mathbb{Z}} . x^2 \equiv_\alpha \lambda y : {\mathbb{Z}}.y^2\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What does it mean for a \(\lambda\)-term to be closed? It has no free variables. LogicAndComputability PartIIINotes
Basic Part III Notes::SLIP Definition: Taylor Expansion
(For \(f\) around \(x_0\) for nearby \(x\)) \[\begin{align*} f(x)&\approx f(x_0) + f'(x_0)(x-x_0) + \frac{1}{2}f''(x_0)(x-x_0)^2 + ...\\\\ &= \sum_{n=0}^\infty \frac{1}{n!} f^{(n)}(x_0)(x - x_0)^n \end{align*}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Maclaurin Expansion of \(e^x\) \[\begin{align*} e^x &= 1 + x + \frac{1}{2}x^2 + ...\\\\ &= \sum_{n=0}^\infty \frac{x^n}{n!} \end{align*}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Wilkes' Theorem The likelihood ratio test (LRT) statistic \[ D = -2\log \left( \frac{\text{Likelihood for null model}}{\text{Likelihood for alternative model}} \right) \] is asymptotically distributed with the chi-squared distribution with \(\dim \Theta - \dim \Theta_0\): \[D \sim \chi^2_{\dim \Theta - \dim \Theta_0}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Proof: Solve for the confidence interval around some parameter \(j\) of the estimated slope \(\hat \beta\) Note that we know \[I(\beta)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] Because \(\hat \beta \overset{p}{\to} \beta\), we can apply the continuous mapping theorem and Slutsky's Theorem to get: \[I(\hat \beta)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] This yields the interval \[C_\alpha := \left[ \hat \beta_j \pm z_{\alpha/2} \sqrt{[i(\hat \beta){^{-1}}]_{jj}} \right]\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Confidence Interval for GLMs given a new datapoint
(Denote the new datapoint \(x^* \in {\mathbb{R}}^p\), we want to find a confidence interval for \({\mathbb{E}}(Y | X = x^*) = g{^{-1}} (x^{*T} \beta)\).) \[C_\alpha := g{^{-1}}\left[ x^{*T}\hat \beta \pm z_{\alpha/2} \sqrt{x^{*T}i(\hat \beta){^{-1}} x^{*}} \right]\]
This comes from the fact that \[\hat \beta \sim N(\beta,\, i(\hat \beta){^{-1}})\] so \[x{^{\mathrm{T}}} \hat \beta \sim N(x{^{\mathrm{T}}}\beta,\, x{^{\mathrm{T}}} i(\hat \beta){^{-1}})\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Mean parameterization \(\tilde l(\mu, \phi)\) \[\begin{align*} \tilde l(\vec \mu, \phi) &= \log \tilde f(y, \vec \mu, \phi)\\\\ &= \log f(y, \theta(\vec \mu), \phi)\\\\ &= \sum_{i=1}^n \log f_0(y_i, \phi) + \sum_{i=1}^n \frac{1}{a_i \phi} (y_i\theta(\mu_i) - K(\theta(\mu_i)) \end{align*}\] Also known as the 'saturated' parameterization. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Deviance of a GLM \[D(\hat\mu) := -2 \phi (\tilde l(\hat \mu, \phi) - \tilde l (Y, \phi))\] where \(\tilde l\) is the mean parameterization of the log likelihood (where we can provide a separate \(\mu_i\) for each sample \(i\)). Note then we have
\[ D(\hat \mu) = -2 \phi \log \left( \frac{\sup_{\mu \in {\mathbb{R}}^n: \mu_i = g{^{-1}}(x_i{^{\mathrm{T}}} \beta ),\, \beta \in {\mathbb{R}}^p} \tilde f(Y; \mu, \phi)}{\sup_{\mu \in {\mathbb{R}}^n} \tilde f(Y; \mu, \phi) } \right) \] Here, the top model is the saturated model and the bottom model is the normal restricted model. So \(D(\hat \mu)\) is the likelihood ratio test statistic for \(H_0\): GLM, \(H_1\): saturated model.
Also note that deviance is independent of \(\phi\)! SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Dispersion-scaled Pearson Residuals \[e_i = \frac{Y_i - \hat \mu_i}{{\mathrm{Var}}(Y_i)} = \frac{Y_i - \hat\mu_i}{\sqrt{\hat\phi a_i V(\hat \mu_i)}}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Deviance Residuals \[d_i = {\mathrm{sgn}}(Y_i - \hat \mu_i) \sqrt{D_i(\hat\mu)}\] where \(D_i(\mu)\) is the \(i\)th summand of the deviance SLIP PartIIINotes
Basic Part III Notes::SLIP What is the relationship between dispersion-scaled Pearson residuals and deviance residuals? When \(Y_i \approx \hat \mu_i\), we have \(d_i = \hat \phi^{1/2} e_i + O(|Y_i - \hat \mu_i|^{3/2})\) SLIP PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Typing Rule for Variables 1. Type of standalone variables: \[\Gamma_1 x : \tau \Vdash x : \tau\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Typing Rule for \(\lambda\)-abstractions 2. Type of \(\lambda\)-abstractions:
\[\Gamma_1 x: \sigma \Vdash M : \tau \implies \Gamma \Vdash (\lambda x : \sigma . M): \sigma \to \tau\]
(assuming \(x\) does not occur in \(\Gamma\), \(M \in \lambda_\Pi\))
LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Typing Rule for \(\lambda\)-applications 3. Type of \(\lambda\)-applications: \[(\Gamma \Vdash M : \sigma \to \tau)\land(\Gamma \Vdash N : \sigma) \implies \Gamma \Vdash (MN): \tau\] (\(M,N \in \lambda_\Pi\))
LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Substitution in \(\lambda_\Pi\)
(Inductive definition) Given \(\lambda\)-terms M, N and a variable \(x\), we define the substitution \(M[x := N]\) by pattern matching inductively:
Case 1: M is a variable equal to x: \[x[x:=N] := N\]
Case 2: M is a variable not equal to x: \[y[x := N] := y\]
Case 3: M is a \(\lambda\)-abstraction: \[(\lambda a:\sigma. A)[x := N] := (\lambda a: \sigma.A[x := N])\] (assuming \(x \neq y\) and \(y\) not free in N)
Case 4: M is a \(\lambda\)-application: \[(PQ)[x := N] := (P[x := N])(Q[x := N])\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: \(\beta\)-Reduction
(Name the two sides of the reduction) \[(\lambda x : \sigma.P)Q \rightarrow_\beta P[x:= Q]\]
The left side is called the \(\beta\)-redex and the right side is called the \(\beta\)-contraction.
We also write \(M \twoheadrightarrow_\beta N\) if M \(\beta\)-reduces to N after a finite number of reductions.
Note that this is preserved under \(\lambda\)-abstraction and \(\lambda\)-application: if \(P \to_\beta P'\), \[\lambda x : \sigma . P \to_\beta \lambda x : \sigma . P'\] \[PZ \to_\beta P'Z\] \[ZP \to_\beta ZP'\] (we can choose to evaluate in any order; we'll end up in the same place either way) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: \(\beta\)-Equivalence The smallest equivalence relation containing \(\to_\beta\), denoted \(\equiv_\beta\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: \(\eta\)-reduction Simplify pointless \(\lambda\)-abstractions: \[\lambda x:\sigma . (Px) \to_\eta P\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: \(\beta\)-Normal Form M is in \(\beta\)-NF if there is no N such that \(M \to_\beta N\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability When we write a \(\lambda\)-term like \(ABC\), what do we actually mean? \[(AB)C\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Free Variables Lemma
(Given \(\Gamma \Vdash M : \sigma\),) then we must have:
1. For all \(\Gamma \subseteq \Gamma'\), \(\Gamma' \Vdash M : \sigma\).
2. The free variables of M occur in \(\Gamma\)
3. There exists some \(\Gamma' \subseteq \Gamma\) consisting of exactly the free variables of M, preserving \(\Gamma' \Vdash M : \sigma\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Generation Lemma Just read each type inference rule in reverse:
1. If \(\Gamma \Vdash x:\sigma\), \[x : \sigma \in \Gamma\]
2. If \(\Gamma \Vdash (\lambda x : M) : \sigma\), then there exists types \(\tau, \rho \in \Pi\) such that \[\Gamma_1 x : \tau \Vdash M : \rho \quad \land \quad \sigma = (\tau \to \rho)\]
3. If \(\Gamma \Vdash (MN) : \sigma\), then there exists types \(\tau \in \Pi\) such that \[\Gamma \Vdash M : \tau \to \sigma\quad \land \quad \Gamma \Vdash N : \tau\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Substitution Lemma 1. If \(\Gamma \Vdash M : \sigma\) and \(\alpha\) is some type variable, then: \[\Gamma[\alpha := \tau] \Vdash M:\sigma[\alpha := \tau]\] (how substitution behaves for types in a context)
2. If \(\Gamma_1 x : \tau \Vdash M : \sigma\) and \(\Gamma \Vdash N : \tau\), then \[\Gamma \Vdash M[x := N] : \sigma\] (how substitution behaves for terms in a context) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: Subject Reduction If \(\Gamma \Vdash M : \sigma\) and \(M \to_\beta N\), then \[\Gamma \Vdash N : \sigma\] (\(\beta\)-reduction preserves types)
Proven by induction with the generation and substitution lemmas. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Theorem: Church-Rosser Theorem for \(\lambda_\Pi\) Supposing \(\Gamma \Vdash M : \sigma\), if \(M \twoheadrightarrow_\beta N_1\) and \(M \twoheadrightarrow_\beta N_2\), then there exists some \(L \in \lambda_\Pi\) such that \[N_1 \twoheadrightarrow_\beta L \quad \land \quad N_2 \twoheadrightarrow_\beta L\] with \(\Gamma \Vdash L . \sigma\).
Also called the confluence property or the diamond property. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Corollary to the Church-Rosser Theorem for \(\lambda_\Pi\) If \(M \in \lambda_\Pi\) admits a \(\beta\)-NF, then it is unique. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Give an example of a \(\lambda\) term that is impossible to type \[\lambda x.xx\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: The height function A recursively defined map on types \(h: \Pi \to {\mathbb{N}}\) mapping:
1. \(v \in V\) to 0
2. \(\sigma \to \tau\) to \(1 + \max(h(\sigma), h(\tau))\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Theorem: The Weak Normalization Theorem for \(\lambda_\Pi(\to)\) Let \(\Gamma \Vdash M : \sigma\). There exists a finite reduction path \[M := M_0 \to_\beta M_1 \to_\beta ...\to_\beta M_n\] where \(M_n\) is a \(\beta\)-NF. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Theorem: The Strong Normalization Theorem for \(\lambda_\Pi(\to)\) Let \(\Gamma \Vdash M : \sigma\). There exists no infinite reduction chain \[M := M_0 \to_\beta M_1 \to_\beta ...\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: Curry-Howard for IPC(\(\to\)) Let \(\Gamma\) be a context and \(\varphi\) be a proposition (type). Then:
1. If \(\Gamma \Vdash M : \varphi\), then \[|\Gamma| = \left\{ \tau \in \Pi : (x : \tau) \in \Gamma \text{ for some variable } x \right\} \vdash_{IPC} \varphi\]
2. If \(\Gamma \vdash_{IPC} \varphi\), then there exists some \(M \in \lambda_\Pi(\rightarrow)\) such that \[\left\{ (x_\psi : \psi) \,|\, \psi \in \Gamma \right\} \Vdash M : \varphi\] LogicAndComputability PartIIINotes
Basic Part III Notes::Astrostatistics Definition of CDF \[F(x) = P(X\leq x) = \int_{-\infty}^x f(x)\,dx\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Characteristic Function (Statistics) \[\varphi_X(t) = {\mathbb{E}}(e^{itX})\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: The Delta Method Just a second-order Taylor approximation with a statistics coat of paint: \[{\mathbb{E}}[g(X)] \approx g({\mathbb{E}} [X]) + \frac{1}{2} g''({\mathbb{E}}[X]) {\mathrm{Var}}(X)\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Univariate General Cramer-Rao Lower Bound Given any estimator \(T\) for \(\theta\), we have: \[{\mathrm{Var}}(\hat \theta) \geq \frac{\left(1 + \frac{d}{d\theta} {\mathrm{Bias}}(\hat \theta)\right)^2}{I(\theta)}\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics How do you solve a problem with selection effects? Use Bayes' Theorem conditioning on the selection function (usually a step function), should end up with a normal CDF normalizing factor Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics State the Frequentist model of statistics (in the context of a regression) We have some true variables we want to compute, but we can only view data that is confounded with some source of error (following a distribution)
So the underlying (latent) variables we seek to understand are deterministic, there's just a bunch of stochastic error complicating matters. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics State the Bayesian model of statistics (in the context of a regression) Interpret probability as the degree of certainty in an event rather than its long-run chance. We start with some guess as to how we think the parameter is distributed, and use the data to update that prior into a more accurate posterior.
The parameters are themselves random variables! We estimate their distribution, not just a single value. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Monte Carlo Integration
(for integrating over some interval I) \[\hat I = \frac{1}{m} \sum_{i=1}^m f(\theta_i)\] where \(f(\theta)\) is some data processing to estimate some statistic (mean, variance, etc) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Monte Carlo integrator for the posterior mean \[f(\theta) = \theta\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Monte Carlo integrator for the posterior variance \[f(\theta) = (\theta - {\mathbb{E}}[\theta {\,|\,} D ])^2\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Monte Carlo integrator in an interval \([a,b]\) \(f(\theta) = I_{[a, b]}(\theta)\) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Variance of a Monte Carlo Integrator \[{\mathrm{Var}}(\hat I) = \frac{1}{m} {\mathrm{Var}}[f(\theta)]\] Note that this is independent of the dimension of \(\theta\)!
The Monte Carlo error is just the square root of this (the standard deviation)
Note that this can also be used to estimate the sample variance: \[\hat {\mathrm{Var}}(\{ f(\theta_i) \}) = \frac{1}{m-1} \sum_{i=1}^m(f(\theta_i) - \hat I)\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Direct Sampling method for estimating a posterior distribution In the case where a posterior distribution can be broken down into the product of named distributions, simply draw from one distribution, feed into the next, etc, and do this to get a bunch of samples.
Note that it's easy to get the marginals this way: just ignore all but one value! Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Kernel Density Estimation Creates a smooth histogram from a bunch of samples \(\theta_i\):
\[\hat P(\theta {\,|\,} D) = \frac{1}{m} \sum_{i=1}^m N(\theta {\,|\,} D_i, b_w^2)\]
where \(b_w\) is the bandwidth. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Silverman's Rule of Thumb \[b_w = \left( \frac{4 \hat \sigma^5}{3m} \right)^{1/5}\]
where \(\sigma\) is the estimated sample standard deviation, \(\hat \sigma^2 = \hat {\mathrm{Var}}(\{ \theta_i \})\) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: The Metropolis Algorithm 1. Pick some \(\mu_0\)
2. Choose a new \(\mu_{\text{prop}} \sim {\mathrm{N}}(\mu_{\text{prev}}, \tau^2)\)
Note: the jump distribution has to be symmetric: \(J(a|b) = J(b|a)\)
3. Accept with probability \(\min(1, r),\,\, r = \frac{{\mathbb{P}}(\mu_{\text{prop}}{\,|\,}\vec y)}{{\mathbb{P}}(\mu_{\text{prev}}{\,|\,}\vec y)}\)
4. Repeat 2 and 3 until you have enough samples.
This works for higher dimensions as well, just draw from a multivariate Gaussian jump distribution. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics State two post-processing strategies to properly format MCMC results 1. Get rid of burn-in: throw out about the first 20 percent of your samples
2. Thinning: Throw out every other sample Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Stationary Distribution Given some distribution \(P\) and some random process kernel \(T\): \[\sum_x P(x)T(x \to x') = P(x')\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Detailed Balance
(and what it implies) Given some distribution \(P\) and some random process kernel \(T\): \[P(x)T(x \to x') = P(x')T(x'\to x)\] This implies that \(P\) is a stationary distribution: just sum both sides over \(x\)! Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Metropolis-Hastings 1. Pick some \(\mu_0\)
2. Choose a new \(\mu_{\text{prop}} \sim J(\mu_{\text{prev}}, \tau^2)\)
Note: the jump distribution need not be symmetric: \(J(a|b) \neq J(b|a)\)
3. Accept with probability \(\min(1, r),\,\, r = \frac{{\mathbb{P}}(\mu_{\text{prop}}{\,|\,}\vec y) / J(\mu_{\text{prop}}{\,|\,} \mu_\text{prev})}{{\mathbb{P}}(\mu_{\text{prev}}{\,|\,}\vec y)/ J(\mu_{\text{prev}}{\,|\,} \mu_\text{prop})}\)
4. Repeat 2 and 3 until you have enough samples. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Gibbs Sampling Assumes that we have marginal distributions along every parameter alone.
1. Pick some \(\vec \mu_0\)
2. Run one Gibbs cycle:
for each \(i \in [d]\), set \(\mu_{next,i} \sim {\mathbb{P}}(\mu_{prev, i}{\,|\,} \mu_{prev, -i}, y)\)
4. Repeat 2 until you have enough samples. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Metropolis-Within-Gibbs If we can't express the conditional distributions necessary to run Gibbs sampling, estimate them by attempting to take a Metropolis step Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics How do we choose a good jump function when tuning the Metropolis algorithm? We can use Laplace approximation to show that the best choice is \(c^2A{^{-1}}\), where \(A\) is the Hessian of the log-posterior and \(c \approx \frac{2.4}{\sqrt{\dim \theta}}\) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics What acceptance rate should we aim for when using the Metropolis algorithm? 44 percent in one dimension, 23 percent in dimensions greater than 5 Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics How can we compare the convergence of two different MCMC methods? Check that your Gelman-Rubin ratio is about 1.
Check the autocorrelation timescale, compare effective sample sizes. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Explain mixed samplers and parameter blocking Mixed samplers describe the process of choosing different means of sampling different parameters of the function we want to sample from.
Parameter blocking refers to updating some parameters together (in blocks), especially ones that are highly correlated (to prevent zig-zagging during Gibbs, for instance). Astrostatistics PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Binary Relation
(and when it is said to contain another relation) Given some set \(X\), \(X \times X\) is said to be a relation on \(X\). It contains another relation \(Y\) if for every \((a,b) \in Y\), \((a,b) \in X\).
It's an equivalence relation when it is reflexive, symmetric, and transitive.
Note that any intersection of equivalence relations always is an equivalence relation. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: The Three Laws of Lattices A lattice L is a set equipped with a a binary operators \(\wedge\) and \(\vee\) such that both operations obey:
- commutativity, and
- associativity
Plus together they must follow the absorption laws.
Note together these imply \(a \vee a = a\) and \(a \wedge a = a\) (this is called being idempotent) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Distributive Lattice A lattice that fulfills distributivity: \[a \wedge (b \vee c) = (a \wedge b) \vee (a \wedge c)\] Note this is true iff \[a \vee (b \wedge c) = (a \vee b) \wedge (a \vee c)\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Bounded Lattice A lattice where there exists \(\bot, \top \in L\) such that: \[a \vee \bot = a,\] \[a \wedge \top = a\] Note that these imply: \[a \wedge \bot = \bot\] \[a \vee \top = \top\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Complemented Lattice A bounded lattice such that, for all \(a \in L\), there exists some \(a^* \in L\) such that: \[a \wedge a^* = \bot,\] \[a \vee a^* = \top\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Boolean Lattice A lattice that is distributive, bounded, and complemented.
(Note that the structure this creates makes the complement function \(*\) equivalent to \(\lnot\) since we must have \(a \lor \lnot a = \top\) and \(a \land \lnot a = \bot\). This is also where \(\wedge\) and \(\vee\) come from! \(\lor\) prefers \(\top\) and \(\land\) prefers \(\bot\)) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Induced Order from a Lattice We say \(a \leq b\) when \(a \wedge b = a\) and equivalently \(a \vee b = b\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: Induced Order of Bounded Lattices For any bounded lattice, for the induced order, we have: \[a \wedge b = \inf\{a, b\},\] \[a \vee b = \sup\{a, b\}\]
Conversely, for any partial order, finite meets and joins form a lattice as above.
Note then that for any total order, we have \(\wedge \equiv \min\) and \(\vee \equiv \max\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Draw the free distributive lattice on {x, y} \begin{tikzpicture}[ node/.style={circle, draw, fill=white, inner sep=2pt, minimum size=6pt}, label/.style={font=\small}, >=stealth ]
% Nodes \node[node] (bot) at (0, 0) {}; \node[node] (xay) at (0, 1.5) {}; \node[node] (x) at (-1.5, 3) {}; \node[node] (y) at (1.5, 3) {}; \node[node] (xvy) at (0, 4.5) {}; \node[node] (top) at (0, 6) {};
% Edges \draw (bot) -- (xay); \draw (xay) -- (x); \draw (xay) -- (y); \draw (x) -- (xvy); \draw (y) -- (xvy); \draw (xvy) -- (top);
% Labels \node[label, below] at (bot.south) {\(\bot\)}; \node[label, left] at (xay.west) {\(x \wedge y\)}; \node[label, left] at (x.west) {\(x\)}; \node[label, right] at (y.east) {\(y\)}; \node[label, left] at (xvy.west) {\(x \vee y\)}; \node[label, above] at (top.north) {\(\top\)};
\end{tikzpicture} LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Valuation A function \(v:L \to \{0, 1\}\) assigning truth values to statements in a consistent manner (respecting logical connectives) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Heyting Algebra A bounded lattice \(H\) equipped with an extra binary relation \(\Rightarrow: H \times H \to H\) such that \[a \wedge b \leq c \iff a \leq (b \Rightarrow c)\]
(So \(a \Rightarrow b\) can be thought of the weakest thing you can assume that, when combined with \(b\), gets you to \(c\)) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Heyting Homomorphism A function that preserves \(\wedge\), \(\vee\), and \(\Rightarrow\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Show that every Boolean algebra is a Heyting algebra We can construct our own \(\Rightarrow\) function: \[a \Rightarrow b := (\lnot a) \lor b\]
An example of this are the power sets of a set. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Show that every topology on a set X is a Heyting algebra We construct a \(\Rightarrow\) function: \[U \Rightarrow V := {\mathrm{Int}}(U^c \cup V)\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: H-Valuation Given some language \(L\) with a set of primitive propositions \(P\), an H-valuation \(v: P \to H\) is extended to \(L\) recursively by:
1. \(v(\bot) = \bot\)
2. \(v(A\land B) = v(A) \land v(B)\)
3. \(v(A\lor B) = v(A) \lor v(B)\)
4. \(v(A \to B) = v(A) \Rightarrow v(B)\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: H-Valid A proposition \(A \in L\) is H-valid if, for all H-valuations \(v\), \(v(A) = \top\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: H-Consequence A proposition \(A \in L\) is an H-consequence of a finite set of propositions \(\Gamma \subseteq L\) if, for all valuations \(v\), \(v(\bigwedge \Gamma) \leq v(A)\).
We denote this by \(\Gamma \vDash_H A\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Lemma: Soundness of Heyting Semantics If H is a Heyting algebra and v is an H-valuation, we always have: \[\Gamma \vdash_{IPC} A \implies \Gamma_{H,v} \vDash A\]
Proven over the structure of all IPC rules. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Example: Show that the LEM is not intuitionistically valid By the soundness of Heyting semantics, we need only to find an example of a Heyting algebra with a valuation where \(v(p \lor \lnot p) \neq \top\) for some \(p \in L\).
Consider the topology \(\{ \emptyset, \{ 1\}, \{1,2\} \}\) (called the Sierpinski space). Take some primitive proposition \(p\), and define our valuation so that \(v(p) = \{1\}\). Then we have
\[v(\lnot p) = v(p) \Rightarrow \emptyset = {\mathrm{Int}} (\{1,2\} \backslash \{ 1\}) = \emptyset\]
and thus
\[v(p \lor \lnot p) = v(p) \lor v(\lnot p) = \{ 1\} \neq \top = \{ 1, 2\}\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Pierce's Law \[((p \to q) \to p) \to p\] This is a tautology in classical logic, but not intuitionistically valid. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Example: Show that Pierce's Law is not intuitionistically valid. By the soundness of Heyting semantics, we need only to find an example of a Heyting algebra with a valuation where \(v(p \lor \lnot p) \neq \top\) for some \(p \in L\).
Take the usual topology on \({\mathbb{R}}^2\) with the valuation:
\[p \longmapsto {\mathbb{R}}^2 \backslash \{ 0, 0\}\] \[q \longmapsto \emptyset\]
Plugging in, you get \(((p \to q) \to p) \to p = {\mathbb{R}}^2 \backslash \{0,0\} \neq \top = {\mathbb{R}}^2\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Lindenbaum-Tarski Algebra A Lindenbaum-Tarski Algebra \(F^Q(T)\) over an L-theory T is an algebra over equivalence classes of propositions where \(\varphi \sim \psi\) \[T, \varphi \vdash_Q \psi \text{ and } T, \psi \vdash_Q \varphi\] (equivalence class over \(\varphi \iff \psi\))
If \(\bowtie\) is any logical connective in \(T\), we set \([\varphi] \bowtie [\psi]\) where \([\varphi \bowtie \psi]\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: Show that the LT Algebra of any theory relative to IPC \(\backslash \to\) (IPC without implication) is a distributive lattice. Commutativity and associativity are trivial, just show the two absorption laws and a distributivity law by IPC rules: \[([A] \lor [B])\land [A] = [(A \lor B) \land A] = [A]\] \[([A] \land [B])\lor [A] = [(A \land B) \lor A] = [A]\] \[[A] \land ([B] \lor [C]) = [A \land (B \lor C)] = [(A \land B) \lor (A \land C)] = ([A] \land [B]) \lor ([A] \land [C]) \] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Lemma: Show that the LT algebra of any theory relative to IPC is a Heyting Algebra We showed that IPC \(\backslash \to\) is a distributive lattice, so we just need to show that IPC has \(\Rightarrow\) and is bounded. Use \(\to\) as \(\Rightarrow\), \(\bot\) as \(\bot\), and \(\bot \to \bot\) as \(\top\). So just show that:
Assuming \([A] \land [B] \leq [C]\), we have \([A] \leq ([B] \Rightarrow [C])\)
Assuming \([A] \leq ([B] \Rightarrow [C])\), we have \([A] \land [B] \leq C\)
Both directions needed to make \(\bot \leq A\) for all \(A\)
Both directions needed to make \(A \leq \top\) for all \(A\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Theorem: Completeness of Heyting semantics A formula is provable in IPC if it is H-valid for every Heyting algebra H (and only if by soundness) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Principal Up-set
(of some poset \(S\)) \[a\uparrow\, := \{s \in S: a \leq s\}\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Terminal Segment
(of a poset \(S\)) A subset \(U \subseteq S\) where \(a \uparrow\, \subseteq U\) for every \(a \in U\)
Note that this isn't the same as a filter: for any lattice with incomparable elements, we can escape the terminal segment with finite meets! LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: If \(S\) is a poset, \(T(S) := \{ U \subseteq S : U \text{ is a terminal segment} \}\) is a Heyting algebra Order by set inclusion, the fact that this is a bounded lattice and closed is taken as trivial. The key step is using \[U \Rightarrow V := \{ s \in S: (s \uparrow) \cap U \subseteq V \}\]
Must show that this is a terminal segment and both directions of the Heyting Algebra definition. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Kripke Model Let P be some set of primitive propositions. A triple \((S, \leq, \Vdash)\) where \((S, \leq)\) is a poset, with S called the set of worlds and \(\leq\) the accessibility relation; and \(\Vdash \subseteq S \times P\) is the forcing relation.
It must fulfill the Persistence Property.
Note that for \(X \subseteq S\) we say \(X \Vdash \varphi\) if all worlds in \(x \in X\) force \(\varphi\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Extended Forcing Relation of the Kripke Model When \(p\) is not a primitive proposition, we define the forcing relation inductively as follows:
1. There is no world that believes in \(\bot\): \(\not \exists s,\, s \Vdash \bot\)
2. We write \(s \Vdash \varphi \land \psi\) iff \(s \Vdash \varphi\) and \(s \Vdash \psi\)
3. We write \(s \Vdash \varphi \lor \psi\) iff \(s \Vdash \varphi\) or \(s \Vdash \psi\)
4. We write \(s \Vdash (\varphi \to \psi)\) iff, for every more knowledgeable world \(s \leq s'\), \(s' \Vdash \varphi\) implies \(s' \Vdash \psi\)
(worlds are humble: they believe only in things that they believe more knowledgeable worlds believe in)
LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What are the conditions for negation in a Kripke model? \[s \Vdash \lnot \varphi \iff \forall s' \geq s,\,\, s' \not \Vdash \varphi\] \(\lnot \varphi\) holds iff no more knowledgeable worlds believe in it LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What are the conditions for double negation in a Kripke model? \[s \Vdash \lnot \lnot \varphi \iff \forall s' \geq s,\,\, \exists s'' \geq s',\,\, s'' \Vdash \varphi\] In all more knowledgeable worlds, there exists some even more knowledgeable world that believes in \(\varphi\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Filter A filter \(F\) on a lattice \(L\) is a subset \(F \subseteq L\) with the following properties:
1. \(F \neq \emptyset\)
2. F is a terminal segment of \(L\): ie if \(x \leq y\) and \(x \in F\), then \(y \in F\).
3. \(F\) is closed under finite meets.
LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Prime Filter A filter is prime if \(x \vee y \in F \implies x \in F \text{ or } y \in F\). LogicAndComputability PartIIINotes
Basic Part III Notes::MedicalStats Definition: The Hypergeometric Distribution The probability of \(k\) successes given \(n\) draws, where we draw without replacement from \(N\) objects with \(K\) that we want.
\[\mathrm{Hypergeometric}(N, K, n)\]
(probability distribution over \(k\)) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Expectation of the hypergeometric distribution \[{\mathbb{E}}(\mathrm{Hypergeometric}(N, K, n)) = n \frac{K}{N}\] MedicalStats PartIIINotes
Basic Part III Notes::MSM What is the benefit of using the Benjamin-Hochberg Procedure? It's guaranteed to keep the FDR beneath \(\alpha\): \[\frac{\alpha m_0}{m} \leq \alpha\] MSM PartIIINotes
Basic Part III Notes::SLIP State the cumulant function for the EDF of a normal random variable. \[K(\theta) = \frac{1}{2}\theta^2\]
Note that this implies \[\mu = \frac{d}{d\theta}K(\theta) = \theta\] and \[\sigma^2 = \phi \frac{d^2}{d\theta^2} K(\theta) = \phi\] SLIP PartIIINotes
Basic Part III Notes::SLIP State the base function \(f_0(y;\phi)\) of the normal EDF \[f_0(y;\phi) = \frac{1}{\sqrt{2\pi\phi}} \exp\left(-\frac{y^2}{2\phi}\right)\] SLIP PartIIINotes
Basic Part III Notes::SLIP How can you find the \(\theta\) for the EDF representing \(\frac{1}{n}\mathrm{Bin}(n,p)\) \[\theta = {\mathrm{logit}}(p) = \frac{p}{1-p}\] and so \[p = {\mathrm{expit}}(\theta) = \frac{e^\theta}{1+e^\theta}\] SLIP PartIIINotes
Basic Part III Notes::SLIP How can you find the \(\phi\) for the EDF representing \(\frac{1}{n}\mathrm{Bin}(n,p)\) \[\phi = \frac{1}{n}\] and so \[n = \frac{1}{\phi}\] SLIP PartIIINotes
Basic Part III Notes::SLIP State the cumulant function of the EDF for \(\frac{1}{n}\mathrm{Bin}(n,p)\) \[K(\theta) = \log(1 + e^\theta) - \log(2)\] SLIP PartIIINotes
Basic Part III Notes::SLIP State the normalizing \(f_0(y;\phi)\) for the EDF of \(\frac{1}{n}\mathrm{Bin}(n,p)\) \[{n\choose{yn}} 2^{-n}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Classifier A function \(h: {\mathbb{R}}^p \to [K]\), typically of the form \[h(x) = h(x; (X_i, Y_i)_{i \in [n]})\]
where \(p\) is the dimension of the classified points, \(K\) is the number of classification groups, and \(n\) is the number of data points. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Misclassification Loss \[L(y, y') = {\mathbb{1}}_{\{ y \neq y'\}}\]
(really obvious, just 1 when you've got the wrong classification or 0 if not) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Prediction Error of a Classifier \[R(h) = {\mathbb{E}}[L(Y^*, h(X^*))] = {\mathbb{P}}[Y^* \neq h(X^*)]\]
where \((X^*, Y^*) \in (X^K, Y^K)\) are test points
We want to find classifiers \(h\) that minimize this! SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Conditional Class Probabilities \[p_k(x) = {\mathbb{P}}(Y^* = k {\,|\,} X^* = x)\]
where \(k\) is some classification in \([K]\). SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Bayes Classier The minimal prediction error over all classifiers is called the Bayes risk. It is attained by the Bayes classifier \[h^*(x) = {\underset{k \in [K]}{\mathrm{argmax}}\,\,}{p^*_k(x)}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Core idea behind decision trees Break covariate space down into rectangular regions \(R_1, R_2, ..., R_M\), each with some contributing factor \(c_m\). When \(x\) falls within each, that region contributes its \(c_m\) to a final total: \[x \mapsto \sum_{m=1}^M c_m {\mathbb{1}}_{x \in R_m}\] The hard part is finding the regions and coefficients that work!
Idea: Start with the entire space in one big rectangle, then split it into two smaller pieces, staying axis-aligned: \[R_l(j, t) = \{ x\in R: x_j \leq t \}\] \[R_r(j, t) = \{ x \in R: x_j > t \}\] where \(j \in [p]\) are the split variables and \(t \in {\mathbb{R}}\) are the split points (we choose only one variable, and one location along that variable, to split over at a time)
Find the best split variable and split point to minimize a cost function \(Q(j, t)\). SLIP PartIIINotes
Basic Part III Notes::SLIP Explain the general algorithm for a regression tree Given data \((x_i, y_i)_{i \in [n]}\) in \({\mathbb{R}}^p \times {\mathbb{R}}\), continually split any region so as long as it has at least 2 data points.
Let \(T_j\) be the set of midpoints between adjacent points in a region \(x_{ij} \in {\mathbb{R}}\). Compute the cost-minimizing split to decide where to split: \[(\hat j_R, \hat t_R) \in {\underset{j \in [p],\, t \in T_j}{\mathrm{argmin}}\,\,} Q(j,t)\] where \(Q(j,t)\) is the loss after each split.
Then perform the split and continue!
SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Regression decision tree loss It's just the best case loss of the left rectangle (best choice of \(c_l\)) plus the best case loss of the right rectangle (best choice of \(c_r\)) minus the current loss of the rectangle (using \(c\)): \[\begin{align*} Q_{reg}(j,t) &=& \min_{c_l \in {\mathbb{R}}} \sum_{i : x_i \in R_l(j,t)} (y_i-c_l)^2 &=& \sum_{i : x_i \in R_l(j,t)} (y_i-c(R_l(j,t))^2\\\\ &&+ \min_{c_r \in {\mathbb{R}}} \sum_{i : x_i \in R_r(j,t)} (y_i-c_r)^2 &&+ \sum_{i : x_i \in R_r(j,t)} (y_i-c(R_r(j,t))^2\\\\ &&- \min_{c \in {\mathbb{R}}} \sum_{i: x_i \in {\mathbb{R}}} (y_i-c)^2 &&+ \sum_{i : x_i \in R} (y_i-c(R))^2 \end{align*}\] where \[c(R) = \frac{1}{N(R)}\sum_{i: x_i \in {\mathbb{R}}} y_i\] where \(N(R)\) is the number of points in a particular region.
Note that this can never be positive! Splitting will never make things worse (because we could just have two splits with the same old height \(c\)). SLIP PartIIINotes
Basic Part III Notes::SLIP Final regression function of a regression tree \[\hat f_{CART}(x) = \sum_{m=1}^M \hat c_m {\mathbb{1}}_{x \in R_m}\] with \(\hat c_m = c(R_m)\), where \[c(R) = \frac{1}{N(R)}\sum_{i: x_i \in {\mathbb{R}}} y_i\] where \(N(R)\) is the number of points in a particular region.
\(R_m\) are called the leaves of the decision tree! SLIP PartIIINotes
Basic Part III Notes::SLIP Heuristic for when a regression tree is done growing Each region should contain no more than 5 points SLIP PartIIINotes
Basic Part III Notes::SLIP Runtime of growing a regression tree (when optimized) O(n) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Define a unitary matrix A matrix \(A\) where \(A^* = A{^{-1}}\),
where \(A^*\) is the complex conjugate
For real matricies this is the same as being orthogonal SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Define an orthogonal matrix, what do they represent? A matrix \(A\) where \(A{^{\mathrm{T}}} = A{^{-1}}\)
it represents rotations and reflections (but not stretching) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Thin Singular Value Decomposition Given \(X \in {\mathbb{R}}^{n\times p}\), the decomposition \[X = U D V{^{\mathrm{T}}}\] where \(U \in {\mathbb{R}}^{n\times p}\) is a unitary matrix,
\(D \in {\mathbb{R}}^{p\times p}\) is a sorted diagonal matrix \((D_{11} \geq D_{22} \geq ...)\),
and \(V \in {\mathbb{R}}^{p \times p}\) is also unitary.
This can be thought of as a process where a space is rotated/reflected, stretched on its axes, then rotated/reflected again. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Principal Component Given some \(X \in {\mathbb{R}}^{n\times p}\) row-centered (\(X^T\)1 = 0), PCA finds a sequence of \(p\) unit vectors \(v_j\), where each \(v_j\) minimizes the sample variance \({\mathrm{Var}}{(Xv_j)} = \frac{1}{n}v{^{\mathrm{T}}} X{^{\mathrm{T}}} X v\) while staying orthogonal to all previous \(v_i\)s.
We call \(Xv_j\) the principal components. SLIP PartIIINotes
Basic Part III Notes::SLIP State how a thin singular value decomposition yields the principal components Given a singular value decomposition \(UDV\), substituting into the sample variance \(v{^{\mathrm{T}}} X{^{\mathrm{T}}} X v\) and optimizing tells us that \(v_j = V_j\) (\(j\)th column of V) and so for the principal component, we have: \[Xv_j = XV_j = D_{jj}U_j\] where \(U_j\) is the \(j\)th row of \(U\). SLIP PartIIINotes
Basic Part III Notes::SLIP In practice, how many principal components do we use? Take the smallest \(J \leq P\) such that \[\frac{\sum_{j=1}^J D_{jj}^2}{\sum_{j=1}^p D_{jj}^2} \geq 0.8\] SLIP PartIIINotes
Basic Part III Notes::SLIP Goal of K-Means Clustering Find a 'good enough' labelling minimizing
\[ L(c_1...c_k) = \sum_{k=1}^K \frac{1}{2 |C_k|} \sum_{i \in C_k} \sum_{j \in C_k} ||x_i - x_j||^2_2\]
(minimize within-cluster variation)
But this is NP-hard! Need a heuristic solution. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Centroid of a cluster \[\bar X_k = \frac{1}{|C_k|} \sum_{i\in C_k} X_i\] (center of mass)
Note that we can represent the within-cluster variance loss in terms of this: \[L(c_1...c_k) = \sum_{k=1}^K \sum_{i \in C_k} ||x_i - \bar x_k||^2_2\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: K-Means Clustering Algorithm 1. Initialize a \(K\)-partition \(C_1...C_k\) randomly.
2. Until convergence:
2-1. Compute the centroid of each cluster \(\hat x_i\)
2-2. Construct a new partition \(\tilde C_1...\tilde C_k\) by assigning each point to the closest centroid.
Note that this only finds a local minimum, not always the global minimum! SLIP PartIIINotes
Basic Part III Notes::SLIP K-Means best practices (how do we choose K?) To choose K, try a bunch and see which ones offer the most explainability. (don't just increase to minimize loss; \(K = n\) always does that)
Good clusterings should be stable with repeated runs, or with the addition or removal of a few data points. SLIP PartIIINotes
Basic Part III Notes::SLIP The two MLEs for the variance of a normal linear model SLIP PartIIINotes
Basic Part III Notes::SLIP Explain how Newton-Raphson can be used to solve an MLE numerically We want to find the \(\theta\) for which \(\frac{dl(\vec\theta;x)}{d\vec\theta} = 0\). So, use Newton-Raphson with the Hessian
\[\frac{d^2l(\vec\theta;x)}{d\vec\theta d\vec\theta{^{\mathrm{T}}}}\] except we want the maximum, not the minimum, so make sure to use the negation instead!
\[\hat \theta^{(n+1)} = \hat \theta^{(n)} - \left( \frac{d^2l}{d\vec\theta d\vec\theta{^{\mathrm{T}}}}(\hat\theta^{(n)}) \right){^{-1}}\frac{dl}{d\vec\theta}(\hat\theta^{(n)})\]
SLIP PartIIINotes
Basic Part III Notes::SLIP When we inspect GLM residuals (Pearson or Deviance), what are we looking for? Should both be approximately normal with mean zero, variance 1.
Note this is based on small-dispersion asymptotics: \[\phi_i^{-1/2} (y_i - \mu_i) \overset{d}{\to} N(0, V(\mu_i))\] as \(\phi_i \to 0\). SLIP PartIIINotes
Basic Part III Notes::SLIP What do we need to be careful of when using dispersion-scaled residuals in R? They aren't normalized by \(\hat \phi ^{-1/2}\), they use: \[\frac{y_i - \hat \mu_i}{\sqrt{a_i V(\mu_i)}}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Sigmoid Function \[\sigma(x) = \frac{e^x}{(1 + e^x)}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: ReLU Function \[\sigma(x) = \max(0, x)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Softmax Function \[g_k(z) = \frac{e^{z_k}}{\sum_{r=1}^K e^{z_k}}\] for \(k \in [K]\) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Universal Approximation Theorem for Neural Networks Any (borel-measurable) function can be approximated to a prescribed precision by a single hidden layer neural network with any nonlinear activation function. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Cross-Entropy Loss \[L(\theta) = -\frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K Y_{ik} \log P_k^{(\theta)}(X_i)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Gradient Descent for Neural Networks \[\theta^{(t + 1)} = \theta^{(t)} - \alpha_t \nabla_\theta \mathrm{Err}_\mathrm{tr}(\theta^{(t)})\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: The Continuous Mapping Theorem Given a function \(g: X \to Y\) that is continuous except at points of measure zero, \(g\) preserves convergence almost surely, in probability, and in distribution. \[X_n \overset{a.s.}{\to} X \implies g(X)_n \overset{a.s.}{\to} g(X)\] \[X_n \overset{p}{\to} X \implies g(X)_n \overset{p}{\to} g(X)\] \[X_n \overset{d}{\to} X \implies g(X)_n \overset{d}{\to} g(X)\] SLIP PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Absorption Laws of Lattices \[a \vee (a \wedge b) = a,\] \[a \wedge (a \vee b) = a\] LogicAndComputability PartIIINotes
Basic Part III Notes::SLIP Definition: Akaike Information Criterion (AIC) \[AIC(M) = 2 \dim \Theta-2 \log f(z; \hat\theta)\] Smaller is better! SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Bayesian Information Criterion (BIC) \[BIC(M) = (\log n) \dim \Theta-2 \log f(z; \hat \theta)\] where \(n\) is the number of data points.
Smaller is better! SLIP PartIIINotes
Basic Part III Notes::SLIP When should you use AIC versus BIC? Use AIC in predictive applications, use BIC in explanatory applications (since it selects for simpler models; extra weight on the \(\dim \Theta\) term) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Prediction Error \(\mathrm{Err}\) (Risk) \[\mathrm{Err} = {\mathbb{E}} [ L(Y^* , \hat f(X^*))]\] where \((X^*, Y^*)\) is a random vector, the test data (space of all possible outcomes) and \(\hat f\) is an arbitrary regression function fit using training data. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: In-Sample Error \(\mathrm{Err}_\mathrm{in}\) \[\mathrm{Err}_\mathrm{in} = \frac{1}{n} \sum_{i=1}^{n} {\mathbb{E}} [ L(Y^*_i , \hat f(x_i))]\] (this is for a fixed design scenario, where the \(x_i\)s are fixed but the outcome is not. \(\hat f\) is fit using specific draws \((x_i, y_i)\) but evaluated against any possible set of \(Y^*_i\) (still at fixed \(x_i\)). SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Training Error \(\mathrm{Err}_\mathrm{tr}\) \[\mathrm{Err}_\mathrm{tr} = \frac{1}{n} \sum_{i=1}^{n} L(y_i , \hat f(x_i))\] The evaluation of \(\hat f\) against the training data itself. SLIP PartIIINotes
Basic Part III Notes::SLIP Show that we have the relation:
\[\mathrm{Err}_\mathrm{in} = {\mathbb{E}}(\hat{ \mathrm{Err}_\mathrm{tr}}) + \frac{2}{n} \sum_{i=1}^n {\mathrm{Cov}}(\hat f (x_i), Y_i)\] for L2 loss. Proof sketch:
Just expand the squared terms. You'll need to use that \[{\mathbb{E}}[Y_i^* \hat f(X_i)] = {\mathbb{E}}[Y_i^*]{\mathbb{E}}[\hat f(X_i)] = {\mathbb{E}}[Y_i]{\mathbb{E}}[\hat f(X_i)]\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Bias-Variance Tradeoff (of the prediction error) If we're trying to model \[Y_i = f(X_i) + {\varepsilon}_i\] where \(f: {\mathbb{R}}^p \to {\mathbb{R}}\), \({\mathbb{E}}({\varepsilon}_i) = 0\), \({\mathrm{Var}}({\varepsilon}_i) = \sigma^2 > 0\), then we have the following decomposition of prediction error: \[\mathrm{Err} = \sigma^2 + {\mathbb{E}}[ \mathrm{Bias}^2(\hat f(X^*) {\,|\,} X^*)] + {\mathbb{E}}[{\mathrm{Var}} (\hat f (X^*) {\,|\,} X^*)]\]
Or for the in-sample error: \[\mathrm{Err}_\mathrm{in} = \sigma^2 + \frac{1}{n}\sum_{i=1}^n{\mathbb{E}}[ \mathrm{Bias}^2(\hat f(X^*_i) {\,|\,} X^*_i)] + \frac{1}{n}\sum_{i=1}^n{\mathbb{E}}[{\mathrm{Var}} (\hat f (X^*_i) {\,|\,} X^*_i)]\] SLIP PartIIINotes
Basic Part III Notes::MedicalStats Definition: Accelerated-Life Families A family of distributions generated by \(F(\lambda t)\) for some scaling \(\lambda > 0\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Proportional-Hazards Families A family of distributions generated by \(F(t)^k\) for some shape parameter \(k >0\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Survivor Function of the Weibull Distribution \[F(t) = e^{-(\theta t)^k}\] where \(\theta >0\) is the shape parameter and \(\theta > 0\) is the scale parameter.
Note that this reduces to the exponential RV when \(k = 1\).
Weibull distributions are both an accelerated life and proportional hazards family! MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats State the contribution to the likelihood for an observed event versus a censored event. When \(v_i = 1\) (uncensored), the contribution is the density \(f(x_i, \theta)\).
When \(v_i = 0\) (censored), the contribution is \({\mathbb{P}}(x_i < T_i) = F(x_i, \theta)\) (where \(F\) is the survivor function).
So the total likelihood is \[L(\theta) = \prod_{v_i = 1}f(x_i, \theta)\prod_{v_i = 0}F(x_i, \theta)\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats PDF of the exponential distribution \[f(t; \theta) = \theta e^{-\theta t}\] given \(\theta > 0\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Survivor Function of the exponential distribution \[F(t; \theta) = e^{-\theta t}\] given \(\theta >0\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Requirements for a nonparametric survival function \(\tilde F(t)\) 1. Bounded between zero and one: \(0 \leq \tilde F(t) \leq 1\)
2. Decreasing: where \(a \leq b\), \(\tilde F(a) \geq \tilde F(b)\) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Contributions to a non-parametric survivor function likelihood Where uncensored, we have: \[{\mathbb{P}}(T = x_i) = {\mathbb{P}}(T \geq x_i) - {\mathbb{P}}(T > x_i) = F(x_i^-) - F(x_i) = F_i^- - F_i\] where the minus subscript represents the limit from the left.
Where censored, we have: \[{\mathbb{P}}(T > x_i) = F(x_i) = F_i\]
So for each event the likelihood contribution is \[L(\theta) = \prod_{v_i = 1}(F_i^- - F_i) \prod_{v_i = 0} F_i\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats How can we simplify a non-parametric survivor function likelihood? We can reduce the number of parameters by our assumptions: \[F_1^- = 1\] \[F_n = 0\] Also the function should only change between \(F_i^-\) and \(F_i\) terms! (we're trying to build a step function; piecewise linear)
We should end up with a polynomial we can maximimize easily. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Non-parametric likelihood contributions of left-censored observations and interval-censored observations Left-censored: \(1 - F(x_i) = 1 - F_i\)
Interval-censored on an interval \((a, b)\): \(F(a) - F(b)\) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Hazard Function (raw form) \[h(t) = \lim_{\Delta \to 0} \frac{{\mathbb{P}}(t < T < t + \Delta {\,|\,} T > t)}{\Delta}\] You can simplify this to the form generally used using Bayes' Theorem and recognizing the definition of derivative: \[h(t) = \frac{f(t)}{F(t)}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Hazard Function (practical form) \[h(t) = \frac{f(t)}{F(t)}\] where \(F(t)\) is the survivor function. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Integrated Hazard \[H(t) = \int_0^t h(t') dt'\] Models the accumulation of hazard over time. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Survivor function as a function of integrated hazard \[F(t) = e^{-H(t)}\] or equivalently \[H(t) = -\log(F(t))\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Counting Process A function \(N : {\mathbb{R}} \to {\mathbb{N}}\) that counts how many times an event has occurred. It always fulfills:
\[N(0) = 0\] Always increasing
Always right-continuous (increments at an event time, not right after it)
MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Survival Analysis as a Counting Process In survival analysis, we only care about the first time an event happens, so \(N(t) \in \{0, 1\}\).
Putting it in our usual notation, we have: \[N(t) = I\{ X \leq t, V = 1 \}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: History (of a stochastic process) \({\mathcal{H}}_t\) denotes knowledge of everything that has happened in a stochastic process up to and including \(t\)
\({\mathcal{H}}_{t-}\) denotes knowledge up to but not including \(t\).
For our purposes, \({\mathcal{H}}_{t-}\) is completely captured by membership to the risk set at \(t\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats What do we mean by \(dN(t)\)? It's called Stieltjes Notation: \[dN(t) = N(t + dt) - N(t)\] It can only take the values \(0\) or \(1\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Intensity \(\lambda(t)\) \[\lambda_i(t) = Y_i(t)h_i(t)\]
where \(Y_i(t) = I\{X_i \geq t\}\) (1 when person \(i\) is still in the risk set).
Let \(\Lambda\) be the integrated intensity, so equivalently we have
\[d\Lambda_i(t) = \sum_i Y_i(t)dH_i(t)\]
Define \(\Lambda_+ = \sum_i \Lambda_i\) and \(Y_+ = \sum_i Y_i\), and assume \(H_i(t) = H(t)\) for all \(i\) (could also assume proportionality instead)
\[d\Lambda_+(t) = Y_+(t) dH(t)\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Show that \(dN(t) - d\Lambda(t)\) is a martingale Work from the definition of the derivative of the integrated intensity: \[\begin{align*} {\mathbb{P}}(dN(t) = 1 {\,|\,} {\mathcal{H}}_{t-}) &= d\Lambda(t)\\\\ {\mathbb{E}}[dN(t) {\,|\,} {\mathcal{H}}_{t-}] &= d\Lambda(t)\\\\ {\mathbb{E}}[dN(t) - d\Lambda(t) {\,|\,} {\mathcal{H}}_{t-}] &= 0 \end{align*}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Given the per-individual martingale identity \[{\mathbb{E}}[dN(t) - d\Lambda(t) {\,|\,} {\mathcal{H}}_{t-}]= 0,\] show that the population derivative of integrated intensity equals the derivative of the population counting process at all \(t\). Define the population counting process and integrated hazard: \[N_+(t) = \sum_i N_i(t)\] \[\Lambda_+(t) = \sum_i \Lambda_i(t)\] By our assumption, we have: \[\begin{align*} {\mathbb{E}}[dN(t)] - {\mathbb{E}}[ d\Lambda(t) {\,|\,} {\mathcal{H}}_{t-}]&= 0\\\\ {\mathbb{E}}[dN(t)] &= {\mathbb{E}}[ d\Lambda(t) {\,|\,} {\mathcal{H}}_{t-}] \end{align*}\] So by the method of moments, we have: \[dN_+(t) = d\Lambda_+(t)\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Use \(dN_+(t) = d\Lambda_+(t)\) and \(d\Lambda_+(t) = Y_+(t) dH(t)\) to build the Nelson-Aalen Estimator (find an estimator \(d\hat H(t)\)) Substituting in and isolating \(d\hat H(t)\), we get: \[d\hat H(t) = \frac{dN_+(s)}{Y_+(s)}\] Integrate both sides: \[\int_0^t d\hat H(s)ds = \hat H(t) = \int_0^t \frac{dN_+(s)}{Y_+(s)} ds\] But \(N_+(t)\) is just a step function that jumps at each failure time, so this integral is just a sum of 'snapshots' at each \(t_j \leq t\): \[\hat H(t) = \sum_{t_j \leq t} \frac{1}{Y_+(t_j)}\] Sum of snapshots of hazard increments. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Nelson-Aalen Estimator Given a set of events \(\{a_1, ...,a_j,..., a_n\}\) with no ties, we can estimate the cumulative hazard \(H(t)\) as \[\hat H(t) = \sum_{a_j \leq t} \frac{1}{Y_+(a_j)}\] where \(Y_+(t)\) is the number of people at risk at \(t\) (this is exactly the \(r_j\) used in Kaplan-Meier).
Note that we can estimate the survivor function using Nelson-Aalen: \[\hat F(t) = \exp(-\hat H(t))\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Semiparametric Modeling Model the bits we care about explaining (\(\beta\)) with a parametric method, and model the bits we don't care so much about (\(\psi\)) with a nonparametric method. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Proportional Hazards Modeling Semiparametric method for modeling hazard. Assume that everyone has hazards in the same proportional hazards family: \[h_2(t) = Kh_1(t),\, K> 0\] Express \(K\) with the explainable parameters \(\beta\), and express \(h_0(t)\) with the nonparametric \(\psi\): \[h(t, z^i, \beta, \psi) = \phi(z^i, \beta)h_0(t, \psi)\]
Note that this is a strong assumption! MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Partial Likelihood An expression of likelihood that neglects some of the data in order to solve for only some of the model parameters.
In Cox regression, it was shown that this is mathematically valid! MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: (Partial) Likelihood in Proportional Hazards Modeling \[L(\beta) = \prod_{i\,:\, v_i = 1} \frac{\phi(z^i, \beta)}{\sum_{i'\in R_i} \phi(z^{i'}, \beta)}\] where \(R_i\) is the risk set at \(t_i\). Note that \(h_0(x_i, \psi)\) gets cancelled out in the fraction. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Cox Regression Proportional hazards modeling with the following form for \(\phi(z, \beta)\): \[\phi(z, \beta) = e^{\beta^T z}\]
This yields \[L(\beta) = \prod_{i\,:\, v_i = 1} \frac{e^{\beta^T z^i}}{\sum_{i'\in R_i} e^{\beta^T z^{i'}}}\] It's easy to take the log-likelihood then: \[\log L(\beta) = \sum_i v_i \left( \beta^Tz^i - \log{\sum_{i' \in R_i} e^{\beta^T z^{i'}}} \right)\] And a derivative with respect to \(\beta\): \[ \frac{d}{d\beta} \log L(\beta) = \sum_i v_i \left( z^i - \sum_{i' \in R_i}\frac{z^{i'}e^{\beta^T z^{i'}}}{ e^{\beta^T z^{i'}}} \right) \] Use Newton-Raphson from here. MedicalStats PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Partial Recursive Function The smallest class of functions \({\mathbb{N}}^k \to {\mathbb{N}}\) that contains the three atomic functions, and is closed under composition, primitive recursion, and minimization.
For the class of primitive recursive functions, ignore minimization. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: The three atomic partial recursive function building blocks 1. Projections: Pick one number out of a list ( \[\Pi_i^k: {\mathbb{N}}^k \to {\mathbb{N}}\] \[\Pi_i^k: (n_1,...,n_k) \mapsto n_i\] 2. Successor: Add one \[S^+ : {\mathbb{N}} \to {\mathbb{N}}\] \[S^+: n \to n + 1\] 3. Zero: Any number to zero \[Z: {\mathbb{N}} \to {\mathbb{N}}\] \[Z : n \mapsto 0\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Partial Recursive Function Closure under Composition Given \(g: {\mathbb{N}}^k \to {\mathbb{N}}\), and for each \(i \in [k]\) we have a function \(h_i: {\mathbb{N}}^m \to {\mathbb{N}}\), then there exists a function \(f : {\mathbb{N}}^m \to {\mathbb{N}}\) given by
\[f(\bar n) := g(h_1(\bar n), ..., h_k(\bar n))\]
Note that \(\bar n\) refers to a tuple of natural numbers.
Sort of like a weird inverse map-reduce, except we're mapping over a list of functions by applying \(\bar n\) to all of them. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Partial Recursion Function Closure under Primitive Recursion Given a base function \(g: {\mathbb{N}}^m \to {\mathbb{N}}\) and step function \(h: {\mathbb{N}}^{m+2} \to {\mathbb{N}}\), there exists a function \(f: {\mathbb{N}}^{m+1} \to {\mathbb{N}}\) defined by: \[\begin{align*} f(0, \bar n) &:= g(\bar n)\\\\ f(k+1, \bar n) &:= h(f(k, \bar n), k, \bar n) \end{align*}\]
The first parameter is the fuel, the second is the surrounding context (constant through the process) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Partial Recursive Function Closure under Minimization Given \(g : {\mathbb{N}}^{m + 1} \to {\mathbb{N}}\), there exists a function \(f: {\mathbb{N}}^m \to {\mathbb{N}}\) that maps \(\bar n\) to the smallest \(k\) such that \(g(k, \bar n)=0\).
Note that it may not exist!
This is sometimes denoted \(f(\bar n) = \mu k. g(k,\bar n) = 0\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Total Recursive Function A partial recursive function for which every every input maps to some outputs.
This just means it always terminates: there's no input that will search forever. All primitive recursive functions are total. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Godel Numbering Let \(L\) be a language. A Godel Numbering is an injection \(L \hookrightarrow {\mathbb{N}}\) with three properties:
1. It's computable: there exists an algorithm you can follow to find the Godel number of any expression in finite time
2. Its image is a recursive subset of \({\mathbb{N}}\) (membership is decidable).
3. Its inverse (where defined) is also computable. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Recursively Enumerable Set \(X \subseteq {\mathbb{N}}\) is recursively enumerable (or recognizable, or semi-decidable) if any of the following are true:
1. \(X\) is the image of some partial recursive function \(f: {\mathbb{N}} \to {\mathbb{N}}\)
(dovetail computations and generate a big list)
2. \(X\) is the image of some total recursive function \(f: {\mathbb{N}} \to {\mathbb{N}}\)
(just run on all numbers sequentially)
3. \(X\) is the domain of some partial recursive function \(f: {\mathbb{N}} \to {\mathbb{N}}\)
(same as 1, but mark the input on the list, not the output) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Recursive Set A set \(X \subseteq {\mathbb{N}}\) is recursive if both \(X\) and its complement \({\mathbb{N}} \,\backslash\, X\) are recursively enumerable. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Recursive Language There exists an algorithm to check whether a string of symbols is an L-formula
(this is generally trivial: just checking if a statement makes sense) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Recursive L-Theory An L-Theory \(T\) is recursive if membership in \(T\) is decidable (we can check if a sentence is assumed or not) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Decidable L-Theory An L-Theory \(T\) is decidable if there is an algorithm for deciding \(T \Vdash \varphi\) for any \(\varphi\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: Craig's Trick Every first-order theory with a recursively-enumerable set of axioms admits a recursive axiomatization
(if your list of axioms is recognizable, it can be reconfigured into a decidable one)
Proof:
There exists a totally recursive \(f\) with \(f(n) = \phi_i\) for \(n \in {\mathbb{N}}\).
Map each \(\phi_i\) to a logically equivalent \(\bigwedge_{j=1}^i \phi_i\).
The algorithm to check if a statement \(A\) is an axiom is thus:
First, check if \(A\) is equivalent to \(f(1)\) (can do this in finite time due to finite sentence length). If it is, return yes, otherwise continue.
Second, find any decompositions \(A = \bigwedge_j^i A_j\). Then just check axiom \(i\) for equivalence. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Notation for the Godel numbering of an expression \(\lceil x \rceil\) maps \(x\) to its Godel numbering
\(\lfloor x \rfloor\) maps a Godel numbering \(x\) to the statement in represents LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: The set of total recursive functions (or their respective Godel numberings) is not recursively enumerable Proof:
By definition, there exists some function \(f\) such that, for every total function \(h\), there exists some \(n \in {\mathbb{N}}\) such that \(h = \lfloor f(n) \rfloor\).
In particular, this includes \(g(x) = \lfloor f(x) \rfloor (x) + 1\).
Show a contradiction from there. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: The Language of Arithmetic \(\mathcal{L}_{PA}\) The first-order language with signature \((0, 1, +, \cdot, <)\) where \(0, 1\) are constants, \(+, \cdot \) are functions with arity two, and \(<\) is a relation with arity two. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: The Base Theory of Arithmetic \(PA^-\) An \(\mathcal{L}_{PA}\)-theory fulfilling:
1. \(+, \cdot\) are commutative and associative, and have identity elements 0 and 1.
(\(+, \cdot\) individually)
2. \(\cdot\) distributes over \(+\).
(distributivity)
3. \(<\) is a linear ordering compatible with \(+\) and \(\cdot\).
(\(<\) is a compatible linear ordering)
4. \(\forall x, y,\, (x < y \implies \exists z,\, x + z = y)\).
(Quantifier definition of \(<\))
5. \(0 < 1\) and \(\forall x,\, (x > 0 \implies x \geq 1)\).
(Zero is less than 1, \(\leq\) base definition)
6. \(\forall x,\, x \geq 0\).
(Everything is greater than or equal to zero) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: The Theory of Peano Arithmetic (PA) \(PA^-\) with the extra set of axioms, the scheme of induction:
For each \(\mathcal{L}_{PA}\)-formula \(\varphi(x, \bar y)\), we have the axiom \[I\varphi := \forall \bar y,\, (\varphi(0, \bar y) \land \forall x,\, (\varphi (x, \bar y) \to \varphi(x + 1, \bar y)) \to \forall x,\, \varphi(x, \bar y)\]
(proving the base case \(n = 0\) and proving the inductive case \(n \implies n+1\) is enough to prove something for all \(n \in {\mathbb{N}}\)) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Delta-0 Formula (\(\Delta_0\)) A formula whose quantifiers (\(\forall\), \(\exists\)) are bounded: for some fixed \(n \in {\mathbb{N}}\), either \[\forall x,\, x<n,\, \varphi(x)\] or \[\exists x,\, x<n,\, \varphi(x)\] for \(n\) not free in \(\varphi(x)\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Strict Order Axioms 1. Irreflexivity: \(\forall x,\, x \not < x\)
2. Asymmetry: \(\forall a,b,\, a < b \implies b \not < a\)
3. Transitivity: \(\forall a, b, c,\, (a < b) \land (b < c) \implies a < c\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Pi-1 Formula and Sigma-1 Formula (\(\Pi_1\), \(\Sigma_1\)) A formula \(\varphi(\bar x)\) is a \(\Sigma_1\)-formula if it is provably equivalent to a there-exists statement with a \(\Delta_0\)-formula body, ie, there exists a \(\Delta_0\) \(\psi(\bar x, \bar y)\) such that \[PA \vdash \left(\varphi(\bar x) \iff \exists \bar y.\, \psi(\bar x, \bar y) \right)\] Similarly \(\varphi(\bar x)\) is a \(\Pi_1\)-formula if it is provably equivalent to a for-all statement with a \(\Delta_0\)-formula body, ie, there exists a \(\Delta_0\) \(\psi(\bar x, \bar y)\) such that \[PA \vdash \left(\varphi(\bar x) \iff \forall \bar y.\, \psi(\bar x, \bar y) \right)\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability State a total recursive bijection from \({\mathbb{N}}^2 \to {\mathbb{N}}\), denoted \(\langle x, y \rangle\) \[ \langle x, y \rangle := \frac{(x+y)(x + y + 1)}{2} + y \] This is called the Cantor Pairing Function.
Intuition: it's just diagonalization. Count up the triangular number underneath the square we want the number of, then add how many squares into the next triangle we are. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Godel's Beta Function There exists a function \(\beta(x, y)\) derivable in PA such that for any sequence \(x\), there is a code \(u\) such that \(\beta(u, i)\) = \(x_i\).
(Decodes the sequence and gives us the \(i\)th element. This is the projection function...)
Note that with Cantor's pairing function, we can reduce this to just one coded number. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Godel's Lemma Let \(M \Vdash PA\), \(n \in {\mathbb{N}}\), and \(x_0,...,x_{n-1} \in M\).
Then there exists a \(u \in M\) such that \(M \Vdash ((u)_i = x_i)\) for all \(i < n\).
If M believes in PA, M can prove that Godel's beta function works as intended. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: Let \(f: {\mathbb{N}}^k \to {\mathbb{N}}\) be a partial function.
\(f\) is (partial) recursive iff there is a \(\Sigma_1\)-formula \(\theta(\bar x, y)\) such that \[y = f(\bar x) \iff {\mathbb{N}} \Vdash \theta(\bar x, y)\]
(ie the graph of \(f\) is \(\Sigma_1\)-definable in \({\mathbb{N}}\)) First we show that if the graph of \(f\) is \(\Sigma_1\)-definable, then \(f\) must be partial recursive. We can just implement a minimization procedure then run the outer there-exists clause, checking every case exhaustively (computing results in \({\mathbb{N}}\) as normal).
Constructing a \(\Sigma_1\) statement from a partial recursive \(f\) is more complicated. The base functions are simple. For the rest:
Composition: show this is equivalent to a there-exists over each subfunction output, with \(f\) applied to all of them together making the required full output.
Primitive Recursion: absolutely requires Godel's beta-function so that you can get an existence statement over the sequence length (we don't know how many times we'll need to run through the loop).
A similar argument holds for minimization. LogicAndComputability PartIIINotes
Basic Part III Notes::SLIP State the variance function of the normal EDF \[V(\mu) = \mu\] SLIP PartIIINotes
Basic Part III Notes::SLIP State the variance function for the binomial EDF \[V(\mu) = \mu (1 - \mu)\] SLIP PartIIINotes
Basic Part III Notes::SLIP State the cumulant function for the Poisson EDF \[K(\theta) = e^\theta - 1\] SLIP PartIIINotes
Basic Part III Notes::SLIP How can you find the \(\theta\) for the EDF representing \(\textrm{Poisson}(\lambda)\) \[\theta = \log (\mu)\] \[\mu = e^\theta\] SLIP PartIIINotes
Basic Part III Notes::SLIP How can you find the \(\phi\) for the EDF representing \(\textrm{Poisson}(\lambda)\) It's always one: \[\phi = 1\] SLIP PartIIINotes
Basic Part III Notes::SLIP Variance function for the Poisson EDF \[V(\mu) = e^{\log \mu} = \mu\] SLIP PartIIINotes
Basic Part III Notes::SLIP State three reasons that we might observe overdispersion in our model 1. Missing critical covariates
2. You're using the wrong distribution
3. There is correlation between your datapoints SLIP PartIIINotes
Basic Part III Notes::SLIP Intuitively, what does the negative binomial distribution model? In a binary experiment with probability \(p\) of failure, the number of failures before \(r\) successes is reached.
NOTE: Other places use the convention that \(p\) is the probability of success, not failure. This course uses the 'probability of failure' convention. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Negative Binomial Distribution PDF If \(X \sim \mathrm{NB}(r, p)\): \[{\mathbb{P}}(X=k) = {k + r - 1 \choose k} p^k (1-p)^r\] SLIP PartIIINotes
Basic Part III Notes::SLIP Mean of the Negative Binomial PDF If \(X \sim \mathrm{NB}(r, p)\): \[{\mathbb{E}}(X) = \frac{rp}{1-p}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Variance of the Negative Binomial PDF If \(X \sim \mathrm{NB}(r, p)\): \[{\mathrm{Var}}(X) = \frac{rp}{(1-p)^2}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Variance Function of the Negative Binomial PDF \[V(\mu) = \mu + \frac{\mu^2}{r}\] SLIP PartIIINotes
Basic Part III Notes::SLIP What is the relationship between the Negative Binomial distribution and the Poisson distribution? As \(r \to \infty\), \[\mathrm{NB}\left(r, \frac{\mu}{\mu + r}\right) \overset{d}{\to} \mathrm{Pois}(\mu)\] SLIP PartIIINotes
Basic Part III Notes::SLIP What do we need to remember when using the negative binomial in R? The default link function is the same as for the Poisson RV: \[g(\mu) = \log \mu\] SLIP PartIIINotes
Basic Part III Notes::SLIP How can we test between \(H_0\): Poisson model and \(H_1\): negative binomial model? Use the modified framework: \[2 \left( l_1(\hat \beta, \hat r; Y) - l_0(\hat \beta; Y) \right) \overset{d}{\to} \frac{1}{2} \delta_0 + \frac{1}{2} \chi_1^2\] as \(n \to \infty\), where \(\delta_0\) is the point mass at zero. SLIP PartIIINotes
Basic Part III Notes::LogicAndComputability What is meant when we write \(T(S)\) where \(S\) is a poset? The set of terminal segments of \(S\): \[\{ U \subseteq S : U \text{ is a terminal segment} \}\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What is a simple way to make a Kripke model? Take a mapping ('valuation') \(v\) from the set of primitive propositions \(P\) to the set of terminal segments \(T(S)\) (known to be a Heyting algebra).
This induces a Kripke model by setting \[s \Vdash p \iff s \in v(p)\] for \(s \in S\), \(p \in P\).
It's clear to see that this fulfills the persistence property: \[s \Vdash p \implies s \in v(p) \implies (s\uparrow) \subseteq v(p) \implies s' \in v(p) \implies s' \Vdash p\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proper Filter A filter is called 'proper' when it isn't just the whole set \(X\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Extension of Filters If \(F\) is a proper filter and \(x \not \in F\), then there exists a prime filter extending \(F\) that still avoids containing \(x\) (by Zorn's Lemma) LogicAndComputability PartIIINotes
Basic Part III Notes::SLIP Definition: Quasi-Likelihood Model \((x_i, y_i)\) follow a quasi-likelihood model if they are independent and satisfy: \[{\mathbb{E}}[y_i] = \mu_i(\beta) = g^{-1}(x_i^T\beta)\] \[{\mathrm{Var}}[y_i] = V_i(\mu_i(\beta))\] where we fix \(V_i\) and \(g\), and we try and fit \(\beta\).
This effectively means that we can just fix some link function and some variance function, without having to specify anything else about our model.
Every GLM fits this format. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Quasi-Score \[U(\beta) = \sum_{i=1}^n \frac{y_i - \mu_i(\beta)}{V_i(\mu_i(\beta))} \mu_i'(\beta) x_i\] where \[\mu_i'(\beta) = (g{^{-1}})'(x_i{^{\mathrm{T}}} \beta)\]
Note that this fulfills the score identity \({\mathbb{E}}[U(\beta)] = 0\), and the Fisher information can still be calculated the standard way, \(-{\mathbb{E}}[\frac{du}{d\mu(\beta)}]\)
We can estimate \(\beta\) by solving the estimating equation \(u(\hat \beta) = 0\) (called the Quasi-Likelihood Estimator) SLIP PartIIINotes
Basic Part III Notes::SLIP Quasi-Likelihood Estimator Distribution \[(X{^{\mathrm{T}}} W X)^{1/2}(\hat \beta - \beta) \overset{d}{\to} N(0, I_p)\] where \(W\) is the diagonal matrix with entries \[W_{ii} = \frac{\mu_i'(\hat \beta)^2}{V_i(\mu_i(\hat \beta))}\] where \(\hat \beta\) is the Quasi-Likelihood Estimator. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Linear Mixed Effects Model Given observations \((x_{ij}, z_{ij}, Y_{ij})\), with \(i \in [n]\) groups and \(j \in [d]\) individuals per group,
\[Y_i = X_i\beta + Z_iu_i + {\varepsilon}_i\] where \(X_i \in {\mathbb{R}}^{d \times p}\),
where \(Z_i \in {\mathbb{R}}^{d \times q}\),
\(u_i \overset{iid}{\sim} N(0, \Sigma_u)\) are random effects (\(\Sigma_u\) is positive semi definite in \({\mathbb{R}}^{q\times q}\)),
\({\varepsilon}_i \overset{iid}{\sim} N(0, \sigma^2 I_d)\) (independent of \(u_i\)),
\(\beta \in {\mathbb{R}}^p\) are fixed effects
\(Y_i\) represents all the observations from group \(i\).
SLIP PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Precision The inverse of variance, often \(\sigma^{-2}\) or \(\Sigma^{-1}\) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Posterior of two multivariate normals, one determining the mean of the other
((same as proportional product of two normal pdfs)) Given \(N(\vec \mu_1, \Sigma_1)\) and \(N(\vec \mu_2, \Sigma_2)\), the product of their pdfs is a new \(N(\mu^*, \Sigma^*)\) with \[\Sigma^* = \left(\Sigma_1^{-1} + \Sigma_2^{-1}\right){^{-1}}\] \[\mu^* = \Sigma^* \left(\Sigma_1^{-1}\vec\mu_1 + \Sigma_2^{-1}\vec\mu_2\right)\] For the posterior result, just use \(\mu_2 := A\).
The precisions just sum. The mean is the precision-weighted average. Astrostatistics PartIIINotes
Basic Part III Notes::MedicalStats The two kinds of ties in a Nelson-Aalen model, and how we can account of them 1. Tie by lack of precision
Suppose in the dataset ordered 1,2,3,4, 2 and 3 tied. Then we can account for this by just adding the two possibilities: \[...\left( \frac{\phi_2 }{\phi_2 + \phi_3 +\phi_4}\frac{\phi_3}{\phi_3 + \phi_4} + \frac{\phi_3 }{\phi_2 + \phi_3 +\phi_4}\frac{\phi_2}{\phi_2 + \phi_4} \right)...\] We can approximate this with \[...\left( \frac{\phi_2\phi_3 }{(\phi_2 + \phi_3 +\phi_4)(\frac{1}{2}\phi_2 + \frac{1}{2}\phi_3 + \phi_4)} \right)...\] 2. Genuine Tie
Same example, use the following: \[...\left( \frac{\phi_2\phi_3 }{\phi_2\phi_3 + \phi_2\phi_4 + \phi_3\phi_4} \right)...\] MedicalStats PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Chinese Remainder Theorem Take integers a list of \(k\) integers \(n_1,...,n_k\), all greater than one. Assume they are pairwise coprime: no pair shares any divisor besides 1. no Let \(N\) be their product, \(n_1...n_k\). Specify a list of \(k\) possible remainders for each \(n_k\), \(0 \leq a_k < n_k\).
Then there exists only one possible solution \(0 \leq x < N\) such that the remainder of \(x\) with each \(n_i\) is \(a_i\).
Another way to state it: with the same assumptions above, the system \[\begin{align*} x \equiv a_1\mod n_1\\\\ ...\\\\ x \equiv a_k\mod n_k \end{align*}\] has only one solution \(x\) mod \(N\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability How do we encode numbers for use in Godel's Beta Function? Suppose we want to code the numbers \( x_0, ..., x_{n-1}\). Let \(m = \max (0, x_0, ..., x_{n-1})!\). Define pairwise coprime \(b_1 :=m + 1,\, b_2:= 2m + 1, ..., b_n:= nm + 1\). Then our tuple becomes coded as the unique \(x\) guaranteed by the Chinese Remainder Theorem such that the remainder by \(b_i\) is \(x_i\). Thus we have \(u = \langle a, m \rangle\)
LogicAndComputability PartIIINotes
Basic Part III Notes::MedicalStats Definition: Estimating Baseline Hazard \[ \hat H_0(t) = \sum_{i : x_i \leq t,\, v_i = 1} \frac{1}{\sum_{i' \in R_i}\hat \phi(z_{'i}, \beta)} \] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Many Baseline Hazards Modeling Suppose we had a treatment trail where individuals in the US and the UK are separately given the same medication. We expect the treatment effect to be the same, but the baseline risk to be different.
To model this: we say there are \(L\) strata, with the function \(q(i)\) mapping the \(i\)th individual to their strata group. Then just use the custom risk set: \[R_i := \{ i' :\, x_{i'} \geq x_i \land q(i') = q(i) \}\] For an event occurring to individual \(i\), we only consider other individuals in the same strata (with known events occurring after that individual's events).
Estimate \(\beta\) by taking the product of each stratum's partial likelihood. Estimate the baseline hazard separately. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Matched Pair Analysis Setup Smallest possible stratification. We have two treatments \(k \in \{ 0,1\}\) and \(L\) pairs where one is exposed to treatment, and the other is not. Using our partial hazard split per individual:
\[h_{k, l}(t) = \phi_k h_l(t)\]
Clearly we can't estimate each individual \(h_l(t)\), but we can estimate the \(\phi_k\) shared among each pair.
Assume we have the baseline effect \(\phi_0 = 1\), and the treatment effect \(\phi_1 = e^\beta\) (or combined \(\phi_k=e^{k \beta}\). \(\beta\) is called the log hazard ratio between treatments. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Matched Pair Analysis Likelihood Contributions Ignore pairs with ties: nothing useful we can learn from a tie. There's only two pieces of information we need from each pair \(l\): \[k(l) = I\{X_{1,l} < X_{0,l} \}\] Just the index of whichever treatment fails first. \[v(l) = V_{k(l), l}\] 0 if it was actually a censoring, not a failure; 1 if not.
This yields the full likelihood contribution: \[ \left(\frac{e^{k(l)\beta}}{1 + e^\beta} \right)^{v(l)} \] This ends up just being equivalent to a binomial likelihood!
Note that there's no likelihood contribution when:
- there's a tie
- one individual is censored before the other has an event
- both individuals are censored MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats When we graph log(survival) and log(log(survival)) over time, what are we graphing? log(survival): Integrated hazard, slope is hazard
log(log(survival)): We'll see a constant vertical difference between two curves if their hazards are proportional. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Probability Integral Transform for Hazard Take the random variable \(T\) with integrated hazard \(H(t)\). Then the random variable \(U := H(T)\) is a \(Exp(1)\) distribution.
Proof: \[\begin{align*} {\mathbb{P}}(u > U) &= {\mathbb{P}}(u > H(T))\\\\ &= {\mathbb{P}}(H^{-1}(u) > T) \\\\ &= F(H^{-1}(T))\\\\ &= e^{-H(H^{-1}(u))} \\\\ &= e^{-u} \end{align*}\]
We can use this to quantify how accurate our models are! MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Cox-Snell Residual Fit the model to construct \(y_i = \hat H_i(x_i)\). By the PIT for Hazard, we expect \(y_i \sim Exp(1)\).
Problem: we still may have censored individuals. So we can use a method like Kaplan-Meier to instead build a nonparametric representation of \(\hat F_{y_i}\).
Graphically, \(\hat F_y(y)\) should look like a straight line against \(y\), with a slope of -1. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Trick for accounting for smaller-than-expected \(y_i\) mean in computing the Cox-Snell Residual Add one to the \(y_i\)s for each censored individual: \[y_{i}' := y_i + (1-v_i)\]
This comes from the memoryless property of the exponential: \[{\mathbb{E}}[Y | Y + c] = c + 1\] This strategy is called mean imputation. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Martingale Residual \[y_{i}'' = 1-y_{i}' = v_i - \hat H_i(x_i)\] We know this must have mean zero: \({\mathbb{E}}[y_{i}''] = 0\) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats How can we use Martingale plots as a means of designing an explanatory model? 1. Fit a model without using explanatory variables (\(z\)).
2. Plot the Martingale residuals \((y_i'')\) against \(z\): where the graph is above 0, we see more events than expected. For instance if you see a log plot, that implies you should probably use Cox regression.
3. Fit a fixed dataset and see if the means have been corrected towards zero. MedicalStats PartIIINotes
Basic Part III Notes::SLIP Definition: Marginal Formulation of Mixed Effects Models Just compute expectation and variance of \(Y_i\): \[V_i := Z_i \Sigma_u Z_i{^{\mathrm{T}}} + \sigma^2 I_d\] \[Y_i \sim N(X_i\beta,\, V_i)\]
Let \(Y = (Y_1^T ... Y_n^T)\), similar for \(X\) and \(Z\), and let \(V = diag(V_1,...,V_n)\). Taking the log-likelihood of the random multivariate normal pdf yields: \[l(\beta, \sigma^2, \Sigma_u) = -\frac{1}{2} nd\log(2\pi) - \frac{1}{2}\log(\det(V)) - \frac{1}{2} (Y - X\beta)^T V{^{-1}} (Y - X\beta)\] SLIP PartIIINotes
Basic Part III Notes::SLIP MLE for \(\beta\) in a linear mixed effects model \[\hat \beta = (X{^{\mathrm{T}}} V{^{-1}} X){^{-1}} X{^{\mathrm{T}}} V{^{-1}} Y\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Restricted Maximum Likelihood Estimation (ReML) for Linear Mixed Effects Models We want a matrix \(L\) such that \(LX = 0\). Use \(L = I - P = I - X(X{^{\mathrm{T}}} X){^{-1}} X{^{\mathrm{T}}}\), then \(LY\) simplifies nicely: \[LY = LX\beta + L[Z_iu_i] + \varepsilon \sim N_r(0, LVL{^{\mathrm{T}}})\] So just estimate \(\sigma^2\), \(\Sigma_u\) using that as the MLE. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Generalized Linear Mixed Effects Models (GLMMs) Assume the normal setup for linear mixed effects models, but let \(u_i\) have an arbitrary distribution with unknown parameter \(\alpha\): \(u_i \overset{d}{\sim} {\mathbb{P}}_\alpha\). Assumes \[g({\mathbb{E}}[Y_{ij} | u_i]) = x_{ij}{^{\mathrm{T}}} \beta + z_{ij}{^{\mathrm{T}}} u_i\]
Estimate \((\alpha, \beta, \phi)\) by MLE: \[f(y; \alpha, \beta, \phi) = \int f(y {\,|\,} u; \beta, \phi) f(u;\alpha)du\] SLIP PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: The Persistence Property for Kripke Models \[(s \Vdash p) \land (s \leq s') \implies s' \Vdash p\] Worlds are ordered by the knowledge they believe.
LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability How do we actually implement Godel's beta function given an encoded number? \[ \beta(u, i) := a \% (m(i+1) + 1) \] where \((a, m)\) are the unique numbers such that \(u = \langle a, m \rangle\). LogicAndComputability PartIIINotes
Basic Part III Notes::MedicalStats Definition: Frailty Each individual has some frailty \(u\) that describes how susceptible they are to risk. We denote the population risk \(\bar F(t)\):
\[\bar F = {\mathbb{E}}_u[F(t{\,|\,} U = u)]\]
Note however that this doesn't imply that the population hazard \(\bar h(t)\) is \({\mathbb{E}}_u[h(t{\,|\,} u)]\): \[\bar h(t) = \frac{\bar f(t {\,|\,} u)}{\bar F(t {\,|\,} u)}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Population Hazard under Proportional Frailty \[ h(t{\,|\,} U=u) = uh_0(t)\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Gamma Distribution PDF \[Gamma(x; p, \lambda) = \frac{\lambda^p x^{p-1}e^{-\lambda x}}{\Gamma(p)}\] \(p\) is called the shape or index parameter, \(\lambda\) is the rate parameter. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Mean of the Gamma Distribution \[\frac{p}{\lambda}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Variance of the Gamma Distribution \[\frac{p}{\lambda^2}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Laplace Transform of the Gamma Distribution With \(Gamma(x; p, \lambda)\): \[\left(\frac{p}{1 + p} \right)^{\lambda}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Population Hazard under Gamma-distributed Frailty \[\bar h(t) = \frac{p h_0(t)}{p + H_0(t)}\]
A lower \(p = \lambda = \psi\) means a higher variance among individuals: they either die off quickly or stick around. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Cox Regression with Frailty \[h_{z}(t {\,|\,} U = u) = u e^{\beta z}h_0(t)\]
You can use this to substitute into population hazard:
\[\bar h(t) = \frac{p e^{\beta t} h_0(t)}{p + e^{\beta z} H_0(t)}\]
Note then the population hazard ratio will start at \(e^\beta\) at \(t = 0\), but then converge to 1 as \(t \to \infty\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats What is the central problem with frailty in practice? How can we resolve this? There's not enough information in \((X_i, V_i)\) pairs alone to assess \(g\) and \(H_0\) separately. Three resolutions:
1. Be aware that we can't tell apart a homogeneous population with decreasing hazard from a population varying in frailty but with constant hazard over time.
2. Measure as many explanatory variables as possible
3. Use a model where frailty goes into the error term (accelerated frailty): \[F(t{\,|\,} u) = F_0(ue^{\beta z}t)\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats The two approaches to competing risks modeling 1. Two different times to events, \(T_A\) and \(T_b\), we only observe one.
2. One time-to-event variable \(T\), and at event time, choose between event type \(A\) or \(B\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats What do \(F(t)\), \(\tilde f_A(t)\), and \(G_A(t)\) represent in competing hazards modeling, and how do we compute them? \(F(t)\) is just the probability of any event before \(t\), so \[F(t) = \exp(-H_A(t) - H_B(t))\]
\(\tilde f_A(t)\) is the event density at \(t\) (not a proper distribution): \[\tilde f_A(t) = h_A(t)F(t)\]
\(G_A(t)\) is the probability of a the specific event \(A\) occurring before or at \(t\), so \[G_A(t) = \int_0^t \tilde f_A(t') dt'\]
Note we can also compute \(\tilde F_A(t) = \exp(-H_A(t))\), the survivor function for \(A\) if the event \(B\) entirely disappeared. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Aalen-Johansen Estimator \[\hat G_A(t) = \sum_{j: a_j \leq t} \left[ \prod_{j'=1}^{j' = j-1} (1 - \Delta \hat H_{j'}) \right] \Delta \hat H_j^A\]
where \[\Delta \hat H_j = \frac{1}{r_j}\] \[\Delta \hat H^A_j = \frac{1}{r_j}{\mathbb{1}}_{E_j = A}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Net Survival Setup Want to consider excess deaths (E) versus the chance of a background event that has nothing to do with the disease we're studying (B).
\[h^i(t) = h^i_B(t) + h_E^i(t)\]
We want an unbiased estimator for \[d\bar H_E(t) = \frac{\sum_i F_E^i(t) dH_E^i(t)}{\sum_i F_E^i(t)}\] Incremental excess integrated population hazard
(Note that we assume \(h^i_B(t)\) is known from government data) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Pohar-Perme Estimator \[d\tilde H_E(t) = \frac{\sum_i dN_i(t) / F_B^i(t)}{\sum_i Y_i(t) / F_B^i(t)} - \frac{\sum_i Y_i(t) dH^i_B(t) / F_B^i(t)}{\sum_i Y_i(t) / F_B^i(t)}\] MedicalStats PartIIINotes
Basic Part III Notes::SLIP Definition: Hierarchial Form of Linear Mixed Effects Models \[u_i \sim N(0, \Sigma_u)\] \[Y_i {\,|\,} u_i \sim N(X_i\beta + Z_iu_i,\, \sigma^2 I_d)\]
SLIP PartIIINotes
Basic Part III Notes::SLIP What do we mean by \(X_{(i)}(x)\) and a corresponding \(Y_{(i)}\)? It's just a permutation of the data, sorting by distance to a point \(x\): \[||X_{(1)}(x) - x|| \leq ||X_{(2)}(x) - x|| \leq ... \leq ||X_{(n)}(x) - x||\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: \(L\)-Nearest Neighbors Classifier We have the conditional class probabilities: \[\hat p_k(x) = \frac{1}{L}\sum_{i=1}^L {\mathbb{1}}_{Y_{(i)} = k}\] Then we just use the plug-in estimation for the classifier: \[h^{LNN}(x) := {\underset{k}{\mathrm{argmax}}\,\,}\hat p_k(x)\]
(majority vote among \(L\) nearest neighbors) SLIP PartIIINotes
Basic Part III Notes::SLIP Statement of the bounds on the 1NN Classifier Risk bounds theorem Suppose we're doing binary classification (\(k = 2\)) and all point distances are distinct (\(||x_{(j)} - x||\) unique). Assume the true conditional class probability \(p_1(x) = {\mathbb{P}}(Y = 1 {\,|\,} X = 1)\) is continuous, and \(X\) has density bounded away from zero on its support.
Then the misclassification risk \(R(h^*)\) fulfills \[R(h^*) \leq \lim_{n \to \infty} R(h^{1NN}) \leq 2R(h^*)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Proposition: 1NN Classifier Risk
Suppose we're doing binary classification (\(k = 2\)) and all point distances are distinct (\(||x_{(j)} - x||\) unique). Assume the true conditional class probability \(p_1(x) = {\mathbb{P}}(Y = 1 {\,|\,} X = 1)\) is continuous, and \(X\) has density bounded away from zero on its support.
Show that the misclassification risk \(R(h^*)\) fulfills \[R(h^*) \leq \lim_{n \to \infty} R(h^{1NN}) \leq 2R(h^*)\] First show that \({\mathbb{P}}(h^{1NN}(x) = 1 {\,|\,} X_1 ... X_n) = p_1(X_{(1)}(x))\), and likewise for \(k=2\).
Use this to show \(R(h^{1NN}) = {\mathbb{E}}[p_2(X_{(1)}(X^*))p_1(X^*) + p_1(X_{(1)}(X^*))p_2(X^*)\)
Make the approximation that \(p_1(X_{(1)} (X^*)) \approx p_1(X^*)\) by continuity.
Plug into \(\lim_{n \to \infty} R(h^{1NN})\), recognize that \(2 {\mathbb{E}}[p_1(X^* )(1-p_1(X^* ))] \leq 2R(h^*) \) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Bootstrap Aggregation (Baggining) Given some low-bias classifier \(h\):
1. For \(b = 1...B\), draw \(n\) samples \((X_i^{(b)}, Y_i^{(b)})\) WITH REPLACEMENT from observations. Use this to compute the classifier \(h^{(b)}\).
2. Create a new compound classifier by majority voting among all \(b\)s: \[h^{bag}(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \frac{1}{B}\sum_{b=1}^B {\mathbb{1}}_{h^{(b)} (x) = k}\] or for regression, just an average: \[\hat f^{bag}(x) = \frac{1}{B}\sum_{b=1}^B f^{(b)} (x)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Benefit of bagging Won't change the bias, but can reduce variance (because bootstrapped samples are more independent). SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: m-out-of-n Bagging
(and its benefits) Instead of drawing \(n\) samples with replacement, draw \(m < n\) samples (still with replacement).
This increases independence among bagged samples, further decreasing the variance.
This yields a consistent estimator when \(B, m \to \infty\) with \(m / n \to 0\). SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Gini Index \[G(R) := \sum_{k=1}^K \hat p_k(R)(1 - \hat p_k(R))\] where \[\hat p_k (R) := \frac{1}{N(R)}\sum_{i: X_i \in R} I_{Y_i = k}\] (fraction of points in \(R\) that correspond to class \(k\))
Measures the impurity of a region \(R\): we want regions with a lot of points with the same label. \\(pick a label randomly from the pool of labels we know the points have; what's the misclassification rate of just picking randomly from those labels, to label each point?) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Classification Tree Loss For classification trees, we use the loss: \[Q(j,t) := \frac{N(R_l(j,t))}{N(R)}G(R_l(j,t)) + \frac{N(R_r(j,t))}{N(R)}G(R_r(j,t)) - G(R)\] where \(G(R)\) is the Gini index.
(Reduction in impurity that splitting gains, weighted by the fraction of points that each region takes) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Final classification function of a classification tree \[h^{CART}(x) := \sum_{m=1}^M \hat c_m I_{x \in R_m}\] where \[\hat c_m = {\underset{k}{\mathrm{argmax}}\,\,} \hat p_k(R_m)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Bias-Variance Tradeoff for CART Shallow trees: high bias, low variance
Deep trees: low bias, high variance SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Random Forest Procedure Similar to bagging:
1. For \(b=1...B\), draw \(n\) samples \((X_i^{(b)}, Y_i^{(b)})\) WITH REPLACEMENT. At each split, pick the best out of some random selection \(m_{try}\) of the \(p\) variables. Continue until it has grown maximally.
2. Aggregate all \(B\) tree functions. For regression, just take the average of applying each tree function; for classification, just take the majority vote. SLIP PartIIINotes
Basic Part III Notes::SLIP Variance of a regression random forest \[{\mathrm{Var}}[f^{RF} (x)] = \rho \sigma^2 +\frac{1 - \rho}{B} \sigma^2\]
where for \(b \neq b'\), \(\sigma^2 = {\mathrm{Var}}[f^{(b)}(x)]\) and \(\rho\) is the correlation between \(f^{(b)}(x)\) and \(f^{(b')}(x)\).
(split correlation into diagonal and non-diagonal bits in the sum) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Out-of-bag Error (OOB) Just the average difference between \(Y_i\) and an aggregated forest prediction \(\hat Y_i\). Estimates the prediction error.
Note this approaches leave-one-out cross validation error as \(B \to \infty\). SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Cook's Distance, Intuitively A datapoint \((x,y)\)'s cook's distance is how much it would affect the final fit if it were to be removed. SLIP PartIIINotes
Basic Part III Notes::MSM Vector generalization of the mean squared error (MSE) for an estimator \(\hat \theta\) of \(\theta\), and what it simplifies down to \[{\mathbb{E}}[(\hat \theta - \theta) (\hat \theta - \theta )^T ] = {\mathrm{Var}}(\hat \theta) + Bias(\hat \theta) Bias(\hat \theta)^T = {\mathrm{Var}}(\hat \theta) + {\mathbb{E}}[\hat \theta - \theta ]{\mathbb{E}}[\hat \theta - \theta ]^T\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Equation optimized for Ridge Regression \[ (\hat \mu_\lambda^R, \hat \beta_\lambda^R) = \underset{(\mu, \beta) \in {\mathbb{R}} \times {\mathbb{R}}^p}{argmin}(||Y - \mu 1 - X \beta||^2_2 + \lambda ||\beta||^2_2) \] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Solution for \(\hat \beta_\lambda^R\) in Ridge Regression \[(X^TX + \lambda I){^{-1}} X^T Y\] MSM PartIIINotes
Basic Part III Notes::MSM What do we know about the differences between OLS solutions \(\beta^{OLS}\) and ridge regression solutions \(\beta^R\)? As \(\lambda \to 0\), the difference in MSEs between \(\beta^{OLS}\) and \(\beta^{R}_\lambda\) is positive definite: \[{\mathbb{E}}[(\beta^{OLS} - \beta_0)(\beta^{OLS} - \beta_0)^T] - {\mathbb{E}}[(\beta^{R} - \beta_0)(\beta^{R} - \beta_0)^T] \succeq 0\]
Show this by first computing the bias and variance of \(\hat \beta ^R_\lambda\). MSM PartIIINotes
Basic Part III Notes::MSM In MSM, what does the notation \(X_j\) mean? The \(j\)th column of \(X\) MSM PartIIINotes
Basic Part III Notes::MSM What does SVD tell about when ridge regression works best? \(\lambda\) shrinks Y most in the small principal components of X. So if most of the signal is in the large principal components, we won't lose much. MSM PartIIINotes
Basic Part III Notes::MSM Show how the kernel trick leads to an alternate form of \(\hat \beta_\lambda^R\) Note first that for \(X \in {\mathbb{R}}^{n \times p}\), \[(X^TX + \lambda I_n) X^T = X^T(XX^T + \lambda I_p)\] Multiply by inverses on each side and then multiplying by Y on the right yields \[(X^TX + \lambda I_n){^{-1}} X^T Y = X^T(XX^T + \lambda I_p){^{-1}} Y = \hat \beta_\lambda^R\]
Note that the fitted values \(X\hat \beta_\lambda^R\) only depend on \(XX{^{\mathrm{T}}}\) then! MSM PartIIINotes
Basic Part III Notes::MSM Kernel Form of Ridge Regression \[X\hat \beta_\lambda^R = K(K + \lambda I){^{-1}} Y\] where \(K = X X{^{\mathrm{T}}}\) MSM PartIIINotes
Basic Part III Notes::MSM Feature Map A function \(\phi:X \to H\) from X to some Hilbert space \(H\).
Note that this gives us a similarity measure \(k: X \times X \to {\mathbb{R}}\): \[k(x, x') = \langle \phi(x), \phi(x') \rangle\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Positive Definite Kernel A symmetric map \(k: {\mathcal{X}} \times {\mathcal{X}} \to {\mathbb{R}}\) such that the matrix \(K\) with entries \[K_{ij} = k(x_i, x_j)\] is positive semi-definite. MSM PartIIINotes
Basic Part III Notes::MSM Definition: Cauchy-Schwarz for Kernels \[k(x, x')^2 \leq k(x,x)k(x',x')\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Linear Kernel \(k(x,x') := x{^{\mathrm{T}}} x'\) MSM PartIIINotes
Basic Part III Notes::MSM Definition: Polynomial Kernel \(k(x,x') := (1 + x{^{\mathrm{T}}} x')^d\) MSM PartIIINotes
Basic Part III Notes::MSM Definition: Gaussian Kernel \(k(x,x') := \exp\left(-\frac{||x - x'||_2^2}{2h^2}\right)\) \(h>0\) is called the bandwidth.
Note that this must be infinite-dimensional because we have a term \(e^{xx'}\), which Taylor expands to an infinite series of polynomial terms. MSM PartIIINotes
Basic Part III Notes::MSM Definition: First-Order Sobolev Kernel \[k(x, x') = \min(x, x')\]
This must be a kernel because it can be represented as \[ k(x,x') = \int_0^1 {\mathbb{1}}_{[0, x]}(u){\mathbb{1}}_{[0, x']}(u) = \langle {\mathbb{1}}_{[0, x]}, {\mathbb{1}}_{[0, x']} \rangle_{L_2([0,1])} \] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Second-Order Sobolev Kernel \[ k(x, x') = \int_0^{\min(x,x')} (x - u)(x' - u)du \] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Jaccard Similarity Similarity based on what features are shared among \(1...p\). \(h(x,x')\) is defined as 1 when \(x \cup x' = \emptyset\), and otherwise \[h(x,x') = \frac{|x \cap x'|}{|x \cup x'|}\] where \(|.|\) is the set size. MSM PartIIINotes
Basic Part III Notes::MSM Definition: The Spectral Theorem If \(X\) is symmetric, we have a decomposition \[X = P{^{\mathrm{T}}} D P\] where \(D\) is diagonal and \(P\) is orthogonal. The rows of \(P\) are the eigenvectors of X.
Furthermore, if \(X\) is positive definite, every diagonal term in \(D\) is positive. MSM PartIIINotes
Basic Part III Notes::MSM Proposition: For every kernel \(k\) on FINITE \({\mathcal{X}}\), there exists some Hilbert space \({\mathcal{H}}\) and feature map \(\phi: {\mathcal{X}} \to {\mathcal{H}}\) for which \[k(x,x') = \langle\phi(x), \phi(x')\rangle\] Because \({\mathcal{X}}\) is finite and \(k\) is a kernel, we have a positive definite \(K \in {\mathbb{R}}^{n \times n}\) that contains all information about the space. Because \(K\) is positive definite, we have the decomposition \(K = P{^{\mathrm{T}}} DP\) with orthogonal \(P\) and positive diagonal \(D\). Take \({\mathcal{H}}\) to be \({\mathbb{R}}^n\) and the feature map \(\phi: x_i \mapsto (D^{1/2} P_i) \) MSM PartIIINotes
Basic Part III Notes::MSM Proposition: For every kernel \(k\) in a general \({\mathcal{X}}\), there exists some Hilbert space \({\mathcal{H}}\) and feature map \(\phi: {\mathcal{X}} \to {\mathcal{H}}\) for which \[k(x,x') = \langle\phi(x), \phi(x')\rangle\]
First specify the \({\mathcal{H}}\) and \(\phi(x)\) used. Take \({\mathcal{H}}\) to be the linear span of the functions \[ f(\cdot) = \sum_{i=1}^n a_i k(\cdot, x_i) \] for some \(n \in {\mathbb{N}}\), \(x_i \in {\mathcal{X}}\), and \(a_i \in {\mathbb{R}}\) with the inner product defined as \[ \langle f, g \rangle = \sum_{i=1}^n \sum_{j=1}^m a_i b_j k(x_i, x'_j) \] This is clearly well-defined since there's no dependence on the expansion of \(f\) or \(g\). So our feature map is \[\phi(x) = k(\cdot , x)\] MSM PartIIINotes
Basic Part III Notes::MSM Proposition: Show that \({\mathcal{H}}\) defined as the space of functions of the form \[ f(\cdot) = \sum_{i=1}^n a_i k(\cdot, x_i) \] with inner product \[ \langle f, g \rangle = \sum_{i=1}^n \sum_{j=1}^m a_i b_j k(x_i, x'_j) \] and \[\phi(x) = k(\cdot , x)\] is a valid kernel and inner product. First show that we indeed have \[k(x,x') = \langle\phi(x), \phi(x')\rangle\] Clearly symmetric and linear, but is it positive definite?
Note first that \(\langle f, g\rangle \geq 0\) by positive definiteness of the kernel. So then for functions \(f_1...f_n\), coefficients \(\gamma_1...\gamma_n\), and working from the PSD requirement, we have: \[\sum_{i,j} \gamma_i \langle f_i, f_j \rangle \gamma_j = \left\langle \sum_i f_i \gamma_i, \sum_j f_j \gamma_j \right\rangle \geq 0\] Thus \(\langle\cdot , \cdot \rangle\) is a kernel. The last part to show is that \(\langle f, f\rangle = 0 \iff f = 0\). The backwards direction is obvious, but for the forwards direction, first note that \[\langle k(\cdot,x), f\rangle = \sum_i a_i k(x_i, x) = f(x)\] So, we have: \[\langle f, f\rangle = 0 \implies f(x) = \langle k(\cdot,x), f\rangle \leq \langle k(\cdot,x), k(\cdot,x)\rangle\langle f, f\rangle = 0\] where the \(\leq \) step is given by Cauchy-Schwarz for kernels. MSM PartIIINotes
Basic Part III Notes::MSM Proposition: Show the representer theorem:
Given some RKHS \(\mathcal{H}\) with representing kernel \(K\) and arbitrary loss function \(c: \mathbb{R}^n \times X^n \times \mathbb{R}^n \to \mathbb{R}\) and strictly increasing \(J: {\mathbb{R}} \to {\mathbb{R}}\), the minimizing \(\hat f \in \mathcal{H}: X \to \mathbb{R}\) of
\[Q_1(f):= c(y, x_1, ..., x_n, f(x_1), ..., f(x_n)) + J(||f||_\mathcal{H}^2)\]
is given by
\[\hat f(x) = \sum_{i=1}^n \hat \alpha_i K(x, x_i)\]
with \(\hat \alpha\) minimizing:
\[Q_2(\alpha) = c(y, x_1, ..., x_n, K\alpha) + J(\alpha ^TK\alpha)\] Let \(U\) be the linear span of \(k(\cdot, x_i), i = 1...n\). Linear subspaces of Hilbert spaces are closed; and closed subspaces of Hilbert spaces mean that any function \(f = u + v\): \(u \in U\) and \(v \in U^\perp\).
Show first that \(f(x_i) = u(x_i)\) by means of the definition of RKHS, and that \(J(||f||^2_{\mathcal{H}}) \geq J(||u||^2_{\mathcal{H}})\). Reason then that \(Q_1\) is minimized when \(v = 0\).
Show then that we can express \(||u||^2_{\mathcal{H}}\) as \(\alpha^T K \alpha\) and \(f(x_i)\) as \(K \alpha\); thus \(Q_1(f) = Q_2(\alpha)\). MSM PartIIINotes
Basic Part III Notes::MSM Functional Generalization of Ridge Regression Find the \(\hat f_\lambda\) minimizing \[\sum_{i=1}^n (Y_i - f(x_i))^2 + \lambda ||f||^2_{\mathcal{H}}\] MSM PartIIINotes
Basic Part III Notes::MSM How can we simplify \[\sum_{i=1}^n (Y_i - f(x_i))^2 + \lambda ||f||^2_{\mathcal{H}}\] with the Representer Theorem? Using \(c(...) = \sum_{i=1}^n (Y_i - f(x_i))^2\) and \(J(||f||^2_{\mathcal{H}}) = \lambda||f||^2_{\mathcal{H}}\), we have that the above is equivalent to minimizing \[||Y - K\alpha||^2_2 + \lambda \alpha{^{\mathrm{T}}} K \alpha\]
Note that this gives us a way to predict an unknown point: \[\hat f(x) = \sum_{i=1}^n \hat \alpha_i k(x, x_i)\] MSM PartIIINotes
Basic Part III Notes::MSM What do we know about orthogonal matrices in the L2 norm? Because an orthogonal matrix never scales anything, we have: \[||UX||^2_2 = ||X|^2_2 = ||XU||^2_2\] MSM PartIIINotes
Basic Part III Notes::MSM Trick to pull orthogonal matrices out of an inverse like \((UDU{^{\mathrm{T}}} + \lambda I){^{-1}}\) Note that \(I = U{^{\mathrm{T}}} U = UU{^{\mathrm{T}}}\), so we have: \[\begin{align*} (UDU{^{\mathrm{T}}} + \lambda I){^{-1}} &= (UDU{^{\mathrm{T}}} + U\lambda IU{^{\mathrm{T}}}){^{-1}}\\\\& = (U(D + \lambda I)U{^{\mathrm{T}}}){^{-1}}\\\\& = U^{-T} (D+\lambda I){^{-1}} U{^{-1}}\\\\& = U(D + \lambda I)U{^{\mathrm{T}}} \end{align*}\] MSM PartIIINotes
Basic Part III Notes::MSM AM-GM The arithmetic mean is always at least as large as the geometric mean: for all \(a, b \in {\mathbb{R}}\), \[\frac{a+b}{2} \geq \sqrt{ab}\] MSM PartIIINotes
Basic Part III Notes::MSM How can we bound a term like \(\frac{ab}{(a+b)^2}\)? Use AM-GM: \[\frac{a+b}{2} \geq \sqrt{ab}\] \[a+b \geq 2\sqrt{ab}\] \[(a+b)^2 \geq 4ab\] \[\frac{ab}{(a+b)^2} \leq \frac{ab}{4ab} = \frac{1}{4}\] MSM PartIIINotes
Basic Part III Notes::MSM Theorem: Prove that the mean squared prediction error (MSPE) may be bounded above in the following way: \[\begin{align*} \frac{1}{n}\mathbb{E}\left\{\sum_{i=1}^n \{f^0(x_i) - \hat{f}_\lambda(x_i)\}^2\right\} &\leq \frac{\sigma^2}{n}\sum_{i=1}^n \frac{d_i^2}{(d_i+\lambda)^2} + \frac{\lambda\|f^0\|^2_\mathcal{H}}{4n} \\\\ &\leq \frac{\sigma^2}{n}\frac{1}{\lambda}\sum_{i=1}^n \min(d_i/4, \lambda) + \frac{\lambda\|f^0\|^2_\mathcal{H}}{4n}. \end{align*}\] Proof sketch:
By the representer theorem, we have that \[[\hat f_\lambda^R(x_i)] = K(K+\lambda I){^{-1}} Y\] and \[[\hat f_\lambda^0(x_i)] = K\alpha\] Use the eigendecomposition \(K = UDU{^{\mathrm{T}}}\), \(U\) orthogonal.
Let \(\Theta = D U{^{\mathrm{T}}} \alpha\), note that \(Y = K \alpha + {\varepsilon}\). Split into deterministic and stochastic parts (and swap them, stochastic is the first term in the final representation)
In the deterministic term, use the \(D{^{-1}}\) trick to introduce \(\alpha{^{\mathrm{T}}} K \alpha = ||f^0||^2_{\mathcal{H}}\); get the \(1/4\) from bounded the fraction with AM-GM.
In the stochastic bit, use the trace trick to get the \(\epsilon\)s together, where you can bring the expectation in to get \(\sigma^2\).
The final equality falls out of AM-GM again.
MSM PartIIINotes
Basic Part III Notes::MSM Definition: Eigenfunction Given a random variable \(X\) taking values in \({\mathcal{X}}\), we say a nonzero function \(e \in {\mathcal{H}}\) is an eigenfunction with eigenvalue \(\mu \in {\mathbb{R}}\) if \[\mu e(x) = {\mathbb{E}}[k(x,X) e(X)]\] in other words, the \({\mathbb{E}}[k(x,X)...]\) part is a linear operator acting upon \(e\):
\[(Te)(x) = \int k(x,x') e(x')dP(x')\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Mercer's Theorem Given a random variable \(X\) taking values in \({\mathcal{X}}\) with a nonzero \(e \in {\mathcal{H}}\) eigenfunction with eigenvalue \(\mu \in {\mathbb{R}}\) (so that \(\mu e(x) = {\mathbb{E}}[k(x,X) e(X)]\) is fulfilled), and under some mild regularity conditions (like \({\mathbb{E}}(k(X,X)) < \infty\)), then three things must hold true:
1. The set of positive eigenvalues is at most countable.
2. The subspace spanned by each eigenfunction has a finite dimension (known as the multiplicity of the eigenvalue)
3. The eigenvalues are orthonormal in the sense that \[{\mathbb{E}}[ e_j(X) e_k(X)] = {\mathbb{1}}_{j=k}\] See the decomposition of kernels. MSM PartIIINotes
Basic Part III Notes::MSM Definition: Bochner's Theorem Let \(k: {\mathbb{R}}^p \times {\mathbb{R}}^p \to {\mathbb{R}}\) be a continuous kernel. Then \(k\) is shift-invariant iff there exists some \(c >0\) and distribution \(F\) on \({\mathbb{R}}^p\) such that when \(W \sim F\),
\[k(x,x') = c {\mathbb{E}}[ e^{i(x-x'){^{\mathrm{T}}} W}] = c {\mathbb{E}}[ \cos ((x - x') {^{\mathrm{T}}} W)]\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Mean-Squared Prediction Error (MSPE) \[\frac{1}{n} {\mathbb{E}}[||X\beta^0 - X\hat \beta||^2_2]\] Measures the average difference between true values and predicted values. MSM PartIIINotes
Basic Part III Notes::MSM Why do we expect that Lasso regression could be useful? Let \(S\) be the set \(\{ k : \beta_k^0 \neq 0 \}\) (the set of parameters that are nonzero). Oftentimes this set is smaller than \(p\), the total number of parameters. When this is the case, OLS could stand to be a lot better: consider the MSPE: \[\frac{1}{n} {\mathbb{E}}[||X\beta^0 - X\hat \beta||^2_2] = ... = \frac{\sigma^2p}{n}\] If we just dropped each of the \(p\)s that aren't in \(S\), we could significantly reduce the error. MSM PartIIINotes
Basic Part III Notes::MSM Function minimized in Lasso regression Find the \((\hat \mu^L, \hat \beta ^L)\) minimizing \[ \frac{1}{2n}||Y - X \beta||_2^2 + \lambda || \beta ||_1 \]
We generally assume that \(X\) is already mean-centered, giving us the loss we actually seek to minimize: \[Q_\lambda(\beta) := \frac{1}{2n}||Y - X\beta||_2^2 + \lambda ||\beta||_1\] MSM PartIIINotes
Basic Part III Notes::MSM Given the minimizer \(\hat \beta ^L_\lambda\) in lasso regression, how can we express this as a conditional optimization problem? \[||Y - X \beta||^2_2\, \text{ subject to } ||\beta||_1 \leq ||\hat \beta ^L_\lambda||_1\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Holder's Inequality Let \(p, q \in [1, \infty]\) with \(\frac{1}{p} + \frac{1}{q} = 1\). Then we have, for any functions \(f\), \(g\), \[||fg||_1 \leq ||f||_p||g||_q\] MSM PartIIINotes
Basic Part III Notes::MSM Theorem: Prediction Error of the Lasso (slow rate):
Let \(\hat{\beta}\) be any Lasso solution when \[\lambda = A\sigma\sqrt{\frac{\log(p)}{n}}.\] with probability at least \(1 - 2p^{-(A^2/2-1)}\). Then: \[\frac{1}{n}\|X(\beta^0 - \hat{\beta})\|_2^2 \leq 4A\sigma\sqrt{\frac{\log(p)}{n}}\|\beta^0\|_1.\] Proof sketch:
Start from the fact that we know \(\beta^L\) can do no better than the true mean \(\beta^0\): \[\frac{1}{2n} ||Y - X \hat \beta||_2^2 + \lambda ||\hat \beta||_1 \leq \frac{1}{2n} ||Y - X \beta^0||_2^2 + \lambda ||\beta^0||_1\] Substitute in \(Y = X\beta^0 + {\varepsilon}\), match the left side.
We can put an implicit norm around \({\varepsilon}{^{\mathrm{T}}} X (\hat \beta - \beta^0)\), use Holder's to get the one norm we need paired with an infinity norm around \(X{^{\mathrm{T}}} {\varepsilon}\)
In order to make everything work out nicely, we need the assumption that \(\frac{1}{n}||X{^{\mathrm{T}}} {\varepsilon}||_\infty \leq \lambda\). Call this event \(\Omega\), we assume that it is true with probability at least \(1 - 2p^{-(A^2/2-1)}\).
Then just use the triangle inequality and multiply both sides by two.
MSM PartIIINotes
Basic Part III Notes::MSM Definition: Subgaussian RV We say a random variable \(X\) is subgaussian with parameter \(\sigma\) if: \[{\mathbb{E}}[e^{t(X - {\mathbb{E}}[X])}] \leq e^{t^2\sigma^2/2}\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Subgaussian Tail Bound Given \(X\) subgaussian with parameter \(\sigma\), we have: \[{\mathbb{P}}(X - {\mathbb{E}}[X] \geq t) \leq e^{-t^2 / (2\sigma^2)}\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Hoeffding's Lemma If \(X\) takes values in \([a,b]\), then \(X\) is subgaussian with parameter \((b-a)/2\). MSM PartIIINotes
Basic Part III Notes::MSM Definition: What must be true about sequences of independent subgaussian RVs? If independent subgaussian RVs \(X_1...X_n\) have parameters \(\sigma_i\) respectively, then the RV \(\gamma{^{\mathrm{T}}} X\) is subgaussian with parameters \(\left( \sum_i \gamma_i^2 \sigma_i^2 \right)^{1/2}\) MSM PartIIINotes
Basic Part III Notes::MSM Theorem: Suppose \((\varepsilon_i)_{i=1}^n\) are independent, mean-zero and sub-Gaussian with common parameter \(\sigma\). Note that this includes \(\varepsilon \sim N_n(0, \sigma^2 I)\). Let \(\lambda = A\sigma\sqrt{\log(p)/n}\). Then \[\mathbb{P}(\|X^\top \varepsilon\|_\infty/n \leq \lambda) \geq 1 - 2p^{-(A^2/2-1)}\] Proof sketch:
Start from \[...\leq \sum_{i=1}^p {\mathbb{P}}(\frac{1}{n} |{\varepsilon}^T X_i| > \lambda)\] Split into two events, one where \({\varepsilon}^T X_i\) is positive, and one where it's negative. Both are subgaussian with parameter \(\sigma/\sqrt{n}\), so use the upper bound. Then just plug in \(\lambda\).
MSM PartIIINotes
Basic Part III Notes::MSM Theorem: For \(\lambda \geq 0\), any two lasso regression solutions \(\hat \beta_\lambda^{(1)}\) and \(\hat \beta_\lambda^{(2)}\) must have the same fitted values: \(X \hat \beta_\lambda^{(1)} = X \hat \beta_\lambda^{(2)}\) First note that, by the convexity of \(||\cdot||^2_2\) with \(t = \frac{1}{2}\), \[||Y - \frac{1}{2}X\beta^{(1)}_\lambda - \frac{1}{2}X\beta^{(2)}_\lambda ||^2_2 \leq ||Y - \frac{1}{2}X\beta^{(1)}_\lambda||^2_2 + ||Y - \frac{1}{2}X\beta^{(2)}_\lambda ||^2_2 \] and by strict convexity, equality is achieved iff \(X\beta^{(1)} = X\beta^{(2)}\).
Show that \[c^* \leq Q_\lambda(\frac{1}{2}X\beta^{(1)} + \frac{1}{2}X\beta^{(2)}) \leq \frac{1}{2}Q_\lambda(X\beta^{(1)}) + \frac{1}{2}Q_\lambda(X\beta^{(2)}) = c^*\] applying convexity of the L1 and L2 norms, and so equality is achieved. MSM PartIIINotes
Basic Part III Notes::MSM Definition: Coordinate Descent Best used to optimize functions of the form \[f(x) = g(x) + \sum_{i=1}^n h_j(x_j)\] Start with some initial guess \(x^{(0)}\). Optimize on one component at a time, updating as you go along each cycle. MSM PartIIINotes
Basic Part III Notes::MSM When will coordinate descent converge to a minimizer? When \(A_0 = \{x : f(x) \leq f(x^{(0)}) \}\) is a compact set (with \(x^{(0)}\) the starting guess) MSM PartIIINotes
Basic Part III Notes::MSM What three properties must inner products have? 1. Conjugate Symmetry: \(\langle x, y \rangle = \overline{ \langle y, x \rangle }\)
2. Linearity in the first argument: \(\langle ax + by, z\rangle = a\langle x, y\rangle + b\langle x,y \rangle\)
3. Positive Definiteness: \(\langle x, x \rangle \geq 0,\, \text{equality iff } x = 0 \)
for \(x, y \in \mathcal{H},\, a, b \in \mathbb{K}\)
Note that 1 and 2 together makes the inner product a sesquilinear form. MSM PartIIINotes
Basic Part III Notes::MSM What does a matrix being positive definite tell us about its eigenvalues? The eigenvalues are positive (and real). MSM PartIIINotes
Basic Part III Notes::MSM What is the definition of a matrix being positive-definite? \(A\) IS SYMMETRIC, and for any nonzero real column vector \(x\),
\(x^TAx > 0\)
Note this implies \[\sum_{i,j} x_i A_{ij} x_j > 0\]
Semidefinite just means it can also be zero. MSM PartIIINotes
Basic Part III Notes::MSM If some matrix \(A\) has full column rank, what does that tell us about \(A^TA\)? \(A^TA\) is positive definite MSM PartIIINotes
Basic Part III Notes::MSM What are the three kernel closure properties? 1. Closure over addition and nonnegative scalars: \(a_1k_1 + a_2k_2\)
2. Closure over sequences: \(\lim_{k \to \infty} k_m\)
3. Closure over products: \(k_1k_2\) MSM PartIIINotes
Basic Part III Notes::MSM State the Representer Theorem Given some RKHS \(\mathcal{H}\) with representing kernel \(K\) and arbitrary loss function \(c: \mathbb{R}^n \times X^n \times \mathbb{R}^n \to \mathbb{R}\) and strictly increasing \(J: {\mathbb{R}} \to {\mathbb{R}}\), the minimizing \(\hat f \in \mathcal{H}: X \to \mathbb{R}\) of
\[Q_1(f):= c(y, x_1, ..., x_n, f(x_1), ..., f(x_n)) + J(||f||_\mathcal{H}^2)\]
is given by
\[\hat f(x) = \sum_{i=1}^n \hat \alpha_i K(x, x_i)\]
with \(\hat \alpha\) minimizing:
\[Q_2(\alpha) = c(y, x_1, ..., x_n, K\alpha) + J(\alpha ^TK\alpha)\] MSM PartIIINotes
Basic Part III Notes::MSM State the KKT conditions for the Lasso \[\hat v = \frac{1}{\lambda n}X^T(Y - X \beta^L_\lambda)\]
where \(\hat v\) satisfies \(||\hat v||_\infty \leq 1\) and \(\hat v_{\hat S} = \mathrm{sgn}(\hat \beta^L_{\lambda,\,\hat S})\) where \(\hat S = \{ j:\, \beta^L_{\lambda,\, j} \neq 0 \}\) MSM PartIIINotes
Basic Part III Notes::MSM State Markov's Inequality If \(X\) is a nonnegative random variable and \(a > 0\), then we have:
\(\mathrm{P}(X \geq a) \leq \frac{\mathrm{E}(X)}{a}\) MSM PartIIINotes
Basic Part III Notes::MSM State Chebyshev's Inequality \(\mathbb{P}(|X - \mu| \geq k\sigma) \leq \frac{1}{k^2}\)
Note that this is useless when \(k \leq 1\), since all probabilities are \(\leq 1\)
Equivalently we have
\(\mathbb{P}(|X-\mu| \geq a) \leq \frac{{\mathrm{Var}}(X)}{a^2}\) MSM PartIIINotes
Basic Part III Notes::MSM State the definition of function convexity. \(f((1-t)x + ty) \leq (1-t)f(x) + tf(y)\)
for all \(x, y \in C\)
given convex space \(C\), \(f: C \to \mathbb{R}^n\) MSM PartIIINotes
Basic Part III Notes::MSM Write the Lagrangian function as used in the Lagrangian Method to minimize a function \(f(x)\) subject to \(g(x) = 0\)
(\(f: C \to \mathbb{R}\) (\(C \subseteq \mathbb{R}^d\) convex), \(g: \mathbb{R}^d \to \mathbb{R}^d\)) \(L(x, \theta) = f(x) + \theta^Tg(x)\)
Find minima in \(x, \theta\)
We're looking for points where the tangents to the contours of \(f\) and \(g\) are parallel (or where \(f\) alone is minimized and \(\theta = 0\)) MSM PartIIINotes
Basic Part III Notes::MSM Given a convex function \(f: C \to \mathbb{R}\), when is a vector \(v \in \mathbb{R}^d\) a subgradient of \(f\) at \(x_0\)? \(\forall x \in C,\, f(x) - f(x_0) \geq v^T(x-x_0)\)
So in one dimension we have:
\(\forall x \in C,\, f(x)-f(x_0) \geq c(x-x_0)\)
given \(c \in \mathbb{R}^n\)
any \(c\) such that our function stays above the line drawn out from \(x\) with slope \(c\) MSM PartIIINotes
Basic Part III Notes::MSM What is a subdifferential of a function \(f\) at \(x\), and how is it denoted? The set of all subgradients of \(f\) at \(x\).
It's denoted \(\partial f(x)\) MSM PartIIINotes
Basic Part III Notes::MSM For the interior of a convex region, if \(f\) is differentiable, what are the subdifferentials of \(f\)? Just the gradient of \(f\) at any point \(x\):
\(\partial f(x) = \{ \nabla f(x) \}\) MSM PartIIINotes
Basic Part III Notes::MSM What are the simplifying properties of subdifferentials? 1. Subdifferentials scale how you'd expect:
\(\partial(\alpha f)(x) = \{ \alpha v,\, v\in \partial f(x) \} \)
2. When adding functions, we consider all permutations:
\(\partial(f_1 + f_2)(x) = \{ v_1 + v_2,\, v_1\in \partial f_1(x),\, v_2\in \partial f_2(x) \} \)
3. Subdifferentials ignore constant adds and respect linearity:
\(h(x) = f(Ax + b) \implies \partial h(x) = \{ A{^{\mathrm{T}}} g : g \in \partial f(Ax+b) \}\) MSM PartIIINotes
Basic Part III Notes::MSM For a convex function \(f\), how must any minimizer relate to subdifferentials? We have
\[x^* \in \underset{x \in \mathbb{R}^d}{\mathrm{argmin}}\, f(x) \iff 0 \in \partial f(x^*)\]
(both are equivalent to saying \(f(y) \leq f(x^*),\, \forall y\)) MSM PartIIINotes
Basic Part III Notes::MSM What is the subdifferential of the L1 norm \(\partial ||x||_1\)? \[\partial ||x||_1 = \{ v \in \mathbb{R}^d : ||v||_\infty \leq 1,\, v_A = \mathrm{sgn}(x_A)\}\]
where \(A \in \{ j : x_j \neq 0 \}\) MSM PartIIINotes
Basic Part III Notes::MSM For \(x \in \mathbb{R}^d\), state an equivalent form of \(\partial ||x||_1\). \[\partial x ||x || _1 = \{ v \in \mathbb{R}^d,\, ||v||_\infty \leq 1 \text{ and } v_A = \mathrm{sgn}(x_A)\}\]
where \(A = \{ j:\, x_j \neq 0\}\)
In other words, \(v\) where the magnitude of all components of \(v\) do not exceed 1, \(v\) is positive in the components where \(x\) is positive and negative where the components of \(x\) is negative. MSM PartIIINotes
Basic Part III Notes::MSM What do we know about the fitted values of Lasso minimizers? They are unique
(though the minimizing \(\beta^L_\lambda\) may not be unique) MSM PartIIINotes
Basic Part III Notes::MSM What do the sets \(S\) and \(N\) denote in the context of Lasso regression? \(S := \{ k: \, \beta_k^0 \neq 0 \}\) is the set of nonzero terms of the true minimizer \(\beta^0\)
\(N:= (1, 2, ..., p) \setminus S\) is the set of zero terms of the true minimizer \(\beta^0\)
So for instance \(X_N\) is the matrix of ignored predictor columns. MSM PartIIINotes
Basic Part III Notes::MSM Definition: Sum form of positive-semi-definiteness A matrix \(X\) is positive semi-definite iff \[\sum_{i,j}a_iX_{ij} a_j \geq 0\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Kernel from a Feature Map Given some feature map \(\phi:{\mathcal{X}} \to {\mathcal{H}}\), the following is a kernel: \[k(x,y) = \langle \phi(x), \phi(y) \rangle \]
This is because \[\sum_{i,j} a_i k(x_i, x_j)a_j = \sum_{i,j} a_i \langle \phi(x_i), \phi(x_j) \rangle a_j = \langle \sum_i a_i \phi(x_i), \sum_i a_i \phi(x_i) \rangle \geq 0\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Set Convexity A set \(C\) is convex if, for all \(x, y \in C\), \(t \in [0,1]\), \[tx + (1-t)y \in C\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: The Gamma Integral \[\Gamma(z) = \int_{0}^\infty u^{z-1}e^{-u}du\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: How to compute \(\Gamma(n)\) Know that \[\Gamma(n) = (n - 1)!\] \[\Gamma(1/2) = \sqrt{\pi}\] \[\Gamma(z+1) = z\Gamma(z)\] for \(n \in {\mathbb{N}}\), \(z \in {\mathbb{R}}\). MSM PartIIINotes
Basic Part III Notes::MSM Definition: Laplace Transform of a Power \[t^{-\alpha} = \frac{1}{\Gamma(z)} \int_0^\infty v^{z-1}e^{-tz}\] MSM PartIIINotes
Basic Part III Notes::MedicalStats Causation versus association Causation is the language of intervention. Association is the language of probability. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Confounding When some event is a common cause of two variables. The two variables may appear correlated, but don't cause each other. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Collider Bias / Berkson's Paradox Celebrity example: You may notice that ugly celebrities tend to be talented, and that untalented celebrities tend to be attractive. But this is only because they wouldn't be a celebrity if they weren't either.
When two different events cause the same outcome, and that outcome makes observation easier, then those two events may seem erroneously negatively correlated. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Reverse Causation If symptoms are consistently treated, treatment may appear to 'cause' the illness. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats How can we measure causality? 1. Perform controlled experiments
2. Perform randomized experiments
3. Account for confounding in analyses
Use instrumental variables
4. Assume that some part of the data-generating process behaves like randomization (eg mendelian randomization)
This is called a natural experiment MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Negative Controls To check your work, check causality of obviously uncorrelated phenomena MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Difference-in-Differences Sudden divergence of trendlines in two populations: there must be some reason for the divergence! MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Regression Continuity Check that measured data that should be smooth is actually smooth. Jump points are suspicious. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Immortal Time Bias Must consider the age requirement of some property: nobel prize winners are older than non-nobel prize winners. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Intention-to-treat Who will be included in the study must be set-in-stone before time zero (can't disclude people on the fly) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Case-control studies Case studies paired with a control group. Not optimal, but sometimes necessary for rare diseases. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Cohort Studies Look at a (snapshot) population and observe what effects it over time. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Rubin's Dictum For objective causal inference, design trumps analysis. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Transition Rate, and when it is a Markov process and time homogeneous \[q_{rs}(t; {\mathcal{F}}_t)=\lim_{\delta t \to 0} \frac{{\mathbb{P}}(X(t + \delta t) = s {\,|\,} X(t) = r,\, {\mathcal{F}}_t)}{\delta t}\] \(X(t)\) represents the state the individual is in at time \(t\), \({\mathcal{F}}_t\) is all information about the process prior to \(t\).
It's a Markov process when \(q_{rs}\) has no dependence on \({\mathcal{F}}_t\), and it's time-homogeneous when \(q_{rs}\) has no dependence on time (time probabilities over an interval only depend on the duration of the interval). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Transition Intensity Matrix \(Q\) \[Q_{r,s} := q_{r,s} \text{ for } r \neq s\] \[Q_{r,r} = -\sum_{s \neq r} q_{rs}\] So rows of \(Q\) sum to zero. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Sojourn Time Distribution of time spent in state \(r\) before moving
Has an exponential distribution with rate \[\lambda = \sum_{s,\, s \neq r} q_{rs} = -q_{rr}\]
So the average time in a state before switching is given by
\[\frac{1}{\lambda} = \frac{1}{\sum_{s,\, s \neq r} q_{rs}} = \frac{1}{-q_{rr}}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Probability that next state is \(s\) given current state is \(r\) in a continuous Markov process \[\frac{q_{rs}}{\sum_{j,\, j\neq r} q_{rj}} = \frac{-q_{rs}}{q_{rr}}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Transition Probability Matrix \(P(t)\) \[P_{rs}(t) = {\mathbb{P}}(\text{State $s$ at time $t$} {\,|\,} \text{State $r$ at time $0$})\] Solution to the forward Kolmogorov equations: \[\frac{dP(t)}{dt} = P(t)Q(t)\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats How to solve for the transition probability matrix given \(P(0) = I\), \(Q\) is time-homogeneous \[P(t) = \exp (tQ) = \sum_{n=0}^\infty \frac{t_n}{n!}Q^n\] Use an eigendecomposition for \(Q = UDU\inv\), \[= \exp(tQ) = U\exp (Dt) U\inv\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Expected time matrix \(T(t)\) (in a continuous time Markov process) Expected total length of time spent in state \(s\), starting from state \(r\): \[T_{rs}(t) = {\mathbb{E}}\left[ \int_0^t I_{X(u) = s} du \right] = \int_0^t P_{rs}(u)du\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Expected number of visits to a state (in a continuous-time Markov process) \[\sum_{i\neq s}\int_0^t P_{ri}(u) q_{is}du = \sum_{i\neq s} T_{ri}(t)q_{is}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats How to find the probability of ever visiting a state (in a continuous-time Markov process) Create a new problem by zeroing out the \(s\)th row of Q (a trap state), then just compute the new \(P^*(t) = \exp(tQ^*)\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Likelihood contribution of discrete data in a continuous-time Markov process Let \(a\) be \(x_{i0}\) (state before), and let \(b\) be \(x_{i1}\) (state after), then \[{\mathbb{P}}(X_{i1} = b {\,|\,} a) = P_{ab}(t_{i1} - t_{i0} {\,|\,} \theta)\] Take the product of these all to get the full likelihood. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Likelihood contribution of a known time interruption (like death) in a continuous time Markov process Problem: we don't know what state the individual was in prior to death. So we have to allow for every possibility: \[\sum_{s \neq d} p_{rs}(t_{i,n_i} - t_{i, n_{i-1}})q_{sd}\] where \(d\) is the death state, and \(r\) is the last known pre-death state \(x_{i,n_{i-1}}\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Likelihood contribution of a censored individual in a continuous time Markov process Problem: We don't know what disease state the individual is in at the end of the time period, only that they never died (or else we would have known) \[\sum_{s \neq d} P_{r,s}(t_{i, n_i} - t_{i,n_{i-1}}) = 1 - P_{r,d}(t_{i, n_i} - t_{i,n_{i-1}})\] where \(d\) is the death state, and \(r\) is the last known pre-death state \(x_{i,n_i-1}\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Net Survival Integral \[F_{net} = \exp \int_0^t d\bar H_E(s) ds\] What fraction of patients would survive to time \(t\) if it were the only possible cause of death? MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats How can we make up for left-truncated data? Divide by \(F(S_i)\) in the likelihood contributions (if the individual had had the event before \(S_i\), they would not have been included in the dataset) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Accounting for piecewise hazard rates Divide time into intervals; use left-truncation on all but the first region and right-censor any observation after the cutoff time.
\[\hat \theta_{interval} = \frac{\text{number of events in interval}}{\text{sum of all time at risk in interval}}\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Best way to account for left-truncated data in practice Kaplan-meier accounts for this already; just make sure that left-truncated individuals aren't considered as part of the risk set until their introduction.
This gives the same estimator as the likelihood and piecewise approaches. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Period Survival We want to characterize survival rates in some restricted period, say 2024. Denote the interval of interest \((a, b]\).
Convert calendar time (time \(u\), \(y_i\) diagnosis date, \(z_i\) event observation date, \(w_i\) event or censored) into individual's time (\(s_i\) truncation time, \(x_i\) event/censoring time, \(v_i\) event or censored)
Only include individuals that spend some amount of time in the interval pre-event. Left-truncate time before interval cutoff; censor events that occur after the time cutoff. MedicalStats PartIIINotes
Basic Part III Notes::MSM Definition: Basic Inequality for the Lasso \[\frac{1}{2n}||Y - X \hat \beta||_2^2 + \lambda || \hat \beta ||_1 \leq \frac{1}{2n}||Y - X \beta^0||_2^2 + \lambda || \beta^0 ||_1\] where \(\hat \beta\) is the true optimized solution of the Lasso loss equation. MSM PartIIINotes
Basic Part III Notes::MSM Theorem: Let \(\lambda > 0\) and define \(\Delta = X_N^\top X_S (X_S^\top X_S)^{-1} \text{sgn}(\beta_S^0)\). If \(\|\Delta\|_\infty \leq 1\) and for \(k \in S\), \[ |\beta_k^0| > \lambda |\text{sgn}(\beta_S^0)^\top [\{\tfrac{1}{n} X_S^\top X_S\}^{-1}]_k|, \] then there exists a Lasso solution \(\hat{\beta}_\lambda^{\text{L}}\) with \(\text{sgn}(\hat{\beta}_\lambda^{\text{L}}) = \text{sgn}(\beta^0)\). As a partial converse, if there exists a Lasso solution \(\hat{\beta}_\lambda^{\text{L}}\) with \(\text{sgn}(\hat{\beta}_\lambda^{\text{L}}) = \text{sgn}(\beta^0)\), then \(\|\Delta\|_\infty \leq 1\). Proof sketch:
Plug into the KKT conditions for the Lasso. Use the following decomposition: \[ X{^{\mathrm{T}}} X = \left[\begin{array}{cc} X_S{^{\mathrm{T}}} X_S & X_S{^{\mathrm{T}}} X_N \\\\ X_N{^{\mathrm{T}}} X_S & X_N{^{\mathrm{T}}} X_N \end{array}\right] \] Also note that if \(\mathrm{sgn}(\beta^0) = \mathrm{sgn}(\hat \beta)\), then \(\hat \beta_N = 0\).
MSM PartIIINotes
Basic Part III Notes::MSM Characteristic Function Bound of Bernstein's Inequality Let \(W_1\), \(W_2\), ... be IID random variables, all with mean \(\mu\), and each satisfying Bernstein's condition with parameter \((\sigma, b)\). Then we have \[ {\mathbb{E}}(e^{\alpha (W_i - \mu)}) \leq \exp\left( \frac{\alpha^2 \sigma^2}{2(1 - b|\alpha|)} \right) \] for \(|\alpha| < 1/b\). MSM PartIIINotes
Basic Part III Notes::MSM Probability bound of Bernstein's Inequality Let \(W_1\), \(W_2\), ... be IID random variables, all with mean \(\mu\), and each satisfying Bernstein's condition with parameter \((\sigma, b)\). Then we have \[ {\mathbb{P}}\left(\frac{1}{n} \sum_{i=1}^n W_i - \mu \geq t\right) \leq \exp\left( -\frac{nt^2}{2(\sigma^2 + bt)} \right) \] for \(t > 0\). MSM PartIIINotes
Basic Part III Notes::MSM Definition: Product of Subgaussian RVs Suppose \(W, Z\) be mean zero and subgaussian with parameters \(\sigma_W\) and \(\sigma_Z\) respectively. Then \(WZ\) fulfills Bernstein's condition with parameter \((8 \sigma_W \sigma_Z, 4 \sigma_W \sigma_Z)\) MSM PartIIINotes
Basic Part III Notes::MSM Theorem: If a distribution \(P\) on \({\mathbb{R}}^n\) has a density with respect to the product measure which is everywhere positive, then its conditional independence graph is global Markov Proof sketch: Use backwards induction on the size of the separating set.
Base case: \(A\) has one element, \(B\) has one element, everything else is in \(S\) (case \(p - 2\)). Fulfills global Markov by independence graph definition.
Inductive step: Grow \(A\) and \(B\) to include all 'independent' points. Either \(A\) or \(B\) must have at least two points; WLOG it's \(A\). Call one of the points \(j\), and the rest of \(A\) \(A_-\). By the inductive hypothesis we have: \[Z_{A_-} {\perp\!\!\!\perp} Z_B {\,|\,} Z_{S \cup \{ j\}}\] \[Z_{j} {\perp\!\!\!\perp} Z_B {\,|\,} Z_{S \cup A_-}\] So, applying the intersection rule (and using our assumption), we have \[Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S\] as required. MSM PartIIINotes
Basic Part III Notes::MSM Definition: Markov Blanket In a regression problem where \(X\) is a set of predictors and \(Y\) is a response, a Markov blanket \(X_S\) is a subset of \(X\) that fulfills \[Y {\perp\!\!\!\perp} X_{S^c} {\,|\,} X_S\] (so \(X_S\) contains all predictors relevant to predicting \(Y\)).
There's guaranteed to be a unique minimal blanket (take the weak union of every individually dependent covariate) MSM PartIIINotes
Basic Part III Notes::MSM Definition: Structural Causal Models (SCMs) A tuple \({\mathcal{S}} = (P_j, h_j, Q_j)_{j \in [p]}\) where:
\(P_j\) is the set of 'dependencies' for node \(j\): in other words, \(\mathrm{Pa}(j)\).
\(h_j: {\mathbb{R}}^{|P_j|} \times {\mathbb{R}} \to {\mathbb{R}}\): A 'decision function' for node \(j\), which takes in every defined dependency and a source of randomness, then makes a decision as to node \(j\)'s value.
\(Q_j\): A probability distribution over \({\mathbb{R}}\), the source of randomness for each node.
Provided alongside it is the full random vector \(Z_j := h_j(Z_{P_j}, {\varepsilon}_j)\) MSM PartIIINotes
Basic Part III Notes::MSM Definition: Interventions and Perfect Interventions to a Structural Causal Model (SCM) Replace a \(h_j(Z_{P_j}, {\varepsilon}_j)\) with a new \(\tilde h_j(Z_{P_j}, {\varepsilon}_j)\), hoping to achieve some downstream result. If \(\tilde h_j\) is just some constant function, we say it is a perfect intervention. MSM PartIIINotes
Basic Part III Notes::MSM How do we denote an intervention in an SCM? \[...|\mathrm{do}(Z_j = \cdot)\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Collider in a path in a DAG Given some DAG with an undirected path \(j_1,..., j_m\) with length \(m \geq 3\), we say \(j_l\) is a collider (relative to the path) if we have \[j_{l-1} \to j_l \leftarrow j_{l + 1}\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Blocked path in a DAG A path \(j_1,..., j_m\) is blocked by a set of nodes \(S\) if there exists a node \(v\) on the path such that either:
\(v\) IS NOT a collider, and \(v \in S\), or
\(v\) IS a collider, and neither \(v\) nor any of its descendants are in \(S\).
So in other words, a path is not blocked if non-colliders are not in \(S\) and colliders are either in \(S\), or have descendants in \(S\). MSM PartIIINotes
Basic Part III Notes::MSM Definition: d-Seperation in a DAG
(and how it's denoted) Given disjoint subsets \(A, B\) in a graph \(G\), we say \(A\) and \(B\) are d-separated by a subset \(S\) if every path between \(A\) and \(B\) is blocked by \(S\). We denote this as: \[A {\perp\!\!\!\perp}_G B {\,|\,} S\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Global Markov Directed (Causal) Graphs A distribution \(P\) on \({\mathbb{R}}^p\) is global Markov with respect to a DAG \(G\) on \(p\) vertices if, whenever \(Z \sim P\), and \(A, B, S\) are disjoint sets of vertices, \[A {\perp\!\!\!\perp}_G B {\,|\,} S \implies Z_A {\perp\!\!\!\perp} Z_B {\,|\,} Z_S\]
Note that distributions generated from an SCM are always global Markov with respect the SCM DAG. MSM PartIIINotes
Basic Part III Notes::MSM Proposition: Let \(p \geq 2\) and \(Z \sim P\). If \(Z_j \not {\perp\!\!\!\perp} Z_k {\,|\,} Z_{-jk}\) for \(j, k \in [p]\), then any causal DAG generating \(P\) must have either \(j \to k\), \(k \to j\), or \(j \to l \leftarrow k\) for some \(l\). Proof sketch:
By definition, \(j\) and \(k\) can not be d-seperated by \(S := [p] \setminus \{ j, k\}\). The only possible way to avoid being blocked by every other point is to either have a path directly between \(j\) and \(k\), or to have \(j \to l \leftarrow k\). MSM PartIIINotes
Basic Part III Notes::MSM Proposition: If nodes \(j\) and \(k\) in a DAG with \(p\) vertices are not adjacent and \(\pi\) is a topological order on \([p]\) with \(\pi(j) < \pi (k)\), then they are d-separated by \(\mathrm{Pa(k)}\). Proof sketch:
Consider a path \(j = j_1,...,j_m=k\). We can't have \(k \to j\) by the topological order assumption. We can't have \(j_{m-1} \to k\) because then \(j_{m-1} \in \mathrm{Pa}(k)\). We could try to have a V-shaped path \[j = j_1 \to ... \to j_{l-1} \to j_l \leftarrow ... \leftarrow j_m = k\] (must exist due to topological ordering). But if we take the maximal \(j_l\), we can see that \(j_l\) can't have a descendant in \(\mathrm{Pa}(k)\) because we'd have a cycle in a DAG. \(j\) and \(k\) must be d-separated. MSM PartIIINotes
Basic Part III Notes::MSM Definition: Fundamental result of Gaussian causal models \[Z_A {\,|\,} Z_B = z_B \sim N_{|A|}(\mu_A + \Sigma_{A,B} \Sigma_{B,B}{^{-1}} (z_B - \mu_B), \Sigma_{A, A} - \Sigma_{A,B} \Sigma_{B,B}{^{-1}} \Sigma_{B,A})\]
For nodewise regression, specialize this to \(A = \{k\}\) and \(B = A^c\). MSM PartIIINotes
Basic Part III Notes::MSM Definition: Graphical Lasso Minimizes \[-\log \det (\Omega) + \mathrm{tr}(S\Omega) + \lambda \sum_{j,k}|\Omega_{jk}|\] where \(\Omega\) is the precision matrix \(\Sigma{^{-1}}\), and \(S := \frac{1}{n} \sum_{i=1}^n x_ix_i{^{\mathrm{T}}} - \hat X \hat X{^{\mathrm{T}}}\) MSM PartIIINotes
Basic Part III Notes::MSM Definition: Weak Law of Large Numbers If \(W_1, W_2, ...\) are iid real-valued random variables and \({\mathbb{E}}(W_i) = \mu < \infty\), then as \(n \to \infty\), \[\frac{1}{n}\sum_{i=1}^n W_i \overset{p}{\to} \mu\] MSM PartIIINotes
Basic Part III Notes::MedicalStats Definition: Assumptions made in the SIR model \(X_i(t)\) maps the \(i\)th individual to the state they are in at time \(t\).
\(S(t), I(t), R(t)\) are the number of people susceptible, infected, and recovered, respectively. ie, \[S(t) = \sum_{i=1}^N I_{X_i(t) = S}\] \(N\) is the total population, assume it is fixed, so that for all \(t\), \[N = S(t) + I(t) + R(t)\] The rate of moving from \(S\) to \(I\) is denoted \(\lambda(t)\), is usually equal to \(\beta I(t)\).
The rate of moving from \(I\) to \(R\) is denoted \(\gamma(t)\), usually just a constant \(\gamma\) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Gillespie Algorithm Continuous time discrete individual algorithm:
Two events:
someone is infected (\((s, i) \mapsto (s-1, i+1)\))
or someone recovers (\((s, i) \mapsto (s, i-1)\)).
Draw next time: \(\sim \exp (\beta is + \gamma i)\),
At that next time, pick one of the two events (like a binomial): infected w.p. \(\frac{\beta is}{\beta is + \gamma i}\) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Chain-Binomial Model Discrete time discrete individual algorithm, sort of like a probabilistic ODE:
Take time steps of size \(\delta: [t, t + \delta)\)
\(S_{t + \delta} = S_t - B_t\)
\(I_{t + \delta} = I_t + B_t - C_t\)
\(R_{t + \delta} = R_t + C_t\)
where \(B_t, C_t\) are binomial RVs: \[B_t \sim \mathrm{Binom}(S_t, 1 - \exp (-\beta I_t \delta)) \] \[C_t \sim \mathrm{Binom}(I_t, 1 - \exp (-\gamma \delta)) \] (where the exponential comes from \(\geq 1\) event in a Poisson RV)
We can simplify \(B_t\) if we assume \(\beta \delta\) is small: \[B_t \sim \mathrm{Binom}(S_t, \beta I_t \delta)\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Reed-Frost Model Given state \((S_t, I_t)\), individuals have a latent (infected) time of 1, and at the end of their latent period, they try to infect every susceptible individual with probability \(p\). So \[I_{t+1} \sim \mathrm{Binomial}(S_t, 1 - (1-p)^{I_t})\] Useful for small populations like households. MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Greenwood Formulation Chain binomial model where instead of the infection rate being proportional to the number of infected individuals, it's just \(p\) if any infected individual exists: \[I_{t + 1} \sim \mathrm{Binomial} (S_t, p)\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Continuous Deterministic SIR Model Assume we start with \(N\) susceptible individuals and 1 infected individual. System of ODEs:
\(\frac{d}{dt}S(t) = -\lambda(t) S(t)\)
\(\frac{d}{dt} I(t) = \lambda(t) S(t) - \gamma I(t)\)
\(\frac{d}{dt}R(t) = \gamma I(t)\) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Threshold result for the SIR ODE system We need \(\frac{dI}{dt}(0) > 0\), so we must have \(S(0) = N > \gamma / \beta\) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: The reproductive number \(R_0\) We need \(\frac{dI}{dt}(0) > 0\), so we must have \(S(0) = N > \gamma / \beta\)
Define \(R_0 = N \frac{\beta}{\gamma}\): number of secondary infections caused by one infected individual in a fully susceptible population (epidemic will only take off if at time zero, \(R_0 > 1\).
Related also to \(R_e(t)\), just the \(R_0\) with the current \(S(t)\) in place of \(N\). MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Prevalence of an infection \[\pi(t) = I(t) / N\] MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Delayed Incubation Period Equation \[\mu(t) = \int_0^t h(u) f(t - u) du\] where \(\mu(t)\) is the rate of newly observed cases at \(t\), \(h(t)\) is the rate of infections at \(t\), and \(f\) is a probability distribution for the incubation length of the disease. Note this is a convolution!
Note this simplifies to the following in discrete time: \[\mu_k = \sum_{i=1}^k h_i f_{k-i}\] MedicalStats PartIIINotes
Basic Part III Notes::SLIP Definition: The Parametric Bootstrap Suppose we know the family of distributions relevant, \({\mathbb{P}}_\theta\), but not the particular \({\mathbb{P}}_{\theta_0}\) with some property \(\psi(\theta_0)\). The idea is to estimate \(\theta_0\) with an estimator \(\hat \theta\), then simulate a bunch of new trials to understand the distribution of \(\psi\) with \(\hat \psi\):
1. For \(b = 1...B\), sample \(Y^{(b)} = (Y_1^{(b)}, ...,Y_n^{(b)}) \overset{iid}{\sim} {\mathbb{P}}_{\hat \theta}\) from the estimated \(\hat \theta\) derived from \(Y\).
Use this to compute \(\hat \theta^{(b)} = \hat \theta(Y^{(b)})\) and plug in to get \(\hat \psi^{(b)} = \psi (\hat \theta^{(b)})\).
2. Approximate \(\hat \psi\) with the empirical distribution: \[\hat {\mathbb{P}}^{(b)} = \frac{1}{B} \sum_{i=1}^B \delta_{ \hat \psi^{(b)}}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Percentile Bootstrap Confidence Interval In the parametric bootstrap, simulate \(B\) trials, order by their outcome \(\hat \psi^{(b)}\), the interval \( [\hat \psi^{0.025 B}, \hat \psi^{0.975 B} ] \) is an approximate two-tailed 95 percent confidence interval.
More formally, we denote this as \([\hat q_{\alpha/2}, \hat q_{1 - \alpha/2}]\)
Note that you can use the likelihood ratio test statistic as \(\psi\), like: \[\psi (\theta) = 2 \{ l(\beta, \sigma^2, \Sigma_u) - l(\beta, \sigma^2, 0) \}\] SLIP PartIIINotes
Basic Part III Notes::SLIP When is the percentile bootstrap confidence interval most valid? When we assume \(\hat \psi \approx N(\psi_0, \sigma^2_\psi)\) and \(\hat \psi ^* \approx N(\hat \psi, \hat \sigma^2_\psi)\) conditional on \(Y\) (where \(\hat \sigma^2_\psi \approx \sigma^2_\psi\)) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Basic Bootstrap Confidence Interval
(and its assumptions) \[[2 \hat \psi - \hat q_{1 - \alpha/2}, 2 \hat \psi - \hat q_{\alpha/2}]\] Relies on the assumption that \(\hat \psi^* - \hat \psi\) under \({\mathbb{P}}^*\) is similar to \(\hat \psi - \psi_0\), so that we have: \[\begin{align*} {\mathbb{P}}(2 \hat \psi - \hat q_{1 - \alpha/2} \leq \psi_0 \leq 2 \hat \psi - \hat q_{\alpha/2}) &\leq {\mathbb{P}}( \hat q_{1 - \alpha/2} - \psi \leq \hat \psi - \psi_0 \leq \hat q_{\alpha/2} - \psi )\\\\ & \approx {\mathbb{P}}^*( \hat q_{1 - \alpha/2} - \psi \leq \hat\psi^* - \hat \psi \leq \hat q_{\alpha/2} - \psi )\\\\ &= 1 - \alpha \end{align*}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Probability as an Expectation For all events \(A\), \[{\mathbb{P}}(A) = {\mathbb{E}}[{\mathbb{1}}(A)]\] SLIP PartIIINotes
Basic Part III Notes::SLIP Proposition: Over all classifiers, the Bayes classifier has the minimal prediction error: \[\underset{h \in H}{\min} R(h) = R(h^*) = {\mathbb{E}}[1 - P_{h^*(X^*)}(X^*)] \] where \(H\) is the set of all classifiers. Proof sketch:
Express the classification risk as an expectation.
Use the tower property to fix some \(X^* = x\), and split into a sum over possible values of \(Y^*\).
Note that \({\mathbb{1}}(h(x) \neq k)\) is now deterministic, so isolate an expectation on \(Y^* = k\).
Note that this is equivalent to the statement of the conditional class probability \(p_k(x)\).
Clearly the resulting statement is minimized when \(h(x) = p_k(x)\). SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Plug-in Classifier To construct a classifier \(\hat h(x)\), define an estimator \(\hat p_k(x)\), then use the plug-in formula \[\hat h(x) = {\underset{k}{\mathrm{argmin}}\,\,} \hat p_k(x)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Linear and nonlinear classifiers A classifier is linear if its decision boundaries are all subsets of hyperplanes in \({\mathbb{R}}^p\) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Class Densities and Class Prior Probabilities for LDA Class densities: \(f_k: X^* {\,|\,} (Y^* = k)\)
Class prior probabilities: \(\pi_k = {\mathbb{P}}(Y^* = k)\) SLIP PartIIINotes
Basic Part III Notes::SLIP Conditional class probability in terms of class densities and class prior probabilities (for LDA) \[p_k = \frac{f_k(x) \pi_k }{\sum_{j=1}^K f_j(x) \pi_j}\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Assumptions for Linear Discriminant Analysis Assume each class density \(f_k = X^* {\,|\,} (Y^* = k)\) is normally distributed around its mean: \[f_k \sim N_p(\mu_k, \Sigma)\] Need to solve for all \(\pi_i\), \(\mu_i\), \(i \in [p]\) and \(\Sigma\). SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Discriminant Functions In linear discriminant analysis, we assume each class density is normal, and so plugging into the conditional class probability, throwing out common factors, and taking a log, we get: \[h^*_k(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \delta_k(x)\] \[\delta_k(x) := -\frac{1}{2} (x - \mu_k){^{\mathrm{T}}} \Sigma{^{-1}} (x - \mu_k) + \log(\pi_k)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Decision Boundaries of LDA Bayes The subset \(\{ x \in {\mathbb{R}}^p: p_k(x) = p_l(x) \}\), where: \[\log \left( \frac{p_k(x)}{p_l(x)}\right) = x^T \Sigma{^{-1}} (\mu_k - \mu_l) - \frac{1}{2} (\mu_k + \mu_l){^{\mathrm{T}}} \Sigma{^{-1}} (\mu_k - \mu_l) + \log \frac{\pi _k}{\pi_l}\]
\(\Sigma{^{-1}} (\mu_k - \mu_l)\) are the normal vectors to the hyperplanes seperating the fitted classes SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Assumed values of each parameter \(\hat \pi_k\), \(\hat \mu_k\), and \(\hat \Sigma\) in \(h^{LDA}\) \(\hat \pi_k= \frac{N_k}{n}\) (where \(N_k\) is the number of observations of class \(k\))
\(\hat \mu_k = \frac{1}{N_k} \sum_{i:Y_i = k}X_i\) (centriod of class \(k\))
\(\hat \Sigma = \frac{1}{n - K} \sum_{k=1}^K \sum_{i:Y_i=k}(X_i - \hat \mu_k) (X_i - \hat \mu_k){^{\mathrm{T}}}\) (pooled sample covariance matrix)
Used in \(h^{LDA}(x)\): \[h^{LDA}(x) \in {\underset{k}{\mathrm{argmax}}\,\,} \hat \delta_k(x)\] \[\hat \delta_k(x) := -\frac{1}{2} (x - \hat\mu_k){^{\mathrm{T}}} \hat\Sigma{^{-1}} (x - \hat \mu_k) + \log(\hat\pi_k)\] This is an MLE, and is therefore consistent. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Logistic Classifier Suppose we have independent observations \(\tilde Y_i \in \{ 1, ..., K \}\) over \(K\) classes, so each \(Y_i \in \{0,1\}^k\), \(Y_{ij} = {\mathbb{1}}_{\{\hat Y_{i = j}\}}\). We have: \[Y_i {\,|\,} (X_i = x_i) \overset{ind}{\sim} \mathrm{Multinomial}(1; p_1,...,p_K(x_i))\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Conditional Class Probabilities for the Logistic Classifier \[p_k(x_i) = \frac{e^{x_i^T \beta_k}}{1 + \sum_{j=1}^{K-1} e^{x_i^T \beta_j}}\] Note that we set \(\beta_K := 0\) as a baseline. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Logistic Classifier Log Likelihood \[L(\beta_1,...,\beta_{K-1}) = \log \left( \prod_{i=1}^n \prod_{k=1}^K p_k(x_i)^{Y_{ik}} \right)\] SLIP PartIIINotes
Basic Part III Notes::SLIP When should you use LDA versus the logisitic classifier? LDA yields smaller variance because of the Gaussian assumption, however the logistic classifier is more robust if Gaussianity does not hold SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Support Vector Machine A two-class classifier (\(K = 2\)) with points labeled \(Y_i \in \{ 1, -1\}\).
We seek a hyperplane \(f(x)\), with \(f(x) > 0\) for class 1 and \(f(x) < 0\) for class 2 (class -1).
We find the hyperplane fit which maximizes the margin (closest distance from points on each side to the separating hyperplane)
\[h^{SVC}(x) = {\mathrm{sgn}} (\hat \alpha + x{^{\mathrm{T}}} \hat \beta)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Hyperplane and Separating Hyperplane \[f(x) = \alpha + x^T\beta\] It's called separating if \[Y_i f(X_i) > 0\] for all \(i \in [n]\) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Optimal Separating Hyperplane for SVMs Maximize \(M\) (the margin) over \(\alpha \in {\mathbb{R}}\), \(\beta \in {\mathbb{R}}^p\), \(M \geq 0\), subject to \[\frac{Y_i(\alpha + X_i{^{\mathrm{T}}} \beta)}{||\beta||_2} \geq M\] for all \(i \in [n]\).
Equivalently, we can MINIMIZE \(||\beta||_2\) subject to \[Y_i(\alpha + X_i{^{\mathrm{T}}} \beta) \geq 1\] for all \(i \in [n]\) SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Softened Separating Hyperplane Optimization for SVMs Minimize \[||\beta||_2^2 + C\sum_{i=1}^n \xi_i\] over \(\alpha \in {\mathbb{R}}\), \(\beta \in {\mathbb{R}}^p\), \(\xi \in {\mathbb{R}}^n\) subject to \[Y_i(\alpha + X_i{^{\mathrm{T}}} \beta) \geq 1 - \xi_i\] \[\xi_i \geq 0\] for \(i \in [n]\). We call the \(\xi_i\) slack variables:
If \(\xi_i = 0\), \(X_i\) is on the right side of the boundary and beyond the margin.
If \(0 < \xi_i < 1\), \(X_i\) is within the boundary but on the wrong side of the margin.
If \(\xi > 1\), then \(X_i\) is on the wrong side of the boundary.
A smaller \(C\) means a tighter margin with more misclassifications. SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Alternative formulation of the soft SVM \[(\hat \alpha, \hat \beta) \in {\underset{\alpha \in {\mathbb{R}}, \beta \in {\mathbb{R}}^p}{\mathrm{argmin}}\,\,} \left( \sum_{i=1}^n (1 - Y_i(\alpha + X_i{^{\mathrm{T}}} \beta))_+ + \lambda||\beta||^2_2 \right)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Laplace Kernel \[k(x,y) = \exp(-\gamma ||x - y||_2)\] Radial like the Gaussian kernel, but with a slower decay for far away points. SLIP PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Shrinkage Estimator The idea is to bias the individual estimates towards the population mean to reduce the final MSE.
This is equivalent to just using Hierarchical Bayes with \(\tau^2\), where you estimate \(\tau^2\) given data first, then consider that for your other estimators. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Posterior Odds Ratio
(and the Bayes factor) \[ \frac{P(M_1 {\,|\,} D)}{P(M_2 {\,|\,} D)} = \frac{P(D {\,|\,} M_1)}{P(D {\,|\,} M_2)} \frac{P(M_1)}{P(M_2)} \] \(M_1\) and \(M_2\) are our models. The first multiplicand is the Bayes factor, the second multiplicand is the prior odds of one model versus the other. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Evidence \[P(D {\,|\,} M_1) = \int_\Theta P(D {\,|\,} \theta, M_1)P(\theta {\,|\,} M_1) d\theta\] also called the marginal likelihood.
Note that \(P(\theta {\,|\,} M_1)\) needs to be a proper prior! Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: The Jefferies Scale Let \(\alpha\) be a placeholder for the natural log of the Bayes factor.
\(\alpha < 1\): Terrible
\(1 < \alpha < 2.5\): Weak but significant
\(2.5 < \alpha < 5\): Significant
\(5 < \alpha\): Decisive Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Harmonic Mean Estimator \[P(D) \approx \left[ \frac{1}{M} \sum_{i=1}^M P(D {\,|\,} \theta_i){^{-1}} \right]{^{-1}}\] This is unstable in practice Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Savage-Dickey Ratio When we are comparing nested models, it's possible to simplify the Bayes factor to \[ \frac{P(\psi {\,|\,} D, M_1)}{P(\psi {\,|\,} M_1)} \] where \(\psi = 0\). Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Bayesian Model Averaging Bayesian model comparison between many models, not just two candidates. Astrostatistics PartIIINotes
Basic Part III Notes::LogicAndComputability \(PA \vdash \varphi\) versus \(PA\vDash\varphi\) versus \({\mathbb{N}} \vDash \varphi\) \(PA \vdash \varphi\) means that you can prove \(\varphi\) from axioms in \(PA\),
\(PA \vDash \varphi\) means that \(\varphi\) holds in every model of \(PA\) (even nonstandard ones),
\({\mathbb{N}} \vDash \varphi\) means that \(\varphi\) holds in the standard model specifically. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: \(\Sigma_1\)-definable Set A set \(A \subseteq {\mathbb{R}}^n\) is \(\Sigma_1\)-definable if there exists a \(\Sigma_1\)-formula \(\varphi\) such that \[\bar n \in A \iff {\mathbb{N}} \vDash \varphi(\bar n) \] We need to construct a \(\Sigma_1\) that is true if and only if \(\bar n \in A\).
For instance, \(A\) is the set of perfect squares, then we could use \(\varphi(n) = \exists x,\, (n = xx)\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Initial Segment A subset of a poset that is downward closed (the reverse of a terminal segment) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: \({\mathbb{N}}\) forms an initial segment of every model \(M\) of \(PA^-\) Map \(0 \in M\) to \(0 \in {\mathbb{N}}\), \(S(0) \in M\) to \(1 \in {\mathbb{N}}\), and onwards.
It's an initial segment because \[PA^- \vdash \forall x. (x \leq k \to x = 0 \lor x = 1 \lor ...\lor x = k)\] for all standard \(k \in {\mathbb{N}}\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Function Representation in a Theory Let \(f: {\mathbb{N}}^l \to {\mathbb{N}}\) be a total function with \(T\) being any \(L_{PA}\) theory extending \(PA^-\). \(f\) is represented in \(T\) if there exists an \(L_{PA}\) formula \(\theta(x_1,...,x_l,y)\) such that, for all \(\bar n \in {\mathbb{N}}^l\),
1) \[T \vdash \exists !y.\theta(\hat n, y)\] 2) \[k = f(\bar n) \implies T\vdash \theta(\bar n, k)\] If \(\theta\) can be made a \(\Sigma_1\), we say \(f\) is \(\Sigma_1\)-represented in \(T\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Subset Representation in a Theory A subset \(S \subseteq {\mathbb{N}}^l\) is represented in an \(L_{PA}\) theory extending \(PA^-\) if there exists an \(L_{PA}\) formula \(\theta(x_1,...,x_l)\) such that for all \(\bar n \in {\mathbb{N}}^l\):
1) \[\bar n \in S \implies T \vdash \theta(\bar n)\] 2) \[n \not \in S \implies T \vdash \neg \theta(\bar n)\] If \(\theta\) can be made a \(\Sigma_1\), we say \(S\) is \(\Sigma_1\)-represented in \(T\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: Every total recursive function \(f: {\mathbb{N}}^m \to {\mathbb{N}}\) is \(\Sigma_1\)-represented in \(PA^-\) We apply the previous result for \({\mathbb{N}}\) to get a \(\Sigma_1\) \(\exists \bar z. \phi(\bar x, y, \bar z)\) and equivalently \(\exists z. \phi(\bar x, y, z)\) by coding.
By example sheet 4 question 2, any \(\Sigma_1\)-sentence true in \({\mathbb{N}}\) is true in any model of \(PA^-\). So we have the second part of the representation definition, but we need the first part. That's tricky because we don't yet have uniqueness for the \(\phi\) statement above (we could pick a \(z\) larger than the smallest minimizable).
So just write a predicate that forbids smaller solutions: \[\psi(\bar x, y, z) := \phi(\bar x, y, z) \land \forall u,\, u \leq (y + z). \forall v, v \leq (y + z). (u+v < y + z \to \neg \phi(\hat x, u, v))\]
Corollary:
Every recursive set \(A \subseteq {\mathbb{N}}^m\) is \(\Sigma_1\)-represented in \(PA^-\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability What do we call an \(L_{PA}\) formula with one free variable? An unary predicate LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: State the Diagonalization Lemma Let \(T\) be an \(L_{PA}\) theory in which every total recursive function is \(\Sigma_1\)-represented, and let \(\theta(x)\) be an \(L_{PA}\) formula with one free variable. Then there exists some sentence \(G\) of \(L_{PA}\) such that \[T \vdash (G \iff \theta(\lceil G \rceil))\]
(So for any numeric property, there's a sentence \(G\) that provably states that its own Godel number has that property) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Diag \(\mathrm{diag}(n):\)
If \(n = \lceil \sigma(x)\rceil\), return \(\lceil \forall y. (y = n \to \sigma(y)) \rceil\)
Else return 0.
We're just rewriting the sentence \(\sigma(x)\) coded by \(n\) to include \(n\) itself.
diag is the coded function that plugs a number into the conditional it represents (if possible). LogicAndComputability PartIIINotes
Basic Part III Notes::MSM Mercer's Theorem decomposition for kernels \[k(x,y) = \sum_{j \in J} \mu_j e_j(x) e_j(y)\] MSM PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Marginalization \[P(x) = \int P(x,y)dy = \int P(x{\,|\,} y) P(y) dy = {\mathbb{E}}_{y}[P(x {\,|\,} y)] \] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Split Multivariate Gaussian Marginals Suppose we have a partioned multivariate Gaussian like \[\boldsymbol{f} = \begin{pmatrix} \boldsymbol{U} \\\\ \boldsymbol{V} \end{pmatrix} \sim N\left( \begin{bmatrix} \boldsymbol{\mu}_U \\\\ \boldsymbol{\mu}_V \end{bmatrix}, \begin{bmatrix} \boldsymbol{\Sigma}_U & \boldsymbol{\Sigma}_{UV} \\\\ \boldsymbol{\Sigma}_{VU} & \boldsymbol{\Sigma}_V \end{bmatrix} \right)\] Then its marginals are exactly what we'd expect: \[P(U) = \int P(U,V)dV = N(U {\,|\,} \mu_U, \Sigma_U)\]
\[P(V) = \int P(U,V)dU = N(V {\,|\,} \mu_V, \Sigma_V)\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Split Multivariate Gaussian Conditional Probabilities \[ U {\,|\,} V \sim N({\mathbb{E}}[U {\,|\,} V], {\mathrm{Var}}[U {\,|\,} V]) \]
\[{\mathbb{E}} [U {\,|\,} V] = \mu_U + \Sigma_{UV} \Sigma_V{^{-1}} (V - \mu_V)\]
\[{\mathrm{Var}}[U {\,|\,} V] = \Sigma_U - \Sigma_{UV} \Sigma_V{^{-1}} \Sigma_{VU}\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics How can we construct a joint probability from \[V \sim N(V_0, \Sigma_V)\] and \[U {\,|\,} V \sim N(U_0 + XV, \Sigma_{U {\,|\,} V})\] \[\begin{pmatrix} \boldsymbol{U} \\\\ \boldsymbol{V} \end{pmatrix} \sim N\left( \begin{pmatrix} \boldsymbol{U}_0 + \boldsymbol{X}\boldsymbol{V}_0 \\\\ \boldsymbol{V}_0 \end{pmatrix}, \begin{pmatrix} \boldsymbol{X}\boldsymbol{\Sigma}_V \boldsymbol{X}^T + \boldsymbol{\Sigma}_{U|V} & \boldsymbol{X}\boldsymbol{\Sigma}_V \\\\ \boldsymbol{\Sigma}_V \boldsymbol{X}^T & \boldsymbol{\Sigma}_V \end{pmatrix} \right)\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: MLE and Unbiased Estimator for Normal RV \(\sigma\) MLE: \[\hat \sigma^2 = \frac{(x_i-\hat \mu)^2}{N}\] Unbiased Estimator: \[\hat \sigma^2 = \frac{(x_i-\hat \mu)^2}{N-1}\] where \(\hat \mu = \bar x\). Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Gaussian Process Time-dependent Gaussian distribution \[f(t) \sim GP(m(t), K_{A, \tau^2}(t, t'))\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Squared Exponential Kernel \[k(t,t') = A^2 \exp(-|t-t'|^2 / \tau^2)\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Stationary Gaussain Process \[K(t,t') = K(t + a, t' + a)\] for all \(a\) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: How to draw probabilistic graphical models Open circle / square nodes: latent parameter, unobserved data
Shaded nodes: Something we condition on (ie data)
Filled dot: Known constant
Plate: iid replications of what's inside Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Occam Factor
(and how to derive it) Define \(g(\theta) = P(\theta {\,|\,} D, M)P(D {\,|\,} M) = P(D {\,|\,} \theta, M)P(\theta {\,|\,} M)\).
Find a MAP estimate \(\theta_0 = {\underset{\Theta}{\mathrm{argmax}}\,\,} \ln(g(\theta))\), use a second order Taylor expansion for \(\ln(g(\theta))\) then exponentiate to get: \[g(\theta) \approx g(\theta) \exp(-\frac{1}{2} (\theta - \theta_0){^{\mathrm{T}}} A (\theta - \theta_0))\] where \(A\) is the Hessian of the log of \(g\) at the MAP
From there see that \[\int g(\theta)d\theta = \int P(\theta {\,|\,} D, M)P(D {\,|\,} M) d\theta = P(D{\,|\,} M) \int P(\theta {\,|\,} D, M)d\theta = P(D {\,|\,} M)\] So we have: \[P(D|M) \approx g(\theta_{MAP}) \det(A / 2\pi)^{-1/2} = P(D {\,|\,} \theta_0) P(\theta_0) \det(A / 2\pi)^{-1/2}\] \(P(\theta_0) \det(A / 2\pi)^{-1/2}\) is called the Occam factor. Astrostatistics PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Fixed Point Theorem There exists a closed \(\lambda\)-term \(Y\) such that for all \(F\), \[F(YF) \equiv_\beta YF\]
\(Y\) is called a fixed-point function. LogicAndComputability PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Laplace Approximation for the Evidence Find a MAP estimate \(\theta_0 = \underset{\Theta}{\mathrm{argmax}}\,\, \ln(P^*(\theta \,|\, D))\), use a second order Taylor expansion for \(\ln(P^*(\theta \,|\, D))\) then exponentiate to get: \[P^*(\theta \,|\, D) \approx P^*(\theta_0 \,|\, D) \exp(-\frac{1}{2} (\theta - \theta_0)^{\mathrm{T}} A (\theta - \theta_0))\] where \(A\) is the Hessian at the MAP
Astrostatistics PartIIINotes
Basic Part III Notes::MedicalStats Definition: Population Survivor under Gamma-distributed Frailty Assume the density of \(u\) is given by \(g(u)\), with \({\mathbb{E}}[g(u)] = 1\). Then we have:
\[{\mathbb{E}}_u[F(t)] = {\mathbb{E}}_u[e^{-H(t{\,|\,} u)}] = \int_0^\infty \exp (-u H_0(t))g(u) du\] Recognizing the Laplace transform and plugging in \(u \sim Gamma(p, \lambda)\): \[\bar F(t) = \left( \frac{p}{p+H_0(t)} \right)^\lambda\]
Usually we want the expected value of \(g(u)\) to be one, so we normally use \(p = \lambda\), denoted \(\psi\). MedicalStats PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Inverse of a 2x2 Matrix For a \(2 \times 2\) matrix: \[A = \begin{pmatrix} a & b \\\\ c & d \end{pmatrix}\] its inverse is: \[A^{-1} = \frac{1}{ad - bc}\begin{pmatrix} d & -b \\\\ -c & a \end{pmatrix}\]
To invert, swap the entries on the diagonal, negate the off-diagonal, and divide everything by the determinant. Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Goal of Nested Sampling We need to be able to evaluate the likelihood \(L(\theta) : = P(D {\,|\,} \theta, M)\) and the prior \(\pi(\theta) := P(\theta {\,|\,} M)\), we want to approximate the integral \[Z = \int P(D {\,|\,} \theta, M) P(\theta | M) d \theta = \int L(\theta)\pi(\theta) d\theta\]
(remember the prior is diffuse, and the likelihood is peaked, which makes this problem difficult) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Nested Sampling 1. Take \(N_{live}\) points and their likelihoods, start with \(Z = 0\).
2. Kill the point with the smallest likelihood, call it \(L_i^*\) for step \(i\)
3. (trivial) Compute the shrinkage factor \(t_i = \frac{X_i}{X_{i-1}} \approx e^{-1/N_{live}}\) (approximate with the mean of the beta-distribution \(\mathrm{Beta(N_{live}, 1)}\))
4. Accumulate evidence: \[\Delta Z = L_i^*(X_{i-1} - X_i) = L_i^* (1 - t_i) X_{i-1}\] 5. Sample new point with likelihood greater than \(L_i^*\)
6. Repeat steps 2-5 until convergence
7. Add information about remaining points to the evidence: \[\Delta Z = \bar L X_{end} = \bar L e^{-m/N_{live}} \] where \(\bar L\) is the average likelihood of remaining points, and \(m\) is the number of steps taken so far.
8. Compute the weighted posterior samples: \[w_i = \frac{L_i^*(1-t_i)X_{i-1}}{Z}\] Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Variational Inference Idea: approximate the evidence with a distribution that is 'close' (measured by Kullback-Leibler divergence) Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Kullback-Leibler Divergence \[KL(q(\theta) {\,||\,} p(\theta) ) = - \int q(\theta) \ln \left( \frac{p(\theta)}{q(\theta)} \right)d\theta\] Nonnegative and zero only when \(p(\theta) = q(\theta)\), but NOT symmetric and so not a metric! Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: ELBO
(and how to derive it) Assume we want to minimize \(KL(q(\theta) {\,||\,} P(\theta {\,|\,} D))\). Break down the KL divergence to extract the \(p(D)\) piece, yielding \[KL(q(\theta) {\,||\,} P(\theta {\,|\,} D)) + {\mathbb{E}}_q[\ln(q(\theta)) - \ln (P(D {\,|\,} \theta)P(\theta))] = \ln(P(D))\] Call the expectation bit the Evidence (Log) Lower Bound (ELBO). If we want to minimize the KL divergence, we have to minimize it (can't ever do better than \(\ln(P(D))\)) Astrostatistics PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Axiom of Extensionality Two sets are equal if they have the same elements. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Axiom of Separation We can define sets as we normally do: \[A = \{ x \in S : \varphi(x) \}\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Axiom of Pairing If you have two sets, there exists a set that contains those two sets: \[\{ A, B \}\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Axiom of Choice Suppose you have a set \(X\) of nonempty sets. Then there exists a function \(f\) that maps each \(x \in X\) to one of its elements. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Statement of Diaconescu's Theorem We can deduce the Law of Excluded Middle (\(\forall \varphi,\,\varphi \lor \neg \varphi \)) from the Axion of Choice in a minimal intuitionistic setting. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proposition: Diaconescu's Theorem Want to show \(\forall \varphi,\,\varphi \lor \neg \varphi \); fix \(\varphi\).
Use the axiom of seperation to get: \[A = \{ x \in \{ 0, 1 \} : \varphi \lor (x = 0) \}\] \[B = \{ x \in \{ 0, 1 \} : \varphi \lor (x = 1) \}\] Use the Axiom of Pairing to get \(\{ A, B \}\). Because \(0 \in A\) and \(1 \in B\), \(A\) and \(B\) are inhabited, so we have a choice function \[f: \{ A, B \} \to A \cup B\] with \(f(A) \in A\) and \(f(B) \in B\).
So we have a proof of \[(f(A) = 0 \lor \varphi) \land (f(B) = 1 \lor \varphi)\] By intuistic logic, we can split into four cases: \[f(A) =0,\, f(B) = 0\] \(f(B) = 0\) implies \(\varphi\) above. \[f(A) = 0,\, f(B) = 1\] Hard case, see below. \[f(A) = 1, f(B) = 0\] \(f(A) = 1\) implies \(\varphi\) above. \[f(A) = 1, f(B) = 1\] \(f(A) = 1\) implies \(\varphi\) above.
Hard case: \(f(A) = 0,\, f(B) = 1\). Prove \(\neg \varphi = \varphi \to \bot\) by the following:
Assume \(\varphi\).
By the Axiom of Extensionality, \(A = B\)
But \(f(A)\) and \(f(B)\) don't have the same value! \[0 = f(A) = f(B) = 1\] Contradiction, we have \(\neg \varphi\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: \(\beta\)-redex A term we can simplify into a \(\beta\)-contractum: \[(\lambda x: \sigma . p) Q \to_\beta P[x := Q]\] Read this as 'function application candidate' LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Weak Normalization Theorem \(m\) function \[m(M) := (0,0) \, \text{ if $M$ is in beta-normal form}\] \[m(M) := (h(M), \mathrm{redex}(M))\]
where \(h(M)\) is the greatest height of a redex in \(M\), and \(\mathrm{redex}(M)\) is the number of occurances of redexes in \(M\) of maximal height.
(the height of a redex is the height of the type of the function being applied) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Grammar of the Full Simply-Typed Lambda Calculus Types \[\begin{align*} \lambda_\Pi &:\, V & \text{Variable}\\\\ &|\, \lambda V: \Pi . \lambda_\Pi & \text{$\lambda$-Abstraction}\\\\ &|\, \lambda_\Pi\lambda_\Pi & \text{$\lambda$-Application}\\\\ &|\, \left< \Pi, \Pi \right> & \text{Product Constructor}\\\\ &|\, \Pi_1(\lambda_\Pi)\, | \, \Pi_2(\lambda_\Pi) & \text{Projection Types / Product Eliminators}\\\\ &|\, \iota_1(\lambda_\Pi)\, | \, \iota_2(\lambda_\Pi) & \text{Left and Right Coproduct Constructors}\\\\ &|\, \mathrm{Case}(\lambda_\Pi; V.\lambda_\Pi; V. \lambda_\Pi) & \text{Coproduct Eliminator}\\\\ &|\, * & \text{Terminal Type Constructor}\\\\ &|\, !_\Pi \lambda_\Pi & \text{Explosion Type} \end{align*}\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Explosion Typing Rule \[\frac{\Gamma \Vdash M : 0}{\Gamma \Vdash !_\varphi M : \varphi}\] for every type \(\varphi \in \Pi\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Case Beta-Reduction \[\mathrm{Case}(\iota_1(M); x.K; y.L) )\to_\beta K[x := M]\] \[\mathrm{Case}(\iota_2(M); x.K; y.L) )\to_\beta L[y := M]\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Extra BHK interpretation \(\eta\)-reductions \[\left< \pi_1 M, \pi_2 M \right> \to_\eta M\] Also if \(\Gamma \Vdash M : 1\) we have \(M \to_\eta *\) LogicAndComputability PartIIINotes
Basic Part III Notes::MedicalStats Definition: Risk Set \[R = \{ j : x_j \geq a_j \}\] Note \(r_j = |R|\)
Note that this includes everyone that hasn't been censored yet right before \(t_j\) MedicalStats PartIIINotes
Basic Part III Notes::MedicalStats Definition: Event Set / Death Set \[D = \{ i : x_i = a_i \land v_i = 1 \}\] Note \(d_j = |D|\) The number of individuals with (uncensored) events exactly at \(t_i\) MedicalStats PartIIINotes
Basic Part III Notes::SLIP Definition: Log-Odds Log of the odds as we know it: like '3 to 1 in our favor'.
For an arbitrary probability \(p\), the logit gives its log-odds.
For classification problems, it's always given by \[\log \left( \frac{p_k(x)}{p_j(x)} \right)\] SLIP PartIIINotes
Basic Part III Notes::SLIP Definition: Log-Odds of the Logistic Classifier \[\log \frac{p_k(x_i)}{p_K(x_i)} = x_i^T \beta_k\] with \(\beta_K = 0\) as the baseline.
Extending this, we have: \[\log \left(\frac{p_k(x_i)}{p_l(x_i)} \right) = x_i{^{\mathrm{T}}} (\beta_k - \beta_l)\] SLIP PartIIINotes
Basic Part III Notes::Astrostatistics Definition: The Apparent Magnitude Equation \[m = M + \mu\] where \(m\) is the true apparent magnitude, \(M\) is the absolute magnitude, and \(\mu\) is the distance modulus \(\mu = 25 + 5 \log_{10} (d)\). Astrostatistics PartIIINotes
Basic Part III Notes::Astrostatistics Definition: Parallax Equation \[\frac{\omega}{\mathrm{arcsec}} = \frac{\mathrm{parsec}}{d}\] where \(\omega\) is the true parallax angle and \(d\) is the distance to the star.
(note this uses small angle approximations, is not valid for large \(\omega\)) Astrostatistics PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: G \[G := \forall y. (y = n \to \psi(y))\] where \[\psi (x) := \forall z. (\delta(x, z) \to \theta(z))\] (\(\delta(x, z)\) is the \(\Sigma_1\)-representation of diag)
and \(n = \lceil \psi (x)\rceil\)
LogicAndComputability PartIIINotes
Basic Part III Notes::MedicalStats MLE for the log-likelihood for the exponential survivor function \[\hat \theta = \frac{V_+}{X_+}\] where \(V_+\) is the total number of (non-censored) events, and \(X_+\) is the total time spent at risk. MedicalStats PartIIINotes
Basic Part III Notes::SLIP Model Complexity and the Bias-Variance Decomposition Complex models have low bias, but high variance.
Simple models have high bias, but low variance. SLIP PartIIINotes
Basic Part III Notes::Astrostatistics Goal of ELBO We want to find a \(q\) that minimizes \[KL(q(\theta) {\,||\,} P(\theta {\,|\,} D))\] (approximate our evidence with a 'close' \(q\) distribution) Astrostatistics PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Crude Incompleteness Let \(T\) be a recursive set of (Godel numbers of) \(L_{PA}\) sentences that is consistent (never contains both \(\phi\) and \(\neg \phi\)) and contains all the \(\Sigma_1\) and \(\Pi_1\) sentences provable in \(PA^-\).
Then there exists a \(\Pi_1\) sentence \(\tau\) such that \(\tau \not\in T\) and \(\neg \tau \not \in T\) (\(T\) can't prove \(\tau\) or its negation). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Godel-Rosser Theorem Let \(T\) be a consistent \(L_{PA}\) theory extending \(PA^-\) and admitting a recursively enumerable axiomatization.
Then \(T\) is \(\Pi_1\) incomplete: there exists a \(\Pi_1\) sentence \(\tau\) for which \(T \not \vdash \tau\) and \(T \not \vdash \neg \tau\)
Assume that \(T\) is complete to get a contradiction. Use Craig's trick to get a recursive axiomatization, do exhaustive proof search from axioms (paired with completeness) to show that \(S\) is recursive. But then we can apply Crude Incompleteness to get a contradiction. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Recursive (and countable) \(L_{PA}\)-structures A (countable) \(L_{PA}\)-structure \({\mathcal{M}}\) is recursive if and only if there exist recursive functions \(\oplus: {\mathbb{N}}^2 \to {\mathbb{N}}\), \(\otimes: {\mathbb{N}}^2 \to {\mathbb{N}}\), a binary recursive relation \(\preceq \subseteq {\mathbb{N}}^2\), and natural numbers \(n_0, n_1 \in {\mathbb{N}}\) such that \[ {\mathcal{M}} \cong ({\mathbb{N}}, \oplus, \otimes, \preceq, n_0, n_1)\] LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Prime Function The \(\Sigma_1\) formula \(\mathrm{pr}(x,y)\) is true only when \(y\) is the \(x\)th prime number. \[PA \vdash \forall x,\exists! y, \mathrm{pr}(x,y)\] So if \({\mathbb{N}}\) believes \(y\) is the \(x\)th prime number, all models of \(PA\) do.
We denote the \(k\)th prime \(\pi_k\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Overspill Lemma Let \({\mathcal{M}}\) be a nonstandard model of \(PA\), \(\phi(x,\bar y)\) be an \(L_{PA}\) formula, and \(\bar a \in {\mathcal{M}}\).
If \({\mathcal{M}} \vDash \phi(n, \bar a)\) for all standard natural numbers \(n\), then there exists a nonstandard \(e \in M\) such that \({\mathcal{M}} \vDash \forall x \leq e. \phi(x,\bar a)\) LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Proof of Overspill Suppose there exists some function \(\mathrm{st}(x,\bar a)\), that, given some helper data \(\bar a\), is true only when \(x\) is a standard natural number.
But clearly we can apply 'st' inductively from zero: \[M \vDash \mathrm{st}(0, \bar a) \land \forall x. \mathrm{st}(x, \bar a) \to \mathrm{st}(x+1,\bar a)\] and remember by the inductive axiom of \(PA\), we must then have \(\forall x, \mathrm{st}(x,\bar a)\)! But that defeats the whole point: every element of a nonstandard model can't be standard! LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Canonical Coding A subset \(S \subseteq {\mathbb{N}}\) is canonically coded in a model \({\mathcal{M}}\) of \(PA\) if there exists some element \(c \in {\mathcal{M}}\) such that \[S = \{ n \in {\mathbb{N}}: \exists y. (\pi_{n^*} \times y) = c \}\] where \(n^*\) is the standard natural number \(n\) in the model.
We proved in class that for any nonstandard model \({\mathcal{M}}\) of \(PA\), there exists a non-recursive set \(S\) which is canonically coded in \({\mathcal{M}}\). LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Recursive Inseparability Subsets \(A\), \(B\) of \({\mathbb{N}}\) are recursively inseparable if they are disjoint and there is no recursive subset \(C\) of \({\mathbb{N}}\) such that \(B \cap C = \emptyset\) and \(A \subseteq C\).
We proved in class that they must exist. LogicAndComputability PartIIINotes
Basic Part III Notes::LogicAndComputability Definition: Tennenbaum's Theorem Let \({\mathcal{M}} = (M, \oplus, \otimes, \preceq, n_0, n_1)\) be a countable nonstandard model of \(PA\). Then \(\oplus\) is not recursive. LogicAndComputability PartIIINotes
Basic Part III Notes::MSM Definition: The Square Root Lasso Minimizes \[\frac{1}{\sqrt{n}}||Y-\bar Y1 - X\beta||_2 + \gamma||\beta||_1\] MSM PartIIINotes
Basic Part III Notes::MSM Definition: Group Lasso Penalty \[\lambda \sum_{j=1}^q m_j ||\beta_{G_j}||_2\] where we have group partitions \(G_1...G_j\) and we usually define \(m_k\) MSM PartIIINotes