Added Exercise

2021-11-08 14:48:08 +01:00 · 2021-11-08 14:48:08 +01:00 · 48b24aa61f
commit 48b24aa61f
parent c8927e41fd
4 changed files with 975 additions and 0 deletions
--- a/exercise.ipynb
+++ b/exercise.ipynb
@ -0,0 +1,975 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# EXERCISE 1 - ML - Grundverfahren WS 21/22\n",
+    "\n",
+    "**Exercise 1**: Ge Li ge.li@kit.edu\n",
+    "\n",
+    "**Exercise 2 & 3**: Philipp Becker philipp.becker@kit.edu\n",
+    "## Submission Instructions\n",
+    "Please follow the instruction from Exercise ZERO!\n",
+    "\n",
+    "\n",
+    "## 1.) Linear Regression"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.1) Matrix Vector Calculus (1 Point)\n",
+    "Given the following element-wise expression of a matrix-vector product,\n",
+    "rewrite it in matrix form:\n",
+    "\n",
+    "\\begin{align*}\n",
+    "        g = \\alpha \\sum_i \\sum_j \\sum_k z_k x_{ij} q_i y_{jk}\n",
+    "\\end{align*}\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.2) Derive Ridge Regression Weights (4 Points)\n",
+    "Derive the optimal solution of weights in Ridge Regression using matrix form, i\n",
+    ".e. $\\boldsymbol{w}= ?$\n",
+    "\n",
+    "Hint: You will need derivatives for vectors/matrices. Start\n",
+    "from the matrix objective for ridge regression as stated here\n",
+    "\n",
+    "\\begin{align*}\n",
+    "L &= (\\boldsymbol{y}-\\boldsymbol{\\Phi} \\boldsymbol{w})^T(\\boldsymbol{y}-\\boldsymbol{\\Phi} \\boldsymbol{w}) + \\lambda \\boldsymbol{w}^T \\boldsymbol{I} \\boldsymbol{w}. \\\\\n",
+    "\\end{align*}\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Ridge Regression - Code\n",
+    "Let's first get the data\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from typing import Tuple\n",
+    "\n",
+    "# Load data\n",
+    "\n",
+    "training_data = np.load('training_data.npy')\n",
+    "test_data = np.load('test_data.npy')\n",
+    "\n",
+    "test_data_x = test_data[:, 0]\n",
+    "test_data_y = test_data[:, 1]\n",
+    "\n",
+    "training_data_x = training_data[:, 0]\n",
+    "training_data_y = training_data[:, 1]\n",
+    "\n",
+    "# Visualize data\n",
+    "plt.plot(test_data_x, test_data_y, 'or')\n",
+    "plt.plot(training_data_x, training_data_y, 'ob')\n",
+    "plt.xlabel('x')\n",
+    "plt.ylabel('y')\n",
+    "plt.legend([\"test_data\", \"training data\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As in the lecture notebook, we will use polynomial-features here again.\n",
+    "The following functions will be used for:\n",
+    "- calculating polynomial features\n",
+    "- computing the mean and std of the features (training data) as normalizer\n",
+    "- normalize other data (test) features using the normalizer (mean and std)\n",
+    "- evaluating the model\n",
+    "- calculating the Mean Squarred Error for assigning a performance to each\n",
+    "model. <br><br>\n",
+    "\n",
+    "Note we will use the mean and the standard deviation to normalize our features\n",
+    "according to:\n",
+    "\\begin{align*}\n",
+    "    \\boldsymbol{\\tilde{\\Phi}} = \\frac{\\boldsymbol{\\Phi}(\\boldsymbol{x}) - \\boldsymbol{\\mu}_{\\Phi}}{\\boldsymbol{\\sigma}_{\\Phi}}, \n",
+    "\\end{align*}\n",
+    "where $\\boldsymbol{\\tilde{\\Phi}}$ are the (approximately) normalized features to any input\n",
+    "$\\boldsymbol{x}$ (not necessarily the training data), $\\boldsymbol{\\mu}_{\\Phi}$ is the mean of the features applied to the training data and $\\boldsymbol{\\sigma}_{\\Phi}$ is the standard deviation of the features applied to the training data for each dimension.<br>\n",
+    "\n",
+    "Normalization is a standard technique used in Regression to avoid numerical problems and to obtain better fits for the weight vectors $\\boldsymbol{w}$. Especially when the features transform the inputs to a very high value range, normalization is very useful. In this homework we will use features of degree 10. Since the input range of the data is roughly from -4 to 4 this will lead to very high values for higher order degrees. By normalizing each dimension of the feature matrix, we will map each dimension of the feature matrix applied to the training data to a zero mean unit variance distribution."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def get_polynomial_features(data: np.ndarray,\n",
+    "                            degree: int) ->np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Function to create Feature Matrix. Extends the feature matrix according to\n",
+    "    the matrix form discussed in the lectures.\n",
+    "\n",
+    "    :param data: data points you want to evaluate the polynomials,\n",
+    "                 shape: [n_samples] (we have 1-dim data)\n",
+    "    :param degree: degree of your polynomial, shape: scalar\n",
+    "    :return polynomial_features: shape [n_samples x (degree+1)]\n",
+    "    \"\"\"\n",
+    "    polynomial_features = np.ones(data.shape)\n",
+    "    for i in range(degree):\n",
+    "        polynomial_features = np.column_stack((polynomial_features, data ** (i + 1)))\n",
+    "    return polynomial_features\n",
+    "\n",
+    "\n",
+    "def get_mean_std_features(polynomial_features: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:\n",
+    "    \"\"\"\n",
+    "    Function for calculating the mean and standard deviation of the features\n",
+    "    :param polynomial_features: shape: [n_samples x (degree+1)]\n",
+    "    :return mean_feat: mean vector of the features,\n",
+    "            shape:[1 x (degrees+1)]\n",
+    "    :return std_feat: standard deviation (for each dimension in feature matrix),\n",
+    "                      shape: [1 x (degrees+1)] \n",
+    "    \"\"\"\n",
+    "    mean_feat = np.mean(polynomial_features, axis=0, keepdims=True)\n",
+    "    mean_feat[:, 0] = 0.0 # we don't want to normalize the bias\n",
+    "    std_feat = np.std(polynomial_features, axis=0, keepdims=True)\n",
+    "    std_feat[:, 0] = 1.0 # we don't want to normalize the bias\n",
+    "    return mean_feat, std_feat\n",
+    "\n",
+    "\n",
+    "def normalize_features(polynomial_features: np.ndarray,\n",
+    "                       mean_train_features: np.ndarray,\n",
+    "                       std_train_features: np.ndarray) ->np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Normalize features\n",
+    "    :param polynomial_features:  features to be normalized,\n",
+    "                 shape: [n_samples x (degree+1)]\n",
+    "    :param mean_train_features: mean of the feature matrix of the training set,\n",
+    "                 shape: [1 x (degrees+1)]\n",
+    "    :param std_train_features: std of the feature matrix of the training set,\n",
+    "                 shape: [1 x (degrees+1)]\n",
+    "    :return norm_feat: normalized features, shape: [n_samples x (degree+1)]\n",
+    "    \"\"\"\n",
+    "\n",
+    "    # note: features: (n_samples x n_dims),\n",
+    "    #       mean_train_features: (1 x n_dims),\n",
+    "    #       std_train_features:  (1 x n_dims)\n",
+    "    #       due to these dimensionalities we can do element-wise operations.\n",
+    "    #       By this we normalize each dimension independently\n",
+    "    norm_feat = (polynomial_features - mean_train_features) / std_train_features\n",
+    "    return norm_feat\n",
+    "\n",
+    "\n",
+    "def eval(Phi:np.ndarray, w:np.ndarray)->np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Evaluate the models\n",
+    "\n",
+    "    :param Phi: Feature matrix, shape: [n_samples x (degree+1)]\n",
+    "    :param w: weight vector, shape: [degree + 1]\n",
+    "    :return : predictions, shape [n_samples] (we have 1-dim data)\n",
+    "    Evaluates your model\n",
+    "    \"\"\"\n",
+    "    return np.dot(Phi, w)\n",
+    "\n",
+    "\n",
+    "def mse(y_target:np.ndarray, y_pred:np.ndarray)->np.ndarray:\n",
+    "    \"\"\"\n",
+    "    :param y_target: the target outputs,\n",
+    "            shape: [n_samples] (here 1-dim data)\n",
+    "    :param y_pred: the predicted outputs,\n",
+    "            shape: [n_samples](we have 1-dim data)\n",
+    "    :return : The Mean Squared Error, shape: scalar\n",
+    "    \"\"\"\n",
+    "    diff = y_target - y_pred\n",
+    "    return np.sum(diff ** 2, axis=0) / y_pred.shape[0]\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 1.3) Implement Ridge Regression Weights (2 Point)\n",
+    "The following function will calculate the weights for ridge regression. Fill in the missing code according to the formula for calculating the weight updates for ridge regression. <br>\n",
+    "Recall that the formula is given by \n",
+    "\\begin{align*}\n",
+    "    \\boldsymbol{w} &= (\\boldsymbol{\\Phi} ^T \\boldsymbol{\\Phi} + \\lambda \\boldsymbol{I} )^{-1} \\boldsymbol{\\Phi}^T \\boldsymbol{y},\n",
+    "\\end{align*}\n",
+    "where $\\boldsymbol{\\Phi}$ is the feature matrix (the matrix storing the data points applied to the polynomial features).\n",
+    "Hint: use np.linalg.solve for solving for the linear equation.\n",
+    "If you got confused because of the normalization described before, don't worry, you do not need to consider it here :)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def calc_weights_ridge(Phi:np.ndarray,\n",
+    "                       y:np.ndarray,\n",
+    "                       ridge_factor:float)->np.ndarray:\n",
+    "    \"\"\"\n",
+    "    :param Phi: Feature Matrix, shape: [n_samples x (degree+1)]\n",
+    "    :param y: Output Values, [n_samples] (we have 1-dim data)\n",
+    "    :param ridge_factor: lambda value, shape: scalar\n",
+    "    :return : The weight vector, calculated according to the equation shown before,\n",
+    "            shape: [degrees +1]\n",
+    "    \"\"\"\n",
+    "    ##################\n",
+    "    ##TODO\n",
+    "    #################"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For demonstrating ridge regression we will pick the polynomial degree of 10. In the lecture notebook we have seen that this model is highly overfitting to the data.\n",
+    "We will investigate the role of the ridge factor $\\lambda$. For that purpose we first need to calculate the weights for different $\\lambda$ values. <br>\n",
+    "We will pick $\\lambda = [1e-{6}, 1e-{3}, 1, 3, 5,10,20,30,40,50, 1e2, 1e3, 1e5] $ to see the differences of the values. <br><br>\n",
+    "\n",
+    "Practical note. We use here very high values for $\\lambda$ for demonstration\n",
+    "purposes here. In practice we would not choose a model where we know from\n",
+    "beginning that it is highly overfitting. When choosing an appropriate model, the value needed for $\\lambda$ automatically will be small (often in the range of $1e^{-6}$ or smaller)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Let's do it on polynomial degree 10 and see the results\n",
+    "\n",
+    "# first we get the mean and the standard deviation of the training feature matrix, which we will use for normalization\n",
+    "train_features = get_polynomial_features(training_data_x, 10)\n",
+    "test_features = get_polynomial_features(test_data_x, 10)\n",
+    "mean_train_feat, std_train_feat = get_mean_std_features(train_features)\n",
+    "norm_train_features = normalize_features(train_features, mean_train_feat, std_train_feat)\n",
+    "norm_test_features = normalize_features(test_features, mean_train_feat, std_train_feat)\n",
+    "\n",
+    "\n",
+    "# now we can calculate the normalized features for degree 10\n",
+    "ridge_factors = [1e-6, 1e-3, 1, 3, 5, 10,20,30,40, 50, 1e2, 1e3, 1e5]\n",
+    "weights_ridge = []\n",
+    "\n",
+    "for lambda_val in ridge_factors:\n",
+    "    weights_ridge.append(calc_weights_ridge(norm_train_features, training_data_y, lambda_val))\n",
+    "\n",
+    "# We further have to perform the predictions based on the models we have calculated\n",
+    "y_training_ridge = []\n",
+    "y_test_ridge = []\n",
+    "\n",
+    "for w in weights_ridge:\n",
+    "    y_training_ridge.append(eval(norm_train_features, w))\n",
+    "    y_test_ridge.append(eval(norm_test_features, w))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We are interested in the mean squarred error on the test and the training data. For that purpose we calculate them here and plot the errors for different $\\lambda$ values in log space. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "training_error_ridge = []\n",
+    "test_error_ridge = []\n",
+    "\n",
+    "for i in range(len(y_training_ridge)):\n",
+    "    training_error_ridge.append(mse(training_data_y, y_training_ridge[i]))\n",
+    "    test_error_ridge.append(mse(test_data_y, y_test_ridge[i]))\n",
+    "\n",
+    "error_fig_ridge = plt.figure()\n",
+    "plt.figure(error_fig_ridge.number)\n",
+    "plt.title(\"Error Plot Ridge Regression\")\n",
+    "plt.xlabel(\"$\\lambda$\")\n",
+    "plt.ylabel(\"MSE\")\n",
+    "x_axis = [\"$1e-{6}$\", \"$1e-{3}$\", \"$1$\", \"$3$\", \"$5$\",\"$10$\",\"$20$\",\"$30$\",\"$40$\",\"$50$\",\n",
+    "          \"$1e2$\", \"$1e3$\", \"$1e5$\"]\n",
+    "plt.yscale('log')\n",
+    "plt.plot(x_axis, training_error_ridge, 'b')\n",
+    "plt.plot(x_axis, test_error_ridge, 'r')\n",
+    "# let's find the index with the minimum training error\n",
+    "min_error_idx = np.argmin(test_error_ridge)\n",
+    "plt.plot(x_axis[min_error_idx], test_error_ridge[min_error_idx], 'xg')\n",
+    "plt.legend(['Training Error', 'Test Error', 'Min Test Error'])\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Let us visualize the newly fitted model with the optimal lambda value here\n",
+    "x = np.linspace(-5, 5, 100)\n",
+    "new_features = get_polynomial_features(x, 10)\n",
+    "new_norm_feat = normalize_features(new_features, mean_train_feat, std_train_feat)\n",
+    "y_pred = eval(new_norm_feat, weights_ridge[min_error_idx])\n",
+    "\n",
+    "plt.plot()\n",
+    "plt.plot(test_data_x, test_data_y, 'or')\n",
+    "plt.plot(training_data_x, training_data_y, 'ob')\n",
+    "plt.plot(x, y_pred)\n",
+    "plt.legend([\"test_data\", \"training_data\", \"inference\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "### 1.4) Error Plot (1 Point)\n",
+    "In the lecture we have seen and analyzed the plot of polynomial degrees \n",
+    "against the error (slide  47).\n",
+    "Similarly, now please analyze the relationship between the error and the \n",
+    "different values of $\\lambda$, as well as the reason behind it.\n",
+    "\n",
+    "Hint: Do not forget that we are in log space. Small changes in the y-axis mean high differences in the error values. <br><br>\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Probability Basics and Linear Classification\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## First Example (Two Moons)\n",
+    "\n",
+    "Let us start by loading a very simple toy dataset, the \"two moons\"."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "from typing import Tuple, Callable\n",
+    "\n",
+    "data = dict(np.load(\"two_moons.npz\", allow_pickle=True))\n",
+    "samples = data[\"samples\"]\n",
+    "labels = data[\"labels\"]\n",
+    "\n",
+    "c0_samples = samples[labels == 0]  # class 0: all samples with label 0\n",
+    "c1_samples = samples[labels == 1]  # class 1: all samples with labe 1 \n",
+    "\n",
+    "plt.figure(\"Data\")\n",
+    "plt.scatter(x=c0_samples[:, 0], y=c0_samples[:, 1], label=\"c0\")\n",
+    "plt.scatter(x=c1_samples[:, 0], y=c1_samples[:, 1], label=\"c1\")\n",
+    "plt.legend()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let us also define some plotting utility"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def draw_2d_gaussian(mu: np.ndarray, sigma: np.ndarray, plt_std: float = 2, *args, **kwargs) -> None:\n",
+    "    (largest_eigval, smallest_eigval), eigvec = np.linalg.eig(sigma)\n",
+    "    phi = -np.arctan2(eigvec[0, 1], eigvec[0, 0])\n",
+    "\n",
+    "    plt.scatter(mu[0:1], mu[1:2], marker=\"x\", *args, **kwargs)\n",
+    "\n",
+    "    a = plt_std * np.sqrt(largest_eigval)\n",
+    "    b = plt_std * np.sqrt(smallest_eigval)\n",
+    "\n",
+    "    ellipse_x_r = a * np.cos(np.linspace(0, 2 * np.pi, num=200))\n",
+    "    ellipse_y_r = b * np.sin(np.linspace(0, 2 * np.pi, num=200))\n",
+    "\n",
+    "    R = np.array([[np.cos(phi), np.sin(phi)], [-np.sin(phi), np.cos(phi)]])\n",
+    "    r_ellipse = np.array([ellipse_x_r, ellipse_y_r]).T @ R\n",
+    "    plt.plot(mu[0] + r_ellipse[:, 0], mu[1] + r_ellipse[:, 1], *args, **kwargs)\n",
+    "\n",
+    "# plot grid for contour plots\n",
+    "plt_range = np.arange(-1.5, 2.5, 0.01)\n",
+    "plt_grid = np.stack(np.meshgrid(plt_range, plt_range), axis=-1)\n",
+    "flat_plt_grid = np.reshape(plt_grid, [-1, 2])\n",
+    "plt_grid_shape = plt_grid.shape[:2]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2): Classification using Generative Models (Naive Bayes Classifier)\n",
+    "\n",
+    "We first try a generative approach, the Naive Bayes Classifier.\n",
+    "We model the class conditional distributions $p(\\boldsymbol{x}|c)$ as Gaussians, the class prior $p(c)$ as\n",
+    "Bernoulli and apply Bayes rule to compute the class posterior $p(c|\\boldsymbol{x})$.\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As a small recap, recall that the density of the Multivariate Normal Distribution is given by\n",
+    "\n",
+    "$$ p(\\boldsymbol{x}) = \\mathcal{N}\\left(\\boldsymbol{x} | \\boldsymbol{\\mu}, \\boldsymbol{\\Sigma} \\right) = \\dfrac{1}{\\sqrt{\\det \\left(2 \\pi \\boldsymbol{\\Sigma}\\right)}} \\exp\\left( - \\dfrac{(\\boldsymbol{x}-\\boldsymbol{\\mu})^T \\boldsymbol{\\Sigma}^{-1} (\\boldsymbol{x}-\\boldsymbol{\\mu})}{2}\\right) $$\n",
+    "\n",
+    "and we already saw how to implement it in the python introduction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def mvn_pdf(x: np.ndarray, mu: np.ndarray, sigma: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    Density of the Multivariate Normal Distribution\n",
+    "    :param x: samples, shape: [N x dimension]\n",
+    "    :param mu: mean, shape: [dimension]\n",
+    "    :param sigma: covariance, shape: [dimension x dimension]\n",
+    "    :return p(x) with p(x) = N(mu, sigma) , shape: [N] \n",
+    "    \"\"\"\n",
+    "    norm_term = 1 / np.sqrt(np.linalg.det(2 * np.pi * sigma))\n",
+    "    diff = x - np.atleast_2d(mu)\n",
+    "    exp_term = np.sum(np.linalg.solve(sigma, diff.T).T * diff, axis=-1)\n",
+    "    return norm_term * np.exp(-0.5 * exp_term)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Practical Aspect:** In practice you would never implement it like that, but stay\n",
+    "in the log-domain. Also for numerically stable implementations of the multivariate normal density the symmetry and\n",
+    "positive definitness of the covariance should be exploited by working with it's Cholesky decomposition."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The maximum likelihood estimator for a Multivariate Normal Distribution is given by\n",
+    "$$ \\boldsymbol{\\mu} = \\dfrac{1}{N} \\sum_{i}^N \\boldsymbol{x}_i \\quad \\quad \\boldsymbol{\\Sigma} = \\dfrac{1}{N} \\sum_{i}^N (\\boldsymbol{x}_i - \\boldsymbol{\\mu}) (\\boldsymbol{x}_i - \\boldsymbol{\\mu})^T. $$\n",
+    "\n",
+    "This time, before we use it, we are going to derive it:\n",
+    "\n",
+    "### Exercise 2.1): Derivation of Maximum Likelihood Estimator (5 Points):\n",
+    "\n",
+    "Derive the maximum likelihood estimator for Multivariate Normal distributions, given above.\n",
+    "This derivations involves some matrix calculus.\n",
+    "Matrix calculus is a bit like programming, you google the stuff you need and then plug it together in the right order.\n",
+    "Good resources for such rules are the \"matrix cookbook\" (https://www.math.uwaterloo.ca/~hwolkowi/matrixcookbook.pdf) and the Wikipdia article about matrix calculus\n",
+    "(https://en.wikipedia.org/wiki/Matrix_calculus ). State all rules you use explicitly\n",
+    "(except the ones given in the hints below). \n",
+    "\n",
+    "**Remark** There are different conventions of how to define a gradient (as column-vector or row-vector). This results in different ways to write the Jacobian and thus different, usually transposed, matrix calculus rules:\n",
+    "- In the lecture we define the gradient as column-vector \n",
+    "- In the Wikipedia article this convention is referred to as \"Denominator Layout\". It also contains a nice explanation of the different conventions for the gourmets among you ;) \n",
+    "- The Matrix Cookbook uses the same convention (gradient as column vector)\n",
+    "- Please also use it here\n",
+    "\n",
+    "**Hint** Here are two of those rules that might come in handy\n",
+    "\n",
+    "$\\dfrac{\\partial\\log\\det(\\boldsymbol{X})}{\\partial \\boldsymbol{X}} = \\boldsymbol{X}^{-1}$\n",
+    "\n",
+    "$\\dfrac{\\partial \\boldsymbol{x}^T\\boldsymbol{A}\\boldsymbol{x}}{\\partial \\boldsymbol{x}} = 2 \\boldsymbol{A}\\boldsymbol{x}$ for symmetric matrices $\\boldsymbol{A}$ (hint hint: covariance matrices are always\n",
+    "symmetric)\n",
+    "\n",
+    "There is one missing to solve the exercise. You need to find it yourself. (Hint hint: Look in the matrix cookbook, chapter 2.2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Implementation**\n",
+    "\n",
+    "Lets reuse one of the implementations from the zeroth-exercise for that "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def mvn_mle(x: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:\n",
+    "    \"\"\"\n",
+    "    Maximum Likelihood Estimation of parameters for Multivariate Normal Distribution\n",
+    "    :param x: samples shape: [N x dimension]\n",
+    "    :return mean (shape: [dimension]) und covariance (shape: [dimension x dimension]) that maximize likelihood of data.\n",
+    "    \"\"\"\n",
+    "    mean = 1 / x.shape[0] * np.sum(x, axis=0)\n",
+    "    diff = x - mean\n",
+    "    cov = 1 / x.shape[0] * diff.T @ diff\n",
+    "    return mean, cov\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now use this maximum likelihood estimator to fit generative models to the samples of both classes. Using those models and some basic rules of probability we can obtain the class posterior distribution $p(c|\\boldsymbol{x})$"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Exercise 2.2) Generative Classifier (2 Points)\n",
+    "\n",
+    "Given a way to fit the class conditional using our Maximum Likelihood estimator, we can implement the generative classifier"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Fit Gaussian Distributions using the maximum likelihood estimator to samples from both classes\n",
+    "mu_c0, sigma_c0 = mvn_mle(c0_samples)\n",
+    "mu_c1, sigma_c1 = mvn_mle(c1_samples)\n",
+    "\n",
+    "# Prior obtained by \"counting\" samples in each class\n",
+    "p_c0 = c0_samples.shape[0] / samples.shape[0]\n",
+    "# LEAVE AS EXERCISE\n",
+    "p_c1 = # TODO \n",
+    "\n",
+    "def compute_posterior(\n",
+    "        samples: np.ndarray,\n",
+    "        p_c0: float, mu_c0: np.ndarray, sigma_c0: np.ndarray,\n",
+    "        p_c1: float, mu_c1: np.ndarray, sigma_c1: np.ndarray) \\\n",
+    "        -> Tuple[np.ndarray, np.ndarray]:\n",
+    "    \"\"\"\n",
+    "    computes the posteroir distribution p(c|x) given samples x, the prior p(c) and the\n",
+    "    class conditional likelihood p(x|c)\n",
+    "    :param samples: samples x to classify, shape: [N x dimension]\n",
+    "    :param p_c0: prior probability of class 0, p(c=0) \n",
+    "    :param mu_c0: mean of class conditional likelihood of class 0, p(x|c=0) shape: [dimension]\n",
+    "    :param sigma_c0: covariance of class conditional likelihood of class 0, p(x|c=0) shape: [dimension x dimension]\n",
+    "    :param p_c1: prior probability of class 1 p(c=1) \n",
+    "    :param mu_c1: mean of class conditional likelihood of class 1 p(x|c=1) shape: [dimension]\n",
+    "    :param sigma_c1: covariance of class conditional likelihood of class 1, p(x|c=1) shape: [dimension x dimension]\n",
+    "    :return two arrays, p(c=0|x) and p(c=1|x), both shape [N]\n",
+    "    \"\"\"\n",
+    "    # TODO: compute class likelihoods \n",
+    "    # TODO: compute normalization using marginalization\n",
+    "    # TODO: compute class posterior using Bayes rule\n",
+    "    p_c0_given_x = \n",
+    "    p_c1_given_x =\n",
+    "    return p_c0_given_x, p_c1_given_x\n",
+    "\n",
+    "\n",
+    "p_c0_given_x, p_c1_given_x = compute_posterior(samples, p_c0, mu_c0, sigma_c0, p_c1, mu_c1, sigma_c1)\n",
+    "# Prediction\n",
+    "predicted_labels = np.zeros(labels.shape)\n",
+    "# break at 0.5 arbitrary\n",
+    "predicted_labels[p_c0_given_x >= 0.5] = 0.0  # is not strictly necessary since whole array already zero.\n",
+    "predicted_labels[p_c1_given_x > 0.5] = 1.0\n",
+    "\n",
+    "# Evaluate\n",
+    "acc = (np.count_nonzero(predicted_labels == labels)) / labels.shape[0]\n",
+    "print(\"Accuracy:\", acc)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Lets look at the class likelihoods"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "plt.title(\"Naive Bayes\")\n",
+    "plt.scatter(x=samples[labels == 0, 0], y=samples[labels == 0, 1], c=\"blue\")\n",
+    "draw_2d_gaussian(mu_c0, sigma_c0, c=\"blue\")\n",
+    "plt.scatter(x=samples[labels == 1, 0], y=samples[labels == 1, 1], c=\"orange\")\n",
+    "draw_2d_gaussian(mu_c1, sigma_c1, c=\"orange\")\n",
+    "plt.legend([\"c0\", \"c1\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "And the final posterior distribution for the case $p(c=1|\\boldsymbol{x})$"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "p_c0_given_x, p_c1_given_x = compute_posterior(flat_plt_grid, p_c0, mu_c0, sigma_c0, p_c1, mu_c1, sigma_c1)\n",
+    "p_c0_given_x = np.reshape(p_c0_given_x, plt_grid_shape)\n",
+    "p_c1_given_x = np.reshape(p_c1_given_x, plt_grid_shape)\n",
+    "\n",
+    "plt.figure(figsize=(10, 4))\n",
+    "plt.subplot(1, 2, 1)\n",
+    "plt.contourf(plt_grid[..., 0], plt_grid[..., 1], p_c0_given_x, levels=10)\n",
+    "plt.colorbar()\n",
+    "# plot decision boundary \n",
+    "plt.contour(plt_grid[..., 0], plt_grid[..., 1], p_c0_given_x, levels=[0.0, 0.5], colors=[\"k\", \"k\"])\n",
+    "\n",
+    "plt.title(\"p($c_0$ | x)\")\n",
+    "s0 = plt.scatter(c0_samples[..., 0], c0_samples[..., 1], color=\"blue\")\n",
+    "s1 = plt.scatter(c1_samples[..., 0], c1_samples[..., 1], color=\"orange\")\n",
+    "plt.legend([s0, s1], [\"c0\", \"c1\"])\n",
+    "plt.xlim(-1.5, 2.5)\n",
+    "\n",
+    "plt.subplot(1, 2, 2)\n",
+    "plt.contourf(plt_grid[..., 0], plt_grid[..., 1], p_c1_given_x, levels=10)\n",
+    "plt.colorbar()\n",
+    "# plot decision boundary \n",
+    "plt.contour(plt_grid[..., 0], plt_grid[..., 1], p_c0_given_x, levels=[0.0, 0.5], colors=[\"k\", \"k\"])\n",
+    "plt.title(\"p($c_1$ | x)\")\n",
+    "s0 = plt.scatter(c0_samples[..., 0], c0_samples[..., 1], color=\"blue\")\n",
+    "s1 = plt.scatter(c1_samples[..., 0], c1_samples[..., 1], color=\"orange\")\n",
+    "plt.legend([s0, s1], [\"c0\", \"c1\"])\n",
+    "\n",
+    "plt.xlim(-1.5, 2.5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The color indicates the posterior likelihood for the respective call and the black line indicates the decision boundary. \n",
+    "We achieve a train accuracy of 87%.\n",
+    "For such a simple task that is clearly not great, but it nicely illustrates a\n",
+    "problem with generative approaches:\n",
+    "They usually depend on quite a lot of assumptions.\n",
+    "\n",
+    "### 2.3) Wrong Assumptions? (1 Point):\n",
+    "Which untrue assumption did we make?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3) Stochastic and Batch Gradients\n",
+    "\n",
+    "In the recap sessions with Prof. Neumann we already saw (or will see) an implementation of a Discriminative Classifier using Logistic Regression. Here we are going to extend this to stochastic and batch gradient descent.  \n",
+    "\n",
+    "We start by implementing a few helper functions for affine mappings, the sigmoid function, and the negative Bernoulli log-likelihood. - Those are the same as used for the full gradient case."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def affine_features(x: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    implements affine feature function\n",
+    "    :param x: inputs, shape: [N x sample_dim]\n",
+    "    :return inputs with additional bias dimension, shape: [N x feature_dim]\n",
+    "    \"\"\"\n",
+    "    return np.concatenate([x, np.ones((x.shape[0], 1))], axis=-1)\n",
+    "\n",
+    "def quad_features(x: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    implements quadratic feature function\n",
+    "    :param x: inputs, shape: [N x sample_dim]\n",
+    "    :return squared features of x, shape: [N x feature_dim]\n",
+    "    \"\"\"\n",
+    "    sq = np.stack([x[:, 0] ** 2, x[:, 1]**2, x[:, 0] * x[:, 1]], axis=-1)\n",
+    "    return np.concatenate([sq, affine_features(x)], axis=-1)\n",
+    "\n",
+    "def cubic_features(x: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    implements cubic feature function\n",
+    "    :param x: inputs, shape: [N x sample_dim]\n",
+    "    :return cubic features of x, shape: [N x feature_dim]\n",
+    "    \"\"\"\n",
+    "    cubic = np.stack([x[:, 0]**3, x[:, 0]**2 * x[:, 1], x[:, 0] * x[:, 1]**2, x[:, 1]**3], axis=-1)\n",
+    "    return np.concatenate([cubic, quad_features(x)], axis=-1)\n",
+    "\n",
+    "def sigmoid(x: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    the sigmoid function\n",
+    "    :param x: inputs \n",
+    "    :return sigma(x)\n",
+    "    \"\"\"\n",
+    "    return 1 / (1 + np.exp(-x))\n",
+    "\n",
+    "def bernoulli_nll(predictions: np.ndarray, labels: np.ndarray, epsilon: float = 1e-12) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    :param predictions: output of the classifier, shape: [N]\n",
+    "    :param labels: true labels of the samples, shape: [N]\n",
+    "    :param epsilon: small offset to avoid numerical instabilities (i.e log(0))\n",
+    "    :return negative log-likelihood of the labels given the predictions\n",
+    "    \"\"\"\n",
+    "    return - (labels * np.log(predictions + epsilon) + (1 - labels) * np.log(1 - predictions + epsilon))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We are also using the same bernoulli objective and its gradient as before"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def objective_bern(weights: np.ndarray, features: np.ndarray, labels: np.ndarray) -> float:\n",
+    "    \"\"\"\n",
+    "    bernoulli log-likelihood objective \n",
+    "    :param weights: current weights to evaluate, shape: [feature_dim]\n",
+    "    :param features: train samples, shape: [N x feature_dim]\n",
+    "    :param labels: class labels corresponding to train samples, shape: [N]\n",
+    "    :return average negative log-likelihood \n",
+    "    \"\"\"\n",
+    "    predictions = sigmoid(features @ weights)\n",
+    "    return np.mean(bernoulli_nll(predictions, labels))\n",
+    "\n",
+    "def d_objective_bern(weights: np.ndarray, features: np.ndarray, labels: np.ndarray) -> np.ndarray:\n",
+    "    \"\"\"\n",
+    "    gradient of the bernoulli log-likelihood objective\n",
+    "    :param weights: current weights to evaluate, shape: [feature_dim]\n",
+    "    :param features: train samples, shape: [N x feature_dim]\n",
+    "    :param labels: class labels corresponding to train samples, shape [N]\n",
+    "    \"\"\"\n",
+    "    res = np.expand_dims(sigmoid(features @ weights) - labels, -1)\n",
+    "    grad = features.T @ res / res.shape[0]\n",
+    "    return np.squeeze(grad)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3.1) Implementation (3 Points)\n",
+    "\n",
+    "Finally, we can implement our batch gradient descent optimizer. When setting the batch_size to 1 it will become a stochastic gradient descent optimizer.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "def minimize_with_sgd(features: np.ndarray, labels: np.ndarray, initial_weights: np.ndarray, schedule: Callable,\n",
+    "                      num_iterations: int, batch_size: int):\n",
+    "    \"\"\"\n",
+    "    :param features: all samples, shape: [N x feature_dim] \n",
+    "    :param labels: all labels, shape: [N]\n",
+    "    :param initial_weights: initial weights of the classifier, shape: [feature_dim * K]\n",
+    "    :param schedule: learning rate schedule (a callable function returning the learning rate, given the iteration\n",
+    "    :param num_iterations: number of times to loop over the whole dataset\n",
+    "    :param batch_size: size of each batch, should be between 1 and size of data\n",
+    "    return \"argmin\", \"min\", logging info\n",
+    "    \"\"\"\n",
+    "\n",
+    "    assert 1 <= batch_size <= features.shape[0]\n",
+    "    # This is a somewhat simplifying assumption but for the exercise its ok\n",
+    "    assert features.shape[0] % batch_size == 0, \"Batch Size does not evenly divide number of samples\"\n",
+    "    batches_per_iter = int(features.shape[0] / batch_size)\n",
+    "\n",
+    "    # setup\n",
+    "    weights = np.zeros([batches_per_iter * num_iterations + 1, initial_weights.shape[0]])\n",
+    "    loss = np.zeros(batches_per_iter * num_iterations + 1)\n",
+    "    weights[0] = initial_weights\n",
+    "    loss[0]= objective_bern(weights[0], features, labels)\n",
+    "\n",
+    "    for i in range(num_iterations):\n",
+    "     #--------------------------------------------------\n",
+    "        # TODO: shuffle data\n",
+    "        #--------------------------------------------------\n",
+    "        for j in range(batches_per_iter):\n",
+    "            global_idx = i * batches_per_iter + j\n",
+    "\n",
+    "             #--------------------------------------------------\n",
+    "            # TODO: do stochastic gradient descent update!\n",
+    "            #--------------------------------------------------\n",
+    "\n",
+    "\n",
+    "            # log loss (on all samples, usually you should not use all samples to evaluate after each stochastic\n",
+    "            # update step)\n",
+    "            loss[global_idx + 1] = objective_bern(weights[global_idx + 1], features, labels)\n",
+    "    return weights[-1], loss[-1], (weights, loss)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The loss curve is expected to look a bit jerky due to the stochastic nature of stochastic gradient descent.\n",
+    "If it goes down asymptotically its fine."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "# Generate Features from Data\n",
+    "\n",
+    "# change this to play arround with feature functions\n",
+    "#feature_fn = affine_features\n",
+    "#feature_fn = quad_features\n",
+    "feature_fn = cubic_features\n",
+    "features = feature_fn(samples)\n",
+    "\n",
+    "num_iterations = 25\n",
+    "\n",
+    "w_bce, l, l_info = minimize_with_sgd(features, labels, np.zeros(features.shape[1]),\n",
+    "                                 schedule=(lambda t: 0.25),\n",
+    "                                 num_iterations=num_iterations,\n",
+    "                                 batch_size=1)\n",
+    "print(\"Final loss\", l)\n",
+    "\n",
+    "plt.figure()\n",
+    "plt.title(\"Cross Entropy Loss\")\n",
+    "plt.grid(\"on\")\n",
+    "plt.xlabel(\"Update Steps\")\n",
+    "plt.ylabel(\"Negative Bernoulli Log-Likelihood\")\n",
+    "plt.semilogy(l_info[1])\n",
+    "\n",
+    "plt.figure()\n",
+    "plt.title(\"Bernoulli LL Solution\")\n",
+    "pred_grid = np.reshape(sigmoid(feature_fn(flat_plt_grid) @ w_bce), plt_grid_shape)\n",
+    "\n",
+    "plt.contourf(plt_grid[..., 0], plt_grid[..., 1], pred_grid, levels=10)\n",
+    "plt.colorbar()\n",
+    "#This is just a very hacky way to get a black line at the decision boundary: \n",
+    "plt.contour(plt_grid[..., 0], plt_grid[..., 1], pred_grid, levels=[0, 0.5], colors=[\"k\"])\n",
+    "\n",
+    "s0 = plt.scatter(c0_samples[..., 0], c0_samples[..., 1], color=\"blue\")\n",
+    "s1 = plt.scatter(c1_samples[..., 0], c1_samples[..., 1], color=\"orange\")\n",
+    "plt.legend([s0, s1], [\"c0\", \"c1\"])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3.2) Effect of different Batch Sizes and Number of Iterations (1. Point)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Play around with the batch size and number of iterations and briefly describe your observations about convergence speed and monotonicity  of the loss curve."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
--- a/test_data.npy
+++ b/test_data.npy
--- a/training_data.npy
+++ b/training_data.npy
--- a/two_moons.npz
+++ b/two_moons.npz