"Perceptron is a a linear classification model of dichotomy, the input is the egienvector of instance and the output is other category of instance(take +1 and -1). The perceptron corresponds to a separate hyperplane in which the instance is divided into two classes in the input space. The perceptron aims to find the hyperplane. In order to find the hyperplane, the loss function based on misclassification is introduced. The gradient descent method is used to optimize the loss function (optimization).Perceptron learning algorithm is simple and easy to implement. It can be divided into primitive form and dual form. Perceptron prediction is a discriminant model, beacause it uses the peceptron model through learning to predict the new instance. Perceptronsis proposed by Rosenblatt in 1957, are the basis of neural networks and support vector machines.\n",
"It imitate neurons in the biological nervous system, which can receive signals from multiple sources and then convert them into signals that are easy to transmit for output (which in the biological body is represented as electrical signals).\n",
"Psychologist Rosenblatt conceive perception machine, as a simplified mathematical model to explain how neurons in the brain work: it took a set of binary input values (nearby neurons), multiply each input value by a continuous value weight (near each neuron synaptic strength), and setting up a threshold, if the weighted input value is more than the threshold, the output is 1, otherwise 0 (similarly in the neurons discharge process). For perceptrons, most of the input values are either data or outputs from other perceptrons.\n",
"Donald Hebb proposed an unexpected and far-reaching idea that knowledge and learning occur in the brain mainly through the formation and change of synapses between neurons, which is briefly described as Hebb's law:\n",
"> When cell A's axons are close enough to excite cell B, and repeatedly and continuously discharge cell B, some growth process or metabolic changes will occur in one or both of these cells, so that A becomes more efficient as one of the cells that discharge B.\n",
"Perception machine has not completely follow the idea. **However, by the weight of the input value, we can have a very simple and intuitive learning plan: the training set of a given an input/output instance, perception machine should \"learn\" a function: for each example, if the output value is much lower than the instance, then increase the weight of it, otherwise if the value is much higher than the instance, reduce the weight of it.**\n"
"Assume that the input space(eigenvector) is $X \\subseteq R^n$, then the output space is $Y=\\{-1, +1\\}$. Input $x \\in X$ stands for the eigenvector of instance which is correspond to the point in input sapce; Input $y \\in Y$ stancds for the category of instance. The function from input space to output space is:\n",
"This is called perceptron. Among this, parameter $w$ is called weigth vector, $b$ is called bias. $w·x$ represent the dot product of $w$ and $x$. $sign$ is symbol function, which is:\n",
"Perceptron model is linear classification model, it assumes that the space is all linear classification model defined in egienspace, which is the function set ${f|f(x)=w·x+b}$. Liner function $w·x+b=0$ correspond to a hyperplane $S$ in eigen space $Rn$, $w$ is a normal vector of the hyperplane, and B is a truncation of the hyperplane. This hyperpalne divide eigen space into two parts. The points on both sides are positive and negative. Hyperplane S is called the separation hyperplane, as shown in the figure below:\n",
"Assume that the training data is linear separable, the goal of perceptron learing is to get a hyperpalne that can totally split the positve and negative point in the training data, that is to get the parameter w and b. This need a learning strategy, which is define loss function and minimize the loss function.\n",
"A natural selection of the loss function is the total number of misclassified points. However, the loss function obtained in this way is not a continuous differentiable function of parameters W and B, so it is not suitable for optimization. Another choice for the loss function is the sum of the distances from the misclassification point to the classification plane.\n",
"\n",
"Firstly, for any poitn $x_0$ the distance for it to hyperplane is:\n",
"\n",
"\n",
"首先,对于任意一点xo到超平面的距离为\n",
"$$\n",
"$$\n",
"\\frac{1}{||w||} | w \\cdot xo + b |\n",
"\\frac{1}{||w||} | w \\cdot xo + b |\n",
"$$\n",
"$$\n",
"\n",
"\n",
"其次,对于误分类点$(x_i,y_i)$来说 $-y_i(w \\cdot x_i + b) > 0$\n",
"Next, for the misclassified point $(x_i,y_i)$:\n",
"\n",
"$-y_i(w \\cdot x_i + b) > 0$\n",
"\n",
"In this way, assume that the total misclassified point of hyperplane S is set M:\n",
"\n",
"\n",
"这样,假设超平面S的总的误分类点集合为M,那么所有误分类点到S的距离之和为\n",
"$$\n",
"$$\n",
"-\\frac{1}{||w||} \\sum_{x_i \\in M} y_i (w \\cdot x_i + b)\n",
"-\\frac{1}{||w||} \\sum_{x_i \\in M} y_i (w \\cdot x_i + b)\n",
"$$\n",
"$$\n",
"不考虑1/||w||,就得到了感知机学习的损失函数。\n",
"\n",
"\n",
"### 经验风险函数\n",
"Without the consideration of 1/||w||, we can get the loss function of perceptron learning.\n",
"\n",
"### Empirical risk function\n",
"\n",
"Given a dataset $T = \\{(x_1,y_1), (x_2, y_2), ... (x_N, y_N)\\}$(among them $x_i \\in R^n$, $y_i \\in \\{-1, +1\\},i=1,2...N$), the loss function of perceptron $sign(w·x+b)$ is defined as:\n",
"Among them M is the set of misclassified point, and the loss funciton is the [empirical risk function](https://blog.csdn.net/zhzhx1204/article/details/70163099) of perceptron learning.\n"
"Optimization problem: Given a dataset $T = \\{(x_1,y_1), (x_2, y_2), ... (x_N, y_N)\\}$(among them $x_i \\in R^n$, $y_i \\in \\{-1, +1\\},i=1,2...N$), calcualte parameter w,b to make it the solve of loss function(M is the set of misclassified point): \n",
"\n",
"\n",
"$$\n",
"$$\n",
"min_{w,b} L(w, b) = - \\sum_{x_i \\in M} y_i (w \\cdot x_i + b)\n",
"min_{w,b} L(w, b) = - \\sum_{x_i \\in M} y_i (w \\cdot x_i + b)\n",
"Perceptron learnign is driven by misclassified, it specifically use [random gradient descent method](https://blog.csdn.net/zbc1090549839/article/details/38149561). Firstly, randomly choose $w_0$、$b_0$. After that, use gradient descent method to constantly minimize object function, the minimization process is not a gradient descent of all the misclassification points in M all at once, instead, it randomly choose one misclassified point at a time to make it gradient descent.\n",
"\n",
"Assume that misclassified set M is fixed, then the depeth of loss function $L(w,b)$ is:\n",
"In the formula $\\eta$(0 ≤ $ \\eta $ ≤ 1) is step length(In statistics, it is learning rate). TThe greater the step size is, the faster the gradient descends and the more it approaches the minimum point. If the step length is too large, it may cross the minimum point and lead to divergence of the function. If the step size is too small, it may take a long time to reach the minimum.\n",
"Visually explain: when a instance point is being misclassified, adjust w,b to make hyperplane move to the side of misclassified point, so that the distance between misclassified point and hyperplane will be reduced untill pass thorough the point and correctly classify it.\n",
"Neurons are essentially the same as perceptron, only whenm we talk about perceptron, their activation function is step function; While when we talk about neurons, the activation function usually choose sigmoid function or tanh function. As shown in the figure below:\n",
"The way to calculate the output of a neurons and calculate the output of perceptron is the same. Assume that the input of nurons is vector $\\vec{x}$, and weight vector is $\\vec{w}$(bias term is $w_0$), activation function is sigmoid function, then the output of y is:\n",
"\n",
"$$\n",
"$$\n",
"y = sigmod(\\vec{w}^T \\cdot \\vec{x})\n",
"y = sigmod(\\vec{w}^T \\cdot \\vec{x})\n",
"$$\n",
"$$\n",
"\n",
"\n",
"sigmoid函数的定义如下:\n",
"The definitation of sigmoid function is as following:\n",
"$$\n",
"$$\n",
"sigmod(x) = \\frac{1}{1+e^{-x}}\n",
"sigmod(x) = \\frac{1}{1+e^{-x}}\n",
"$$\n",
"$$\n",
"将其带入前面的式子,得到\n",
"\n",
"Put this into the former formula, we obtain:\n",
"\n",
"$$\n",
"$$\n",
"y = \\frac{1}{1+e^{-\\vec{w}^T \\cdot \\vec{x}}}\n",
"y = \\frac{1}{1+e^{-\\vec{w}^T \\cdot \\vec{x}}}\n",
"$$\n",
"$$\n",
"\n",
"\n",
"sigmoid函数是一个非线性函数,值域是(0,1)。函数图像如下图所示\n",
"Sigmoid is a nolinear function, the domain is (0,1). The function of grapgh is shown as following:\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"sigmoid函数的导数是:\n",
"The derivative of sigmoid function is:\n",
"\\begin{eqnarray}\n",
"\\begin{eqnarray}\n",
"y & = & sigmod(x) \\tag{1} \\\\\n",
"y & = & sigmod(x) \\tag{1} \\\\\n",
"y' & = & y(1-y)\n",
"y' & = & y(1-y)\n",
"\\end{eqnarray}\n",
"\\end{eqnarray}\n",
"\n",
"\n",
"We can see that the derivative of sigmoid function is very interesting, it can use sigmoid function itself to represent. In this way, once the value of sigmoid funtion is being calcualted, it is very convenient to calculate the value of its derivative.\n",
"Neural is actually multiple neurons connected according to certain rules. The upper graph shows a fully connected neural networks. By observing the upper graph, we can find the rule of it including:\n",
"* Neurons are laid out in layers. The leftmost layer, called the input layer, receives input data; The rightmost layer is called the output layer, from which we can get the neural network output data. The layers between the input and output layers are called hidden layers because they are not visible to the outside world.\n",
"* Neurons in the same layer do not have connection with each other.\n",
"* All the neurons in Nth layer is connect to all neurons in N-1 layer(this is the meaning of full connected), the output of N-1 layer neurons is the input of N layer's input.\n",
"All the rules defined the construction of fully connected neural networks. In fact, there exist many other kind of construction neural network, such as CNN, RNN, they all have different connect rules.\n",