Xiuchuan Zhang

Personal Website

This is Xiuchuan's personal website.
I plan to post some of my current learning and review notes on it.
If you have any questions or suggestions, welcome to comment in my posts.
这里是秀川的个人博客。
我打算上传一些现阶段正在复习与学习的笔记在这网站。
若有任何问题或建议,欢迎在各页面留言。


Machine Learning

Coursera.org 机器学习 笔记
These are the notes by learning Andrew Ng’s “Machine Learning” from Coursera.org for future review


Machine Learning

  • Grew out of work in AI
  • New capability for computers

Examples:

Database mining
Large datasets from growth of automation/ web
E.g., Web click data, medical records, biology, engineering

Applications cannot program by hand
E.g., Autonomous helicopter, handwriting recognition, most of Natural Language Processing (NLP), Computer Vision

Self-customizing programs
E.g., Amazon, Netflix product recommendations

Understanding human learning (brain, real AI)

Introduction

  • Definition
    • Arthur Samuel (1959). Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed
    • Tom Mitchell (1995) Well-posed Leaning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine Learning Algorithms:

Supervised learning

  • “Right answers” given
  • Regression: Predict continuous valued output
  • Classification: Discrete valued output (0 or 1 or others)

Unsupervised learning

  • allows us to approach problems with little or no idea what our results should look like.
  • We can derive structure from data where we do not necessarily know the effect of the variables.
  • We can derive this structure by clustering the data based on relationships among the variables in the data
  • There is no feedback based on the prediction results

  • E.g., Organize computing clusters, social network analysis, market segmentation, astronomical data analysis

Model representation

  • Notation
    • m = Number of training examples
    • x’s = “input” variable / features
    • y’s = “output” variable / “target” variable
  • Hypothesis
    • Training set → Learning Algorithm → Hypothesis (h)
    • xh → estimated value of y
    • h: X → Y
    • h maps x’s to y’s
    • Linear regression with one variable or Univariate linear regression
    • $ h_{\theta}(x) = \theta_{0} + \theta_{1}x $
    • shorthand: $ h(x) $
    • $ \theta_{i} $’s: parameters

Cost function

  • Idea: Choose $ \theta_{0}, \theta_{1} $ so that $ h_{\theta}(x) $ is close to y for our training examples (x,y)
    • $ \underset{ \theta_{0}, \ \theta_{1} }{ \mathrm{minimize} } $ $ \dfrac{1}{2m} \displaystyle\sum_{i=1}^m ( \hat{y}^{i}- y^{i} )^2 = $
    • $ h_{\theta}(x^{i}) = \theta_{0} + \theta_{1}x^{i} $
  • Cost function: $ J(\theta_0, \theta_1) = $
    • Octave: J = 1/(2*m)*sum((X*theta .- y).^2)
    • also called squared error cost function or squared error function
  • Goal: $ \displaystyle \underset{ \theta_{0}, \ \theta_{1} }{ \mathrm{minimize} } J(\theta_0, \theta_1) $

  • contour plots or contour figures

Gradient descent

  • Outline:
    • Start with some $ \theta_{0}, \ \theta_{1} $
    • Keep changing $ \theta_{0}, \ \theta_{1} $ to reduce $ J(\theta_{0}, \theta_{1}) $ until we hopefully end up at a minimum
  • Algorithm
    • repeat until convergence { $ \theta_j := \theta_j - \alpha \displaystyle \frac{\partial}{\partial \theta_j} J(\theta_0, \ \theta_1) \ \ (\mathrm{for} \ j = 0 \ \mathrm{and} \ j = 1 ) $ }
    • α: learning rate: too small → slow ; too large → overshoot; proper → automatically decrease α over time
    • Correct simultaneous update
      • temp0 := $ \theta_0 - \alpha \displaystyle \frac{\partial}{\partial \theta_0} J(\theta_0, \ \theta_1) \ \ (\mathrm{for} \ j = 0 \ \mathrm{and} \ j = 1 ) $
      • temp1 := $ \theta_1 - \alpha \displaystyle \frac{\partial}{\partial \theta_1} J(\theta_0, \ \theta_1) \ \ (\mathrm{for} \ j = 0 \ \mathrm{and} \ j = 1 ) $
      • $ \theta_0 := $ temp0
      • $ \theta_1 := $ temp1
    • linear regression:
      • repeat until convergence:
      • $ \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) $
      • $ \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}((h_\theta(x_{i}) - y_{i}) x_{i})$
      • Octave: theta = theta - alpha/m*(X'*(X*theta .- y))
  • “Batch” Gradient Descent:
    • Batch”: Each step of gradient descent uses all the training examples

Multivariate Linear Regression

Multiple Features (variables)

  • n = number of features
  • $ x^{(i)} \text{ for } (i\in { 1,\dots, m } ) $ = input (features) of $ i^{th} $ training examples
    • $ x^{i} = [ x_{1}^{i} \ x_{2}^{i} \ \dots \ x_{n}^{i} ]^{T} \in \mathbb{R}^{n} $
  • $ x_{j}^{(i)} $ = value of $ j^{th} $ features in $ i^{th} $ training examples
  • Hypothesis:
    • $ x = [ x_{0} \ x_{1} \ x_{2} \ \dots \ x_{n} ]^{T} \in \mathbb{R}^{n+1} \ \ (x_{0}^{i} = 1) $
    • $ \theta = [ \theta_{0} \ \theta_{1} \ \theta_{2} \ \dots \ \theta_{n} ]^{T} \in \mathbb{R}^{n+1} $
    • $ h_{\theta}(x) = \theta_{0}x_{0} + \theta_{1}x_{1} + \theta_{2}x_{2} + \dots + \theta_{n}x_{n} \ \ (x_{0} = 1) $
      $ \ \ \ \ \ = \theta^{T}x $
  • Parameters: $ \theta_{0}, \theta_{1}, \dots , \theta_{n} $
  • Cost function: $ J(\theta) = J(\theta_{0}, \theta_{1}, \dots , \theta_{n}) = $
  • Gradient descent:
    Repeat {

    } (simultaneously update for every $ j = 0, \dots, n $ )
  • New algorithm (n ≥ 1):
    Repeat {

    } (simultaneously update $ \theta_j $ for $ j = 0, \dots, n $ )

  • Features Scaling
    • Let $ x = \displaystyle \frac{x}{ \mathrm{range} S } $
      to get every feature into approximately a $ -1 \leq x \leq 1 $ range
    • Mean normalization: replace $ x_j \mathrm{with} \frac{x_j - \mu_j}{ s_{j} } $ (Do not apply to $ x_0 = 1 $)
      into $ -0.5 \leq x_j \leq 0.5 $ range
  • Learning Rate choose
    • “Debugging”: plot ( $ \displaystyle\underset{ \theta }{ \mathrm{min} } \ J(\theta) $, No. of iterations)
      $ J(\theta) $ should decrease after every iteration → converge (sufficiently α)
    • Not working → Use smaller α
      slow converge → use larger α
  • Create new feature
  • Change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function or any other form (Features Scaling)

Normal equation

  • Definition: Method to solve for θ analytically
  • Formula: $ \displaystyle\theta = (X^{T} X)^{-1} X^{T} y $
    • Octave: pinv(x'*x)*x'*y
    • No need feature scaling
    • If n is very large, it works very slow, by compute $ n^3 $

Classification

  • Binary classification problem: y = 0 or 1
  • Use linear regression:
    • To map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0
    • $ h_\theta(x) $ can be >1 or <0, so it does not work well
  • Logistic Regression: make $ 0 \le h_\theta(x) \le 1 $

Logistic Regression

  • Clasification and Representation
    • Sigmoid(Logistic) function: $ g(Z) = \displaystyle\frac{1}{1+e^{-z}} $
    • Octave: h = 1.0 ./ (1.0 + exp(-z))
    • $ h_\theta(x) = g(\theta^{T}x) = \displaystyle\frac{1}{1+e^{-\theta^{T}x}} $
    • $ h_\theta(x) = $ estimated probability that y = 1 on input x
    • $ h_\theta(x) = P(y=1 \vert x ; \theta) = 1 - P(y=0 \vert x ; \theta) $
  • Decision boundary
    • To get discrete 0 or 1 classification, can translate the output of the hypothesis function as:
      • $ h_\theta(x) \geq 0.5 \rightarrow y = 1 $
      • $ h_\theta(x) \leq 0.5 \rightarrow y = 0 $
    • $ g(z) \geq 0.5 $ when $ z \geq 0 $ that $ h_\theta(x) = g(\theta^T x) \geq 0.5 $ when $ \theta^T x \geq 0 $
    • From these statements we can say:
      • $ \theta^T x \geq 0 \Rightarrow y = 1 $
      • $ \theta^T x \leq 0 \Rightarrow y = 0 $

  • Cost function
    • $ J(\theta) = \displaystyle\dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) $
    • $ \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) $ if y = 1
    • $ \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) $ if y = 0
    • Compress: $ \displaystyle \mathrm{Cost}(h_\theta(x),y) = -y\log(h_\theta(x)) -(1-y)\log(1-h_\theta(x)) $
    • Vectorized implementation:
      $ h = g(X\theta) $
      $ \displaystyle J(\theta) = \frac{1}{m} \cdot (-y^{T}\log(h)-(1-y)^{T}\log(1-h)) $
      Octave: the cost: J = 1/m * (-y' * log(h)-(1-y)'*log(1-h))
      the partial derivatives: grad = 1/m * X'*(h-y)
  • Gradient Descent
    • Repeat {
      general form:
      $ \displaystyle \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) $
      derivative part using calculus:
      $ \displaystyle \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} $
      }
    • Vectorized implementation:
      $\displaystyle \theta := \theta - \frac{\alpha}{m} X^T (g(X\theta) - \vec y) $
  • Optimization algorithm
    • Gradient Descent
    • Conjugate gradient
    • BFGS
    • L-BFGS
  • Codes in Octave:
    function [jVal, gradient] = costFunction(theta)
    jVal = [...code to compute J(theta)...];
    gradient = [...code to compute derivative of J(theta)...]; end
    • use octave’s ‘fminunc’ optimization algorithm
      options = optimset('GradObj', 'on', 'MaxIter', 100);
      initialTheta = zeros(2,1);
      [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
  • Multiclass Classification: One-vs-all
    • Train a logistic regression classifier for each class i to predict the probability that
      $ h_\theta^{(i)}(x) = P(y = i| x;\theta) (i=1,2,\dots,n) $
    • On a new input x, to make a prediction, pick the class i that maximizes
      prediction = $ \displaystyle\underset{i}{\mathrm{max}} h_\theta^{(i)}(x) $

Overfitting

  • “Underfit” (“high bias”): not fit the training data very well
  • “Just right”
  • “Overfit” (“high variance”): may fit the trainning set very well, but fail to generalize to new examples

Regularization

  • Using the cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting.
    $ min_\theta\, \dfrac{1}{2m}\, \displaystyle[\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda \sum_{j=1}^n \theta_j^2] $
  • , or lambda is the regularization parameter
  • If lambda is too large, it may cause underfitting.

Linear Regression

  • Gradient Descent:
    Repeat {
    $ \displaystyle\theta_0 := \theta_0 - \alpha\, \frac{1}{m} \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} $
    $ \displaystyle\theta_j := \theta_j - \alpha\, [ ( \frac{1}{m}\, \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}) + \frac{\lambda}{m}\theta_j ] $ $ \displaystyle\quad j \in \lbrace 1,2…n\rbrace $
    $ \displaystyle\Rrightarrow \theta_j := \theta_j (1- \alpha\,\frac{\lambda}{m}) - \alpha\, \frac{1}{m}\, \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}) $
    }

  • Normal Equation:
    $ \displaystyle\theta = ( X^TX + \lambda \cdot L )^{-1} X^Ty $
    $ \text{where}\, \, L = $

Logistic Regression

  • Cost function:
    • $ J(\theta) = - \displaystyle\frac{1}{m} \sum_{i=1}^m [ y^{(i)}\, \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\, \log (1 - h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2 $
    • $ \sum_{j=1}^n \theta_j^2 $ means to explicitly exclude the bias term $ \theta_0 $
    • Octave:
      • the cost:
        J = 1/m * (-y' * log(h)-(1-y)'*log(1-h)) + (lambda/(2*m))*theta'*theta
      • the partial derivatives:
        grad = 1/m * X'*(h-y) + (lambda/m)*theta
  • Gradient descent:
    • $ \displaystyle h_\theta(x) = \frac{1}{1+e^{-\theta^{T}x}} $
      Repeat {
      $ \displaystyle\theta_0 := \theta_0 - \alpha\, \frac{1}{m} \displaystyle\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} $
      $ \displaystyle\theta_j := \theta_j (1- \alpha\,\frac{\lambda}{m}) - \alpha\, \frac{1}{m}\, \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}) $ $ \displaystyle\quad j \in \lbrace 1,2…n\rbrace $
      }

Neural Networks

  • Origins: Algorithms that try to mimic the brain

Model representation(NN)

  • Neuron model: Logistic unit
    • Sigmoid (logistic) activation function: $ g(Z) = \displaystyle\frac{1}{1+e^{-z}} $
    • Parameters $ \theta $ also called “weights”
    • 1st layer called “input layer”; last layer called “output layer”; middle layers called “hidden layer”
  • Notations
    • $ x_0 = 1 $ is a “bias unit”
    • $ a_i^{(j)} = $ “activation” of unit $ i $ in layer $ j $
    • $ \Theta^{(j)} = $ matrix of weights controlling function mapping from layer $ j $ to layer $ j+1 $

    • If network has $ s_j $ units in layer $ \displaystyle j, s_{j+1} $ units in layer $ j + 1 $, then $ \Theta^{(j)} $ will be of dimension $ \displaystyle s_{j+1} \times (s_j + 1) $ that
  • Forward propagation: Vectorized implementation

    • $ a^{(1)} = x $
    • $ \displaystyle z^{(j)} = \Theta^{(j-1)} \ a^{(j-1)} $
    • $ \hookrightarrow \displaystyle a^{(j)} = g(z^{(j)} ) $
    • Add $ \displaystyle a_0^{(j)} = 1 $
    • $ z^{(j+1)} = \Theta^{(j)} a^{(j)} $
    • $ \displaystyle h_\Theta(x) = \ a^{(j+1)} = \ g(z^{(j+1)}\ ) $
  • Other network architectures

To be continued

  • others: Reinforcement learning, recommender systems

  • Practical advice for applying learning algorithms

Support

cancel

Thank you for your supporting

Scan
Scan
Scan It

打开支付宝或微信扫一扫,即可进行扫码打赏哦