ilgyu-yi

기본개념 미정리 문서

2018-08-18

KAIST 문인철교수 강의 / Utah Univ cs5350 / Georgia Tech cs7641 / PRML
참고하여 헷갈릴 수 있는 개념 위주로 정리

MLE (Maximum Likelihood Estimation)

MLE의 정확한 의미

observed data의 등장확률을 최대화하는 모수를 찾음

θ^=argmaxθP(Dθ)\hat\theta = \underset{\theta}{\operatorname{argmax}}P(D|\theta)

Binomial에서의 MLE (Thumbtack flip의 예)

P(Dθ)=θaH(1θ)aTP(D|\theta) = \theta^{a_H}(1-\theta)^{a_T}

lnP(Dθ)=aHlnθ+aTln(1θ)\ln{P(D|\theta)} = a_H\ln\theta+a_T\ln(1-\theta)

θ^=argmaxθP(Dθ)=argmaxθlnP(Dθ)=argmaxθ{aHlnθ+aTln(1θ)}\hat\theta = \underset{\theta}{\operatorname{argmax}}P(D|\theta) = \underset{\theta}{\operatorname{argmax}}\ln{P(D|\theta)} = \underset{\theta}{\operatorname{argmax}}\{a_H\ln\theta+a_T\ln(1-\theta)\}

ddθ{aHlnθ+aTln(1θ)}=0{d\over{d\theta}}\{a_H\ln\theta+a_T\ln(1-\theta)\} = 0

aHθaT1θ=0{a_H\over\theta}-{a_T\over{1-\theta}} = 0

θ^=aHaH+aT\hat\theta = {a_H\over{a_H+a_T}}

Simple Error Bound (Hoeffding's inequality)

P(θ^θϵ)2exp(2Nϵ2)P(|\hat\theta-\theta^*|\geq\epsilon)\leq2\exp{(-2N\epsilon^2)}

θ^\hat\theta은 probability PP, approximately ϵ\epsilon에 대하여 "PAC(Probably Approximate Correct) learning의 결과"

MLE의 단점

MAP (Maximum a Posteriori Estimation)

MAP의 정확한 의미

observed data에 대하여 최대 확률을 가지는 모수를 찾음

θ^=argmaxθP(θD)\hat\theta = \underset{\theta}{\operatorname{argmax}}P(\theta|D)

P(θD)=P(Dθ)P(θ)P(D)P(\theta|D) = {P(D|\theta)P(\theta)\over{P(D)}}

posterior=likelihoodpriornormalizningconstantposterior = {likelihood * prior\over{normalizning\quad{constant}}}

P(Dθ)=θaH(1θ)aTP(D|\theta) = \theta^{a_H}(1-\theta)^{a_T}

P(θ)=?P(\theta) = ?

P(θ)=θα1(1θ)β1B(α,β)P(\theta) = {\theta^{\alpha-1}(1-\theta)^{\beta-1}\over{B(\alpha,\beta)}}

B(α,β)=Γ(α)Γ(β)Γ(α+β)B(\alpha,\beta) = {\Gamma(\alpha)\Gamma(\beta)\over\Gamma(\alpha+\beta)}

Γ(α)=(α1)!\Gamma(\alpha) = (\alpha - 1)!

B(α,β)=constB(\alpha,\beta) = const

P(θD)θaH(1θ)aTθα1(1θ)β1=θaH+α1(1θ)aT+β1P(\theta|D)\propto\theta^{a_H}(1-\theta)^{a_T}\theta^{\alpha-1}(1-\theta)^{\beta-1} = \theta^{a_H+\alpha-1}(1-\theta)^{a_T+\beta-1}

θ^=argmaxθP(θD)\hat\theta = \underset{\theta}{\operatorname{argmax}}P(\theta|D)

θ^=aH+α1aH+aT+α+β2\hat\theta = {a_H+\alpha-1\over{a_H+a_T+\alpha+\beta-2}}

MLE, MAP 되새김질

Distribution

Gaussian Distribution

f(x;μ,σ)=1(2π)1/2σexp{(xμ)2σ22}f(x;\mu,\sigma) = {1\over(2\pi)^{1/2}\sigma}exp{\left\{-{(x-\mu)^2\sigma^{-2}\over2}\right\}}

Multivariable Gaussian Distribution

f(x;μ,Σ)=1(2π)D/2Σ1/2exp{(xμ)TΣ1(xμ)2}f(x;\mu,\Sigma) = {1\over(2\pi)^{D/2}|\Sigma|^{1/2}}exp{\left\{-{(x-\mu)^T\Sigma^{-1}(x-\mu)\over2}\right\}}

Beta Distribution

f(x;α,β)=xα1(1x)β101uα1(1u)β1du=xα1(1x)β1B(α,β)=Γ(α+β)Γ(α)Γ(β)xα1(1x)β1f(x;\alpha,\beta) = {x^{\alpha-1}(1-x)^{\beta-1}\over\int_0^1{u^{\alpha-1}(1-u)^{\beta-1}du}} = {x^{\alpha-1}(1-x)^{\beta-1}\over{B(\alpha,\beta)}} = {\Gamma(\alpha+\beta)\over\Gamma(\alpha)\Gamma(\beta)}x^{\alpha-1}(1-x)^{\beta-1}

Binomial Distribution

f(k;n,p)=(nk)pk(1p)nkf(k;n,p) = {n\choose{k}}p^k(1-p)^{n-k}

(nk)=n!k!(nk)!{n\choose{k}} = {n!\over{k!(n-k)!}}

Multinomial Distribution

f(x1,,xk;n,p1,,pk)=n!x1!xk!p1x1pkxkf(x_1,\cdots,x_k;n,p_1,\cdots,p_k) = {n!\over{x_1!{\cdots}x_k!}}p_1^{x_1}{\cdots}p_k^{x_k}

Gamma function

Γ(x)=0tx1etdt\Gamma(x) = \int_0^\infty{t^{x-1}\over{e^t}}dt

Γ(x)=limn1nx(x+n)nx=limnn!(x1)!(x+n)!nx=limnn!nx(x+n)!(x1)!\Gamma(x) = \underset{n\rightarrow\infty}{\lim}{1{\cdots}n\over{x\cdots(x+n)}}n^x = \underset{n\rightarrow\infty}{\lim}{n!(x-1)!\over{(x+n)!}}n^x = \underset{n\rightarrow\infty}{\lim}{n!n^x\over{(x+n)!}}(x-1)!

limnn!nx(x+n)!=1\underset{n\rightarrow\infty}{\lim}{n!n^x\over{(x+n)!}}=1

(x1)!(x-1)!

Γ(x+1)=0txetdt=[txet]t=0t=(0xtx1etdt)\Gamma(x+1) = \int_0^\infty{t^x\over{e^t}}dt = \left[-t^xe^{-t}\right]_{t=0}^{t=\infty} - \left({\int_0^\infty}-xt^{x-1}e^{-t}dt\right)

=limt(txet)0+(0xtx1etdt)= \underset{t\rightarrow\infty}{\lim}(-t^xe^{-t}) - 0 + \left({\int_0^\infty}xt^{x-1}e^{-t}dt\right)

=x0tx1etdt=xΓ(x)= x{\int_0^\infty}t^{x-1}e^{-t}dt = x\Gamma(x)

Γ(x+1)=xΓ(x)\Gamma(x+1) = x\Gamma(x)

Γ(1)=1,0!=1\Gamma(1) = 1,{\quad}0! = 1

Γ(n+1)=n!,Γ(n)=(n1)!\therefore\Gamma(n+1) = n!,{\quad}\Gamma(n) = (n-1)!

Beta function

B(x,y)=01tx1(1t)y1dt=Γ(x)Γ(y)Γ(x+y)B(x,y) = \int_0^1t^{x-1}(1-t)^{y-1}dt = {\Gamma(x)\Gamma(y)\over\Gamma(x+y)}

B(n,m)=(n1)!(m1)!(n+m2)!B(n,m) = {(n-1)!(m-1)!\over(n+m-2)!}

Rule-Based ML

perfect world

Find-S Algorithm

initialize h to the most specific in H ...[0]
for instance x in D:
    if y of x is positive: ...[1]
        for feature f in O:
            if (f in h) == (f in x): ...[1-1]
                do nothing
            else: ...[1-2]
                (f in h) = (f in h) U (f in x)

...[0]: hypothesis h를 할 수 있는 한 가장 specific하게 초기설정 (어떤 feature로도 만족시킬 수 없는 경우로 시작)

...[1]: Dataset의 instance x에 대하여, x의 레이블값이 참일 경우

...[1-1]: h의 f와 x의 f가 동일한 경우 그대로

...[1-2]: h의 f와 x의 f가 다를 경우 둘 모두를 포함하는 범위로 h의 f를 수정

점점 범위를 넓혀 나가서 모든 dataset을 만족하는 최소집합(most specific)의 Rule을 찾음

Candidate Elimination Algorithm

initialize S to maximally specific h in H ...[0-1]
initialize G to maximally general h in H ...[0-2]
for instance x in D:
    if y of x is positive:
        generalize S as much as needed to cover o in x
        remove any h in G, for which h(o) != y
    if y of x is negative:
        specialize G as much as needed to exclude o in x
        remove any h in S, for which h(p) = y
generate h that satisfies (s in S, g in G, g >= h >= s)

Decision Tree

Entropy

H(X)=iP(xi)I(xi)=iP(xi)lnP(xi)=p(x)lnp(x)dxH(X) = {\sum_i}P(x_i)I(x_i) = -{\sum_i}P(x_i){\ln}P(x_i) = -{\int}p(x){\ln}p(x)dx

H(YX)=xp(x)H(YX=x)H(Y|X) = {\sum_x}p(x)H(Y|X=x)

=xp(x)yp(yx)lnp(yx)= -{\sum_x}p(x){\sum_y}p(y|x){\ln}p(y|x)

=xyp(x,y)lnp(yx)= -\sum_x{\sum_y}p(x,y){\ln}p(y|x)

=x,yp(x,y)lnp(yx)= -\sum_{x,y}p(x,y){\ln}p(y|x)

=x,yp(x,y)ln{p(x,y)p(x)}= -\sum_{x,y}p(x,y){\ln}\left\{p(x,y)\over{p(x)}\right\}

=x,yp(x,y)ln{p(x)p(x,y)}= \sum_{x,y}p(x,y){\ln}\left\{p(x)\over{p(x,y)}\right\}

Information Gain

IG(Y,Ai)=H(Y)H(YAi)IG(Y, A_i) = H(Y) - H(Y|A_i)

Top-down Induction Algorithm

select an open node to split
select a best variable to split
for values of the selected variable:
    sort instances with the value of the selected variable
    put the sorted items under the branch of the value of the variable
    if the sorted items are all in one class:
        close the leaf node of the branch

Linear Regression

Hypothesis

h:f^(x;θ)=θ0+i=1nθixi=i=0nθixi(x0=1)h: \hat{f}(x;\theta) = \theta_0+\sum_{i=1}^n\theta_ix_i = \sum_{i=0}^n\theta_ix_i\quad(x_0=1)

f(x;θ)=i=0nθixi+e=yf(x;\theta) = \sum_{i=0}^n\theta_ix_i+e=y

f:Xθ+e=Yf: X\theta+e = Y

θ^=argminθ(ff^)2=argminθ(YXθ)2=argminθ(YXθ)T(YXθ)\hat\theta = \underset\theta{\operatorname{argmin}}(f-\hat{f})^2 = \underset\theta{\operatorname{argmin}}(Y-X\theta)^2 = \underset\theta{\operatorname{argmin}}(Y-X\theta)^T(Y-X\theta)

=argminθ(YTY2θTXTY+θTXTXΘ)=argminθ(θTXTXθ2θTXTY)= \underset\theta{\operatorname{argmin}}(Y^TY-2\theta^TX^TY+\theta^TX^TX\Theta) = \underset\theta{\operatorname{argmin}}(\theta^TX^TX\theta-2\theta^TX^TY)

θ(θTXTXθ2θTXTY)=0\nabla_\theta(\theta^TX^TX\theta-2\theta^TX^TY) = 0

2XTXθ2XTY=02X^TX\theta-2X^TY = 0

θ=(XTX)1XTY\theta = (X^TX)^{-1}X^TY

(XTX)1XT=X+(MoorePenrosePseudoinverse)(X^TX)^{-1}X^T = X^+(Moore-Penrose\,Pseudoinverse)

θ=X+Y\theta = X^+Y

Naive Bayes Classifier

Optimal Classification

f^=argminfP(f(X)Y)\hat{f} = \underset{f}{\operatorname{argmin}}P(f(X){\neq}Y)

y^=argmaxyθiΘP(yθi)P(θiD)\hat{y} = \underset{y}{\operatorname{argmax}}\sum_{\theta_i{\in}\Theta}P(y|\theta_i)P(\theta_i|D)

y^=argmaxyP(Y=yX=x)=argmaxyP(X=xY=y)P(Y=y)\hat{y} = \underset{y}{\operatorname{argmax}}P(Y=y|X=x) = \underset{y}{\operatorname{argmax}}P(X=x|Y=y)P(Y=y)

Bayes Risk - 추후 재정리

Optimal Classification의 한계

y^=argmaxyP(X=xY=y)P(Y=y)\hat{y} = \underset{y}{\operatorname{argmax}}P(X=x|Y=y)P(Y=y)

Conditional Independence Assumption

P(X=<x1,,xi>Y=y)iP(Xi=xiY=y)P(X = <x_1, \cdots, x_i>|Y = y) \approx \prod_iP(X_i = x_i|Y = y)

P(x1x2,y)=P(x1y)P(x_1|x_2,y) = P(x_1|y)

Conditional(local) vs Marginal Independence

P(X)=P(XY)P(X) = P(X|Y)

P(x1x2,y)=P(x1y)P(x_1|x_2,y) = P(x_1|y)

Naive Bayes Classifier

y^=argmaxyP(X=xY=y)P(Y=y)\hat{y} = \underset{y}{\operatorname{argmax}}P(X=x|Y=y)P(Y=y)

argmaxyP(Y=y)1idP(Xi=xiY=y)=fNB(x)\approx \underset{y}{\operatorname{argmax}}P(Y=y)\prod_{1{\leq}i{\leq}d}P(X_i = x_i|Y = y) = f_{NB}(x)

Logistic Regression

σ(z)=11+exp(z)\sigma(z) = {1\over{1+\exp(-z)}}

logit(p)=ln(p1p)logit(p) = \ln\left({p\over{1-p}}\right)

z=ln(p1p)=Xθz = \ln\left({p\over{1-p}}\right) = X\theta

hθ(x)=σ(θTx)=11+exp(θTx)h_\theta(x) = \sigma(\theta^Tx) = {1\over{1+\exp(-\theta^Tx)}}

P(yx)=μ(x)y(1μ(x))1yP(y|x) = \mu(x)^y(1-\mu(x))^{1-y}

μ(x)=11+exp(θTx)=11+exp(Xθ)=P(y=1x)\mu(x) = {1\over{1+\exp(-\theta^Tx})} = {1\over{1+\exp(-X\theta})} = P(y=1|x)

Xθ=ln(μ(x)1μ(x))X\theta = \ln\left(\mu(x)\over{1-\mu(x)}\right)

θ^=argmaxθP(Dθ)\hat\theta = \underset{\theta}{\operatorname{argmax}}P(D|\theta)

θ^=argmaxθP(Dθ)=argmaxθ1iNP(YiXi;θ)\hat\theta = \underset{\theta}{\operatorname{argmax}}P(D|\theta) = \underset{\theta}{\operatorname{argmax}}{\prod_{1{\leq}i{\leq}N}}P(Y_i|X_i;\theta)

=argmaxθln{1iNP(YiXi;θ)}=argmaxθ1iNln{P(YiXi;θ)}= \underset{\theta}{\operatorname{argmax}}\ln\left\{\prod_{1{\leq}i{\leq}N}P(Y_i|X_i;\theta)\right\} = \underset{\theta}{\operatorname{argmax}}\sum_{1{\leq}i{\leq}N}\ln\left\{P(Y_i|X_i;\theta)\right\}

P(YiXi;θ)=μ(Xi)Yi(1μ(Xi))1YiP(Y_i|X_i;\theta) = \mu(X_i)^{Y_i}(1-\mu(X_i))^{1-Y_i}

lnP(YiXi;θ)=Yilnμ(Xi)+(1Yi)ln(1μ(Xi)){\ln}P(Y_i|X_i;\theta) = Y_i\ln\mu(X_i)+(1-Y_i)\ln(1-\mu(X_i))

=Yi(lnμ(Xi)ln(1μ(Xi))+ln(1μ(Xi))= Y_i(\ln\mu(X_i)-\ln(1-\mu(X_i))+\ln(1-\mu(X_i))

=Yi(lnμ(Xi)1μ(Xi))+ln(1μ(Xi))= Y_i\left(\ln{\mu(X_i)\over{1-\mu(X_i)}}\right)+\ln(1-\mu(X_i))

=YiXiθ+ln(1μ(Xi))= Y_iX_i\theta + \ln(1-\mu(X_i))

=YiXiθln{1+exp(Xiθ)}= Y_iX_i\theta - \ln\{1+\exp(X_i\theta)\}

θj[1iNYiXiθln{1+exp(Xiθ)}]{\partial\over\partial\theta_j}\left[\sum_{1{\leq}i{\leq}N}Y_iX_i\theta - \ln\{1+\exp(X_i\theta)\}\right]

=1iNXi,j(Yiexp(Xiθ)1+exp(Xiθ))= \sum_{1{\leq}i{\leq}N}X_{i,j}\left(Y_i-{\exp(X_i\theta)\over{1+\exp(X_i\theta)}}\right)

=1iNXi,j(YiP(Yi=1Xi;θ))=0= \sum_{1{\leq}i{\leq}N}X_{i,j}(Y_i-P(Y_i=1|X_i;\theta)) = 0

θjt+1=θjt+hcf(θt)θj\theta_j^{t+1} = \theta_j^t+{h\over{c}}{\partial{f(\theta^t)}\over\partial\theta_j}

=θjt+hc1iNXi,j(Yiexp(Xiθt)1+exp(Xiθt))= \theta_j^t+{h\over{c}}{\sum_{1{\leq}i{\leq}N}X_{i,j}\left(Y_i-{\exp(X_i\theta^t)\over{1+\exp(X_i\theta^t)}}\right)}

1iNlnP(YiXi;θ)=1iN{Yilnμ(Xi)+(1Yi)ln(1μ(Xi))}-\sum_{1{\leq}i{\leq}N}{\ln}P(Y_i\|X_i;\theta) = -\sum_{1{\leq}i{\leq}N}\left\{Y_i\ln\mu(X_i)+(1-Y_i)\ln(1-\mu(X_i))\right\}

Generative - Discriminative Pair

Naive Bayes Classifier - Logistic Regression

Softmax Regression(Multinomial Logistic Regression)

σ(z)j=ezjk=1Kezk(forj=1,,K)\sigma(z)_j = {e^{z_j}\over{\sum_{k=1}^Ke^{z_k}}}{\quad}(for\,j=1,\cdots,K)

P(yx)=p1y1(1p1)1y1P(y|x) = p_1^{y_1}(1-p_1)^{1-y_1}

P(yx)=p1y1(1p1)1y1=p1y1p0y0P(y|x) = p_1^{y_1}(1-p_1)^{1-y_1} = p_1^{y_1}p_0^{y_0}

P(yx)=p1y1p2y2p3y3pCyCP(y|x) = p_1^{y_1}p_2^{y_2}p_3^{y_3}{\cdots}p_C^{y_C}

=1jCpjyj= \prod_{1{\leq}j{\leq}C}p_j^{y_j}

1iNlnP(Yi,jXi;θ)=1iN1jCYi,jlnμj(Xi)-\sum_{1{\leq}i{\leq}N}{\ln}P(Y_{i,j}\|X_i;\theta) = -\sum_{1{\leq}i{\leq}N}\sum_{1{\leq}j{\leq}C}Y_{i,j}\ln{\mu_j(X_{i})}

Support Vector Machine

Problem Definition

wTx+b=0w^Tx+b = 0

wTx++b=+δw^Tx_++b = +\delta

wTx+b=δw^Tx_-+b = -\delta

wTx++b+δ:y=+1w^Tx_++b \geq +\delta : y = +1

wTx+bδ:y=1w^Tx_-+b \leq -\delta : y = -1

y(wTx+b)δy(w^Tx+b) \geq \delta

2δw{2\delta\over\lVert{w}\rVert}

y(wTx+b)/δ1y(w^Tx+b)/\delta \geq 1

wTx+b=0,(wTx+b)/δ=0w^Tx+b = 0, \quad (w^Tx+b)/\delta = 0

y(wTx+b)1y(w^Tx+b) \geq 1

y(wTx+b)10y(w^Tx+b)-1 \geq 0

minw\operatorname{min}{\lVert{w}\rVert}

min12w2\operatorname{min}{1\over{2}}\lVert{w}\rVert^2

Soft-margin and Penalization

min12w2+C#error\operatorname{min}{1\over{2}}\lVert{w}\rVert^2+C*\#_{error}

s.t.(wTxi+b)yi1s.t.\quad(w^Tx_i+b)y_i\geq{1}

min12w2+Cinξi\operatorname{min}{1\over{2}}\lVert{w}\rVert^2+C\sum_{i}^n\xi_i

s.t.yi(wTxi+b)1ξi,ξi0s.t.{\quad}y_i(w^Tx_i+b)\geq{1-\xi_i},\quad{\xi_i\geq{0}}

ξi=max(0,{1yi(wTxi+b)})\xi_i = max(0, \{1-y_i(w^Tx_i+b)\})

Langrangian Dual Problem

optimizef(x)optimize\,f(x)

subjecttosubject\,to

gi(x)0g_i(x)\leq{0}

hi(x)=0h_i(x) = 0

f(x)=i=1mμigi(x)+j=1lλjhj(x)\nabla{f(x^*)} = \sum_{i=1}^{m}\mu_i\nabla{g_i}(x^*)+\sum_{j=1}^{l}\lambda_j\nabla{h_j}(x^*)

f(x)=i=1mμigi(x)+j=1lλjhj(x)-\nabla{f(x^*)} = \sum_{i=1}^{m}\mu_i\nabla{g_i}(x^*)+\sum_{j=1}^{l}\lambda_j\nabla{h_j}(x^*)

gi(x)0,fori=1,,mg_i(x^*)\leq{0},{\quad}for\,i=1,\cdots,m

hj(x)=0,forj=1,,lh_j(x^*)=0,{\quad}for\,j=1,\cdots,l

μi0,fori=1,,m\mu_i\geq{0},{\quad}for\,i=1,\cdots,m

μigi(x)=0,fori=1,,m\mu_ig_i(x^*) = 0,{\quad}for\,i=1,\cdots,m

minLP=12w2+Ci=1nξi+i=1nαi{1yi(wTxi+b)ξi}+i=1nβiξi\operatorname{min}L_P = {1\over{2}}\lVert{w}\rVert^2+C\sum_{i=1}^n\xi_i+\sum_{i=1}^n\alpha_i\{1-y_i(w^Tx_i+b)-\xi_i\}+\sum_{i=1}^n-\beta_i\xi_i

=12w2+Ci=1nξii=1nαi{yi(wTxi+b)+ξi1}i=1nβiξi= {1\over{2}}\lVert{w}\rVert^2+C\sum_{i=1}^n\xi_i-\sum_{i=1}^n\alpha_i\{y_i(w^Tx_i+b)+\xi_i-1\}-\sum_{i=1}^n\beta_i\xi_i

Lw=0Lb=0Lξ=0{\partial{L}\over{\partial{w}}} = 0\quad{\partial{L}\over{\partial{b}}} = 0\quad{\partial{L}\over{\partial{\xi}}} = 0

w=i=1nαiyixiw = \sum_{i=1}^n\alpha_iy_ix_i

i=1nαiyi=0\sum_{i=1}^n\alpha_iy_i = 0

i=1n(Cαiβi)=0\sum_{i=1}^n(C-\alpha_i-\beta_i) = 0

yi(wTxi+b)1ξi,ξi0y_i(w^Tx_i+b)\geq{1-\xi_i},\quad{\xi_i\geq{0}}

αi0,βi0\alpha_i\geq{0},\quad\beta_i\geq{0}

αi{yi(wTxi+b)+ξi1}=0,βiξi=0\alpha_i\{y_i(w^Tx_i+b)+\xi_i-1\} = 0,\quad\beta_i\xi_i = 0

Cαi0C\geq{\alpha_i}\geq{0}

minLP=12w2+Ci=1nξii=1nαi{yi(wTxi+b)+ξi1}i=1nβiξi\operatorname{min}L_P = {1\over{2}}\lVert{w}\rVert^2+C\sum_{i=1}^n\xi_i-\sum_{i=1}^n\alpha_i\{y_i(w^Tx_i+b)+\xi_i-1\}-\sum_{i=1}^n\beta_i\xi_i

maxLD=i=1nαi12i=1nj=1nαiαjyiyjxiTxj\operatorname{max}L_D = \sum_{i=1}^n\alpha_i-{1\over{2}}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_jx_i^Tx_j

(i=1nαiyi=0)\left(\sum_{i=1}^n\alpha_iy_i = 0\right)

(αi{yi(wTxi+b)1}=0)\left(\alpha_i\{y_i(w^Tx_i+b)-1\} = 0\right)

(Cαi0)\left(C\geq{\alpha_i}\geq{0}\right)

Kernel Trick

maxLD=i=1nαi12i=1nj=1nαiαjyiyjϕ(xi)Tϕ(xj)\operatorname{max}L_D = \sum_{i=1}^n\alpha_i-{1\over{2}}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_j\phi(x_i)^T\phi(x_j)

=i=1nαi12i=1nj=1nαiαjyiyjK(xi,xj)= \sum_{i=1}^n\alpha_i-{1\over{2}}\sum_{i=1}^n\sum_{j=1}^n\alpha_i\alpha_jy_iy_jK(x_i,x_j)

w=i=1nαiyiϕ(xi)w = \sum_{i=1}^n\alpha_iy_i\phi(x_i)

b=yii=1nαiyiϕ(xi)Tϕ(xj)b = y_i-\sum_{i=1}^n\alpha_iy_i\phi(x_i)^T\phi(x_j)

sign(wϕ(x)+b)=sign(i=1nαiyiϕ(xi)ϕ(x)+yii=1nαiyiϕ(xi)Tϕ(xj))sign(w\phi(x)+b) = sign\left(\sum_{i=1}^n\alpha_iy_i\phi(x_i)\phi(x)+y_i-\sum_{i=1}^n\alpha_iy_i\phi(x_i)^T\phi(x_j)\right)

=sign(i=1nαiyiK(xi,x)+yii=1nαiyiK(xi,xj))= sign\left(\sum_{i=1}^n\alpha_iy_iK(x_i,x)+y_i-\sum_{i=1}^n\alpha_iy_iK(x_i,x_j)\right)

K(x1,x2)=x1Tx2K(x_1,x_2) = x_1^Tx_2

K(x1,x2)=(γ(x1Tx2)+θ)dK(x_1,x_2) = (\gamma(x_1^Tx_2)+\theta)^d

K(x1,x2)=exp(γx1x22)K(x_1,x_2) = \exp(-\gamma\lVert{x_1-x_2}\rVert^2)

K(x1,x2)=tanh(γ(x1Tx2)+θ)K(x_1,x_2) = \tanh(\gamma(x_1^Tx_2)+\theta)

Training / Testing & Regularization

Source of error in two folds

EoutEin+ΩE_{out}\leq{E_{in} + \Omega}

Eout:EstimationerrorE_{out}:{\quad}Estimation\,error

Ein:ErrorfromapproximationE_{in}:{\quad}Error\,from\,approximation

Ω:Errorfromvarianceoftheobservation\Omega:{\quad}Error\,from\,variance\,of\,the\,observation

Bias-Variance Decomposition

(문인철교수 강의보다는 PRML의 식이 더 입맛에 맞아 이쪽의 notation으로 정리)

h(x)=E(tx)=tp(tx)dth(x) = E(t|x) = \int{tp(t|x)}dt

E(L)={y(x)t}2p(x,t)dxdtE(L) = \iint\{y(x)-t\}^2p(x,t)dxdt

={y(x)h(x)}2p(x)dx+{h(x)t}2p(x,t)dxdt= \int\{y(x)-h(x)\}^2p(x)dx+\iint\{h(x)-t\}^2p(x,t)dxdt

{y(x)h(x)}2p(x)dx=ED({y(x;D)h(x)}2)p(x)dx\int\{y(x)-h(x)\}^2p(x)dx = \int{E_D(\{y(x;D)-h(x)\}^2)p(x)dx}

ED({y(x;D)h(x)}2)E_D(\{y(x;D)-h(x)\}^2)

=ED({y(x;D)ED(y(x;D))+ED(y(x;D))h(x)}2)= E_D(\{y(x;D)-E_D(y(x;D))+E_D(y(x;D))-h(x)\}^2)

=ED({y(x;D)ED(y(x;D))}2+{ED(y(x;D))h(x)}2= E_D(\{y(x;D)-E_D(y(x;D))\}^2+\{E_D(y(x;D))-h(x)\}^2

+2{y(x;D)ED(y(x;D))}{ED(y(x;D))h(x)})+2\{y(x;D)-E_D(y(x;D))\}\{E_D(y(x;D))-h(x)\})

ED(y(x;D))h(x)=constE_D(y(x;D))-h(x) = const

ED(y(x;D)ED(y(x;D)))=0E_D(y(x;D)-E_D(y(x;D))) = 0

ED({y(x;D)h(x)}2)p(x)dx\therefore{\int{E_D(\{y(x;D)-h(x)\}^2)p(x)dx}}

={ED(y(x;D))h(x)}2p(x)dx+ED({y(x;D)ED(y(x;D))}2)p(x)dx= \int{\{E_D(y(x;D))-h(x)\}^2p(x)dx}+\int{E_D(\{y(x;D)-E_D(y(x;D))\}^2)p(x)dx}

E(L)={y(x)t}2p(x,t)dxdt\therefore{E(L) = \iint\{y(x)-t\}^2p(x,t)dxdt}

={ED(y(x;D))h(x)}2p(x)dx(bias)2= \int{\{E_D(y(x;D))-h(x)\}^2p(x)dx}\quad(bias)^2

+ED({y(x;D)ED(y(x;D))}2)p(x)dx(variance)+ \int{E_D(\{y(x;D)-E_D(y(x;D))\}^2)p(x)dx}\quad(variance)

+{h(x)t}2p(x,t)dxdt(noise)+ \iint\{h(x)-t\}^2p(x,t)dxdt\quad(noise)

Occam's Razor

Cross Validation

Performance metrics

|||실제정답|

Positive Negative
실험결과 Positive TP(True Positive) FP(False Positive)
Negative FN(False Negative) TN(True Negative)

Fβ=(1+β2)precisionrecallβ2precision+recallF_\beta={(1+\beta^2)*precision * recall\over\beta^2*precision+recall}

Regularization

PRML과 (http://enginius.tistory.com/476)의 게시글 정리, 블로그의 게시글에 추가적으로 볼 내용 있으나, Legendre polynomial / Augmented error파트는 다 읽지 못함 추후에 반드시 공부할 것

EoutEin+ΩE_{out}\leq{E_{in} + \Omega}

E(w)=ED(w)+λEW(w)E(w) = E_D(w)+\lambda{E_W(w)}

E(w)=12n=1N{tnwTϕ(xn)}2+12j=1MwjqE(w) = {1\over{2}}\sum_{n=1}^N\{t_n-w^T\phi(x_n)\}^2+{1\over{2}}\sum_{j=1}^M\lVert{w_j}\rVert^q

Mahalanobis Distance / Singular Value Decomposition (과정외)

DM2=(xμ)TΣ1(xμ)D_M^2 = (x-\mu)^T\Sigma^{-1}(x-\mu)

XD=USVT\overline{X_{D}} = USV^T

Σ=XDTXD=VSUTUSVT=VSSVT\Sigma = \overline{X_{D}}^T\overline{X_{D}} = VSU^TUSV^T = VSSV^T

Σ+=VS+S+VT\Sigma^+ = VS^+S^+V^T

DM2=(xμ)TΣ1(xμ)=(xμ)TVS+S+VT(xμ)D_M^2 = (x-\mu)^T\Sigma^{-1}(x-\mu) = (x-\mu)^TVS^+S^+V^T(x-\mu)

={(xμ)TVS+}{(xμ)TVS+}T= \{(x-\mu)^TVS^+\}\{(x-\mu)^TVS^+\}^T