โ† ์ฃผ์ฐจ ๋ชฉ๋ก

Week 11. RNN๊ณผ LSTM

์‹œ๊ฐ„์ด ํ๋ฅด๋Š” ๋ฐ์ดํ„ฐ โ€” ๋ง, ์Œ์•…, ์ฃผ๊ฐ€, ์„ผ์„œ ์‹ ํ˜ธ โ€” ๋ฅผ ๋‹ค๋ฃจ๋Š” ์‹ ๊ฒฝ๋ง. ๊ทธ๋ฆฌ๊ณ  ๊ทธ๊ฒƒ์„ ์‹ค์šฉํ™”ํ•œ LSTM์˜ ๊ฒŒ์ดํŠธ ๊ตฌ์กฐ.

์ด๋ฒˆ ์ฃผ์— ๋ฐฐ์šฐ๋Š” ๊ฒƒ

  1. ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ์˜ ํŠน์ง•
  2. vanilla RNN๊ณผ BPTT
  3. ์žฅ๊ธฐ ์˜์กด์„ฑ ๋ฌธ์ œ
  4. LSTM์˜ ์„ธ ๊ฒŒ์ดํŠธ
  5. GRU โ€” ๋” ๊ฐ„๋‹จํ•œ ๋Œ€์•ˆ

1. ์‹œํ€€์Šค ๋ฐ์ดํ„ฐ โ€” ์‹œ๊ฐ„์ด ํ๋ฅด๋Š” ์„ธ์ƒ

์ง€๊ธˆ๊นŒ์ง€ ๋‹ค๋ฃฌ ์‹ ๊ฒฝ๋ง์€ ์ž…๋ ฅ์ด ๊ณ ์ •๋œ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ์˜€์Šต๋‹ˆ๋‹ค. W9 CNN์˜ ์ด๋ฏธ์ง€๋„ $224 \times 224$๋กœ ํฌ๊ธฐ๊ฐ€ ๊ณ ์ •์ž…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ํ˜„์‹ค์—๋Š” ๊ธธ์ด๊ฐ€ ๊ฐ€๋ณ€์ ์ด๊ณ  ์ˆœ์„œ๊ฐ€ ์ค‘์š”ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ˆ˜์—†์ด ๋งŽ์Šต๋‹ˆ๋‹ค:

๋‘ ๊ฐ€์ง€ ํŠน์„ฑ์ด ์ด๋“ค์„ ์™„์ „์—ฐ๊ฒฐ๋ง์ด๋‚˜ CNN์œผ๋กœ ๋‹ค๋ฃจ๊ธฐ ์–ด๋ ต๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

ํŠน์„ฑ 1 โ€” ๊ฐ€๋ณ€ ๊ธธ์ด. "์•ˆ๋…•"์€ ๋‹จ์–ด 1๊ฐœ, "์˜ค๋Š˜ ๋‚ ์”จ๊ฐ€ ์ฐธ ์ข‹๋„ค์š”"๋Š” 5๊ฐœ. ์™„์ „์—ฐ๊ฒฐ๋ง์˜ ์ž…๋ ฅ ์ฐจ์›์€ ๊ณ ์ •์ด๋ผ ๋‘ ๋ฌธ์žฅ์„ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ํŒจ๋”ฉ(0์œผ๋กœ ์ฑ„์šฐ๊ธฐ)์ด๋‚˜ ์ž๋ฅด๊ธฐ๋Š” ์ •๋ณด ์†์‹ค์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

ํŠน์„ฑ 2 โ€” ์ˆœ์„œ๊ฐ€ ๊ณง ์˜๋ฏธ. "๋‚˜๋Š” ํ•™๊ต์— ๊ฐ”๋‹ค"์™€ "๊ฐ”๋‹ค ํ•™๊ต์— ๋‚˜๋Š”"์€ ๋‹จ์–ด ์ง‘ํ•ฉ์€ ๋˜‘๊ฐ™์ง€๋งŒ ์˜๋ฏธ๊ฐ€ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋‹จ์–ด๋ฅผ bag-of-words๋กœ ๋‹จ์ˆœํžˆ ํ•ฉ์น˜๋ฉด ์ด ๋‘ ๋ฌธ์žฅ์ด ๊ตฌ๋ณ„๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  "๋‚˜๋Š” ์–ด์ œ ๋„์„œ๊ด€์—์„œ ์นœ๊ตฌ๋“ค๊ณผ ์ƒˆ๋กœ ๋‚˜์˜จ ์ฑ…์„ ์ฝ์—ˆ๋‹ค"์—์„œ, ๋งˆ์ง€๋ง‰ ๋™์‚ฌ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๋‹จ์„œ(์ฃผ์–ด "๋‚˜๋Š”")๋Š” ๋จผ ๊ณผ๊ฑฐ์— ์žˆ์Šต๋‹ˆ๋‹ค. ๋ชจ๋ธ์ด ์žฅ๊ฑฐ๋ฆฌ ์˜์กด์„ฑ์„ ํฌ์ฐฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

์ด ๋‘ ์š”๊ตฌ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ์‹ ๊ฒฝ๋ง์ด ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง(Recurrent Neural Network, RNN)์ž…๋‹ˆ๋‹ค.

2. Vanilla RNN โ€” ์‹œ๊ฐ„์ด๋ผ๋Š” ๊ณต์œ  ์ถ•์„ ๋”ฐ๋ผ

RNN์˜ ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ๋”ฑ ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค: "์ง€๊ธˆ๊นŒ์ง€ ๋ณธ ๊ฒƒ์„ ์š”์•ฝํ•œ ์€๋‹‰ ์ƒํƒœ(hidden state)๋ฅผ ๋‹ค์Œ ์‹œ์ ์œผ๋กœ ๋„˜๊ธฐ์ž". ์ด ์€๋‹‰ ์ƒํƒœ $h_t$๊ฐ€ ์ผ์ข…์˜ "๊ธฐ์–ต"์ž…๋‹ˆ๋‹ค.

๊ฐ€์žฅ ๋‹จ์ˆœํ•œ RNN์˜ ์—…๋ฐ์ดํŠธ ๊ทœ์น™ (Elman, 1990):

$$ h_t = \tanh(W_h h_{t-1} + W_x x_t + b_h) $$ $$ y_t = W_y h_t + b_y $$

์—ฌ๊ธฐ์„œ $x_t$๋Š” ์‹œ์  $t$์˜ ์ž…๋ ฅ(์˜ˆ: ํ˜„์žฌ ๋‹จ์–ด์˜ ์ž„๋ฒ ๋”ฉ), $h_t$๋Š” ์‹œ์  $t$์˜ ์€๋‹‰ ์ƒํƒœ(๊ณผ๊ฑฐ ์ „๋ถ€๋ฅผ ์š”์•ฝ), $y_t$๋Š” ์‹œ์  $t$์˜ ์ถœ๋ ฅ. ํ•ต์‹ฌ์€ ๊ฐ™์€ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ $W_h, W_x, W_y$๋ฅผ ๋ชจ๋“  ์‹œ์ ์ด ๊ณต์œ ํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ธธ์ด๊ฐ€ 10์ด๋“  1000์ด๋“  ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” ๊ทธ๋Œ€๋กœ โ€” ์ด๋กœ์จ ๊ฐ€๋ณ€ ๊ธธ์ด ๋ฌธ์ œ๊ฐ€ ํ•ด๊ฒฐ๋ฉ๋‹ˆ๋‹ค.

์ž‘๋™ ๋ฐฉ์‹์„ ์ƒ์ƒํ•ด๋ด…์‹œ๋‹ค. $h_0 = 0$์œผ๋กœ ์‹œ์ž‘. ์ฒซ ๋‹จ์–ด $x_1$์ด ๋“ค์–ด์˜ค๋ฉด $h_1 = \tanh(W_x x_1 + b_h)$. ๋‘˜์งธ ๋‹จ์–ด $x_2$๊ฐ€ ๋“ค์–ด์˜ค๋ฉด $h_2 = \tanh(W_h h_1 + W_x x_2 + b_h)$ โ€” ์—ฌ๊ธฐ์„œ $h_1$์„ ํ†ตํ•ด ์ฒซ ๋‹จ์–ด ์ •๋ณด๊ฐ€ ์„ž์ž…๋‹ˆ๋‹ค. ์ด๋Ÿฐ ์‹์œผ๋กœ $h_t$๋Š” $x_1, \dots, x_t$ ์ „๋ถ€์˜ ์š”์•ฝ์ด ๋ฉ๋‹ˆ๋‹ค.

์ด ๊ตฌ์กฐ์˜ ๋˜ ๋‹ค๋ฅธ ๋งค๋ ฅ์€ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต์œ ๊ฐ€ ์ผ๋ฐ˜ํ™”๋ฅผ ๋•๋Š”๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฐ™์€ $W$๋ฅผ ์‹œ์ ๋งˆ๋‹ค ์“ฐ๋ฏ€๋กœ, "๋ฌธ์žฅ ์ฒซ ์ž๋ฆฌ์—์„œ ๋ณธ ํŒจํ„ด"์ด "๋ฌธ์žฅ ์ค‘๊ฐ„์— ๋‹ค์‹œ ๋‚˜ํƒ€๋‚˜๋ฉด" ๊ฐ™์€ ์ฒ˜๋ฆฌ๊ฐ€ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค. ์ด๋Š” W9 CNN์˜ "๊ณต๊ฐ„ ์œ„์น˜์— ์ƒ๊ด€์—†์ด ๊ฐ™์€ ํ•„ํ„ฐ"์™€ ๊ฐ™์€ ์›๋ฆฌ โ€” ๋‹ค๋งŒ ์ฐจ์›์ด ๊ณต๊ฐ„์ด ์•„๋‹ˆ๋ผ ์‹œ๊ฐ„.

2.1 BPTT โ€” ์‹œ๊ฐ„์„ ํŽผ์ณ ์—ญ์ „ํŒŒํ•˜๊ธฐ

ํ•™์Šต์€ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ์š”? RNN์„ "์‹œ๊ฐ„์„ ๋”ฐ๋ผ ํŽผ์น˜๋ฉด(unroll)" ์‚ฌ์‹ค์ƒ ๊ฐ ์‹œ์ ๋งˆ๋‹ค ์ธต์ด ํ•˜๋‚˜์”ฉ ์žˆ๋Š” ์•„์ฃผ ๊นŠ์€ ํ”ผ๋“œํฌ์›Œ๋“œ ๋„คํŠธ์›Œํฌ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์— W8 ์—ญ์ „ํŒŒ๋ฅผ ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด BPTT(Backpropagation Through Time)์ž…๋‹ˆ๋‹ค.

์†์‹ค $L = \sum_t L_t$๊ฐ€ ์žˆ์„ ๋•Œ, $W_h$์— ๋Œ€ํ•œ ๊ทธ๋ž˜๋””์–ธํŠธ๋Š”:

$$ \frac{\partial L}{\partial W_h} = \sum_t \sum_{k \le t} \frac{\partial L_t}{\partial h_t} \cdot \left(\prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}}\right) \cdot \frac{\partial h_k}{\partial W_h} $$

ํ•ต์‹ฌ์€ ์•ˆ์ชฝ์˜ Jacobian ๊ณฑ $\prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}}$์ž…๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๋‹ค์Œ ์„น์…˜์˜ ์ฃผ์ธ๊ณต์ž…๋‹ˆ๋‹ค.

BPTT๋Š” ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋งŽ์ด ๋“ญ๋‹ˆ๋‹ค. ๋ชจ๋“  ์‹œ์ ์˜ ์€๋‹‰ ์ƒํƒœ๋ฅผ ์ €์žฅํ•ด์•ผ ์—ญ์ „ํŒŒํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ธด ์‹œํ€€์Šค์—์„œ๋Š” TBPTT(Truncated BPTT)๋กœ ๋ช‡ ์‹œ์ ๋งŒ ๊ฑฐ๊พธ๋กœ ์ „ํŒŒํ•˜๋Š” ๊ทผ์‚ฌ๊ฐ€ ์ž์ฃผ ์“ฐ์ž…๋‹ˆ๋‹ค.

๐ŸŽฎ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ: RNN ํŽผ์นจ ์‹œ๊ฐํ™”

๊ฐ™์€ ์…€์ด ์‹œ๊ฐ„์„ ๋”ฐ๋ผ 5๋ฒˆ ๋ฐ˜๋ณต๋˜๋Š” ๋ชจ์Šต์„ ๋ด…๋‹ˆ๋‹ค. ์Šฌ๋ผ์ด๋”๋กœ ์‹œ์ ์„ ์˜ฎ๊ธฐ๋ฉด ๊ทธ ์ˆœ๊ฐ„์˜ ์€๋‹‰ ์ƒํƒœ๊ฐ€ ๊ฐ•์กฐ๋ฉ๋‹ˆ๋‹ค.

3. ์žฅ๊ธฐ ์˜์กด์„ฑ ๋ฌธ์ œ โ€” ๊ธฐ์šธ๊ธฐ๊ฐ€ ์‚ฌ๋ผ์ง€๋Š” ์ด์œ 

RNN์˜ ์ˆ˜์‹์€ ์šฐ์•„ํ•˜์ง€๋งŒ ํ˜„์‹ค์—์„  ์‹ฌ๊ฐํ•œ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฌธ: "๋‚˜๋Š” ์–ด์ œ ๋„์„œ๊ด€์—์„œ ์นœ๊ตฌ๋“ค๊ณผ ๊ณต๋ถ€ํ•˜๊ณ  ์ƒˆ๋กœ ๋‚˜์˜จ ์†Œ์„ค์„ ๋นŒ๋ ค ์ง‘์— ์™€์„œ ์ €๋…์„ ๋จน๊ณ  ์ž ๋“ค๊ธฐ ์ „์— ํ•œ ์‹œ๊ฐ„ ๋™์•ˆ ๊ทธ ์ฑ…์„ ์ฝ์—ˆ๋‹ค." ๋งˆ์ง€๋ง‰ ๋™์‚ฌ "์ฝ์—ˆ๋‹ค"๊ฐ€ ๊ณผ๊ฑฐํ˜•์ธ ๊ฒƒ์€ ๋ฌธ์žฅ ๋งจ ์•ž์˜ "์–ด์ œ"์™€ ๊ด€๋ จ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฑฐ๋ฆฌ๊ฐ€ 20๋‹จ์–ด ์ด์ƒ.

์ด๋ก ์ ์œผ๋กœ RNN์€ $h_t$์— ๋ชจ๋“  ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ ๋„ฃ์„ ์ˆ˜ ์žˆ์ง€๋งŒ, ์‹ค์ „์—์„œ๋Š” 5~10๋‹จ์–ด ์ด์ƒ ๋–จ์–ด์ง„ ์ •๋ณด๋ฅผ ๊ธฐ์–ตํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ์™œ ๊ทธ๋Ÿด๊นŒ์š”?

BPTT์˜ Jacobian ๊ณฑ์„ ๋‹ค์‹œ ๋ด…์‹œ๋‹ค:

$$ \prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}} = \prod_{j=k+1}^{t} W_h^\top \text{diag}(\tanh'(z_j)) $$

์ด ๊ณฑ์ด $t - k$์˜ ์ง€์ˆ˜ํ•จ์ˆ˜์ž…๋‹ˆ๋‹ค. ๋‘ ๊ฒฝ์šฐ๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค:

์ด ๋ฌธ์ œ๋Š” W8 ยง2 ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค์—์„œ ๋ณธ ํ˜„์ƒ๊ณผ ๋ณธ์งˆ์ ์œผ๋กœ ๊ฐ™์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์‹œ๊ฐ„์ถ•์—์„œ ๋ฐœ์ƒํ•  ๋ฟ. 1994๋…„ Bengio ๋“ฑ์ด ์ด ๋ฌธ์ œ๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ถ„์„ํ•ด "vanilla RNN์€ ์žฅ๊ธฐ ์˜์กด์„ฑ์„ ๋ณธ์งˆ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ต๋‹ค"๋Š” ๋ถ€์ •์  ๊ฒฐ๋ก ์„ ๋‚ด๋ฆฌ๋ฉด์„œ RNN ์—ฐ๊ตฌ๋Š” ํ•œ๋™์•ˆ ์นจ์ฒด๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

4. LSTM โ€” ์ •๋ณด์˜ ๊ณ ์†๋„๋กœ

ํ•ด๋‹ต์€ ๋†€๋ž๊ฒŒ๋„ Bengio์˜ ๋ถ„์„๊ณผ ๊ฑฐ์˜ ๊ฐ™์€ ์‹œ๊ธฐ์— ์ด๋ฏธ ์ œ์‹œ๋˜์–ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. 1997๋…„ Sepp Hochreiter์™€ Jรผrgen Schmidhuber๋Š” ๋…ผ๋ฌธ "Long Short-Term Memory"์—์„œ RNN ์…€์„ ์™„์ „ํžˆ ์žฌ์„ค๊ณ„ํ•œ LSTM์„ ๋ฐœํ‘œํ–ˆ์Šต๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด๋Š” ํ•˜๋‚˜์ž…๋‹ˆ๋‹ค: ์€๋‹‰ ์ƒํƒœ ์™ธ์— "์…€ ์ƒํƒœ(cell state) $C_t$"๋ผ๋Š” ๋ณ„๋„์˜ ์ •๋ณด ํ†ต๋กœ๋ฅผ ๋‘๊ณ , ์ด ํ†ต๋กœ๋Š” ๊ณฑ์…ˆ์ด ์•„๋‹Œ ๋ง์…ˆ์œผ๋กœ ์—…๋ฐ์ดํŠธ๋˜๊ฒŒ ํ•˜์ž.

๋ง์…ˆ์€ ๊ณฑ์…ˆ๊ณผ ๋‹ฌ๋ฆฌ ๊ธฐ์šธ๊ธฐ๋ฅผ ์†Œ์‹ค์‹œํ‚ค์ง€ ์•Š์Šต๋‹ˆ๋‹ค. $\frac{\partial C_t}{\partial C_{t-1}} = 1$ (๊ฒŒ์ดํŠธ๊ฐ€ ์ ์ ˆํžˆ ์„ค์ •๋˜๋ฉด)์ด๋ฉด ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ˆ˜๋ฐฑ ์‹œ์ ์„ ๊ฑฐ์Šฌ๋Ÿฌ ์˜ฌ๋ผ๊ฐ€๋„ ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด LSTM์„ "์ •๋ณด์˜ ๊ณ ์†๋„๋กœ"๋ผ ๋ถ€๋ฅด๋Š” ์ด์œ . ์…€ ์ƒํƒœ๊ฐ€ ๊ณ ์†๋„๋กœ๊ณ , ๊ฒŒ์ดํŠธ๊ฐ€ ๊ทธ ๊ณ ์†๋„๋กœ์— ๋ฌด์—‡์„ ์˜ฌ๋ฆฌ๊ณ  ๋‚ด๋ฆด์ง€ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค.

LSTM์€ ์„ธ ๊ฐœ์˜ ๊ฒŒ์ดํŠธ๋กœ ์…€ ์ƒํƒœ๋ฅผ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค:

$$ f_t = \sigma(W_f [h_{t-1}, x_t]), \quad i_t = \sigma(W_i [h_{t-1}, x_t]), \quad o_t = \sigma(W_o [h_{t-1}, x_t]) $$ $$ \tilde{C}_t = \tanh(W_C [h_{t-1}, x_t]), \quad C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t $$ $$ h_t = o_t \odot \tanh(C_t) $$

๐ŸŽฎ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ: LSTM ๊ฒŒ์ดํŠธ ๊ฐ’ ์กฐ์ ˆ

์„ธ ๊ฒŒ์ดํŠธ ๊ฐ’์„ ์ง์ ‘ ์›€์ง์ด๋ฉฐ ์…€ ์ƒํƒœ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ๋ณด์„ธ์š”. ๋ง๊ฐ์ด 0์ด๋ฉด ๊ณผ๊ฑฐ๋ฅผ ์™„์ „ํžˆ ์ง€์šฐ๊ณ , ์ž…๋ ฅ์ด 0์ด๋ฉด ์ƒˆ ์ •๋ณด๋ฅผ ๋ฐ›์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

4.1 LSTM ์ˆ˜์‹ ํ•œ ์ค„์”ฉ ์ดํ•ดํ•˜๊ธฐ

LSTM์˜ ์ „์ฒด ์—…๋ฐ์ดํŠธ ์‹์„ ๋‹ค์‹œ ์ ์–ด๋ด…์‹œ๋‹ค:

$$ f_t = \sigma(W_f [h_{t-1}, x_t] + b_f) $$ $$ i_t = \sigma(W_i [h_{t-1}, x_t] + b_i) $$ $$ o_t = \sigma(W_o [h_{t-1}, x_t] + b_o) $$ $$ \tilde C_t = \tanh(W_C [h_{t-1}, x_t] + b_C) $$ $$ C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t $$ $$ h_t = o_t \odot \tanh(C_t) $$

$[h_{t-1}, x_t]$๋Š” ๋‘ ๋ฒกํ„ฐ๋ฅผ ์ด์–ด๋ถ™์ธ ๊ฒƒ์ด๊ณ , $\odot$๋Š” ์š”์†Œ๋ณ„ ๊ณฑ. ๊ฐ ์‹์˜ ์˜๋ฏธ๋ฅผ ํ•œ ๋ฌธ์žฅ์”ฉ:

4.2 ์™œ ๊ธฐ์šธ๊ธฐ๊ฐ€ ๋” ์ด์ƒ ์‚ฌ๋ผ์ง€์ง€ ์•Š๋Š”๊ฐ€

ํ•ต์‹ฌ์€ $C_t = f_t \odot C_{t-1} + i_t \odot \tilde C_t$์˜ $\partial C_t / \partial C_{t-1} = f_t$. ๋ง๊ฐ ๊ฒŒ์ดํŠธ์˜ ๊ฐ’์ž…๋‹ˆ๋‹ค. ์ค‘์š”ํ•œ ๊ด€์ฐฐ:

์ฆ‰ LSTM์€ ์Šค์Šค๋กœ "์–ผ๋งˆ๋‚˜ ์˜ค๋ž˜ ๊ธฐ์–ตํ• ์ง€"๋ฅผ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ด ๋Šฅ๋ ฅ์ด ์—†๋˜ vanilla RNN๊ณผ์˜ ๊ฒฐ์ •์  ์ฐจ์ด์ž…๋‹ˆ๋‹ค.

๋ง๊ฐ ๊ฒŒ์ดํŠธ ํŽธํ–ฅ์˜ ํŠธ๋ฆญ โ€” ์‹ค์ „์—์„œ $b_f$๋ฅผ 1 ๋˜๋Š” 2๋กœ ์ดˆ๊ธฐํ™”ํ•˜๋ฉด ํ•™์Šต ์ดˆ๊ธฐ์— ๋ง๊ฐ์ด ๊ฑฐ์˜ ์ผ์–ด๋‚˜์ง€ ์•Š์•„ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. ์ž‘์€ ํŠธ๋ฆญ์ด์ง€๋งŒ ์ˆ˜๋ ด ์†๋„์— ํฐ ์ฐจ์ด๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค. PyTorch์˜ ๊ธฐ๋ณธ๊ฐ’์€ ์ด ํŠธ๋ฆญ์„ ๋ฐ˜์˜ํ•˜์ง€ ์•Š์œผ๋‹ˆ ์ง์ ‘ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

5. GRU โ€” ๋‹จ์ˆœํ•จ์˜ ๋ฏธ๋•

2014๋…„ Kyunghyun Cho ๋“ฑ์ด ์ œ์•ˆํ•œ GRU(Gated Recurrent Unit)๋Š” LSTM์˜ ์„ธ ๊ฒŒ์ดํŠธ๋ฅผ ๋‘ ๊ฐœ๋กœ ์ค„์ธ ๋‹จ์ˆœํ™” ๋ฒ„์ „์ž…๋‹ˆ๋‹ค. ํ•ต์‹ฌ ์•„์ด๋””์–ด:

$$ z_t = \sigma(W_z [h_{t-1}, x_t]), \quad r_t = \sigma(W_r [h_{t-1}, x_t]) $$ $$ \tilde h_t = \tanh(W [r_t \odot h_{t-1}, x_t]) $$ $$ h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde h_t $$

GRU์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” LSTM์˜ ์•ฝ 3/4์ž…๋‹ˆ๋‹ค. ์„ฑ๋Šฅ์€ ์ž‘์—…์— ๋”ฐ๋ผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ์‚ด์ง ๋‹ค๋ฅธ๋ฐ, ์ผ๋ฐ˜์ ์œผ๋กœ ์ž‘์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” GRU๊ฐ€ ์œ ๋ฆฌํ•˜๊ณ (๊ณผ์ ํ•ฉ์ด ๋œํ•จ), ํฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” LSTM์ด ์•ฝ๊ฐ„ ๋” ์ž˜ ์ž‘๋™ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์‹ค๋ฌด์—์„œ๋Š” ๋‘˜ ๋‹ค ์‹œ๋„ํ•ด๋ณด๊ณ  ๊ฒ€์ฆ ์„ฑ๋Šฅ์ด ์ข‹์€ ์ชฝ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

์ดํ›„ Transformer(W13)๊ฐ€ ๋“ฑ์žฅํ•˜๋ฉด์„œ RNN ๊ณ„์—ด์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ฃผ๋ฅ˜์—์„œ ๋ฌผ๋Ÿฌ๋‚ฌ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์„ผ์„œ ์‹ ํ˜ธ ์ฒ˜๋ฆฌ, ์‹œ๊ณ„์—ด ์˜ˆ์ธก, ์Œ์„ฑ ์ธ์‹ ๊ฐ™์€ ํŠน์ˆ˜ ๋ถ„์•ผ์—์„œ๋Š” LSTM/GRU๊ฐ€ ๊ธฐ๋ณธ ์„ ํƒ์ž…๋‹ˆ๋‹ค. ํŠนํžˆ ์—ฃ์ง€ ๋””๋ฐ”์ด์Šค๋‚˜ ์‹ค์‹œ๊ฐ„ ์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ํ™˜๊ฒฝ์—์„œ๋Š” Transformer์˜ $O(n^2)$ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€๋‹ด์ด๋ผ RNN์ด ์—ฌ์ „ํžˆ ์„ ํ˜ธ๋ฉ๋‹ˆ๋‹ค.

6. ์‹œ๊ณ„์—ด ์˜ˆ์ธก ๋ฐ๋ชจ

๐ŸŽฎ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ: ์‚ฌ์ธํŒŒ ์˜ˆ์ธก

๊ฐ„๋‹จํ•œ 1์…€ RNN์ด ์‚ฌ์ธํŒŒ๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค. ์ง„ํญ๊ณผ ์ฃผํŒŒ์ˆ˜๋ฅผ ์กฐ์ ˆํ•ด๋ณด์„ธ์š”(์‹œ๋ฎฌ๋ ˆ์ด์…˜).

7. ์ฝ”๋“œ ์˜ˆ์ œ (PyTorch)

import torch.nn as nn

class LSTMNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, output_size)
    def forward(self, x):
        out, (h, c) = self.lstm(x)
        return self.fc(out[:, -1, :])

๐Ÿ“– ๋” ๊นŠ์ด ๊ณต๋ถ€ํ•˜๊ธฐ