โ† ์ฃผ์ฐจ ๋ชฉ๋ก

Week 12. Seq2Seq์™€ Attention

๋ฒˆ์—ญ๊ธฐ์ฒ˜๋Ÿผ ์‹œํ€€์Šค๋ฅผ ๋ฐ›์•„ ์‹œํ€€์Šค๋ฅผ ๋งŒ๋“œ๋Š” ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ, ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋“  LLM์˜ ์–ด๋จธ๋‹ˆ์ธ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜.

์ด๋ฒˆ ์ฃผ์— ๋ฐฐ์šฐ๋Š” ๊ฒƒ

  1. Sequence-to-Sequence ๋ฌธ์ œ
  2. ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ
  3. ๊ณ ์ •๋œ ์ปจํ…์ŠคํŠธ ๋ฒกํ„ฐ์˜ ํ•œ๊ณ„
  4. Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜
  5. ๋น” ์„œ์น˜ ๋””์ฝ”๋”ฉ

1. Seq2Seq ๋ฌธ์ œ โ€” ์‹œํ€€์Šค๋ฅผ ๋ฐ›์•„ ์‹œํ€€์Šค๋ฅผ ๋‚ด๊ธฐ

W11 RNN/LSTM์—์„œ ๋‹ค๋ฃฌ RNN์€ ์‹œํ€€์Šค๋ฅผ ๋ฐ›์•„ ํ•˜๋‚˜์˜ ์ถœ๋ ฅ์„ ๋‚ด๋Š” ๊ตฌ์กฐ์˜€์Šต๋‹ˆ๋‹ค (์˜ˆ: ๊ฐ์„ฑ ๋ถ„๋ฅ˜ "๊ธ์ •/๋ถ€์ •"). ๊ทธ๋Ÿฌ๋‚˜ ํ˜„์‹ค์—” ์‹œํ€€์Šค๋ฅผ ๋ฐ›์•„ ์‹œํ€€์Šค๋ฅผ ๋‚ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ํ›จ์”ฌ ๋งŽ์Šต๋‹ˆ๋‹ค:

์ด ๋ฌธ์ œ๋“ค์˜ ๊ณตํ†ต ์–ด๋ ค์›€: ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ๊ธธ์ด๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅด๊ณ , ๋ฏธ๋ฆฌ ์˜ˆ์ธกํ•  ์ˆ˜๋„ ์—†๋‹ค๋Š” ๊ฒƒ. "์•ˆ๋…•"(1๋‹จ์–ด)์„ "hi"(1๋‹จ์–ด)๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ฒฝ์šฐ์™€, 20๋‹จ์–ด ๋ฌธ์žฅ์„ 30๋‹จ์–ด๋กœ ๋ฒˆ์—ญํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๊ฐ™์€ ๋ชจ๋ธ๋กœ ์ฒ˜๋ฆฌ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ํ•ด๋‹ต์€ ๋‘ ๊ฐœ์˜ RNN์„ ์ด์–ด๋ถ™์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•˜๋‚˜๋Š” ์ž…๋ ฅ์„ ์ฝ์–ด "์˜๋ฏธ ๋ฒกํ„ฐ"๋กœ ์••์ถ•ํ•˜๋Š” ์ธ์ฝ”๋”(encoder), ๋‹ค๋ฅธ ํ•˜๋‚˜๋Š” ๊ทธ ์˜๋ฏธ ๋ฒกํ„ฐ๋กœ๋ถ€ํ„ฐ ์ถœ๋ ฅ์„ ํ•œ ๋‹จ์–ด์”ฉ ์ƒ์„ฑํ•˜๋Š” ๋””์ฝ”๋”(decoder). ์ด๋ฅผ Seq2Seq ๋ชจ๋ธ์ด๋ผ ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

2. ์ธ์ฝ”๋”-๋””์ฝ”๋” ๊ตฌ์กฐ

2014๋…„ Google Brain์˜ Ilya Sutskever, Oriol Vinyals, Quoc Le๊ฐ€ ๋ฐœํ‘œํ•œ ๋…ผ๋ฌธ "Sequence to Sequence Learning with Neural Networks"๋Š” Seq2Seq์˜ ์›ํ˜•์„ ํ™•๋ฆฝํ–ˆ์Šต๋‹ˆ๋‹ค. ๊ฐ™์€ ํ•ด Cho Kyunghyun ๋“ฑ์ด "RNN Encoder-Decoder for Statistical Machine Translation"์—์„œ ์œ ์‚ฌํ•œ ์•„์ด๋””์–ด๋ฅผ ์ œ์•ˆํ–ˆ๊ณ , ์—ฌ๊ธฐ์„œ GRU๋„ ํ•จ๊ป˜ ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์กฐ๋Š” ๋‹จ์ˆœํ•ฉ๋‹ˆ๋‹ค:

๋””์ฝ”๋”๋Š” ์‹œ์ž‘ ํ† ํฐ <SOS>์—์„œ ์‹œ์ž‘ํ•ด ์ข…๋ฃŒ ํ† ํฐ <EOS>๋ฅผ ๋‚ผ ๋•Œ๊นŒ์ง€ ๊ณ„์† ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๋•๋ถ„์— ์ถœ๋ ฅ ๊ธธ์ด๊ฐ€ ์ž์œ ๋กญ๊ฒŒ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต์€ ์–ด๋–ป๊ฒŒ ํ• ๊นŒ์š”? ๋ณดํ†ต "์ •๋‹ต ์‹œํ€€์Šค"๋ฅผ ์‚ฌ์šฉํ•ด teacher forcing์„ ์”๋‹ˆ๋‹ค โ€” ๋””์ฝ”๋”์˜ ์ด์ „ ์ถœ๋ ฅ ๋Œ€์‹  ์ •๋‹ต ๋‹จ์–ด๋ฅผ ๋‹ค์Œ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด ํ•™์Šต์„ ์•ˆ์ •ํ™”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ๋Œ€์‹  ์ถ”๋ก  ์‹œ์—๋Š” ์ž์‹ ์˜ ์ด์ „ ์ถœ๋ ฅ์„ ์“ฐ๊ฒŒ ๋˜๋Š” train-test ๋ถˆ์ผ์น˜๊ฐ€ ์ƒ๊ธฐ๋Š”๋ฐ, ์ด๋ฅผ ์™„ํ™”ํ•˜๋Š” scheduled sampling ๊ฐ™์€ ๊ธฐ๋ฒ•์ด ํ›„์† ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆ๋์Šต๋‹ˆ๋‹ค.

2.1 Seq2Seq์˜ ์˜์˜

Seq2Seq๊ฐ€ ํ˜๋ช…์ ์ด์—ˆ๋˜ ์ด์œ ๋Š” "๋ชจ๋“  ๊ฒƒ์„ End-to-End๋กœ ํ•™์Šต"ํ•  ์ˆ˜ ์žˆ์—ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค. ๊ทธ ์ด์ „์˜ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์€ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„(ํ† ํฐํ™”โ†’๊ตฌ๋ฌธ ๋ถ„์„โ†’๋‹จ์–ด ์ •๋ ฌโ†’๊ตฌ ๋ฒˆ์—ญโ†’์žฌ์ •๋ ฌ)๋กœ ๋‚˜๋‰˜์–ด ๊ฐ ๋‹จ๊ณ„๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ํŠœ๋‹๋˜์—ˆ๋Š”๋ฐ, Seq2Seq๋Š” ์ด ๋ชจ๋‘๋ฅผ ํ•˜๋‚˜์˜ ๋„คํŠธ์›Œํฌ๋กœ ๋Œ€์ฒดํ–ˆ์Šต๋‹ˆ๋‹ค. ์„ฑ๋Šฅ๋„ ๊ณง๋ฐ”๋กœ ๋‹น์‹œ ์ตœ๊ณ  ๊ธฐ๋ฒ•์„ ๋”ฐ๋ผ์žก์•˜๊ณ , ์ด๊ฒƒ์ด 2016๋…„ Google Translate์˜ GNMT ์‹œ์Šคํ…œ์œผ๋กœ ์‹ค์ œ ์„œ๋น„์Šค์— ์ ์šฉ๋˜์–ด ํ•˜๋ฃป๋ฐค ์‚ฌ์ด ๋ฒˆ์—ญ ํ’ˆ์งˆ์ด ๊ธ‰์ƒ์Šนํ•œ ์‚ฌ๊ฑด("the Google Translate revolution")์˜ ๋ฐฐ๊ฒฝ์ž…๋‹ˆ๋‹ค.

๐ŸŽฎ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ: ์ธ์ฝ”๋”-๋””์ฝ”๋” ํ๋ฆ„

ํ•œ๊ตญ์–ด 4๋‹จ์–ด๊ฐ€ ์ธ์ฝ”๋”๋กœ ๋“ค์–ด๊ฐ€ ์ปจํ…์ŠคํŠธ๊ฐ€ ๋˜๊ณ , ๋””์ฝ”๋”๊ฐ€ ์˜์–ด 4๋‹จ์–ด๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ณผ์ •์„ ์‹œ๊ฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ์Šฌ๋ผ์ด๋”๋กœ ์‹œ์ ์„ ์˜ฎ๊ฒจ๋ณด์„ธ์š”.

3. ๊ณ ์ • ์ปจํ…์ŠคํŠธ ๋ฒกํ„ฐ์˜ ๋ณ‘๋ชฉ

Seq2Seq๋Š” ํ›Œ๋ฅญํ–ˆ์ง€๋งŒ ํ•œ ๊ฐ€์ง€ ๊ทผ๋ณธ์  ๋ฌธ์ œ๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์•„๋ฌด๋ฆฌ ๋ณต์žกํ•œ ๋ฌธ์žฅ์ด๋ผ๋„ ๋‹จ ํ•˜๋‚˜์˜ ๊ณ ์ • ํฌ๊ธฐ ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•ด์•ผ ํ•œ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค. ๋ฌธ์žฅ์ด ๊ธธ์–ด์งˆ์ˆ˜๋ก ๋” ๋งŽ์€ ์ •๋ณด๊ฐ€ ๊ฐ™์€ ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ์— ์šฑ์—ฌ ๋„ฃ์–ด์ง€๋‹ˆ, ๋””์ฝ”๋”๋Š” ์‹œ์ž‘ ๋ถ€๋ถ„์„ "๊ธฐ์–ต"ํ•  ์—ฌ๋ ฅ์„ ์žƒ์Šต๋‹ˆ๋‹ค.

์‹ค์ œ๋กœ 2014~2015๋…„ ์‹คํ—˜์—์„œ ์ž…๋ ฅ ๋ฌธ์žฅ์ด 20๋‹จ์–ด๋ฅผ ๋„˜์–ด๊ฐ€๋ฉด Seq2Seq์˜ ๋ฒˆ์—ญ ํ’ˆ์งˆ์ด ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง€๋Š” ๊ฒƒ์ด ๊ด€์ฐฐ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์น˜ ์‹œํ—˜ ์ง์ „์— 500์ชฝ ์ฑ…์„ ํ•œ ๋‹จ์–ด("๊ณต๋ถ€")๋กœ ์š”์•ฝํ•ด ์‹œํ—˜์„ ์น˜๋ ค๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ ์ƒํ™ฉ์ž…๋‹ˆ๋‹ค. ์ด "์ •๋ณด ๋ณ‘๋ชฉ(information bottleneck)"์ด RNN ๋ฒˆ์—ญ๊ธฐ์˜ ๊ฒฐ์ •์  ์•ฝ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค.

๋ช‡ ๊ฐ€์ง€ ์™„ํ™”์ฑ…์ด ์ œ์‹œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค:

๊ทธ๋Ÿฌ๋‚˜ ์ด ๋ชจ๋‘๋Š” ์ฆ์ƒ ์™„ํ™”์˜€๊ณ , ๊ทผ๋ณธ ์น˜๋ฃŒ๋Š” ์ „ํ˜€ ๋‹ค๋ฅธ ์•„์ด๋””์–ด์—์„œ ๋‚˜์™”์Šต๋‹ˆ๋‹ค: "์• ์ดˆ์— ๋ชจ๋“  ์ •๋ณด๋ฅผ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ์••์ถ•ํ•˜์ง€ ๋ง์ž."

4. Attention โ€” "์ง‘์ค‘"์„ ํ•™์Šตํ•˜๋‹ค

2014๋…„ 9์›”, Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio๊ฐ€ ICLR์— ํˆฌ๊ณ ํ•œ ๋…ผ๋ฌธ "Neural Machine Translation by Jointly Learning to Align and Translate"๊ฐ€ ์–ดํ…์…˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ฒ˜์Œ ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์•„์ด๋””์–ด๋Š” ์‹ฌํ”Œํ•˜์ง€๋งŒ ํ˜๋ช…์ ์ž…๋‹ˆ๋‹ค.

์–ดํ…์…˜ ํ•œ ์ค„ ์š”์•ฝ โ€” "๋””์ฝ”๋”๊ฐ€ ๋งค ์ถœ๋ ฅ ์‹œ์ ๋งˆ๋‹ค, ์ธ์ฝ”๋”์˜ ๋ชจ๋“  ์€๋‹‰ ์ƒํƒœ๋ฅผ ์ฐธ๊ณ ํ•˜๋˜, ์ง€๊ธˆ ์ƒ์„ฑํ•˜๋Š” ๋‹จ์–ด์™€ ๊ด€๋ จ ์žˆ๋Š” ์œ„์น˜์—๋งŒ ์ง‘์ค‘ํ•œ๋‹ค."

๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค:

  1. ์ธ์ฝ”๋”๋Š” ์ž…๋ ฅ ๋‹จ์–ด๋งˆ๋‹ค ์€๋‹‰ ์ƒํƒœ $h_1, h_2, \dots, h_T$๋ฅผ ๋ชจ๋‘ ์ €์žฅํ•ด๋‘ก๋‹ˆ๋‹ค (๋งˆ์ง€๋ง‰ ๊ฒƒ๋งŒ ์“ฐ๋Š” ๊ฒŒ ์•„๋‹˜).
  2. ๋””์ฝ”๋”๊ฐ€ ์ถœ๋ ฅ ์‹œ์  $t$์—์„œ ์ƒˆ ๋‹จ์–ด๋ฅผ ๋‚ด๋ ค๊ณ  ํ•  ๋•Œ, ์ž์‹ ์˜ ํ˜„์žฌ ์ƒํƒœ $s_t$์™€ ๋ชจ๋“  $h_i$ ์‚ฌ์ด์˜ ์ ์ˆ˜(score)๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  3. ์ ์ˆ˜๋ฅผ softmax๋กœ ์ •๊ทœํ™”ํ•ด ์ฃผ์˜ ๊ฐ€์ค‘์น˜ $\alpha_{t,i}$๋ฅผ ์–ป์Šต๋‹ˆ๋‹ค.
  4. ๊ฐ€์ค‘์น˜๋กœ ๊ฐ€์ค‘ ํ‰๊ท ํ•œ ์ปจํ…์ŠคํŠธ ๋ฒกํ„ฐ $c_t$๋ฅผ ๋งŒ๋“ค์–ด ๋””์ฝ”๋”์— ๊ณต๊ธ‰ํ•ฉ๋‹ˆ๋‹ค.

์ˆ˜์‹:

$$ \alpha_{t,i} = \frac{\exp(\text{score}(s_t, h_i))}{\sum_j \exp(\text{score}(s_t, h_j))}, \quad c_t = \sum_i \alpha_{t,i} h_i $$

์ด์ œ ๋””์ฝ”๋”๋Š” ๋งค ์‹œ์ ๋งˆ๋‹ค "์–ด๋””์— ์ง‘์ค‘ํ• ์ง€"๋ฅผ ์Šค์Šค๋กœ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฐ€์ค‘์น˜๋ฅผ ์‹œ๊ฐํ™”ํ•˜๋ฉด ๊ธฐ๊ณ„ ๋ฒˆ์—ญ๊ธฐ๊ฐ€ "school"์„ ๋งŒ๋“ค ๋•Œ ํ•œ๊ตญ์–ด "ํ•™๊ต"๋ฅผ ๋ณด๊ณ , "went"๋ฅผ ๋งŒ๋“ค ๋•Œ "๊ฐ”๋‹ค"๋ฅผ ๋ณธ๋‹ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ง ๊ทธ๋Œ€๋กœ "๋ฒˆ์—ญ ์ •๋ ฌ"์ด ์ž๋™์œผ๋กœ ํ•™์Šต๋˜๋Š” ๊ฒƒ.

4.1 ์ ์ˆ˜ ํ•จ์ˆ˜์˜ ๋‘ ๊ณ„์—ด โ€” Bahdanau vs Luong

์ ์ˆ˜ ํ•จ์ˆ˜์˜ ํ˜•ํƒœ์— ๋”ฐ๋ผ ๋‘ ๋Œ€ํ‘œ์  ๋ณ€ํ˜•์ด ์žˆ์Šต๋‹ˆ๋‹ค:

ํ˜„๋Œ€ Transformer(W13)๋Š” Luong์˜ dot-product ๋ฐฉ์‹์„ ํ™•์žฅํ•œ "scaled dot-product attention"์„ ์”๋‹ˆ๋‹ค. ๋‘ ๋ฐฉ์‹์˜ ์‹ค์ฆ์  ์„ฑ๋Šฅ์€ ๋น„์Šทํ•˜์ง€๋งŒ ์†๋„์™€ ๋‹จ์ˆœ์„ฑ์—์„œ dot-product๊ฐ€ ์Šน๋ฆฌํ–ˆ์Šต๋‹ˆ๋‹ค.

4.2 ์–ดํ…์…˜์ด ๋ฐ”๊พผ ๊ฒƒ๋“ค

์–ดํ…์…˜์˜ ๋“ฑ์žฅ์€ ๋”ฅ๋Ÿฌ๋‹ NLP์˜ ์ „ํ™˜์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ํšจ๊ณผ๋Š” ์ฆ‰๊ฐ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค:

๊ทธ๋ฆฌ๊ณ  ๊ฒฐ์ •์ ์œผ๋กœ, 2017๋…„ Google์˜ "Attention Is All You Need" ๋…ผ๋ฌธ์ด "์–ดํ…์…˜๋งŒ ์žˆ์œผ๋ฉด RNN ์ž์ฒด๊ฐ€ ํ•„์š” ์—†๋‹ค"๋Š” ์ฃผ์žฅ์œผ๋กœ Transformer๋ฅผ ์ œ์•ˆํ–ˆ๊ณ , ์ดํ›„ ๋ชจ๋“  LLM์˜ ๊ธฐ๋ฐ˜์ด ๋ฉ๋‹ˆ๋‹ค. ์–ดํ…์…˜์€ ๋‹จ์ˆœํ•œ ๊ฐœ์„ ์„ ๋„˜์–ด ์‹ ๊ฒฝ๋ง ์„ค๊ณ„ ํŒจ๋Ÿฌ๋‹ค์ž„ ์ž์ฒด๋ฅผ ๋ฐ”๊ฟจ์Šต๋‹ˆ๋‹ค.

๐ŸŽฎ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ: ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜ ํžˆํŠธ๋งต

ํ•œโ†’์˜ ๋ฒˆ์—ญ์˜ ๊ฐ€์ƒ ์–ดํ…์…˜ ํ–‰๋ ฌ์ž…๋‹ˆ๋‹ค. ํ–‰์€ ์ถœ๋ ฅ ๋‹จ์–ด, ์—ด์€ ์ž…๋ ฅ ๋‹จ์–ด. ์ง„ํ• ์ˆ˜๋ก ํฐ ๊ฐ€์ค‘์น˜์ž…๋‹ˆ๋‹ค. ์Šฌ๋ผ์ด๋”๋กœ ํŒจํ„ด์„ ๋ฐ”๊ฟ”๋ณด์„ธ์š”.

์™œ ์–ดํ…์…˜์ด ํ˜๋ช…์ด์—ˆ๋‚˜. ๊ฑฐ๋ฆฌ์— ์ƒ๊ด€์—†์ด ๋ชจ๋“  ์œ„์น˜๋ฅผ ์ง์ ‘ ์—ฐ๊ฒฐํ•˜๋ฏ€๋กœ ์žฅ๊ธฐ ์˜์กด์„ฑ์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ’€๋ฆฝ๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฐ€์ค‘์น˜๊ฐ€ ํ•™์Šต ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์•Œ์•„์„œ ์–ด๋””๋ฅผ ๋ด์•ผ ํ• ์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด ํ†ต์ฐฐ์ด ํŠธ๋žœ์Šคํฌ๋จธ๋กœ ์ด์–ด์ง‘๋‹ˆ๋‹ค.

5. ๋น” ์„œ์น˜(Beam Search)

๋””์ฝ”๋”ฉ ์‹œ ๋งค ์‹œ์  ๊ฐ€์žฅ ํ™•๋ฅ  ๋†’์€ ๋‹จ์–ด ํ•˜๋‚˜๋งŒ ๊ณ ๋ฅด๋Š” ๊ทธ๋ฆฌ๋”” ๋””์ฝ”๋”ฉ์€ ๊ทผ์‹œ์•ˆ์ ์ž…๋‹ˆ๋‹ค. ๋น” ์„œ์น˜๋Š” ์ƒ์œ„ $k$๊ฐœ ํ›„๋ณด๋ฅผ ๋™์‹œ์— ์œ ์ง€ํ•˜๋ฉฐ ์ „์ฒด์ ์œผ๋กœ ์ ์ˆ˜๊ฐ€ ๋†’์€ ์‹œํ€€์Šค๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค. $k=1$์ด๋ฉด ๊ทธ๋ฆฌ๋””, $k$๊ฐ€ ํฌ๋ฉด ๋” ์ข‹์€ ๊ฒฐ๊ณผ์ง€๋งŒ ๋А๋ฆฝ๋‹ˆ๋‹ค.

๐ŸŽฎ ์ธํ„ฐ๋ž™ํ‹ฐ๋ธŒ: ๋น” ์„œ์น˜ ํŠธ๋ฆฌ

๊ฐ ์‹œ์ ์—์„œ ๋น” ํญ๋งŒํผ์˜ ํ›„๋ณด๊ฐ€ ์‚ด์•„๋‚จ๋Š” ๋ชจ์Šต์„ ๋ด…๋‹ˆ๋‹ค. ๋น” ํญ์ด ์ปค์งˆ์ˆ˜๋ก ํŠธ๋ฆฌ๊ฐ€ ๋„“์–ด์ง‘๋‹ˆ๋‹ค.

6. ์ฝ”๋“œ ์˜ˆ์ œ (PyTorch ๊ฐœ๋…)

import torch.nn as nn

class Encoder(nn.Module):
    def __init__(self, vocab, hid):
        super().__init__()
        self.emb = nn.Embedding(vocab, hid)
        self.gru = nn.GRU(hid, hid, batch_first=True)
    def forward(self, x):
        return self.gru(self.emb(x))   # (out, h)

class Attention(nn.Module):
    def __init__(self, hid):
        super().__init__()
        self.v = nn.Linear(hid*2, 1)
    def forward(self, dec_h, enc_outs):
        # dec_h: (B,H), enc_outs: (B,T,H)
        scores = self.v(torch.cat([dec_h.unsqueeze(1).expand_as(enc_outs), enc_outs], -1))
        alpha = torch.softmax(scores.squeeze(-1), dim=1)
        ctx = (alpha.unsqueeze(-1) * enc_outs).sum(1)
        return ctx, alpha

๐Ÿ“– ๋” ๊นŠ์ด ๊ณต๋ถ€ํ•˜๊ธฐ