{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Ekstrakcja informacji

\n", "

13. Transformery 2 [ćwiczenia]

\n", "

Jakub Pokrywka (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wizualizacja atencji\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "https://github.com/jessevig/bertviz" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "!pip install bertviz" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers import AutoTokenizer, AutoModel\n", "from bertviz import model_view, head_view" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TEXT = \"This is a sample input sentence for a transformer model\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MODEL = \"distilbert-base-uncased\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokenizer = AutoTokenizer.from_pretrained(MODEL)\n", "model = AutoModel.from_pretrained(MODEL, output_attentions=True)\n", "inputs = tokenizer.encode(TEXT, return_tensors='pt')\n", "outputs = model(inputs)\n", "attention = outputs[-1]\n", "tokens = tokenizer.convert_ids_to_tokens(inputs[0]) \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### SELF ATTENTION MODELS" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "head_view(attention, tokens)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "model_view(attention, tokens)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ENCODER-DECODER MODELS" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MODEL = \"Helsinki-NLP/opus-mt-en-de\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TEXT_ENCODER = \"She sees the small elephant.\"\n", "TEXT_DECODER = \"Sie sieht den kleinen Elefanten.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokenizer = AutoTokenizer.from_pretrained(MODEL)\n", "model = AutoModel.from_pretrained(MODEL, output_attentions=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "encoder_input_ids = tokenizer(TEXT_ENCODER, return_tensors=\"pt\", add_special_tokens=True).input_ids\n", "decoder_input_ids = tokenizer(TEXT_DECODER, return_tensors=\"pt\", add_special_tokens=True).input_ids\n", "\n", "outputs = model(input_ids=encoder_input_ids, decoder_input_ids=decoder_input_ids)\n", "\n", "encoder_text = tokenizer.convert_ids_to_tokens(encoder_input_ids[0])\n", "decoder_text = tokenizer.convert_ids_to_tokens(decoder_input_ids[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "head_view(\n", " encoder_attention=outputs.encoder_attentions,\n", " decoder_attention=outputs.decoder_attentions,\n", " cross_attention=outputs.cross_attentions,\n", " encoder_tokens= encoder_text,\n", " decoder_tokens = decoder_text\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "model_view(\n", " encoder_attention=outputs.encoder_attentions,\n", " decoder_attention=outputs.decoder_attentions,\n", " cross_attention=outputs.cross_attentions,\n", " encoder_tokens= encoder_text,\n", " decoder_tokens = decoder_text\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Zadanie (10 minut)\n", "\n", "Za pomocą modelu en-fr przetłumacz dowolne zdanie z angielskiego na język francuski i sprawdź wagi atencji dla tego tłumaczenia" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "MODEL = \"Helsinki-NLP/opus-mt-en-fr\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TEXT_ENCODER = \"Although I still have fresh memories of my brother the elder Hamlet’s death, and though it was proper to mourn him throughout our kingdom, life still goes on—I think it’s wise to mourn him while also thinking about my own well being.\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from transformers import AutoModelWithLMHead, AutoTokenizer\n", "\n", "model = AutoModelWithLMHead.from_pretrained(MODEL)\n", "tokenizer = AutoTokenizer.from_pretrained(MODEL)\n", "\n", "inputs = tokenizer.encode(TEXT_ENCODER, return_tensors=\"pt\")\n", "outputs = model.generate(inputs, max_length=40, num_beams=4, early_stopping=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TEXT_DECODER = tokenizer.decode(outputs[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "TEXT_DECODER" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "tokenizer = AutoTokenizer.from_pretrained(MODEL)\n", "model = AutoModel.from_pretrained(MODEL, output_attentions=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "encoder_input_ids = tokenizer(TEXT_ENCODER, return_tensors=\"pt\", add_special_tokens=True).input_ids\n", "decoder_input_ids = tokenizer(TEXT_DECODER, return_tensors=\"pt\", add_special_tokens=True).input_ids\n", "\n", "outputs = model(input_ids=encoder_input_ids, decoder_input_ids=decoder_input_ids)\n", "\n", "encoder_text = tokenizer.convert_ids_to_tokens(encoder_input_ids[0])\n", "decoder_text = tokenizer.convert_ids_to_tokens(decoder_input_ids[0])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "head_view(\n", " encoder_attention=outputs.encoder_attentions,\n", " decoder_attention=outputs.decoder_attentions,\n", " cross_attention=outputs.cross_attentions,\n", " encoder_tokens= encoder_text,\n", " decoder_tokens = decoder_text\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### PRZYKŁAD: GPT3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ZADANIE DOMOWE - POLEVAL" ] } ], "metadata": { "author": "Jakub Pokrywka", "email": "kubapok@wmi.amu.edu.pl", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "lang": "pl", "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" }, "subtitle": "13.Transformery 2[ćwiczenia]", "title": "Ekstrakcja informacji", "year": "2021" }, "nbformat": 4, "nbformat_minor": 4 }