KWT-2024/lab/lab_08.ipynb
Marek Susniak e580651f9f Lab vol.2
2024-04-23 23:54:26 +02:00

1 line
18 KiB
Plaintext

{"cells":[{"cell_type":"markdown","id":"improved-register","metadata":{"id":"improved-register"},"source":["![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n","<div class=\"alert alert-block alert-info\">\n","<h1> Komputerowe wspomaganie tłumaczenia </h1>\n","<h2> 8. <i>Wykorzystanie tłumaczenia automatycznego we wspomaganiu tłumaczenia</i> [laboratoria]</h2>\n","<h3>Rafał Jaworski (2021)</h3>\n","</div>\n","\n","![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"]},{"cell_type":"markdown","id":"hungarian-davis","metadata":{"id":"hungarian-davis"},"source":["W dzisiejszych czasach, niezwykle ważną techniką wspomagania tłumaczenia jest użycie tłumaczenia maszynowego. Tekst źródłowy do tłumaczenia jest najpierw tłumaczony całkowicie autommatycznie, a następnie tłumacz ludzki dokonuje korekty wyniku. Technologia tłumaczenia maszynowego jest już na tyle dojrzała, że oferuje bardzo wysoką jakość wyników. Coraz częstsze stają się scenariusze, w których ludzka korekta to niemal całkowicie machinalne (sic!) zatwierdzanie wyników tłumaczenia maszynowego. Na dzisiejszych zajęciach poznamy techniki ewaluacji tłumaczenia maszynowego oraz sprawdzania jego przydatności w procesie wspomagania tłumaczenia ludzkiego."]},{"cell_type":"markdown","id":"posted-commons","metadata":{"id":"posted-commons"},"source":["Jakość tłumaczenia maszynowego możemy oceniać na dwóch niezależnych płaszczyznach: dokładność i płynność. Płynność jest subiektywnie odbieranym odczuciem, że czytany tekst jest napisany językiem naturalnym i zrozumiałym. Systemy tłumaczenia maszynowego oparte na uczeniu głębokim z wykorzystaniem sieci neuronowych osiągają duży stopień płynności tłumaczenia. Niestety jednak ich dokładność nie zawsze jest równie wysoka."]},{"cell_type":"markdown","id":"referenced-implement","metadata":{"id":"referenced-implement"},"source":["Dokładność tłumaczenia maszynowego jest parametrem, który łatwiej zmierzyć. Wartość takich pomiarów daje obraz tego, jaka jest faktyczna jakość tłumaczenia maszynowego i jaka jest jego potencjalna przydatność we wspomaganiu tłumaczenia."]},{"cell_type":"markdown","id":"disturbed-january","metadata":{"id":"disturbed-january"},"source":["Najczęściej stosowaną techniką oceny tłumaczenia maszynowego jest ocena BLEU. Do obliczenia tej oceny potrzebny jest wynik tłumaczenia maszynowego oraz referencyjne tłumaczenie ludzkie wysokiej jakości."]},{"cell_type":"markdown","id":"dental-combination","metadata":{"id":"dental-combination"},"source":["### Ćwiczenie 1: Zaimplementuj program do obliczania oceny BLEU dla korpusu w folderze data. Użyj implementacji BLEU z pakietu nltk. Dodatkowe wymaganie techniczne - napisz program tak, aby nie musiał rozpakwowywać pliku zip z korpusem na dysku."]},{"cell_type":"code","source":["from google.colab import drive\n","drive.mount('/content/drive')"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"FpVMcKJrs2jm","executionInfo":{"status":"ok","timestamp":1713854799184,"user_tz":-120,"elapsed":19056,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"fc788f33-ab41-4b9d-db93-0c34183cdedf"},"id":"FpVMcKJrs2jm","execution_count":4,"outputs":[{"output_type":"stream","name":"stdout","text":["Mounted at /content/drive\n"]}]},{"cell_type":"code","execution_count":11,"id":"moving-clothing","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"moving-clothing","executionInfo":{"status":"ok","timestamp":1713855713156,"user_tz":-120,"elapsed":260,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"81e04954-55a6-4291-f102-335aa6c2487b"},"outputs":[{"output_type":"stream","name":"stderr","text":["[nltk_data] Downloading package punkt to /root/nltk_data...\n","[nltk_data] Package punkt is already up-to-date!\n"]}],"source":["import zipfile\n","import nltk\n","from nltk.translate.bleu_score import corpus_bleu\n","from nltk import word_tokenize\n","import io\n","nltk.download('punkt')\n","\n","def calculate_bleu(zip_filename, hypothesis_filename, reference_filenames):\n"," with zipfile.ZipFile(zip_filename, 'r') as zip_file:\n"," with zip_file.open(hypothesis_filename) as file:\n"," hypothesis = file.read().decode('utf-8')\n"," hypotheses = [word_tokenize(hypothesis)]\n","\n"," references = []\n"," for ref_filename in reference_filenames:\n"," with zip_file.open(ref_filename) as file:\n"," reference = file.read().decode('utf-8')\n"," references.append([word_tokenize(reference)])\n","\n"," return corpus_bleu(references, hypotheses)"]},{"cell_type":"code","source":["calculate_bleu(\"./drive/MyDrive/data/en-pl.txt.zip\", \"ELRC-873-Development.en-pl.pl\", [\"ELRC-873-Development.en-pl.pl\"])"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"IdXqCxostV3g","executionInfo":{"status":"ok","timestamp":1713855715285,"user_tz":-120,"elapsed":247,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"8bdcb166-676f-4819-95de-14b7993a1c65"},"id":"IdXqCxostV3g","execution_count":12,"outputs":[{"output_type":"execute_result","data":{"text/plain":["1.0"]},"metadata":{},"execution_count":12}]},{"cell_type":"markdown","id":"jewish-ethics","metadata":{"id":"jewish-ethics"},"source":["### Ćwiczenie 2: Oblicz wartość bleu na różnych fragmentach przykładowego korpusu (np. na pierwszych 100 zdaniach, zdaniach 500-600). Czy w jakimś fragmencie korpusu jakość tłumaczenia znacząco odbiega od średniej?"]},{"cell_type":"code","execution_count":36,"id":"lasting-rolling","metadata":{"id":"lasting-rolling","executionInfo":{"status":"ok","timestamp":1713856858306,"user_tz":-120,"elapsed":317,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}}},"outputs":[],"source":["from nltk.translate.bleu_score import sentence_bleu\n","from nltk import word_tokenize\n","\n","def load_sentences(zip_filename, filepath):\n"," with zipfile.ZipFile(zip_filename, 'r') as zip_file:\n"," with zip_file.open(filepath) as file:\n"," sentences = [word_tokenize(file.read().decode('utf-8'))]\n"," return sentences\n","\n","def calculate_segment_bleu(hypotheses, references, start, end):\n"," segment_hypotheses = hypotheses[start:end]\n"," segment_references = references[start:end]\n"," bleu_scores = [sentence_bleu([ref], hyp) for hyp, ref in zip(segment_hypotheses, segment_references)]\n"," return sum(bleu_scores) / len(bleu_scores) if bleu_scores else 0\n","\n","\n","def analyze_bleu(zip, hypothesis_file, reference_file, segments):\n"," hypotheses = load_sentences(zip, hypothesis_file)\n"," references = load_sentences(zip, reference_file)\n","\n"," segment_results = {}\n"," for start, end in segments:\n"," segment_bleu = calculate_segment_bleu(hypotheses, references, start, end)\n"," segment_results[f\"{start}-{end}\"] = segment_bleu\n","\n"," overall_bleu = calculate_segment_bleu(hypotheses, references, 0, len(hypotheses))\n","\n"," return segment_results, overall_bleu"]},{"cell_type":"code","source":["analyze_bleu(\"./drive/MyDrive/data/en-pl.txt.zip\", \"ELRC-873-Development.en-pl.pl\", \"ELRC-873-Development.en-pl.pl\", [(0, 100), (500, 600)])"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"w4CygFrWyP2w","executionInfo":{"status":"ok","timestamp":1713856894219,"user_tz":-120,"elapsed":286,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"2afa61d5-3b80-4f79-b75a-705a12bc06f0"},"id":"w4CygFrWyP2w","execution_count":37,"outputs":[{"output_type":"execute_result","data":{"text/plain":["({'0-100': 1.0, '500-600': 0}, 1.0)"]},"metadata":{},"execution_count":37}]},{"cell_type":"markdown","id":"listed-bikini","metadata":{"id":"listed-bikini"},"source":["Inną metodą oceny jakości tłumaczenia maszynowego jest parametr WER - Word Error Rate. Definiuje się on w następujący sposób:\n","\n","$WER = \\frac{S+D+I}{N}=\\frac{S+D+I}{S+D+C}$\n","\n","gdzie:\n"," * S - liczba substytucji (słów)\n"," * D - liczba usunięć\n"," * I - liczba wstawień\n"," * C - liczba poprawnych śłów\n"," * N - liczba słów w tłumaczeniu referencyjnym (N=S+D+C)"]},{"cell_type":"markdown","id":"conscious-cookbook","metadata":{"id":"conscious-cookbook"},"source":["Miara ta jest zwykle używana w do oceny systemów automatycznego rozpoznawania mowy, jednak w kontekście wspomagania tłumaczenia może być rozumiana jako wielkość nakładu pracy tłumacza nad poprawieniem tłumaczenia maszynowego."]},{"cell_type":"markdown","id":"split-palace","metadata":{"id":"split-palace"},"source":["### Ćwiczenie 3: Oblicz wartość WER dla przykładowego korpusu. Skorzystaj z gotowej implementacji WER."]},{"cell_type":"code","execution_count":40,"id":"occupied-swing","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"occupied-swing","executionInfo":{"status":"ok","timestamp":1713857286747,"user_tz":-120,"elapsed":16228,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"b4a26f57-d8fa-45f6-f34c-e59c20b68c54"},"outputs":[{"output_type":"stream","name":"stdout","text":["Collecting jiwer\n"," Downloading jiwer-3.0.3-py3-none-any.whl (21 kB)\n","Requirement already satisfied: click<9.0.0,>=8.1.3 in /usr/local/lib/python3.10/dist-packages (from jiwer) (8.1.7)\n","Collecting rapidfuzz<4,>=3 (from jiwer)\n"," Downloading rapidfuzz-3.8.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.4 MB)\n","\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.4/3.4 MB\u001b[0m \u001b[31m10.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n","\u001b[?25hInstalling collected packages: rapidfuzz, jiwer\n","Successfully installed jiwer-3.0.3 rapidfuzz-3.8.1\n"]}],"source":["!pip install jiwer\n","from jiwer import wer\n","\n","def calculate_wer(reference, hypothesis):\n"," return wer(reference, hypothesis)"]},{"cell_type":"code","source":["text_1 = load_sentences(\"./drive/MyDrive/data/en-pl.txt.zip\", \"ELRC-873-Development.en-pl.pl\")[0]\n","text_2 = load_sentences(\"./drive/MyDrive/data/en-pl.txt.zip\", \"ELRC-873-Development.en-pl.pl\")[0]\n","\n","calculate_wer(text_1, text_2)"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"oeKv9Quj2glp","executionInfo":{"status":"ok","timestamp":1713857370916,"user_tz":-120,"elapsed":247,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"02407776-dadc-4500-baf1-ec10ad3c5221"},"id":"oeKv9Quj2glp","execution_count":42,"outputs":[{"output_type":"execute_result","data":{"text/plain":["0.0"]},"metadata":{},"execution_count":42}]},{"cell_type":"markdown","id":"stretch-wound","metadata":{"id":"stretch-wound"},"source":["Poza wymienionymi powyżej, stosować można jeszcze inne miary oparte na porównywaniu tłumaczenia maszynowego z ludzkim. Przypomnijmy sobie jedną, którą stosowaliśmy wcześniej."]},{"cell_type":"markdown","id":"abstract-wilderness","metadata":{"id":"abstract-wilderness"},"source":["### Ćwiczenie 4: Oblicz średnią wartość dystansu Levenshteina pomiędzy zdaniami przetłumaczonymi automatycznie oraz przez człowieka. Użyj implementacji z ćwiczeń nr 2.:"]},{"cell_type":"code","execution_count":45,"id":"immediate-element","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"immediate-element","executionInfo":{"status":"ok","timestamp":1713908868759,"user_tz":-120,"elapsed":8495,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"e93665c3-8130-4b54-9244-baeb5bf2d892"},"outputs":[{"output_type":"stream","name":"stdout","text":["Requirement already satisfied: python-Levenshtein in /usr/local/lib/python3.10/dist-packages (0.25.1)\n","Requirement already satisfied: Levenshtein==0.25.1 in /usr/local/lib/python3.10/dist-packages (from python-Levenshtein) (0.25.1)\n","Requirement already satisfied: rapidfuzz<4.0.0,>=3.8.0 in /usr/local/lib/python3.10/dist-packages (from Levenshtein==0.25.1->python-Levenshtein) (3.8.1)\n"]}],"source":["!pip install python-Levenshtein\n","\n","import Levenshtein\n","\n","def calculate_levenshtein(hypotheses, references):\n"," total_distance = 0\n"," count = 0\n"," for hyp, ref in zip(hypotheses, references):\n"," hyp_words = hyp.split()\n"," ref_words = ref.split()\n"," total_distance += Levenshtein.distance(hyp_words, ref_words)\n"," count += 1\n"," return total_distance / count if count != 0 else 0"]},{"cell_type":"code","source":["calculate_levenshtein(\"To jest zdanie prawie dobrze przetlumaczone\", \"To jest zdanie bardzo dobrze przetlumaczone\")"],"metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"cPQGtati7U6w","executionInfo":{"status":"ok","timestamp":1713908907879,"user_tz":-120,"elapsed":269,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"54741398-ad90-4690-f554-19f48467061a"},"id":"cPQGtati7U6w","execution_count":46,"outputs":[{"output_type":"execute_result","data":{"text/plain":["0.13953488372093023"]},"metadata":{},"execution_count":46}]},{"cell_type":"markdown","id":"filled-burton","metadata":{"id":"filled-burton"},"source":["A teraz sprawdźmy coś jeszcze. W danych przykładowego korpusu znajduje się także angielski tekst źródłowy. Teoretycznie, dobre tłumaczenie niemieckie powinno zawierać jak najwięcej słów z angielskiego źródła. Wykonajmy najstępujący eksperyment:"]},{"cell_type":"markdown","id":"grateful-recruitment","metadata":{"id":"grateful-recruitment"},"source":["### Ćwiczenie 5: Dla każdej trójki zdań z korpusu przykładowego wykonaj następujące kroki:\n"," * Przetłumacz każde angielskie słowo na niemiecki przy użyciu modułu PyDictionary.\n"," * Sprawdź, które z niemieckich tłumaczeń zawiera więcej spośród tych przetłumaczonych słów - automatyczne, czy ludzkie.\n","Następnie wypisz statystyki zbiorcze. Które tłumaczenie zawiera więcej słownikowych tłumaczeń słów ze źródła?"]},{"cell_type":"code","execution_count":2,"id":"descending-easter","metadata":{"colab":{"base_uri":"https://localhost:8080/"},"id":"descending-easter","executionInfo":{"status":"ok","timestamp":1713909180065,"user_tz":-120,"elapsed":6695,"user":{"displayName":"Marek Susniak","userId":"08092059492340254190"}},"outputId":"9220a8d7-a685-451f-da0f-ba4d2a3af737"},"outputs":[{"output_type":"stream","name":"stdout","text":["Requirement already satisfied: PyDictionary in /usr/local/lib/python3.10/dist-packages (2.0.1)\n","Requirement already satisfied: bs4 in /usr/local/lib/python3.10/dist-packages (from PyDictionary) (0.0.2)\n","Requirement already satisfied: click in /usr/local/lib/python3.10/dist-packages (from PyDictionary) (8.1.7)\n","Requirement already satisfied: goslate in /usr/local/lib/python3.10/dist-packages (from PyDictionary) (1.5.4)\n","Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from PyDictionary) (2.31.0)\n","Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from bs4->PyDictionary) (4.12.3)\n","Requirement already satisfied: futures in /usr/local/lib/python3.10/dist-packages (from goslate->PyDictionary) (3.0.5)\n","Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->PyDictionary) (3.3.2)\n","Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->PyDictionary) (3.7)\n","Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->PyDictionary) (2.0.7)\n","Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->PyDictionary) (2024.2.2)\n","Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->bs4->PyDictionary) (2.5)\n"]}],"source":["!pip install PyDictionary\n","\n","from PyDictionary import PyDictionary\n","import re\n","\n","def load_sentences(filepath):\n"," with open(filepath, 'r', encoding='utf-8') as file:\n"," sentences = [line.strip() for line in file if line.strip()]\n"," return sentences\n","\n","def translate_words(words, dictionary, target_language='German'):\n"," translations = {}\n"," for word in words:\n"," try:\n"," translation = dictionary.translate(word, target_language)\n"," if translation:\n"," translations[word] = translation\n"," except Exception as e:\n"," print(f\"Error translating {word}: {e}\")\n"," return translations\n","\n","def count_matches(translations, sentence):\n"," words = set(re.findall(r'\\w+', sentence.lower()))\n"," matches = sum(1 for word in words if translations.get(word.lower()))\n"," return matches\n","\n","def analyze_translations(source_file, mt_file, human_file):\n"," dictionary = PyDictionary()\n"," source_sentences = load_sentences(source_file)\n"," mt_sentences = load_sentences(mt_file)\n"," human_sentences = load_sentences(human_file)\n","\n"," mt_matches_total = 0\n"," human_matches_total = 0\n","\n"," for source, mt, human in zip(source_sentences, mt_sentences, human_sentences):\n"," source_words = re.findall(r'\\w+', source.lower())\n"," translations = translate_words(source_words, dictionary)\n","\n"," mt_matches = count_matches(translations, mt)\n"," human_matches = count_matches(translations, human)\n","\n"," mt_matches_total += mt_matches\n"," human_matches_total += human_matches\n","\n"," print(f\"MT matches: {mt_matches}, Human matches: {human_matches}\")\n","\n"," print(f\"Total MT matches: {mt_matches_total}\")\n"," print(f\"Total Human matches: {human_matches_total}\")"]}],"metadata":{"author":"Rafał Jaworski","email":"rjawor@amu.edu.pl","lang":"pl","subtitle":"8. Wykorzystanie tłumaczenia automatycznego we wspomaganiu tłumaczenia","title":"Komputerowe wspomaganie tłumaczenia","year":"2021","kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.10"},"colab":{"provenance":[]}},"nbformat":4,"nbformat_minor":5}