KWT-2024/lab/lab_03.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "coastal-lincoln",
   "metadata": {},
   "source": [
    "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<h1> Komputerowe wspomaganie tłumaczenia </h1>\n",
    "<h2> 3. <i>Terminologia</i> [laboratoria]</h2> \n",
    "<h3>Rafał Jaworski (2021)</h3>\n",
    "</div>\n",
    "\n",
    "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aggregate-listing",
   "metadata": {},
   "source": [
    "Na dzisiejszych zajęciach zajmiemy się bliżej słownikami używanymi do wspomagania tłumaczenia. Oczywiście na rynku dostępnych jest bardzo wiele słowników w formacie elektronicznym. Wiele z nich jest gotowych do użycia w SDL Trados, memoQ i innych narzędziach CAT. Zawierają one setki tysięcy lub miliony haseł i oferują natychmiastową pomoc tłumaczowi."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "israeli-excuse",
   "metadata": {},
   "source": [
    "Problem jednak w tym, iż często nie zawierają odpowiedniej terminologii specjalistycznej - używanej przez klienta zamawiającego tłumaczenie. Terminy specjalistyczne są bardzo częste w tekstach tłumaczonych ze względu na następujące zjawiska:\n",
    "- Teksty o tematyce ogólnej są tłumaczone dość rzadko (nikt nie tłumaczy pocztówek z pozdrowieniami z wakacji...)\n",
    "- Te same słowa mogą mieć zarówno znaczenie ogólne, jak i bardzo specjalistyczne (np. \"dziedziczenie\" w kontekście prawnym lub informatycznym)\n",
    "- Klient używa nazw lub słów wymyślonych przez siebie, np. na potrzeby marketingowe."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "reflected-enforcement",
   "metadata": {},
   "source": [
    "Nietrywialnymi zadaniami stają się: odnalezienie terminu specjalistycznego w tekście źródłowym oraz podanie prawidłowego tłumaczenia tego terminu na język docelowy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "statutory-florist",
   "metadata": {},
   "source": [
    "Brzmi prosto? Spróbujmy wykonać ręcznie tę drugą operację."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "danish-anchor",
   "metadata": {},
   "source": [
    "### Ćwiczenie 1: Podaj tłumaczenie terminu \"prowadnice szaf metalowych\" na język angielski. Opisz, z jakich narzędzi skorzystałaś/eś."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "diverse-sunglasses",
   "metadata": {},
   "source": [
    "Odpowiedź: metal cabinet guides lub metal cabinet slides. Skorzystalem z dwoch slownikow oraz duzego modelu jezykowego."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "limited-waterproof",
   "metadata": {},
   "source": [
    "W dalszych ćwiczeniach skupimy się jednak na odszukaniu terminu specjalistycznego w tekście. W tym celu będą potrzebne dwie operacje:\n",
    "1. Przygotowanie słownika specjalistycznego.\n",
    "2. Detekcja terminologii przy użyciu przygotowanego słownika specjalistycznego."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "literary-blues",
   "metadata": {},
   "source": [
    "Zajmijmy się najpierw krokiem nr 2 (gdyż jest prostszy). Rozważmy następujący tekst:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 102,
   "id": "loving-prince",
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \" For all Java programmers:\"\n",
    "text += \" This section explains how to compile and run a Swing application from the command line.\"\n",
    "text += \" For information on compiling and running a Swing application using NetBeans IDE,\"\n",
    "text += \" see Running Tutorial Examples in NetBeans IDE. The compilation instructions work for all Swing programs\"\n",
    "text += \" — applets, as well as applications. Here are the steps you need to follow:\"\n",
    "text += \" Install the latest release of the Java SE platform, if you haven't already done so.\"\n",
    "text += \" Create a program that uses Swing components. Compile the program. Run the program.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "extreme-cycling",
   "metadata": {},
   "source": [
    "Załóżmy, że posiadamy następujący słownik:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 103,
   "id": "bound-auction",
   "metadata": {},
   "outputs": [],
   "source": [
    "dictionary = ['program', 'application', 'applet', 'compile']"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "other-trinidad",
   "metadata": {},
   "source": [
    "### Ćwiczenie 2: Napisz program, który wypisze pozycje wszystkich wystąpień poszczególnych terminów specjalistycznych. Dla każdego terminu należy wypisać listę par (pozycja_startowa, pozycja końcowa)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "id": "cognitive-cedar",
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "def terminology_lookup(txt, labels):\n",
    "    results = []\n",
    "\n",
    "    for label in labels:\n",
    "        results.append((\n",
    "            label,\n",
    "             [(m.start(), m.end() - 1) for m in re.finditer(label, txt)]\n",
    "        ))\n",
    "\n",
    "    return results"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "id": "7cc3ad1f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('program', [(14, 20), (291, 297), (468, 474), (516, 522), (533, 539)]),\n",
       " ('application', [(80, 90), (164, 174), (322, 332)]),\n",
       " ('applet', [(302, 307)]),\n",
       " ('compile', [(56, 62)])]"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "terminology_lookup(text, dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "interior-things",
   "metadata": {},
   "source": [
    "Zwykłe wyszukiwanie w tekście ma pewne wady. Na przykład, gdy szukaliśmy słowa \"program\", złapaliśmy przypadkiem słowo \"programmer\". Złapaliśmy także słowo \"programs\", co jest poprawne, ale niepoprawnie podaliśmy jego pozycję w tekście."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aggressive-plane",
   "metadata": {},
   "source": [
    "Żeby poradzić sobie z tymi problemami, musimy wykorzystać techniki przetwarzania języka naturalnego. Wypróbujmy pakiet spaCy:\n",
    "\n",
    "`pip3 install spacy`\n",
    "\n",
    "oraz\n",
    "\n",
    "`python3 -m spacy download en_core_web_sm`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "id": "tribal-attention",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    0\n",
      "For for 1\n",
      "all all 5\n",
      "Java Java 9\n",
      "programmers programmer 14\n",
      ": : 25\n",
      "This this 27\n",
      "section section 32\n",
      "explains explain 40\n",
      "how how 49\n",
      "to to 53\n",
      "compile compile 56\n",
      "and and 64\n",
      "run run 68\n",
      "a a 72\n",
      "Swing swing 74\n",
      "application application 80\n",
      "from from 92\n",
      "the the 97\n",
      "command command 101\n",
      "line line 109\n",
      ". . 113\n",
      "For for 115\n",
      "information information 119\n",
      "on on 131\n",
      "compiling compile 134\n",
      "and and 144\n",
      "running run 148\n",
      "a a 156\n",
      "Swing swing 158\n",
      "application application 164\n",
      "using use 176\n",
      "NetBeans NetBeans 182\n",
      "IDE IDE 191\n",
      ", , 194\n",
      "see see 196\n",
      "Running run 200\n",
      "Tutorial Tutorial 208\n",
      "Examples Examples 217\n",
      "in in 226\n",
      "NetBeans NetBeans 229\n",
      "IDE IDE 238\n",
      ". . 241\n",
      "The the 243\n",
      "compilation compilation 247\n",
      "instructions instruction 259\n",
      "work work 272\n",
      "for for 277\n",
      "all all 281\n",
      "Swing Swing 285\n",
      "programs program 291\n",
      "— — 300\n",
      "applets applet 302\n",
      ", , 309\n",
      "as as 311\n",
      "well well 314\n",
      "as as 319\n",
      "applications application 322\n",
      ". . 334\n",
      "Here here 336\n",
      "are be 341\n",
      "the the 345\n",
      "steps step 349\n",
      "you you 355\n",
      "need need 359\n",
      "to to 364\n",
      "follow follow 367\n",
      ": : 373\n",
      "Install install 375\n",
      "the the 383\n",
      "latest late 387\n",
      "release release 394\n",
      "of of 402\n",
      "the the 405\n",
      "Java Java 409\n",
      "SE SE 414\n",
      "platform platform 417\n",
      ", , 425\n",
      "if if 427\n",
      "you you 430\n",
      "have have 434\n",
      "n't not 438\n",
      "already already 442\n",
      "done do 450\n",
      "so so 455\n",
      ". . 457\n",
      "Create create 459\n",
      "a a 466\n",
      "program program 468\n",
      "that that 476\n",
      "uses use 481\n",
      "Swing swing 486\n",
      "components component 492\n",
      ". . 502\n",
      "Compile compile 504\n",
      "the the 512\n",
      "program program 516\n",
      ". . 523\n",
      "Run run 525\n",
      "the the 529\n",
      "program program 533\n",
      ". . 540\n"
     ]
    }
   ],
   "source": [
    "import spacy\n",
    "nlp = spacy.load(\"en_core_web_sm\")\n",
    "\n",
    "doc = nlp(text)\n",
    "\n",
    "for token in doc:\n",
    "    print(token, token.lemma_, token.idx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "regional-craft",
   "metadata": {},
   "source": [
    "Sukces! Nastąpił podział tekstu na słowa (tokenizacja) oraz sprowadzenie do formy podstawowej każdego słowa (lematyzacja)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "toxic-subsection",
   "metadata": {},
   "source": [
    "### Ćwiczenie 3: Zmodyfikuj program z ćwiczenia 2 tak, aby zwracał również odmienione słowa. Na przykład, dla słowa \"program\" powinien znaleźć również \"programs\", ustawiając pozycje w tekście odpowiednio dla słowa \"programs\". Wykorzystaj właściwość idx tokenu."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "id": "surgical-demonstration",
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy\n",
    "nlp = spacy.load(\"en_core_web_sm\")\n",
    "\n",
    "\n",
    "def terminology_lookup(txt, labels):\n",
    "    result = {};\n",
    "    doc = nlp(txt)\n",
    "\n",
    "    for token in doc:\n",
    "        if token.lemma_ in labels: \n",
    "            if token.lemma_ not in result:\n",
    "                result[token.lemma_] = []\n",
    "            result[token.lemma_].append((token.idx, token.idx + len(token)))\n",
    "\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "id": "4772c1b1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'compile': [(56, 63), (134, 143), (504, 511)],\n",
       " 'application': [(80, 91), (164, 175), (322, 334)],\n",
       " 'program': [(291, 299), (468, 475), (516, 523), (533, 540)],\n",
       " 'applet': [(302, 309)]}"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "terminology_lookup(text, dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "straight-letter",
   "metadata": {},
   "source": [
    "Teraz czas zająć się problemem przygotowania słownika specjalistycznego. W tym celu napiszemy nasz własny ekstraktor terminologii. Wejściem do ekstraktora będzie tekst zawierający specjalistyczną terminologię. Wyjściem - lista terminów."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "nearby-frontier",
   "metadata": {},
   "source": [
    "Przyjmijmy następujące podejście - terminami specjalistycznymi będą najcześćiej występujące rzeczowniki w tekście. Wykonajmy krok pierwszy:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "harmful-lightning",
   "metadata": {},
   "source": [
    "### Ćwiczenie 4: Wypisz wszystkie rzeczowniki z tekstu. Wykorzystaj możliwości spaCy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "id": "superb-butterfly",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_nouns(text):\n",
    "    doc = nlp(text)\n",
    "    return [token.lemma_ for token in doc if token.pos_ == 'NOUN']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "id": "3c916a3e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['programmer',\n",
       " 'section',\n",
       " 'swing',\n",
       " 'application',\n",
       " 'command',\n",
       " 'line',\n",
       " 'information',\n",
       " 'swing',\n",
       " 'application',\n",
       " 'compilation',\n",
       " 'instruction',\n",
       " 'program',\n",
       " 'applet',\n",
       " 'application',\n",
       " 'step',\n",
       " 'release',\n",
       " 'platform',\n",
       " 'program',\n",
       " 'swing',\n",
       " 'component',\n",
       " 'program',\n",
       " 'program']"
      ]
     },
     "execution_count": 110,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "get_nouns(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "musical-creator",
   "metadata": {},
   "source": [
    "Teraz czas na podliczenie wystąpień poszczególnych rzeczowników. Uwaga - różne formy tego samego słowa zliczamy razem jako wystąpienia tego słowa (np. \"program\" i \"programs\"). Najwygodniejszą metodą podliczania jest zastosowanie tzw. tally (po polsku \"zestawienie\"). Jest to słownik, którego kluczem jest słowo w formie podstawowej, a wartością liczba wystąpień tego słowa, wliczając słowa odmienione. Przykład gotowego tally:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "id": "acting-tolerance",
   "metadata": {},
   "outputs": [],
   "source": [
    "tally = {\"program\" : 4, \"component\" : 1}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "vanilla-estimate",
   "metadata": {},
   "source": [
    "### Ćwiczenie 5: Napisz program do ekstrakcji terminologii z tekstu według powyższych wytycznych."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "id": "eight-redhead",
   "metadata": {},
   "outputs": [],
   "source": [
    "def count_words(words):\n",
    "    word_count = {}\n",
    "    for word in words:\n",
    "        if word in word_count:\n",
    "            word_count[word] += 1\n",
    "        else:\n",
    "            word_count[word] = 1\n",
    "    return word_count\n",
    "\n",
    "def extract_terms(text):\n",
    "    return count_words(get_nouns(text))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "id": "374550d8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'programmer': 1,\n",
       " 'section': 1,\n",
       " 'swing': 3,\n",
       " 'application': 3,\n",
       " 'command': 1,\n",
       " 'line': 1,\n",
       " 'information': 1,\n",
       " 'compilation': 1,\n",
       " 'instruction': 1,\n",
       " 'program': 4,\n",
       " 'applet': 1,\n",
       " 'step': 1,\n",
       " 'release': 1,\n",
       " 'platform': 1,\n",
       " 'component': 1}"
      ]
     },
     "execution_count": 113,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extract_terms(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "loaded-smell",
   "metadata": {},
   "source": [
    "### Ćwiczenie 6: Rozszerz powyższy program o ekstrację czasowników i przymiotników."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "id": "monetary-mambo",
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_verbs(text):\n",
    "    doc = nlp(text)\n",
    "    return [token.lemma_ for token in doc if token.pos_ == 'VERB']\n",
    "\n",
    "def get_adjectives(text):\n",
    "    doc = nlp(text)\n",
    "    return [token.lemma_ for token in doc if token.pos_ == 'ADJ']\n",
    "\n",
    "def extract_terms(text):\n",
    "    return {\n",
    "        \"nouns\": get_nouns(text),\n",
    "        \"verbs\": get_verbs(text),\n",
    "        \"adjectives\": get_adjectives(text)\n",
    "    }"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "id": "95494ac9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'nouns': ['programmer',\n",
       "  'section',\n",
       "  'swing',\n",
       "  'application',\n",
       "  'command',\n",
       "  'line',\n",
       "  'information',\n",
       "  'swing',\n",
       "  'application',\n",
       "  'compilation',\n",
       "  'instruction',\n",
       "  'program',\n",
       "  'applet',\n",
       "  'application',\n",
       "  'step',\n",
       "  'release',\n",
       "  'platform',\n",
       "  'program',\n",
       "  'swing',\n",
       "  'component',\n",
       "  'program',\n",
       "  'program'],\n",
       " 'verbs': ['explain',\n",
       "  'compile',\n",
       "  'run',\n",
       "  'compile',\n",
       "  'run',\n",
       "  'use',\n",
       "  'see',\n",
       "  'run',\n",
       "  'work',\n",
       "  'need',\n",
       "  'follow',\n",
       "  'install',\n",
       "  'do',\n",
       "  'create',\n",
       "  'use',\n",
       "  'compile',\n",
       "  'run'],\n",
       " 'adjectives': ['late']}"
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "extract_terms(text)"
   ]
  }
 ],
 "metadata": {
  "author": "Rafał Jaworski",
  "email": "rjawor@amu.edu.pl",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "lang": "pl",
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.9"
  },
  "subtitle": "3. Terminologia",
  "title": "Komputerowe wspomaganie tłumaczenia",
  "year": "2021"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
upload 2024-04-13 08:20:53 +02:00			`{`
			`"cells": [`
			`{`
			`"cell_type": "markdown",`
			`"id": "coastal-lincoln",`
			`"metadata": {},`
			`"source": [`
			`"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",`
			`"<div class=\"alert alert-block alert-info\">\n",`
			`"<h1> Komputerowe wspomaganie tłumaczenia </h1>\n",`
			`"<h2> 3. <i>Terminologia</i> [laboratoria]</h2> \n",`
			`"<h3>Rafał Jaworski (2021)</h3>\n",`
			`"</div>\n",`
			`"\n",`
			`"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "aggregate-listing",`
			`"metadata": {},`
			`"source": [`
			`"Na dzisiejszych zajęciach zajmiemy się bliżej słownikami używanymi do wspomagania tłumaczenia. Oczywiście na rynku dostępnych jest bardzo wiele słowników w formacie elektronicznym. Wiele z nich jest gotowych do użycia w SDL Trados, memoQ i innych narzędziach CAT. Zawierają one setki tysięcy lub miliony haseł i oferują natychmiastową pomoc tłumaczowi."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "israeli-excuse",`
			`"metadata": {},`
			`"source": [`
			`"Problem jednak w tym, iż często nie zawierają odpowiedniej terminologii specjalistycznej - używanej przez klienta zamawiającego tłumaczenie. Terminy specjalistyczne są bardzo częste w tekstach tłumaczonych ze względu na następujące zjawiska:\n",`
			`"- Teksty o tematyce ogólnej są tłumaczone dość rzadko (nikt nie tłumaczy pocztówek z pozdrowieniami z wakacji...)\n",`
			`"- Te same słowa mogą mieć zarówno znaczenie ogólne, jak i bardzo specjalistyczne (np. \"dziedziczenie\" w kontekście prawnym lub informatycznym)\n",`
			`"- Klient używa nazw lub słów wymyślonych przez siebie, np. na potrzeby marketingowe."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "reflected-enforcement",`
			`"metadata": {},`
			`"source": [`
			`"Nietrywialnymi zadaniami stają się: odnalezienie terminu specjalistycznego w tekście źródłowym oraz podanie prawidłowego tłumaczenia tego terminu na język docelowy"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "statutory-florist",`
			`"metadata": {},`
			`"source": [`
			`"Brzmi prosto? Spróbujmy wykonać ręcznie tę drugą operację."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "danish-anchor",`
			`"metadata": {},`
			`"source": [`
			`"### Ćwiczenie 1: Podaj tłumaczenie terminu \"prowadnice szaf metalowych\" na język angielski. Opisz, z jakich narzędzi skorzystałaś/eś."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "diverse-sunglasses",`
			`"metadata": {},`
			`"source": [`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"Odpowiedź: metal cabinet guides lub metal cabinet slides. Skorzystalem z dwoch slownikow oraz duzego modelu jezykowego."`
upload 2024-04-13 08:20:53 +02:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "limited-waterproof",`
			`"metadata": {},`
			`"source": [`
			`"W dalszych ćwiczeniach skupimy się jednak na odszukaniu terminu specjalistycznego w tekście. W tym celu będą potrzebne dwie operacje:\n",`
			`"1. Przygotowanie słownika specjalistycznego.\n",`
			`"2. Detekcja terminologii przy użyciu przygotowanego słownika specjalistycznego."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "literary-blues",`
			`"metadata": {},`
			`"source": [`
			`"Zajmijmy się najpierw krokiem nr 2 (gdyż jest prostszy). Rozważmy następujący tekst:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 102,`
upload 2024-04-13 08:20:53 +02:00			`"id": "loving-prince",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"text = \" For all Java programmers:\"\n",`
			`"text += \" This section explains how to compile and run a Swing application from the command line.\"\n",`
			`"text += \" For information on compiling and running a Swing application using NetBeans IDE,\"\n",`
			`"text += \" see Running Tutorial Examples in NetBeans IDE. The compilation instructions work for all Swing programs\"\n",`
			`"text += \" — applets, as well as applications. Here are the steps you need to follow:\"\n",`
			`"text += \" Install the latest release of the Java SE platform, if you haven't already done so.\"\n",`
			`"text += \" Create a program that uses Swing components. Compile the program. Run the program.\""`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "extreme-cycling",`
			`"metadata": {},`
			`"source": [`
			`"Załóżmy, że posiadamy następujący słownik:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 103,`
upload 2024-04-13 08:20:53 +02:00			`"id": "bound-auction",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
Laboratoria 13.04.2024 2024-04-15 19:34:20 +02:00			`"dictionary = ['program', 'application', 'applet', 'compile']"`
upload 2024-04-13 08:20:53 +02:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "other-trinidad",`
			`"metadata": {},`
			`"source": [`
			`"### Ćwiczenie 2: Napisz program, który wypisze pozycje wszystkich wystąpień poszczególnych terminów specjalistycznych. Dla każdego terminu należy wypisać listę par (pozycja_startowa, pozycja końcowa)."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 104,`
upload 2024-04-13 08:20:53 +02:00			`"id": "cognitive-cedar",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
Laboratoria 13.04.2024 2024-04-15 19:34:20 +02:00			`"import re\n",`
			`"\n",`
			`"def terminology_lookup(txt, labels):\n",`
			`" results = []\n",`
			`"\n",`
			`" for label in labels:\n",`
			`" results.append((\n",`
			`" label,\n",`
			`" [(m.start(), m.end() - 1) for m in re.finditer(label, txt)]\n",`
			`" ))\n",`
			`"\n",`
			`" return results"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 105,`
Laboratoria 13.04.2024 2024-04-15 19:34:20 +02:00			`"id": "7cc3ad1f",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"[('program', [(14, 20), (291, 297), (468, 474), (516, 522), (533, 539)]),\n",`
			`" ('application', [(80, 90), (164, 174), (322, 332)]),\n",`
			`" ('applet', [(302, 307)]),\n",`
			`" ('compile', [(56, 62)])]"`
			`]`
			`},`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 105,`
Laboratoria 13.04.2024 2024-04-15 19:34:20 +02:00			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"terminology_lookup(text, dictionary)"`
upload 2024-04-13 08:20:53 +02:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "interior-things",`
			`"metadata": {},`
			`"source": [`
			`"Zwykłe wyszukiwanie w tekście ma pewne wady. Na przykład, gdy szukaliśmy słowa \"program\", złapaliśmy przypadkiem słowo \"programmer\". Złapaliśmy także słowo \"programs\", co jest poprawne, ale niepoprawnie podaliśmy jego pozycję w tekście."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "aggressive-plane",`
			`"metadata": {},`
			`"source": [`
			`"Żeby poradzić sobie z tymi problemami, musimy wykorzystać techniki przetwarzania języka naturalnego. Wypróbujmy pakiet spaCy:\n",`
			`"\n",`
			"`pip3 install spacy`\n",
			`"\n",`
			`"oraz\n",`
			`"\n",`
			"`python3 -m spacy download en_core_web_sm`"
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 106,`
upload 2024-04-13 08:20:53 +02:00			`"id": "tribal-attention",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"name": "stdout",`
			`"output_type": "stream",`
			`"text": [`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`" 0\n",`
			`"For for 1\n",`
			`"all all 5\n",`
			`"Java Java 9\n",`
			`"programmers programmer 14\n",`
			`": : 25\n",`
			`"This this 27\n",`
			`"section section 32\n",`
			`"explains explain 40\n",`
			`"how how 49\n",`
			`"to to 53\n",`
			`"compile compile 56\n",`
			`"and and 64\n",`
			`"run run 68\n",`
			`"a a 72\n",`
			`"Swing swing 74\n",`
			`"application application 80\n",`
			`"from from 92\n",`
			`"the the 97\n",`
			`"command command 101\n",`
			`"line line 109\n",`
			`". . 113\n",`
			`"For for 115\n",`
			`"information information 119\n",`
			`"on on 131\n",`
			`"compiling compile 134\n",`
			`"and and 144\n",`
			`"running run 148\n",`
			`"a a 156\n",`
			`"Swing swing 158\n",`
			`"application application 164\n",`
			`"using use 176\n",`
			`"NetBeans NetBeans 182\n",`
			`"IDE IDE 191\n",`
			`", , 194\n",`
			`"see see 196\n",`
			`"Running run 200\n",`
			`"Tutorial Tutorial 208\n",`
			`"Examples Examples 217\n",`
			`"in in 226\n",`
			`"NetBeans NetBeans 229\n",`
			`"IDE IDE 238\n",`
			`". . 241\n",`
			`"The the 243\n",`
			`"compilation compilation 247\n",`
			`"instructions instruction 259\n",`
			`"work work 272\n",`
			`"for for 277\n",`
			`"all all 281\n",`
			`"Swing Swing 285\n",`
			`"programs program 291\n",`
			`"— — 300\n",`
			`"applets applet 302\n",`
			`", , 309\n",`
			`"as as 311\n",`
			`"well well 314\n",`
			`"as as 319\n",`
			`"applications application 322\n",`
			`". . 334\n",`
			`"Here here 336\n",`
			`"are be 341\n",`
			`"the the 345\n",`
			`"steps step 349\n",`
			`"you you 355\n",`
			`"need need 359\n",`
			`"to to 364\n",`
			`"follow follow 367\n",`
			`": : 373\n",`
			`"Install install 375\n",`
			`"the the 383\n",`
			`"latest late 387\n",`
			`"release release 394\n",`
			`"of of 402\n",`
			`"the the 405\n",`
			`"Java Java 409\n",`
			`"SE SE 414\n",`
			`"platform platform 417\n",`
			`", , 425\n",`
			`"if if 427\n",`
			`"you you 430\n",`
			`"have have 434\n",`
			`"n't not 438\n",`
			`"already already 442\n",`
			`"done do 450\n",`
			`"so so 455\n",`
			`". . 457\n",`
			`"Create create 459\n",`
			`"a a 466\n",`
			`"program program 468\n",`
			`"that that 476\n",`
			`"uses use 481\n",`
			`"Swing swing 486\n",`
			`"components component 492\n",`
			`". . 502\n",`
			`"Compile compile 504\n",`
			`"the the 512\n",`
			`"program program 516\n",`
			`". . 523\n",`
			`"Run run 525\n",`
			`"the the 529\n",`
			`"program program 533\n",`
			`". . 540\n"`
upload 2024-04-13 08:20:53 +02:00			`]`
			`}`
			`],`
			`"source": [`
			`"import spacy\n",`
			`"nlp = spacy.load(\"en_core_web_sm\")\n",`
			`"\n",`
			`"doc = nlp(text)\n",`
			`"\n",`
			`"for token in doc:\n",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`" print(token, token.lemma_, token.idx)"`
upload 2024-04-13 08:20:53 +02:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "regional-craft",`
			`"metadata": {},`
			`"source": [`
			`"Sukces! Nastąpił podział tekstu na słowa (tokenizacja) oraz sprowadzenie do formy podstawowej każdego słowa (lematyzacja)."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "toxic-subsection",`
			`"metadata": {},`
			`"source": [`
			`"### Ćwiczenie 3: Zmodyfikuj program z ćwiczenia 2 tak, aby zwracał również odmienione słowa. Na przykład, dla słowa \"program\" powinien znaleźć również \"programs\", ustawiając pozycje w tekście odpowiednio dla słowa \"programs\". Wykorzystaj właściwość idx tokenu."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 107,`
upload 2024-04-13 08:20:53 +02:00			`"id": "surgical-demonstration",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"import spacy\n",`
			`"nlp = spacy.load(\"en_core_web_sm\")\n",`
			`"\n",`
			`"\n",`
			`"def terminology_lookup(txt, labels):\n",`
			`" result = {};\n",`
			`" doc = nlp(txt)\n",`
			`"\n",`
			`" for token in doc:\n",`
			`" if token.lemma_ in labels: \n",`
			`" if token.lemma_ not in result:\n",`
			`" result[token.lemma_] = []\n",`
			`" result[token.lemma_].append((token.idx, token.idx + len(token)))\n",`
			`"\n",`
			`" return result"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 108,`
			`"id": "4772c1b1",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'compile': [(56, 63), (134, 143), (504, 511)],\n",`
			`" 'application': [(80, 91), (164, 175), (322, 334)],\n",`
			`" 'program': [(291, 299), (468, 475), (516, 523), (533, 540)],\n",`
			`" 'applet': [(302, 309)]}"`
			`]`
			`},`
			`"execution_count": 108,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"terminology_lookup(text, dictionary)"`
upload 2024-04-13 08:20:53 +02:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "straight-letter",`
			`"metadata": {},`
			`"source": [`
			`"Teraz czas zająć się problemem przygotowania słownika specjalistycznego. W tym celu napiszemy nasz własny ekstraktor terminologii. Wejściem do ekstraktora będzie tekst zawierający specjalistyczną terminologię. Wyjściem - lista terminów."`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "nearby-frontier",`
			`"metadata": {},`
			`"source": [`
			`"Przyjmijmy następujące podejście - terminami specjalistycznymi będą najcześćiej występujące rzeczowniki w tekście. Wykonajmy krok pierwszy:"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "harmful-lightning",`
			`"metadata": {},`
			`"source": [`
			`"### Ćwiczenie 4: Wypisz wszystkie rzeczowniki z tekstu. Wykorzystaj możliwości spaCy."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 109,`
upload 2024-04-13 08:20:53 +02:00			`"id": "superb-butterfly",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"def get_nouns(text):\n",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`" doc = nlp(text)\n",`
			`" return [token.lemma_ for token in doc if token.pos_ == 'NOUN']"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 110,`
			`"id": "3c916a3e",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"['programmer',\n",`
			`" 'section',\n",`
			`" 'swing',\n",`
			`" 'application',\n",`
			`" 'command',\n",`
			`" 'line',\n",`
			`" 'information',\n",`
			`" 'swing',\n",`
			`" 'application',\n",`
			`" 'compilation',\n",`
			`" 'instruction',\n",`
			`" 'program',\n",`
			`" 'applet',\n",`
			`" 'application',\n",`
			`" 'step',\n",`
			`" 'release',\n",`
			`" 'platform',\n",`
			`" 'program',\n",`
			`" 'swing',\n",`
			`" 'component',\n",`
			`" 'program',\n",`
			`" 'program']"`
			`]`
			`},`
			`"execution_count": 110,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"get_nouns(text)"`
upload 2024-04-13 08:20:53 +02:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "musical-creator",`
			`"metadata": {},`
			`"source": [`
			`"Teraz czas na podliczenie wystąpień poszczególnych rzeczowników. Uwaga - różne formy tego samego słowa zliczamy razem jako wystąpienia tego słowa (np. \"program\" i \"programs\"). Najwygodniejszą metodą podliczania jest zastosowanie tzw. tally (po polsku \"zestawienie\"). Jest to słownik, którego kluczem jest słowo w formie podstawowej, a wartością liczba wystąpień tego słowa, wliczając słowa odmienione. Przykład gotowego tally:"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 111,`
upload 2024-04-13 08:20:53 +02:00			`"id": "acting-tolerance",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
			`"tally = {\"program\" : 4, \"component\" : 1}"`
			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "vanilla-estimate",`
			`"metadata": {},`
			`"source": [`
			`"### Ćwiczenie 5: Napisz program do ekstrakcji terminologii z tekstu według powyższych wytycznych."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 112,`
upload 2024-04-13 08:20:53 +02:00			`"id": "eight-redhead",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"def count_words(words):\n",`
			`" word_count = {}\n",`
			`" for word in words:\n",`
			`" if word in word_count:\n",`
			`" word_count[word] += 1\n",`
			`" else:\n",`
			`" word_count[word] = 1\n",`
			`" return word_count\n",`
			`"\n",`
upload 2024-04-13 08:20:53 +02:00			`"def extract_terms(text):\n",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`" return count_words(get_nouns(text))"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 113,`
			`"id": "374550d8",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'programmer': 1,\n",`
			`" 'section': 1,\n",`
			`" 'swing': 3,\n",`
			`" 'application': 3,\n",`
			`" 'command': 1,\n",`
			`" 'line': 1,\n",`
			`" 'information': 1,\n",`
			`" 'compilation': 1,\n",`
			`" 'instruction': 1,\n",`
			`" 'program': 4,\n",`
			`" 'applet': 1,\n",`
			`" 'step': 1,\n",`
			`" 'release': 1,\n",`
			`" 'platform': 1,\n",`
			`" 'component': 1}"`
			`]`
			`},`
			`"execution_count": 113,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"extract_terms(text)"`
upload 2024-04-13 08:20:53 +02:00			`]`
			`},`
			`{`
			`"cell_type": "markdown",`
			`"id": "loaded-smell",`
			`"metadata": {},`
			`"source": [`
			`"### Ćwiczenie 6: Rozszerz powyższy program o ekstrację czasowników i przymiotników."`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"execution_count": 114,`
upload 2024-04-13 08:20:53 +02:00			`"id": "monetary-mambo",`
			`"metadata": {},`
			`"outputs": [],`
			`"source": [`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`"def get_verbs(text):\n",`
			`" doc = nlp(text)\n",`
			`" return [token.lemma_ for token in doc if token.pos_ == 'VERB']\n",`
			`"\n",`
			`"def get_adjectives(text):\n",`
			`" doc = nlp(text)\n",`
			`" return [token.lemma_ for token in doc if token.pos_ == 'ADJ']\n",`
			`"\n",`
upload 2024-04-13 08:20:53 +02:00			`"def extract_terms(text):\n",`
Laboratoria 13.04.2024 2024-04-15 21:15:24 +02:00			`" return {\n",`
			`" \"nouns\": get_nouns(text),\n",`
			`" \"verbs\": get_verbs(text),\n",`
			`" \"adjectives\": get_adjectives(text)\n",`
			`" }"`
			`]`
			`},`
			`{`
			`"cell_type": "code",`
			`"execution_count": 115,`
			`"id": "95494ac9",`
			`"metadata": {},`
			`"outputs": [`
			`{`
			`"data": {`
			`"text/plain": [`
			`"{'nouns': ['programmer',\n",`
			`" 'section',\n",`
			`" 'swing',\n",`
			`" 'application',\n",`
			`" 'command',\n",`
			`" 'line',\n",`
			`" 'information',\n",`
			`" 'swing',\n",`
			`" 'application',\n",`
			`" 'compilation',\n",`
			`" 'instruction',\n",`
			`" 'program',\n",`
			`" 'applet',\n",`
			`" 'application',\n",`
			`" 'step',\n",`
			`" 'release',\n",`
			`" 'platform',\n",`
			`" 'program',\n",`
			`" 'swing',\n",`
			`" 'component',\n",`
			`" 'program',\n",`
			`" 'program'],\n",`
			`" 'verbs': ['explain',\n",`
			`" 'compile',\n",`
			`" 'run',\n",`
			`" 'compile',\n",`
			`" 'run',\n",`
			`" 'use',\n",`
			`" 'see',\n",`
			`" 'run',\n",`
			`" 'work',\n",`
			`" 'need',\n",`
			`" 'follow',\n",`
			`" 'install',\n",`
			`" 'do',\n",`
			`" 'create',\n",`
			`" 'use',\n",`
			`" 'compile',\n",`
			`" 'run'],\n",`
			`" 'adjectives': ['late']}"`
			`]`
			`},`
			`"execution_count": 115,`
			`"metadata": {},`
			`"output_type": "execute_result"`
			`}`
			`],`
			`"source": [`
			`"extract_terms(text)"`
upload 2024-04-13 08:20:53 +02:00			`]`
			`}`
			`],`
			`"metadata": {`
			`"author": "Rafał Jaworski",`
			`"email": "rjawor@amu.edu.pl",`
			`"kernelspec": {`
			`"display_name": "Python 3",`
			`"language": "python",`
			`"name": "python3"`
			`},`
			`"lang": "pl",`
			`"language_info": {`
			`"codemirror_mode": {`
			`"name": "ipython",`
			`"version": 3`
			`},`
			`"file_extension": ".py",`
			`"mimetype": "text/x-python",`
			`"name": "python",`
			`"nbconvert_exporter": "python",`
			`"pygments_lexer": "ipython3",`
Laboratoria 13.04.2024 2024-04-15 19:34:20 +02:00			`"version": "3.7.9"`
upload 2024-04-13 08:20:53 +02:00			`},`
			`"subtitle": "3. Terminologia",`
			`"title": "Komputerowe wspomaganie tłumaczenia",`
			`"year": "2021"`
			`},`
			`"nbformat": 4,`
			`"nbformat_minor": 5`
			`}`