{ "cells": [ { "cell_type": "markdown", "id": "coastal-lincoln", "metadata": {}, "source": [ "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n", "
\n", "

Komputerowe wspomaganie tłumaczenia

\n", "

3. Terminologia [laboratoria]

\n", "

Rafał Jaworski (2021)

\n", "
\n", "\n", "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)" ] }, { "cell_type": "markdown", "id": "aggregate-listing", "metadata": {}, "source": [ "```python\n", "import collections\n", "lista1 = [3,4,5,4,4,7,8,7]\n", "lista2 = [3,4,5,4,4,7,8,7]\n", "print((collections.Counter(lista) + collections.Counter(lista2)).most_common(5))\n", "```\n", "\n", "Na dzisiejszych zajęciach zajmiemy się bliżej słownikami używanymi do wspomagania tłumaczenia. Oczywiście na rynku dostępnych jest bardzo wiele słowników w formacie elektronicznym. Wiele z nich jest gotowych do użycia w SDL Trados, memoQ i innych narzędziach CAT. Zawierają one setki tysięcy lub miliony haseł i oferują natychmiastową pomoc tłumaczowi." ] }, { "cell_type": "markdown", "id": "israeli-excuse", "metadata": {}, "source": [ "Problem jednak w tym, iż często nie zawierają odpowiedniej terminologii specjalistycznej - używanej przez klienta zamawiającego tłumaczenie. Terminy specjalistyczne są bardzo częste w tekstach tłumaczonych ze względu na następujące zjawiska:\n", "- Teksty o tematyce ogólnej są tłumaczone dość rzadko (nikt nie tłumaczy pocztówek z pozdrowieniami z wakacji...)\n", "- Te same słowa mogą mieć zarówno znaczenie ogólne, jak i bardzo specjalistyczne (np. \"dziedziczenie\" w kontekście prawnym lub informatycznym)\n", "- Klient używa nazw lub słów wymyślonych przez siebie, np. na potrzeby marketingowe." ] }, { "cell_type": "markdown", "id": "reflected-enforcement", "metadata": {}, "source": [ "Nietrywialnymi zadaniami stają się: odnalezienie terminu specjalistycznego w tekście źródłowym oraz podanie prawidłowego tłumaczenia tego terminu na język docelowy" ] }, { "cell_type": "markdown", "id": "statutory-florist", "metadata": {}, "source": [ "Brzmi prosto? Spróbujmy wykonać ręcznie tę drugą operację." ] }, { "cell_type": "markdown", "id": "danish-anchor", "metadata": {}, "source": [ "### Ćwiczenie 1: Podaj tłumaczenie terminu \"prowadnice szaf metalowych\" na język angielski. Opisz, z jakich narzędzi skorzystałaś/eś." ] }, { "cell_type": "markdown", "id": "diverse-sunglasses", "metadata": {}, "source": [ "### Odpowiedź:\n", "- **DeepL:** metal cabinet slides / metal cabinet guides\n", "- **Model GPT-3.5:** metal cabinet slides / metal wardrobe rails.\n", "- **Model GPT-4:** guides for metal cabinets / metal cabinet guides\n", "- **Google-translate**: metal cabinet guides\n", "- **www.tlumaczangielskopolski.pl:** metal cabinet guides\n" ] }, { "cell_type": "markdown", "id": "limited-waterproof", "metadata": {}, "source": [ "W dalszych ćwiczeniach skupimy się jednak na odszukaniu terminu specjalistycznego w tekście. W tym celu będą potrzebne dwie operacje:\n", "1. Przygotowanie słownika specjalistycznego.\n", "2. Detekcja terminologii przy użyciu przygotowanego słownika specjalistycznego." ] }, { "cell_type": "markdown", "id": "literary-blues", "metadata": {}, "source": [ "Zajmijmy się najpierw krokiem nr 2 (gdyż jest prostszy). Rozważmy następujący tekst:" ] }, { "cell_type": "code", "execution_count": 70, "id": "loving-prince", "metadata": {}, "outputs": [], "source": [ "text = \" For all Java programmers:\"\n", "text += \" This section explains how to compile and run a Swing application from the command line.\"\n", "text += \" For information on compiling and running a Swing application using NetBeans IDE,\"\n", "text += \" see Running Tutorial Examples in NetBeans IDE. The compilation instructions work for all Swing programs\"\n", "text += \" — applets, as well as applications. Here are the steps you need to follow:\"\n", "text += \" Install the latest release of the Java SE platform, if you haven't already done so.\"\n", "text += \" Create a program that uses Swing components. Compile the program. Run the program.\"" ] }, { "cell_type": "markdown", "id": "extreme-cycling", "metadata": {}, "source": [ "Załóżmy, że posiadamy następujący słownik:" ] }, { "cell_type": "code", "execution_count": 71, "id": "bound-auction", "metadata": {}, "outputs": [], "source": [ "dictionary = ['program', 'application', 'applet', 'compile']" ] }, { "cell_type": "markdown", "id": "other-trinidad", "metadata": {}, "source": [ "### Ćwiczenie 2: Napisz program, który wypisze pozycje wszystkich wystąpień poszczególnych terminów specjalistycznych. Dla każdego terminu należy wypisać listę par (pozycja_startowa, pozycja końcowa)." ] }, { "cell_type": "code", "execution_count": 76, "id": "cognitive-cedar", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'applet': [(302, 308)],\n", " 'application': [(80, 91), (153, 84), (300, 158)],\n", " 'compile': [(56, 63), (497, 448)],\n", " 'program': [(14, 21), (284, 277), (454, 177), (495, 48), (505, 17)]}\n" ] } ], "source": [ "import re\n", "from pprint import pprint\n", "\n", "def terminology_lookup():\n", " answer = {pattern:[] for pattern in dictionary}\n", " low_text = text.lower()\n", " for pattern in dictionary:\n", " offset = 0\n", " start = 0\n", " end = 0\n", " while True:\n", " match = (re.search(pattern,low_text[offset:]))\n", " if not match:\n", " break\n", " else:\n", " start += match.start()\n", " end = +match.end()\n", " offset += end\n", "\n", " answer[pattern].append((start,end))\n", " pprint(answer)\n", " #return answer\n", "\n", "terminology_lookup()" ] }, { "cell_type": "markdown", "id": "interior-things", "metadata": {}, "source": [ "Zwykłe wyszukiwanie w tekście ma pewne wady. Na przykład, gdy szukaliśmy słowa \"program\", złapaliśmy przypadkiem słowo \"programmer\". Złapaliśmy także słowo \"programs\", co jest poprawne, ale niepoprawnie podaliśmy jego pozycję w tekście." ] }, { "cell_type": "markdown", "id": "aggressive-plane", "metadata": {}, "source": [ "Żeby poradzić sobie z tymi problemami, musimy wykorzystać techniki przetwarzania języka naturalnego. Wypróbujmy pakiet spaCy:\n", "\n", "`pip3 install spacy`\n", "\n", "oraz\n", "\n", "`python3 -m spacy download en_core_web_sm`" ] }, { "cell_type": "code", "execution_count": 1, "id": "02e1c16f-be37-4a64-a514-8875b393ccb7", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "Requirement already satisfied: spacy in /usr/local/lib/python3.9/dist-packages (3.4.1)\n", "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /usr/local/lib/python3.9/dist-packages (from spacy) (3.0.10)\n", "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (1.0.3)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (1.0.8)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.9/dist-packages (from spacy) (2.0.6)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from spacy) (3.0.7)\n", "Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (8.1.1)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.9/dist-packages (from spacy) (0.10.1)\n", "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.9/dist-packages (from spacy) (2.4.4)\n", "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.9/dist-packages (from spacy) (2.0.8)\n", "Requirement already satisfied: typer<0.5.0,>=0.3.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (0.4.2)\n", "Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.9/dist-packages (from spacy) (0.6.2)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (4.64.1)\n", "Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (1.21.6)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (2.28.1)\n", "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4 in /usr/local/lib/python3.9/dist-packages (from spacy) (1.9.2)\n", "Requirement already satisfied: jinja2 in /usr/local/lib/python3.9/dist-packages (from spacy) (3.1.2)\n", "Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from spacy) (52.0.0)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (21.3)\n", "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (3.3.0)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/lib/python3/dist-packages (from packaging>=20.0->spacy) (2.4.7)\n", "Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /usr/local/lib/python3.9/dist-packages (from pathy>=0.3.5->spacy) (5.2.1)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4->spacy) (4.3.0)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.1.1)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (1.26.12)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2022.9.14)\n", "Requirement already satisfied: blis<0.10.0,>=0.7.8 in /usr/local/lib/python3.9/dist-packages (from thinc<8.2.0,>=8.1.0->spacy) (0.9.1)\n", "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.9/dist-packages (from thinc<8.2.0,>=8.1.0->spacy) (0.0.1)\n", "Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.9/dist-packages (from typer<0.5.0,>=0.3.0->spacy) (8.1.3)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.9/dist-packages (from jinja2->spacy) (2.1.1)\n" ] } ], "source": [ "pip3 install spacy" ] }, { "cell_type": "code", "execution_count": null, "id": "f6d7e9f5-4d6f-49c5-8dea-9957bc6da318", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Defaulting to user installation because normal site-packages is not writeable\n", "\u001b[33mDEPRECATION: https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl#egg=en_core_web_sm==3.4.1 contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617\u001b[0m\u001b[33m\n", "\u001b[0mCollecting en-core-web-sm==3.4.1\n", " Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)\n", "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m45.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m0:01\u001b[0m\n", "\u001b[?25hRequirement already satisfied: spacy<3.5.0,>=3.4.0 in /usr/local/lib/python3.9/dist-packages (from en-core-web-sm==3.4.1) (3.4.1)\n", "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.0.10)\n", "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.0.3)\n", "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.0.8)\n", "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.0.6)\n", "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.0.7)\n", "Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (8.1.1)\n", "Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.10.1)\n", "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.4.4)\n", "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.0.8)\n", "Requirement already satisfied: typer<0.5.0,>=0.3.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.4.2)\n", "Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.6.2)\n", "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (4.64.1)\n", "Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.21.6)\n", "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.28.1)\n", "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.9.2)\n", "Requirement already satisfied: jinja2 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.1.2)\n", "Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (52.0.0)\n", "Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (21.3)\n", "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.3.0)\n", "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/lib/python3/dist-packages (from packaging>=20.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.4.7)\n", "Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /usr/local/lib/python3.9/dist-packages (from pathy>=0.3.5->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (5.2.1)\n", "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (4.3.0)\n", "Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.1.1)\n", "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.4)\n", "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.26.12)\n", "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2022.9.14)\n", "Requirement already satisfied: blis<0.10.0,>=0.7.8 in /usr/local/lib/python3.9/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.9.1)\n", "Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.9/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.0.1)\n", "Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.9/dist-packages (from typer<0.5.0,>=0.3.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (8.1.3)\n", "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.9/dist-packages (from jinja2->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.1.1)\n", "Installing collected packages: en-core-web-sm\n", "Successfully installed en-core-web-sm-3.4.1\n", "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", "You can now load the package via spacy.load('en_core_web_sm')\n" ] } ], "source": [ "python3 -m spacy download en_core_web_sm" ] }, { "cell_type": "code", "execution_count": 15, "id": "tribal-attention", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " \n", "for\n", "all\n", "Java\n", "programmer\n", ":\n", "this\n", "section\n", "explain\n", "how\n", "to\n", "compile\n", "and\n", "run\n", "a\n", "swing\n", "application\n", "from\n", "the\n", "command\n", "line\n", ".\n", "for\n", "information\n", "on\n", "compile\n", "and\n", "run\n", "a\n", "swing\n", "application\n", "use\n", "NetBeans\n", "IDE\n", ",\n", "see\n", "Running\n", "Tutorial\n", "Examples\n", "in\n", "NetBeans\n", "IDE\n", ".\n", "the\n", "compilation\n", "instruction\n", "work\n", "for\n", "all\n", "swing\n", "program\n", "—\n", "applet\n", ",\n", "as\n", "well\n", "as\n", "application\n", ".\n", "here\n", "be\n", "the\n", "step\n", "you\n", "need\n", "to\n", "follow\n", ":\n", "install\n", "the\n", "late\n", "release\n", "of\n", "the\n", "Java\n", "SE\n", "platform\n", ",\n", "if\n", "you\n", "have\n", "not\n", "already\n", "do\n", "so\n", ".\n", "create\n", "a\n", "program\n", "that\n", "use\n", "swing\n", "component\n", ".\n", "compile\n", "the\n", "program\n", ".\n", "run\n", "the\n", "program\n", ".\n" ] } ], "source": [ "import spacy\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "\n", "doc = nlp(text)\n", "\n", "for token in doc:\n", " print(token.lemma_)" ] }, { "cell_type": "markdown", "id": "regional-craft", "metadata": {}, "source": [ "Sukces! Nastąpił podział tekstu na słowa (tokenizacja) oraz sprowadzenie do formy podstawowej każdego słowa (lematyzacja)." ] }, { "cell_type": "markdown", "id": "toxic-subsection", "metadata": {}, "source": [ "### Ćwiczenie 3: Zmodyfikuj program z ćwiczenia 2 tak, aby zwracał również odmienione słowa. Na przykład, dla słowa \"program\" powinien znaleźć również \"programs\", ustawiając pozycje w tekście odpowiednio dla słowa \"programs\". Wykorzystaj właściwość idx tokenu." ] }, { "cell_type": "code", "execution_count": 20, "id": "surgical-demonstration", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'program': [(14, 24), (291, 298), (468, 475), (516, 523), (533, 540)],\n", " 'application': [(80, 91), (164, 175), (322, 333)],\n", " 'applet': [(302, 308)],\n", " 'compile': [(56, 63), (134, 141), (504, 511)]}" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import re\n", "\n", "def terminology_lookup():\n", " answer = {pattern:[] for pattern in dictionary}\n", "\n", " for pattern in dictionary:\n", " for token in doc:\n", " if pattern in token.lemma_:\n", " answer[pattern].append((token.idx,token.idx+len(token.lemma_)))\n", " return answer\n", "\n", "terminology_lookup()" ] }, { "cell_type": "markdown", "id": "straight-letter", "metadata": {}, "source": [ "Teraz czas zająć się problemem przygotowania słownika specjalistycznego. W tym celu napiszemy nasz własny ekstraktor terminologii. Wejściem do ekstraktora będzie tekst zawierający specjalistyczną terminologię. Wyjściem - lista terminów." ] }, { "cell_type": "markdown", "id": "nearby-frontier", "metadata": {}, "source": [ "Przyjmijmy następujące podejście - terminami specjalistycznymi będą najcześćiej występujące rzeczowniki w tekście. Wykonajmy krok pierwszy:" ] }, { "cell_type": "markdown", "id": "harmful-lightning", "metadata": {}, "source": [ "### Ćwiczenie 4: Wypisz wszystkie rzeczowniki z tekstu. Wykorzystaj możliwości spaCy." ] }, { "cell_type": "code", "execution_count": 73, "id": "superb-butterfly", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['programmers',\n", " 'section',\n", " 'Swing',\n", " 'application',\n", " 'command',\n", " 'line',\n", " 'information',\n", " 'Swing',\n", " 'application',\n", " 'compilation',\n", " 'instructions',\n", " 'Swing',\n", " 'programs',\n", " 'applets',\n", " 'applications',\n", " 'steps',\n", " 'release',\n", " 'platform',\n", " 'program',\n", " 'Swing',\n", " 'components',\n", " 'program',\n", " 'program']" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import spacy\n", "def get_nouns(text):\n", " nlp = spacy.load(\"en_core_web_sm\")\n", " doc = nlp(text)\n", " nouns = [token.text for token in doc if token.pos_ == \"NOUN\"]\n", " return nouns\n", "\n", "get_nouns(text)" ] }, { "cell_type": "markdown", "id": "musical-creator", "metadata": {}, "source": [ "Teraz czas na podliczenie wystąpień poszczególnych rzeczowników. Uwaga - różne formy tego samego słowa zliczamy razem jako wystąpienia tego słowa (np. \"program\" i \"programs\"). Najwygodniejszą metodą podliczania jest zastosowanie tzw. tally (po polsku \"zestawienie\"). Jest to słownik, którego kluczem jest słowo w formie podstawowej, a wartością liczba wystąpień tego słowa, wliczając słowa odmienione. Przykład gotowego tally:" ] }, { "cell_type": "code", "execution_count": 7, "id": "acting-tolerance", "metadata": {}, "outputs": [], "source": [ "tally = {\"program\" : 4, \"component\" : 1}" ] }, { "cell_type": "markdown", "id": "vanilla-estimate", "metadata": {}, "source": [ "### Ćwiczenie 5: Napisz program do ekstrakcji terminologii z tekstu według powyższych wytycznych." ] }, { "cell_type": "code", "execution_count": 74, "id": "eight-redhead", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'programmer': 1,\n", " 'section': 1,\n", " 'swing': 4,\n", " 'application': 3,\n", " 'command': 1,\n", " 'line': 1,\n", " 'information': 1,\n", " 'compilation': 1,\n", " 'instruction': 1,\n", " 'program': 4,\n", " 'applet': 1,\n", " 'step': 1,\n", " 'release': 1,\n", " 'platform': 1,\n", " 'component': 1}" ] }, "execution_count": 74, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "import spacy\n", "\n", "def extract_terms(text):\n", " nlp = spacy.load(\"en_core_web_sm\")\n", " doc = nlp(text)\n", " tally = {}\n", " nouns = [token.lemma_ for token in doc if token.pos_ == \"NOUN\"]\n", " nouns_counts = Counter(nouns)\n", " \n", " for word, count in nouns_counts.items():\n", " tally.update({word: count})\n", " return tally\n", "\n", "extract_terms(text)" ] }, { "cell_type": "markdown", "id": "loaded-smell", "metadata": {}, "source": [ "### Ćwiczenie 6: Rozszerz powyższy program o ekstrację czasowników i przymiotników." ] }, { "cell_type": "code", "execution_count": 75, "id": "monetary-mambo", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'adjectives': {'late': 1},\n", " 'nouns': {'applet': 1,\n", " 'application': 3,\n", " 'command': 1,\n", " 'compilation': 1,\n", " 'component': 1,\n", " 'information': 1,\n", " 'instruction': 1,\n", " 'line': 1,\n", " 'platform': 1,\n", " 'program': 4,\n", " 'programmer': 1,\n", " 'release': 1,\n", " 'section': 1,\n", " 'step': 1,\n", " 'swing': 4},\n", " 'verbs': {'compile': 3,\n", " 'create': 1,\n", " 'do': 1,\n", " 'explain': 1,\n", " 'follow': 1,\n", " 'install': 1,\n", " 'need': 1,\n", " 'run': 3,\n", " 'see': 1,\n", " 'use': 2,\n", " 'work': 1}}\n" ] } ], "source": [ "from pprint import pprint\n", "from collections import Counter\n", "import spacy\n", "\n", "def extract_terms(text):\n", " \n", " nlp = spacy.load(\"en_core_web_sm\")\n", " doc = nlp(text)\n", " \n", " nouns, verbs, adjectives = [], [], []\n", " tally = {\"nouns\": {}, \"verbs\": {}, \"adjectives\": {}}\n", " \n", " for token in doc:\n", " if token.pos_ == \"NOUN\":\n", " nouns.append(token.lemma_)\n", " elif token.pos_ == \"VERB\":\n", " verbs.append(token.lemma_)\n", " elif token.pos_ == \"ADJ\":\n", " adjectives.append(token.lemma_)\n", " \n", " nouns_counts = Counter(nouns)\n", " verbs_counts = Counter(verbs)\n", " adjectives_counts = Counter(adjectives)\n", "\n", " for word, count in nouns_counts.items():\n", " tally[\"nouns\"].update({word: count})\n", " \n", " for word, count in verbs_counts.items():\n", " tally[\"verbs\"].update({word: count})\n", " \n", " for word, count in adjectives_counts.items():\n", " tally[\"adjectives\"].update({word: count})\n", "\n", " pprint(tally)\n", " #return tally\n", "\n", "extract_terms(text)" ] } ], "metadata": { "author": "Rafał Jaworski", "email": "rjawor@amu.edu.pl", "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "lang": "pl", "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.2" }, "subtitle": "3. Terminologia", "title": "Komputerowe wspomaganie tłumaczenia", "year": "2021" }, "nbformat": 4, "nbformat_minor": 5 }