forked from bfijalkowski/KWT-2024
755 lines
29 KiB
Plaintext
755 lines
29 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "coastal-lincoln",
|
|
"metadata": {},
|
|
"source": [
|
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
|
"<div class=\"alert alert-block alert-info\">\n",
|
|
"<h1> Komputerowe wspomaganie tłumaczenia </h1>\n",
|
|
"<h2> 3. <i>Terminologia</i> [laboratoria]</h2> \n",
|
|
"<h3>Rafał Jaworski (2021)</h3>\n",
|
|
"</div>\n",
|
|
"\n",
|
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "aggregate-listing",
|
|
"metadata": {},
|
|
"source": [
|
|
"```python\n",
|
|
"import collections\n",
|
|
"lista1 = [3,4,5,4,4,7,8,7]\n",
|
|
"lista2 = [3,4,5,4,4,7,8,7]\n",
|
|
"print((collections.Counter(lista) + collections.Counter(lista2)).most_common(5))\n",
|
|
"```\n",
|
|
"\n",
|
|
"Na dzisiejszych zajęciach zajmiemy się bliżej słownikami używanymi do wspomagania tłumaczenia. Oczywiście na rynku dostępnych jest bardzo wiele słowników w formacie elektronicznym. Wiele z nich jest gotowych do użycia w SDL Trados, memoQ i innych narzędziach CAT. Zawierają one setki tysięcy lub miliony haseł i oferują natychmiastową pomoc tłumaczowi."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "israeli-excuse",
|
|
"metadata": {},
|
|
"source": [
|
|
"Problem jednak w tym, iż często nie zawierają odpowiedniej terminologii specjalistycznej - używanej przez klienta zamawiającego tłumaczenie. Terminy specjalistyczne są bardzo częste w tekstach tłumaczonych ze względu na następujące zjawiska:\n",
|
|
"- Teksty o tematyce ogólnej są tłumaczone dość rzadko (nikt nie tłumaczy pocztówek z pozdrowieniami z wakacji...)\n",
|
|
"- Te same słowa mogą mieć zarówno znaczenie ogólne, jak i bardzo specjalistyczne (np. \"dziedziczenie\" w kontekście prawnym lub informatycznym)\n",
|
|
"- Klient używa nazw lub słów wymyślonych przez siebie, np. na potrzeby marketingowe."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "reflected-enforcement",
|
|
"metadata": {},
|
|
"source": [
|
|
"Nietrywialnymi zadaniami stają się: odnalezienie terminu specjalistycznego w tekście źródłowym oraz podanie prawidłowego tłumaczenia tego terminu na język docelowy"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "statutory-florist",
|
|
"metadata": {},
|
|
"source": [
|
|
"Brzmi prosto? Spróbujmy wykonać ręcznie tę drugą operację."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "danish-anchor",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ćwiczenie 1: Podaj tłumaczenie terminu \"prowadnice szaf metalowych\" na język angielski. Opisz, z jakich narzędzi skorzystałaś/eś."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "diverse-sunglasses",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Odpowiedź:\n",
|
|
"- **DeepL:** metal cabinet slides / metal cabinet guides\n",
|
|
"- **Model GPT-3.5:** metal cabinet slides / metal wardrobe rails.\n",
|
|
"- **Model GPT-4:** guides for metal cabinets / metal cabinet guides\n",
|
|
"- **Google-translate**: metal cabinet guides\n",
|
|
"- **www.tlumaczangielskopolski.pl:** metal cabinet guides\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "limited-waterproof",
|
|
"metadata": {},
|
|
"source": [
|
|
"W dalszych ćwiczeniach skupimy się jednak na odszukaniu terminu specjalistycznego w tekście. W tym celu będą potrzebne dwie operacje:\n",
|
|
"1. Przygotowanie słownika specjalistycznego.\n",
|
|
"2. Detekcja terminologii przy użyciu przygotowanego słownika specjalistycznego."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "literary-blues",
|
|
"metadata": {},
|
|
"source": [
|
|
"Zajmijmy się najpierw krokiem nr 2 (gdyż jest prostszy). Rozważmy następujący tekst:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 70,
|
|
"id": "loving-prince",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"text = \" For all Java programmers:\"\n",
|
|
"text += \" This section explains how to compile and run a Swing application from the command line.\"\n",
|
|
"text += \" For information on compiling and running a Swing application using NetBeans IDE,\"\n",
|
|
"text += \" see Running Tutorial Examples in NetBeans IDE. The compilation instructions work for all Swing programs\"\n",
|
|
"text += \" — applets, as well as applications. Here are the steps you need to follow:\"\n",
|
|
"text += \" Install the latest release of the Java SE platform, if you haven't already done so.\"\n",
|
|
"text += \" Create a program that uses Swing components. Compile the program. Run the program.\""
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "extreme-cycling",
|
|
"metadata": {},
|
|
"source": [
|
|
"Załóżmy, że posiadamy następujący słownik:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 71,
|
|
"id": "bound-auction",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"dictionary = ['program', 'application', 'applet', 'compile']"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "other-trinidad",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ćwiczenie 2: Napisz program, który wypisze pozycje wszystkich wystąpień poszczególnych terminów specjalistycznych. Dla każdego terminu należy wypisać listę par (pozycja_startowa, pozycja końcowa)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 76,
|
|
"id": "cognitive-cedar",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'applet': [(302, 308)],\n",
|
|
" 'application': [(80, 91), (153, 84), (300, 158)],\n",
|
|
" 'compile': [(56, 63), (497, 448)],\n",
|
|
" 'program': [(14, 21), (284, 277), (454, 177), (495, 48), (505, 17)]}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import re\n",
|
|
"from pprint import pprint\n",
|
|
"\n",
|
|
"def terminology_lookup():\n",
|
|
" answer = {pattern:[] for pattern in dictionary}\n",
|
|
" low_text = text.lower()\n",
|
|
" for pattern in dictionary:\n",
|
|
" offset = 0\n",
|
|
" start = 0\n",
|
|
" end = 0\n",
|
|
" while True:\n",
|
|
" match = (re.search(pattern,low_text[offset:]))\n",
|
|
" if not match:\n",
|
|
" break\n",
|
|
" else:\n",
|
|
" start += match.start()\n",
|
|
" end = +match.end()\n",
|
|
" offset += end\n",
|
|
"\n",
|
|
" answer[pattern].append((start,end))\n",
|
|
" pprint(answer)\n",
|
|
" #return answer\n",
|
|
"\n",
|
|
"terminology_lookup()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "interior-things",
|
|
"metadata": {},
|
|
"source": [
|
|
"Zwykłe wyszukiwanie w tekście ma pewne wady. Na przykład, gdy szukaliśmy słowa \"program\", złapaliśmy przypadkiem słowo \"programmer\". Złapaliśmy także słowo \"programs\", co jest poprawne, ale niepoprawnie podaliśmy jego pozycję w tekście."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "aggressive-plane",
|
|
"metadata": {},
|
|
"source": [
|
|
"Żeby poradzić sobie z tymi problemami, musimy wykorzystać techniki przetwarzania języka naturalnego. Wypróbujmy pakiet spaCy:\n",
|
|
"\n",
|
|
"`pip3 install spacy`\n",
|
|
"\n",
|
|
"oraz\n",
|
|
"\n",
|
|
"`python3 -m spacy download en_core_web_sm`"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "02e1c16f-be37-4a64-a514-8875b393ccb7",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Defaulting to user installation because normal site-packages is not writeable\n",
|
|
"Requirement already satisfied: spacy in /usr/local/lib/python3.9/dist-packages (3.4.1)\n",
|
|
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /usr/local/lib/python3.9/dist-packages (from spacy) (3.0.10)\n",
|
|
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (1.0.3)\n",
|
|
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (1.0.8)\n",
|
|
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.9/dist-packages (from spacy) (2.0.6)\n",
|
|
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from spacy) (3.0.7)\n",
|
|
"Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (8.1.1)\n",
|
|
"Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.9/dist-packages (from spacy) (0.10.1)\n",
|
|
"Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.9/dist-packages (from spacy) (2.4.4)\n",
|
|
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.9/dist-packages (from spacy) (2.0.8)\n",
|
|
"Requirement already satisfied: typer<0.5.0,>=0.3.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (0.4.2)\n",
|
|
"Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.9/dist-packages (from spacy) (0.6.2)\n",
|
|
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (4.64.1)\n",
|
|
"Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (1.21.6)\n",
|
|
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (2.28.1)\n",
|
|
"Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4 in /usr/local/lib/python3.9/dist-packages (from spacy) (1.9.2)\n",
|
|
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.9/dist-packages (from spacy) (3.1.2)\n",
|
|
"Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from spacy) (52.0.0)\n",
|
|
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (21.3)\n",
|
|
"Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from spacy) (3.3.0)\n",
|
|
"Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/lib/python3/dist-packages (from packaging>=20.0->spacy) (2.4.7)\n",
|
|
"Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /usr/local/lib/python3.9/dist-packages (from pathy>=0.3.5->spacy) (5.2.1)\n",
|
|
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4->spacy) (4.3.0)\n",
|
|
"Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2.1.1)\n",
|
|
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (3.4)\n",
|
|
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (1.26.12)\n",
|
|
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy) (2022.9.14)\n",
|
|
"Requirement already satisfied: blis<0.10.0,>=0.7.8 in /usr/local/lib/python3.9/dist-packages (from thinc<8.2.0,>=8.1.0->spacy) (0.9.1)\n",
|
|
"Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.9/dist-packages (from thinc<8.2.0,>=8.1.0->spacy) (0.0.1)\n",
|
|
"Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.9/dist-packages (from typer<0.5.0,>=0.3.0->spacy) (8.1.3)\n",
|
|
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.9/dist-packages (from jinja2->spacy) (2.1.1)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"pip3 install spacy"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "f6d7e9f5-4d6f-49c5-8dea-9957bc6da318",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Defaulting to user installation because normal site-packages is not writeable\n",
|
|
"\u001b[33mDEPRECATION: https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl#egg=en_core_web_sm==3.4.1 contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617\u001b[0m\u001b[33m\n",
|
|
"\u001b[0mCollecting en-core-web-sm==3.4.1\n",
|
|
" Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)\n",
|
|
"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.8/12.8 MB\u001b[0m \u001b[31m45.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m0:01\u001b[0m\n",
|
|
"\u001b[?25hRequirement already satisfied: spacy<3.5.0,>=3.4.0 in /usr/local/lib/python3.9/dist-packages (from en-core-web-sm==3.4.1) (3.4.1)\n",
|
|
"Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.0.10)\n",
|
|
"Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.0.3)\n",
|
|
"Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.0.8)\n",
|
|
"Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.0.6)\n",
|
|
"Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.0.7)\n",
|
|
"Requirement already satisfied: thinc<8.2.0,>=8.1.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (8.1.1)\n",
|
|
"Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.10.1)\n",
|
|
"Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.4.4)\n",
|
|
"Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.0.8)\n",
|
|
"Requirement already satisfied: typer<0.5.0,>=0.3.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.4.2)\n",
|
|
"Requirement already satisfied: pathy>=0.3.5 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.6.2)\n",
|
|
"Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (4.64.1)\n",
|
|
"Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.21.6)\n",
|
|
"Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.28.1)\n",
|
|
"Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.9.2)\n",
|
|
"Requirement already satisfied: jinja2 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.1.2)\n",
|
|
"Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (52.0.0)\n",
|
|
"Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (21.3)\n",
|
|
"Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/lib/python3.9/dist-packages (from spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.3.0)\n",
|
|
"Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/lib/python3/dist-packages (from packaging>=20.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.4.7)\n",
|
|
"Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /usr/local/lib/python3.9/dist-packages (from pathy>=0.3.5->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (5.2.1)\n",
|
|
"Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from pydantic!=1.8,!=1.8.1,<1.10.0,>=1.7.4->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (4.3.0)\n",
|
|
"Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.1.1)\n",
|
|
"Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (3.4)\n",
|
|
"Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (1.26.12)\n",
|
|
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests<3.0.0,>=2.13.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2022.9.14)\n",
|
|
"Requirement already satisfied: blis<0.10.0,>=0.7.8 in /usr/local/lib/python3.9/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.9.1)\n",
|
|
"Requirement already satisfied: confection<1.0.0,>=0.0.1 in /usr/local/lib/python3.9/dist-packages (from thinc<8.2.0,>=8.1.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (0.0.1)\n",
|
|
"Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/lib/python3.9/dist-packages (from typer<0.5.0,>=0.3.0->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (8.1.3)\n",
|
|
"Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.9/dist-packages (from jinja2->spacy<3.5.0,>=3.4.0->en-core-web-sm==3.4.1) (2.1.1)\n",
|
|
"Installing collected packages: en-core-web-sm\n",
|
|
"Successfully installed en-core-web-sm-3.4.1\n",
|
|
"\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n",
|
|
"You can now load the package via spacy.load('en_core_web_sm')\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"python3 -m spacy download en_core_web_sm"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 15,
|
|
"id": "tribal-attention",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \n",
|
|
"for\n",
|
|
"all\n",
|
|
"Java\n",
|
|
"programmer\n",
|
|
":\n",
|
|
"this\n",
|
|
"section\n",
|
|
"explain\n",
|
|
"how\n",
|
|
"to\n",
|
|
"compile\n",
|
|
"and\n",
|
|
"run\n",
|
|
"a\n",
|
|
"swing\n",
|
|
"application\n",
|
|
"from\n",
|
|
"the\n",
|
|
"command\n",
|
|
"line\n",
|
|
".\n",
|
|
"for\n",
|
|
"information\n",
|
|
"on\n",
|
|
"compile\n",
|
|
"and\n",
|
|
"run\n",
|
|
"a\n",
|
|
"swing\n",
|
|
"application\n",
|
|
"use\n",
|
|
"NetBeans\n",
|
|
"IDE\n",
|
|
",\n",
|
|
"see\n",
|
|
"Running\n",
|
|
"Tutorial\n",
|
|
"Examples\n",
|
|
"in\n",
|
|
"NetBeans\n",
|
|
"IDE\n",
|
|
".\n",
|
|
"the\n",
|
|
"compilation\n",
|
|
"instruction\n",
|
|
"work\n",
|
|
"for\n",
|
|
"all\n",
|
|
"swing\n",
|
|
"program\n",
|
|
"—\n",
|
|
"applet\n",
|
|
",\n",
|
|
"as\n",
|
|
"well\n",
|
|
"as\n",
|
|
"application\n",
|
|
".\n",
|
|
"here\n",
|
|
"be\n",
|
|
"the\n",
|
|
"step\n",
|
|
"you\n",
|
|
"need\n",
|
|
"to\n",
|
|
"follow\n",
|
|
":\n",
|
|
"install\n",
|
|
"the\n",
|
|
"late\n",
|
|
"release\n",
|
|
"of\n",
|
|
"the\n",
|
|
"Java\n",
|
|
"SE\n",
|
|
"platform\n",
|
|
",\n",
|
|
"if\n",
|
|
"you\n",
|
|
"have\n",
|
|
"not\n",
|
|
"already\n",
|
|
"do\n",
|
|
"so\n",
|
|
".\n",
|
|
"create\n",
|
|
"a\n",
|
|
"program\n",
|
|
"that\n",
|
|
"use\n",
|
|
"swing\n",
|
|
"component\n",
|
|
".\n",
|
|
"compile\n",
|
|
"the\n",
|
|
"program\n",
|
|
".\n",
|
|
"run\n",
|
|
"the\n",
|
|
"program\n",
|
|
".\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import spacy\n",
|
|
"nlp = spacy.load(\"en_core_web_sm\")\n",
|
|
"\n",
|
|
"doc = nlp(text)\n",
|
|
"\n",
|
|
"for token in doc:\n",
|
|
" print(token.lemma_)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "regional-craft",
|
|
"metadata": {},
|
|
"source": [
|
|
"Sukces! Nastąpił podział tekstu na słowa (tokenizacja) oraz sprowadzenie do formy podstawowej każdego słowa (lematyzacja)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "toxic-subsection",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ćwiczenie 3: Zmodyfikuj program z ćwiczenia 2 tak, aby zwracał również odmienione słowa. Na przykład, dla słowa \"program\" powinien znaleźć również \"programs\", ustawiając pozycje w tekście odpowiednio dla słowa \"programs\". Wykorzystaj właściwość idx tokenu."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"id": "surgical-demonstration",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'program': [(14, 24), (291, 298), (468, 475), (516, 523), (533, 540)],\n",
|
|
" 'application': [(80, 91), (164, 175), (322, 333)],\n",
|
|
" 'applet': [(302, 308)],\n",
|
|
" 'compile': [(56, 63), (134, 141), (504, 511)]}"
|
|
]
|
|
},
|
|
"execution_count": 20,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"import re\n",
|
|
"\n",
|
|
"def terminology_lookup():\n",
|
|
" answer = {pattern:[] for pattern in dictionary}\n",
|
|
"\n",
|
|
" for pattern in dictionary:\n",
|
|
" for token in doc:\n",
|
|
" if pattern in token.lemma_:\n",
|
|
" answer[pattern].append((token.idx,token.idx+len(token.lemma_)))\n",
|
|
" return answer\n",
|
|
"\n",
|
|
"terminology_lookup()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "straight-letter",
|
|
"metadata": {},
|
|
"source": [
|
|
"Teraz czas zająć się problemem przygotowania słownika specjalistycznego. W tym celu napiszemy nasz własny ekstraktor terminologii. Wejściem do ekstraktora będzie tekst zawierający specjalistyczną terminologię. Wyjściem - lista terminów."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "nearby-frontier",
|
|
"metadata": {},
|
|
"source": [
|
|
"Przyjmijmy następujące podejście - terminami specjalistycznymi będą najcześćiej występujące rzeczowniki w tekście. Wykonajmy krok pierwszy:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "harmful-lightning",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ćwiczenie 4: Wypisz wszystkie rzeczowniki z tekstu. Wykorzystaj możliwości spaCy."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 73,
|
|
"id": "superb-butterfly",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"['programmers',\n",
|
|
" 'section',\n",
|
|
" 'Swing',\n",
|
|
" 'application',\n",
|
|
" 'command',\n",
|
|
" 'line',\n",
|
|
" 'information',\n",
|
|
" 'Swing',\n",
|
|
" 'application',\n",
|
|
" 'compilation',\n",
|
|
" 'instructions',\n",
|
|
" 'Swing',\n",
|
|
" 'programs',\n",
|
|
" 'applets',\n",
|
|
" 'applications',\n",
|
|
" 'steps',\n",
|
|
" 'release',\n",
|
|
" 'platform',\n",
|
|
" 'program',\n",
|
|
" 'Swing',\n",
|
|
" 'components',\n",
|
|
" 'program',\n",
|
|
" 'program']"
|
|
]
|
|
},
|
|
"execution_count": 73,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"import spacy\n",
|
|
"def get_nouns(text):\n",
|
|
" nlp = spacy.load(\"en_core_web_sm\")\n",
|
|
" doc = nlp(text)\n",
|
|
" nouns = [token.text for token in doc if token.pos_ == \"NOUN\"]\n",
|
|
" return nouns\n",
|
|
"\n",
|
|
"get_nouns(text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "musical-creator",
|
|
"metadata": {},
|
|
"source": [
|
|
"Teraz czas na podliczenie wystąpień poszczególnych rzeczowników. Uwaga - różne formy tego samego słowa zliczamy razem jako wystąpienia tego słowa (np. \"program\" i \"programs\"). Najwygodniejszą metodą podliczania jest zastosowanie tzw. tally (po polsku \"zestawienie\"). Jest to słownik, którego kluczem jest słowo w formie podstawowej, a wartością liczba wystąpień tego słowa, wliczając słowa odmienione. Przykład gotowego tally:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "acting-tolerance",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"tally = {\"program\" : 4, \"component\" : 1}"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "vanilla-estimate",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ćwiczenie 5: Napisz program do ekstrakcji terminologii z tekstu według powyższych wytycznych."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 74,
|
|
"id": "eight-redhead",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/plain": [
|
|
"{'programmer': 1,\n",
|
|
" 'section': 1,\n",
|
|
" 'swing': 4,\n",
|
|
" 'application': 3,\n",
|
|
" 'command': 1,\n",
|
|
" 'line': 1,\n",
|
|
" 'information': 1,\n",
|
|
" 'compilation': 1,\n",
|
|
" 'instruction': 1,\n",
|
|
" 'program': 4,\n",
|
|
" 'applet': 1,\n",
|
|
" 'step': 1,\n",
|
|
" 'release': 1,\n",
|
|
" 'platform': 1,\n",
|
|
" 'component': 1}"
|
|
]
|
|
},
|
|
"execution_count": 74,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"from collections import Counter\n",
|
|
"import spacy\n",
|
|
"\n",
|
|
"def extract_terms(text):\n",
|
|
" nlp = spacy.load(\"en_core_web_sm\")\n",
|
|
" doc = nlp(text)\n",
|
|
" tally = {}\n",
|
|
" nouns = [token.lemma_ for token in doc if token.pos_ == \"NOUN\"]\n",
|
|
" nouns_counts = Counter(nouns)\n",
|
|
" \n",
|
|
" for word, count in nouns_counts.items():\n",
|
|
" tally.update({word: count})\n",
|
|
" return tally\n",
|
|
"\n",
|
|
"extract_terms(text)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "loaded-smell",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Ćwiczenie 6: Rozszerz powyższy program o ekstrację czasowników i przymiotników."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 75,
|
|
"id": "monetary-mambo",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"{'adjectives': {'late': 1},\n",
|
|
" 'nouns': {'applet': 1,\n",
|
|
" 'application': 3,\n",
|
|
" 'command': 1,\n",
|
|
" 'compilation': 1,\n",
|
|
" 'component': 1,\n",
|
|
" 'information': 1,\n",
|
|
" 'instruction': 1,\n",
|
|
" 'line': 1,\n",
|
|
" 'platform': 1,\n",
|
|
" 'program': 4,\n",
|
|
" 'programmer': 1,\n",
|
|
" 'release': 1,\n",
|
|
" 'section': 1,\n",
|
|
" 'step': 1,\n",
|
|
" 'swing': 4},\n",
|
|
" 'verbs': {'compile': 3,\n",
|
|
" 'create': 1,\n",
|
|
" 'do': 1,\n",
|
|
" 'explain': 1,\n",
|
|
" 'follow': 1,\n",
|
|
" 'install': 1,\n",
|
|
" 'need': 1,\n",
|
|
" 'run': 3,\n",
|
|
" 'see': 1,\n",
|
|
" 'use': 2,\n",
|
|
" 'work': 1}}\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"from pprint import pprint\n",
|
|
"from collections import Counter\n",
|
|
"import spacy\n",
|
|
"\n",
|
|
"def extract_terms(text):\n",
|
|
" \n",
|
|
" nlp = spacy.load(\"en_core_web_sm\")\n",
|
|
" doc = nlp(text)\n",
|
|
" \n",
|
|
" nouns, verbs, adjectives = [], [], []\n",
|
|
" tally = {\"nouns\": {}, \"verbs\": {}, \"adjectives\": {}}\n",
|
|
" \n",
|
|
" for token in doc:\n",
|
|
" if token.pos_ == \"NOUN\":\n",
|
|
" nouns.append(token.lemma_)\n",
|
|
" elif token.pos_ == \"VERB\":\n",
|
|
" verbs.append(token.lemma_)\n",
|
|
" elif token.pos_ == \"ADJ\":\n",
|
|
" adjectives.append(token.lemma_)\n",
|
|
" \n",
|
|
" nouns_counts = Counter(nouns)\n",
|
|
" verbs_counts = Counter(verbs)\n",
|
|
" adjectives_counts = Counter(adjectives)\n",
|
|
"\n",
|
|
" for word, count in nouns_counts.items():\n",
|
|
" tally[\"nouns\"].update({word: count})\n",
|
|
" \n",
|
|
" for word, count in verbs_counts.items():\n",
|
|
" tally[\"verbs\"].update({word: count})\n",
|
|
" \n",
|
|
" for word, count in adjectives_counts.items():\n",
|
|
" tally[\"adjectives\"].update({word: count})\n",
|
|
"\n",
|
|
" pprint(tally)\n",
|
|
" #return tally\n",
|
|
"\n",
|
|
"extract_terms(text)"
|
|
]
|
|
}
|
|
],
|
|
"metadata": {
|
|
"author": "Rafał Jaworski",
|
|
"email": "rjawor@amu.edu.pl",
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"lang": "pl",
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.9.2"
|
|
},
|
|
"subtitle": "3. Terminologia",
|
|
"title": "Komputerowe wspomaganie tłumaczenia",
|
|
"year": "2021"
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|