moj-2024/wyk/02_Jezyki.ipynb

534 lines
174 KiB
Plaintext
Raw Normal View History

2024-02-27 21:20:36 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Modelowanie języka</h1>\n",
"<h2> 02. <i>Języki i ich prawa statystyczne</i> [wykład]</h2> \n",
"<h3> Filip Graliński (2022)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Języki i ich prawa statystyczne\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jakim rozkładom statystycznym podlegają języki?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Język naturalny albo „Pan Tadeusz” w liczbach\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Przygotujmy najpierw „infrastrukturę” do *segmentacji* tekstu na różnego rodzaju jednostki.\n",
"Używać będziemy generatorów.\n",
"\n",
"**Pytanie** Dlaczego generatory zamiast list?\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Księga pierwsza\n",
"\n",
"\n",
"\n",
"Gospodarstwo\n",
"\n",
"Powrót pani"
]
}
],
"source": [
"import requests\n",
"\n",
"url = 'https://wolnelektury.pl/media/book/txt/pan-tadeusz.txt'\n",
"pan_tadeusz = requests.get(url).content.decode('utf-8')\n",
"\n",
"pan_tadeusz[100:150]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Znaki\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['K', 's', 'i', 'ę', 'g', 'a', ' ', 'p', 'i', 'e', 'r', 'w', 's', 'z', 'a', '\\r', '\\n', '\\r', '\\n', '\\r', '\\n', '\\r', '\\n', 'G', 'o', 's', 'p', 'o', 'd', 'a', 'r', 's', 't', 'w', 'o', '\\r', '\\n', '\\r', '\\n', 'P', 'o', 'w', 'r', 'ó', 't', ' ', 'p', 'a', 'n', 'i']"
]
}
],
"source": [
"from itertools import islice\n",
"\n",
"def get_characters(t):\n",
" yield from t\n",
"\n",
"list(islice(get_characters(pan_tadeusz), 100, 150))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Counter({' ': 63444, 'a': 30979, 'i': 29353, 'e': 25343, 'o': 23050, 'z': 22741, 'n': 15505, 'r': 15328, 's': 15255, 'w': 14625, 'c': 14153, 'y': 13732, 'k': 12362, 'd': 11465, '\\r': 10851, '\\n': 10851, 't': 10757, 'm': 10269, 'ł': 10059, ',': 9130, 'p': 8031, 'u': 7699, 'l': 6677, 'j': 6586, 'b': 5753, 'ę': 5534, 'ą': 4794, 'g': 4775, 'h': 3915, 'ż': 3334, 'ó': 3097, 'ś': 2524, '.': 2380, 'ć': 1956, ';': 1445, 'P': 1265, 'W': 1258, ':': 1152, '!': 1083, 'S': 1045, 'T': 971, 'I': 795, 'N': 793, 'Z': 785, 'J': 729, '—': 720, 'A': 698, 'K': 683, 'ń': 651, 'M': 585, 'B': 567, 'O': 567, 'C': 556, 'D': 552, '«': 540, '»': 538, 'R': 489, '?': 441, 'ź': 414, 'f': 386, 'G': 358, 'L': 316, 'H': 309, 'Ż': 219, 'U': 184, '…': 157, '*': 150, '(': 76, ')': 76, 'Ś': 71, 'F': 47, 'é': 43, '-': 33, 'Ł': 24, 'E': 23, '/': 19, 'Ó': 13, '8': 10, '9': 8, '2': 6, 'v': 5, 'Ź': 4, '1': 4, '3': 3, 'x': 3, 'V': 3, '7': 2, '4': 2, '5': 2, 'q': 2, 'æ': 2, 'à': 1, 'Ć': 1, '6': 1, '0': 1})"
]
}
],
"source": [
"from collections import Counter\n",
"\n",
"c = Counter(get_characters(pan_tadeusz))\n",
"\n",
"c"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Napiszmy pomocniczą funkcję, która zwraca **listę frekwencyjną**.\n",
"\n",
"Counter({' ': 63444, 'a': 30979, 'i': 29353, 'e': 25343, 'o': 23050, 'z': 22741, 'n': 15505, 'r': 15328, 's': 15255, 'w': 14625, 'c': 14153, 'y': 13732, 'k': 12362, 'd': 11465, '\\r': 10851, '\\n': 10851, 't': 10757, 'm': 10269, 'ł': 10059, ',': 9130, 'p': 8031, 'u': 7699, 'l': 6677, 'j': 6586, 'b': 5753, 'ę': 5534, 'ą': 4794, 'g': 4775, 'h': 3915, 'ż': 3334, 'ó': 3097, 'ś': 2524, '.': 2380, 'ć': 1956, ';': 1445, 'P': 1265, 'W': 1258, ':': 1152, '!': 1083, 'S': 1045, 'T': 971, 'I': 795, 'N': 793, 'Z': 785, 'J': 729, '—': 720, 'A': 698, 'K': 683, 'ń': 651, 'M': 585, 'B': 567, 'O': 567, 'C': 556, 'D': 552, '«': 540, '»': 538, 'R': 489, '?': 441, 'ź': 414, 'f': 386, 'G': 358, 'L': 316, 'H': 309, 'Ż': 219, 'U': 184, '…': 157, '\\*': 150, '(': 76, ')': 76, 'Ś': 71, 'F': 47, 'é': 43, '-': 33, 'Ł': 24, 'E': 23, '/': 19, 'Ó': 13, '8': 10, '9': 8, '2': 6, 'v': 5, 'Ź': 4, '1': 4, '3': 3, 'x': 3, 'V': 3, '7': 2, '4': 2, '5': 2, 'q': 2, 'æ': 2, 'à': 1, 'Ć': 1, '6': 1, '0': 1})\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"OrderedDict([(' ', 63444), ('a', 30979), ('i', 29353), ('e', 25343), ('o', 23050), ('z', 22741), ('n', 15505), ('r', 15328)])"
]
}
],
"source": [
"from collections import Counter\n",
"from collections import OrderedDict\n",
"\n",
"def freq_list(g, top=None):\n",
" c = Counter(g)\n",
"\n",
" if top is None:\n",
" items = c.items()\n",
" else:\n",
" items = c.most_common(top)\n",
"\n",
" return OrderedDict(sorted(items, key=lambda t: -t[1]))\n",
"\n",
"freq_list(get_characters(pan_tadeusz), top=8)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAABiFUlEQVR4nO3de3zP9f//8fvbTma2d7PZZhkpzGEOhRjVFDZyqI9CYYgcIiwk0ifycSg5Fal0QKj1Qb4ptYYPIuehckgH5NBmldmQttmevz+67PXztoNtDnu3btfL5X252PP1eL9ej9fr/Xq9vN6P9/P1fNmMMUYAAAAAAACAkypT0gkAAAAAAAAABaGABQAAAAAAAKdGAQsAAAAAAABOjQIWAAAAAAAAnBoFLAAAAAAAADg1ClgAAAAAAABwahSwAAAAAAAA4NQoYAEAAAAAAMCpUcACAAAAAACAU6OABQAAAAAAAKdGAQsAAAAAAABOjQIWAAAAAAAAnBoFLAAAAAAAADg1ClgAAAAAAABwahSwAAAAAAAA4NQoYAEAAAAAAMCpUcACAAAAAACAU6OABQAAAAAAAKdGAQsAAAAAAABOjQIWAAAAAAAAnBoFLAAAAAAAADg1ClgAAAAAAABwahSwAAAAAAAA4NQoYAEAAAAAAMCpUcACAAAAAACAU6OABQAAAAAAAKdGAQsAAAAAAABOjQIWAAAAAAAAnBoFLAAAAAAAADg1ClgAAAAAAABwahSwAAAAAAAA4NQoYAEAAAAAAMCpUcACAAAAAACAU6OABQAAAAAAAKfmWtIJoPTJzs7WL7/8Im9vb9lstpJOBwAAAABQyhljdPbsWQUHB6tMGfrqlEYUsHDN/fLLLwoJCSnpNAAAAAAA/zDHjx9X5cqVSzoNXAcUsHDNeXt7S/rrxOHj41PC2QAAAAAASru0tDSFhIRY30dR+lDAwjWXc9ugj48PBSwAAAAAwA3DMDalFzeGAgAAAAAAwKlRwAIAAAAAAIBTo4AFAAAAAAAAp0YBCwAAAAAAAE6NAhYAAAAAAACcGgUsAAAAAAAAODUKWAAAAAAAAHBqFLAAAAAAAADg1FxLOgGgJN0yZnW+046+2P4GZgIAAAAAAPJDDywAAAAAAAA4NQpYAAAAAAAAcGoUsAAAAAAAAODUKGABAAAAAADAqVHAAgAAAAAAgFOjgAUAAAAAAACnRgHLSZw8eVI9e/aUn5+fypUrp4YNGyohIcGabozRhAkTFBwcLE9PT7Vs2VL79+93mEd6erqGDh0qf39/eXl5qVOnTjpx4oRDTEpKiqKjo2W322W32xUdHa0zZ844xBw7dkwdO3aUl5eX/P39NWzYMGVkZFy3dQcAAAAAACgIBSwnkJKSohYtWsjNzU2ff/65Dhw4oBkzZuimm26yYqZNm6aZM2dq7ty52rlzp4KCgtSmTRudPXvWiomJidHKlSsVGxurzZs369y5c+rQoYOysrKsmO7du2vv3r2Ki4tTXFyc9u7dq+joaGt6VlaW2rdvr/Pnz2vz5s2KjY3VihUrNHLkyBuyLQAAAAAAAC5nM8aYkk7in27MmDH66quvtGnTpjynG2MUHBysmJgYPfPMM5L+6m0VGBiol156SQMHDlRqaqoqVqyoxYsXq1u3bpKkX375RSEhIfrss88UFRWlgwcPqk6dOtq2bZuaNm0qSdq2bZvCw8P13XffKTQ0VJ9//rk6dOig48ePKzg4WJIUGxurPn36KDk5WT4+Pldcn7S0NNntdqWmphYqviTdMmZ1vtOOvtj+BmYCAAAAACiuv9P3UBQPPbCcwKpVq9S4cWN16dJFAQEBuv322/XWW29Z048cOaKkpCRFRkZabR4eHoqIiNCWLVskSQkJCcrMzHSICQ4OVlhYmBWzdetW2e12q3glSc2aNZPdbneICQsLs4pXkhQVFaX09HSHWxoBAAAAAABuFApYTuDw4cN6/fXXVaNGDX3xxRcaNGiQhg0bpvfee0+SlJSUJEkKDAx0eF9gYKA1LSkpSe7u7vL19S0wJiAgINfyAwICHGIuX46vr6/c3d2tmMulp6crLS3N4QUAAAAAAHCtuJZ0ApCys7PVuHFjTZkyRZJ0++23a//+/Xr99dfVq1cvK85mszm8zxiTq+1yl8fkFV+cmEtNnTpVL7zwQoF5AAAAAAAAFBc9sJxApUqVVKdOHYe22rVr69ixY5KkoKAgScrVAyo5OdnqLRUUFKSMjAylpKQUGHPq1Klcy//1118dYi5fTkpKijIzM3P1zMoxduxYpaamWq/jx48Xar0BAAAAAAAKgwKWE2jRooUOHTrk0Pb999+ratWqkqRq1aopKChIa9assaZnZGRo48aNat68uSSpUaNGcnNzc4hJTEzUvn37rJjw8HClpqZqx44dVsz27duVmprqELNv3z4lJiZaMfHx8fLw8FCjRo3yzN/Dw0M+Pj4OLwAAAAAAgGuFWwidwFNPPaXmzZtrypQp6tq1q3bs2KH58+dr/vz5kv66pS8mJkZTpkxRjRo1VKNGDU2ZMkXlypVT9+7dJUl2u139+vXTyJEj5efnpwoVKmjUqFGqV6+eWrduLemvXl1t27ZV//799eabb0qSBgwYoA4dOig0NFSSFBkZqTp16ig6Olovv/yyTp8+rVGjRql///4UpgAAAAAAQImggOUEmjRpopUrV2rs2LGaOHGiqlWrptmzZ6tHjx5WzOjRo3XhwgUNHjxYKSkpatq0qeLj4+Xt7W3FzJo1S66ururatasuXLigVq1aaeHChXJxcbFili5dqmHDhllPK+zUqZPmzp1rTXdxcdHq1as1ePBgtWjRQp6enurevbumT59+A7YEAAAAAABAbjZjjCnpJFC6pKWlyW63KzU11el7bd0yZnW+046+2P4GZgIAAAAAKK6/0/dQFA9jYAEAAAAAAMCpUcACAAAAAACAU6OABQAAAAAAAKdGAQsAAAAAAABOjQIWAAAAAAAAnBoFLAAAAAAAADg1ClgAAAAAAABwahSwAAAAAAAA4NQoYAEAAAAAAMCpUcACAAAAAACAU6OABQAAAAAAAKdGAQsAAAAAAABOjQIWAAAAAAAAnBoFLAAAAAAAADg1ClgAAAAAAABwahSwAAAAAAAA4NQoYAEAAAAAAMCpUcACAAAAAACAU6OABQAAAAAAAKdGAQsAAAAAAABOjQIWAAAAAAAAnBoFLAAAAAAAADg1ClgAAAAAAABwahSwAAAAAAAA4NQoYDmBCRMmyGazObyCgoKs6cYYTZgwQcHBwfL09FTLli21f/9+h3mkp6dr6NCh8vf3l5eXlzp16qQTJ044xKSkpCg6Olp2u112u13R0dE6c+aMQ8yxY8fUsWNHeXl5yd/fX8OGDVNGRsZ1W3cAAAAAAIAroYDlJOrWravExETr9e2331rTpk2bppkzZ2ru3LnauXOngoKC1KZNG509e9aKiYmJ0cqVKxUbG6vNmzfr3Llz6tChg7KysqyY7t27a+/evYqLi1NcXJz27t2r6Ohoa3pWVpbat2+v8+fPa/PmzYqNjdWKFSs0cuTIG7MRAAAAAAAA8uBa0gngL66urg69rnIYYzR79myNGzdOnTt3liQtWrRIgYGBev/99zVw4EClpqbqnXfe0eLFi9W6dWtJ0pIlSxQSEqK1a9cqKipKBw8eVFxcnLZt26amTZtKkt566y2Fh4fr0KFDCg0NVXx8vA4cOKDjx48rODhYkjRjxgz16dNHkydPlo+Pzw3aGgAAAAAAAP8fPbCcxA8//KDg4GBVq1ZNjzzyiA4fPixJOnLkiJKSkhQZGWnFenh4KCIiQlu2bJEkJSQkKDMz0yEmODhYYWFhVszWrVtlt9ut4pUkNWvWTHa73SEmLCzMKl5JUlRUlNLT05WQkHD9Vh4AAAAAAKAA9MByAk2bNtV7772nmjVr6tSpU5o0aZKaN2+u/fv3KykpSZIUGBjo8J7AwED9/PPPkqSkpCS5u7vL19c3V0zO+5OSkhQQEJBr2QEBAQ4xly/H19dX7u7uVkxe0tPTlZ6ebv2dlpZW2FUHAAAAAAC4IgpYTqB
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from collections import OrderedDict\n",
"\n",
"def rang_freq_with_labels(name, g, top=None):\n",
" freq = freq_list(g, top)\n",
"\n",
" plt.figure(figsize=(12, 3))\n",
" plt.ylabel('liczba wystąpień')\n",
"\n",
" plt.bar(freq.keys(), freq.values())\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"rang_freq_with_labels('pt-chars', get_characters(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Słowa\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Co rozumiemy pod pojęciem słowa czy wyrazu, nie jest oczywiste. W praktyce zależy to od wyboru **tokenizatora**.\n",
"\n",
"Załóżmy, że przez wyraz rozumieć będziemy nieprzerwany ciąg liter bądź cyfr (oraz gwiazdek\n",
"— to za chwilę ułatwi nam analizę pewnego tekstu…).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Ty', 'co', 'gród', 'zamkowy', 'Nowogródzki', 'ochraniasz', 'z', 'jego', 'wiernym', 'ludem', 'Jak', 'mnie', 'dziecko', 'do', 'zdrowia', 'powróciłaś', 'cudem', 'Gdy', 'od', 'płaczącej', 'matki', 'pod', 'Twoją', 'opiekę', 'Ofiarowany', 'martwą', 'podniosłem', 'powiekę', 'I', 'zaraz']"
]
}
],
"source": [
"from itertools import islice\n",
"import regex as re\n",
"\n",
"def get_words(t):\n",
" for m in re.finditer(r'[\\p{L}0-9\\*]+', t):\n",
" yield m.group(0)\n",
"\n",
"list(islice(get_words(pan_tadeusz), 100, 130))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy 20 najczęstszych wyrazów.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAABC3UlEQVR4nO3deViU9f7/8dfIruAEmBCJS7mkgktWbhV43FJJO56yo4b2y6OWueCu2YLmUnYUCk+LHk+YS6aWrSdzSUkOrigWaZqmuQTRQihKgHD//ujy/jqCZsVwz9DzcV1zXd6f+z0zrxmYG3zzuT+3zTAMQwAAAAAAAICLqmZ1AAAAAAAAAOBKaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApXlaHQBVT2lpqb755hsFBATIZrNZHQcAAAAAUMUZhqEzZ84oLCxM1aoxV6cqooGFCvfNN98oPDzc6hgAAAAAgD+ZEydOqE6dOlbHgBPQwEKFCwgIkPTLgaNmzZoWpwEAAAAAVHWnT59WeHi4+f9RVD00sFDhLpw2WLNmTRpYAAAAAIBKwzI2VRcnhgIAAAAAAMCl0cACAAAAAACAS6OBBQAAAAAAAJdGAwsAAAAAAAAujQYWAAAAAAAAXBoNLAAAAAAAALg0GlgAAAAAAABwaTSwAAAAAAAA4NI8rQ4AWKn+lA+sjiBJOvZMr1+tcaesAAAAAABUJGZgAQAAAAAAwKXRwAIAAAAAAIBLo4EFAAAAAAAAl0YDCwAAAAAAAC6NBhYAAAAAAABcGg0sAAAAAAAAuDQaWAAAAAAAAHBpNLAAAAAAAADg0mhgAQAAAAAAwKXRwHITn3zyie6++26FhYXJZrPp7bffvmzt8OHDZbPZlJiY6DBeWFioUaNGqVatWqpRo4Z69+6tkydPOtTk5uYqNjZWdrtddrtdsbGx+umnnyr+BQEAAAAAAFwlGlhu4uzZs2rZsqUWLFhwxbq3335bO3bsUFhYWJl9cXFxWrt2rVauXKnU1FTl5+crJiZGJSUlZs2AAQOUkZGhdevWad26dcrIyFBsbGyFvx4AAAAAAICr5Wl1AFydHj16qEePHlesOXXqlEaOHKmPPvpIvXr1ctiXl5enxYsXa+nSperSpYskadmyZQoPD9fGjRvVvXt3HThwQOvWrdP27dvVtm1bSdKiRYvUvn17HTx4UE2aNHHOi0OVU3/KB1ZHkCQde6bXrxcBAAAAAFweM7CqiNLSUsXGxmrixIlq3rx5mf3p6ekqLi5Wt27dzLGwsDBFREQoLS1NkrRt2zbZ7XazeSVJ7dq1k91uN2vKU1hYqNOnTzvcAAAAAAAAKgoNrCri2Weflaenp0aPHl3u/uzsbHl7eyswMNBhPCQkRNnZ2WZN7dq1y9y3du3aZk155syZY66ZZbfbFR4e/gdeCQAAAAAAgCMaWFVAenq6nn/+eSUnJ8tms/2m+xqG4XCf8u5/ac2lpk6dqry8PPN24sSJ35QBAAAAAADgSmhgVQFbt25VTk6O6tatK09PT3l6eurrr7/W+PHjVb9+fUlSaGioioqKlJub63DfnJwchYSEmDXffvttmcf/7rvvzJry+Pj4qGbNmg43AAAAAACAikIDqwqIjY3Vp59+qoyMDPMWFhamiRMn6qOPPpIktWnTRl5eXtqwYYN5v6ysLGVmZqpDhw6SpPbt2ysvL087d+40a3bs2KG8vDyzBgAAAAAAoLJxFUI3kZ+fr8OHD5vbR48eVUZGhoKCglS3bl0FBwc71Ht5eSk0NNS8cqDdbteQIUM0fvx4BQcHKygoSBMmTFBkZKR5VcKmTZvqrrvu0tChQ/XKK69IkoYNG6aYmBiuQAgAAAAAACxDA8tN7N69W506dTK3x40bJ0kaPHiwkpOTr+oxEhIS5OnpqX79+qmgoECdO3dWcnKyPDw8zJrly5dr9OjR5tUKe/furQULFlTcCwEAAAAAAPiNaGC5iejoaBmGcdX1x44dKzPm6+urpKQkJSUlXfZ+QUFBWrZs2e+JCAAAAAAA4BSsgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACX5ml1AACwSv0pH1gdwXTsmV5WRwAAAAAAl8UMLDfxySef6O6771ZYWJhsNpvefvttc19xcbEmT56syMhI1ahRQ2FhYRo0aJC++eYbh8coLCzUqFGjVKtWLdWoUUO9e/fWyZMnHWpyc3MVGxsru90uu92u2NhY/fTTT5XwCgEAAAAAAMpHA8tNnD17Vi1bttSCBQvK7Dt37pz27NmjJ554Qnv27NFbb72lQ4cOqXfv3g51cXFxWrt2rVauXKnU1FTl5+crJiZGJSUlZs2AAQOUkZGhdevWad26dcrIyFBsbKzTXx8AAAAAAMDlcAqhm+jRo4d69OhR7j673a4NGzY4jCUlJem2227T8ePHVbduXeXl5Wnx4sVaunSpunTpIklatmyZwsPDtXHjRnXv3l0HDhzQunXrtH37drVt21aStGjRIrVv314HDx5UkyZNnPsiAQAAAAAAysEMrCoqLy9PNptN11xzjSQpPT1dxcXF6tatm1kTFhamiIgIpaWlSZK2bdsmu91uNq8kqV27drLb7WYNAAAAAABAZWMGVhX0888/a8qUKRowYIBq1qwpScrOzpa3t7cCAwMdakNCQpSdnW3W1K5du8zj1a5d26wpT2FhoQoLC83t06dPV8TLAAAAAAAAkMQMrCqnuLhYf//731VaWqoXX3zxV+sNw5DNZjO3L/735WouNWfOHHPRd7vdrvDw8N8XHgAAAAAAoBw0sKqQ4uJi9evXT0ePHtWGDRvM2VeSFBoaqqKiIuXm5jrcJycnRyEhIWbNt99+W+Zxv/vuO7OmPFOnTlVeXp55O3HiRAW9IgAAAAAAABpYVcaF5tWXX36pjRs3Kjg42GF/mzZt5OXl5bDYe1ZWljIzM9WhQwdJUvv27ZWXl6edO3eaNTt27FBeXp5ZUx4fHx/VrFnT4QYAAAAAAFBRWAPLTeTn5+vw4cPm9tGjR5WRkaGgoCCFhYXp3nvv1Z49e/T++++rpKTEXLMqKChI3t7estvtGjJkiMaPH6/g4GAFBQVpwoQJioyMNK9K2LRpU911110aOnSoXnnlFUnSsGHDFBMTwxUIAQAAAACAZWhgOdlPP/2knTt3KicnR6WlpQ77Bg0adNWPs3v3bnXq1MncHjdunCRp8ODBio+P17vvvitJatWqlcP9Nm/erOjoaElSQkKCPD091a9fPxUUFKhz585KTk6Wh4eHWb98+XKNHj3avFph7969tWDBgqvOCQAAAAAAUNFoYDnRe++9p4EDB+rs2bMKCAgos1j6b2lgRUdHyzCMy+6/0r4LfH19lZSUpKSkpMvWBAUFadmyZVedCwAAAAAAwNlYA6sCrVq1Sl9//bW5PX78eD300EM6c+aMfvrpJ+Xm5pq3H3/80cKkAAAAAAAA7oMGVgX
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rang_freq_with_labels('pt-words-20', get_words(pan_tadeusz), top=20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy pełny obraz, już bez etykiet.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAwSElEQVR4nO3df3QV9Z3/8dclP64hm1zzo7mXWwKme7LWNSm1wYZEt2CBADWkLt9dtLgp3eWruAg0S/AHx+2Kfk+Tyq7ArqlWWb5CQYyn3xXWs7WRUDWaL6AYyAqIqGsKQXMN+r3cJBBvQjLfPyizvSSAQiZ3bub5OGfOyZ15z+QzfrwnLz4znxmXYRiGAAAA4Bijot0AAAAADC8CIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADhMf7QbEsv7+fn388cdKSUmRy+WKdnMAAMAXYBiGOjs75ff7NWqUM8fCCICX4eOPP1Z2dna0mwEAAC5Ba2urxo4dG+1mRAUB8DKkpKRIOvM/UGpqapRbAwAAvoiOjg5lZ2ebf8ediAB4Gc5e9k1NTSUAAgAQY5x8+5YzL3wDAAA4GAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCoA39Zn+bfly7T796qzXaTQEAACMQAdCG3g106t+bP9a+1hPRbgoAABiBCIA2lBh/plv6+owotwQAAIxEBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCoI0ZYhYwAAAYegRAAAAAhyEAAgAAOAwBEAAAwGEIgAAAAA5DAAQAAHAYAqCNGUwCBgAAFiAA2pDLFe0WAACAkYwACAAA4DAEQAAAAIchAAIAADiM7QLga6+9ptmzZ8vv98vlcmnbtm3nrV24cKFcLpfWrl0bsT4cDmvJkiXKzMxUcnKyysrKdOzYsYiaYDCo8vJyeTweeTwelZeX68SJE0N/QgAAADZjuwB48uRJTZgwQTU1NRes27Ztm9544w35/f4B2yoqKrR161bV1taqsbFRXV1dKi0tVV9fn1kzb948NTc3q66uTnV1dWpublZ5efmQn8/lYBIwAACwQny0G3CuWbNmadasWRes+eijj7R48WK99NJLuvnmmyO2hUIhrV+/Xps2bdK0adMkSZs3b1Z2drZ27NihGTNm6NChQ6qrq9Pu3btVWFgoSVq3bp2Kiop0+PBhXX311dac3BfkEtOAAQCAdWw3Angx/f39Ki8v1z333KNrr712wPampib19vaqpKTEXOf3+5WXl6edO3dKknbt2iWPx2OGP0maNGmSPB6PWTOYcDisjo6OiAUAACDWxFwAfOSRRxQfH6+lS5cOuj0QCCgxMVFpaWkR671erwKBgFmTlZU1YN+srCyzZjDV1dXmPYMej0fZ2dmXcSYAAADREVMBsKmpSf/8z/+sDRs2yPUln5ZsGEbEPoPtf27NuVasWKFQKGQura2tX6oNAAAAdhBTAfD1119Xe3u7xo0bp/j4eMXHx+vIkSOqrKzUVVddJUny+Xzq6elRMBiM2Le9vV1er9es+eSTTwYc//jx42bNYNxut1JTUyMWAACAWBNTAbC8vFxvv/22mpubzcXv9+uee+7RSy+9JEkqKChQQkKC6uvrzf3a2tp04MABFRcXS5KKiooUCoX05ptvmjVvvPGGQqGQWQMAADBS2W4WcFdXlz744APzc0tLi5qbm5Wenq5x48YpIyMjoj4hIUE+n8+cuevxeLRgwQJVVlYqIyND6enpWr58ufLz881Zwddcc41mzpypO+64Q08++aQk6c4771RpaWnUZwBL//0uYIPnwAAAAAvYLgC+9dZbuummm8zPy5YtkyTNnz9fGzZs+ELHWLNmjeLj4zV37lx1d3dr6tSp2rBhg+Li4syaZ555RkuXLjVnC5eVlV302YMAAAAjgcswGGe6VB0dHfJ4PAqFQkN6P+AvGv5LP/vNu/of3xqrR+dOGLLjAgAA6/5+x5KYugcQAAAAl48ACAAA4DAEQAAAAIchANrQ2UdRG+L2TAAAMPQIgAAAAA5DAAQAAHAYAiAAAIDDEAABAAAchgAIAADgMARAGzr7LmAAAAArEADtjKfAAAAACxAAAQAAHIYACAAA4DAEQAAAAIchAAIAADgMARAAAMBhCIA25NKZ58AwCRgAAFiBAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMAtCGXK9otAAAAIxkB0MYMgwfBAACAoUcABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQDaGHOAAQCAFQiAAAAADkMABAAAcBgCIAAAgMMQAAEAABzGdgHwtdde0+zZs+X3++VyubRt2zZzW29vr+677z7l5+crOTlZfr9fP/zhD/Xxxx9HHCMcDmvJkiXKzMxUcnKyysrKdOzYsYiaYDCo8vJyeTweeTwelZeX68SJE8Nwhhfn4mXAAADAQrYLgCdPntSECRNUU1MzYNupU6e0d+9e/eQnP9HevXv1/PPP67333lNZWVlEXUVFhbZu3ara2lo1Njaqq6tLpaWl6uvrM2vmzZun5uZm1dXVqa6uTs3NzSovL7f8/AAAAKItPtoNONesWbM0a9asQbd5PB7V19dHrHvsscf07W9/W0ePHtW4ceMUCoW0fv16bdq0SdOmTZMkbd68WdnZ2dqxY4dmzJihQ4cOqa6uTrt371ZhYaEkad26dSoqKtLhw4d19dVXW3uSX5DBc2AAAIAFbDcC+GWFQiG5XC5deeWVkqSmpib19vaqpKTErPH7/crLy9POnTslSbt27ZLH4zHDnyRNmjRJHo/HrBlMOBxWR0dHxAIAABBrYjoAfv7557r//vs1b948paamSpICgYASExOVlpYWUev1ehUIBMyarKysAcfLysoyawZTXV1t3jPo8XiUnZ09hGcDAAAwPGI2APb29uq2225Tf3+/Hn/88YvWG4YRMblisIkW59aca8WKFQqFQubS2tp6aY0HAACIopgMgL29vZo7d65aWlpUX19vjv5Jks/nU09Pj4LBYMQ+7e3t8nq9Zs0nn3wy4LjHjx83awbjdruVmpoasQAAAMSamAuAZ8Pf+++/rx07digjIyNie0FBgRISEiImi7S1tenAgQMqLi6WJBUVFSkUCunNN980a9544w2FQiGzJpp4CAwAALCS7WYBd3V16YMPPjA/t7S0qLm5Wenp6fL7/fqLv/gL7d27V//xH/+hvr4+85699PR0JSYmyuPxaMGCBaqsrFRGRobS09O1fPly5efnm7OCr7nmGs2cOVN33HGHnnzySUnSnXfeqdLSUtvMAJYkJgEDAAAr2C4AvvXWW7rpppvMz8uWLZMkzZ8/XytXrtQLL7wgSfrmN78Zsd8rr7yiKVOmSJLWrFmj+Ph4zZ07V93d3Zo6dao2bNiguLg4s/6ZZ57R0qVLzdnCZWVlgz57EAAAYKRxGQZPm7tUHR0d8ng8CoVCQ3o/4P9ubNHD//GOZk/w67EfXDdkxwUAANb9/Y4lMXcPIAAAAC4PARAAAMBhCIA2dIFHEQIAAFw2AiAAAIDDEABtjPk5AADACgR
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from math import log\n",
"\n",
"def rang_freq(name, g):\n",
" freq = freq_list(g)\n",
"\n",
" plt.figure().clear()\n",
" plt.plot(range(1, len(freq.values())+1), freq.values())\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"rang_freq('pt-words', get_words(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Widać, jak różne skale obejmuje ten wykres. Zastosujemy logarytm,\n",
"najpierw tylko do współrzędnej $y$.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAo10lEQVR4nO3dfZSV5X03+t+eFzY4DqOIvEwYkZgYNaCxaBWNifGFSMQkzcsyqTEkabqWWWBqtWc1NutZmrQV2z718TzHxlaPx2rTiOusqvU8MSZ4VNSjGF+wQTQGowFUECU4w4tsmJn7/MHMxglCVPae+95zfT5r7eXsPfdmfsM1s/j6u+7rukpZlmUBAEAymvIuAACA4SUAAgAkRgAEAEiMAAgAkBgBEAAgMQIgAEBiBEAAgMQIgAAAiREAAQASIwACACRGAAQASIwACACQGAEQACAxAiAAQGIEQACAxAiAAACJEQABABIjAAIAJEYABABIjAAIAJAYARAAIDECIABAYgRAAIDECIAAAIkRAAEAEiMAAgAkRgAEAEiMAAgAkBgBEAAgMQIgAEBiBEAAgMQIgAAAiREAAQASIwACACRGAAQASIwACACQGAEQACAxAiAAQGIEQACAxAiAAACJEQABABIjAAIAJEYABABIjAAIAJAYARAAIDECIABAYgRAAIDECIAAAIkRAAEAEiMAAgAkRgAEAEiMAAgAkBgBEAAgMQIgAEBiBEAAgMQIgAAAiWnJu4BG1t/fH6+88kq0t7dHqVTKuxwA4B3Isiw2bdoUnZ2d0dSUZi9MANwHr7zySnR1deVdBgDwHqxZsyamTJmSdxm5EAD3QXt7e0Ts/AEaO3ZsztUAAO9ET09PdHV1Vf8dT5EAuA8Gp33Hjh0rAAJAg0n59q00J74BABImAAIAJEYABABIjAAIAJAYARAAIDECIABAYgRAAIDECIAAAIkRAAEAEiMAAgAkRgAEAEiMAFhQWZblXQIAMEIJgAX0y3U98elr/r/4xUtv5F0KADACCYAF9I8/+1Usf7k7/u/HX8q7FABgBBIAC+gjXQdERESlty/fQgCAEUkALKBSaed/3QYIANSDAFhApdiZAPsFQACgDgTAAmoa7ACGBAgA1J4AWECDU8DyHwBQDwJgAe2aApYAAYDaEwALqLoIJN8yAIARSgAsoNJAAtQABADqQQAsILcAAgD1lHQAPPTQQ6NUKu32mD9/fq51Da4Cdg8gAFAPLXkXkKfHHnss+vp2nbbx9NNPx5lnnhlf/OIXc6xq1xSwFiAAUA9JB8CDDz54yPMrr7wyDjvssPj4xz+eU0U7lewDCADUUdJTwG+1ffv2+OEPfxjf+MY3dnXgclK9B1D+AwDqIOkO4Fvdcccd8cYbb8TXvva1PV5TqVSiUqlUn/f09NSnGKuAAYA60gEccMMNN8ScOXOis7Nzj9csXLgwOjo6qo+urq661LJrFbAECADUngAYEatWrYp77rknvvnNb+71uksvvTS6u7urjzVr1tSlnpxnoAGAEc4UcETceOONMWHChDj77LP3el25XI5yuTxMVZkCBgDqI/kOYH9/f9x4440xb968aGkpRh4ePAtY/gMA6iH5AHjPPffE6tWr4xvf+EbepVSZAgYA6qkYLa8czZ49O7KCzrUWtCwAoMEl3wEsol0NQAkQAKg9AbCATAEDAPUkABaYKWAAoB4EwAKyChgAqCcBsIhMAQMAdSQAFlhRVycDAI1NACygXWcBAwDUngBYQKWBZcAagABAPQiABeQWQACgngTAAtMABADqQQAsoMGNoC0CAQDqQQAsICeBAAD1JAACACRGACyg6kkgZoABgDoQAAvIFDAAUE8CYIFl1gEDAHUgABaYKWAAoB4EwAIqmQMGAOpIACwwHUAAoB4EwAIa7P+5BxAAqAcBsIB2nQSSbx0AwMgkABZQKdwDCADUjwBYYBqAAEA9CIAFVNp1EyAAQM0JgAVkAhgAqCcBsMCsAgYA6kEALCCrgAGAehIAC8kkMABQPwJggWkAAgD1IAAW0K4pYBEQAKg9AbCATAADAPUkABaY/h8AUA8CYAGVBuaAzQADAPUgABaQg0AAgHoSAAuo5CZAAKCOkg6AL7/8cnzlK1+Jgw46KPbbb7/4yEc+Ek888UTeZe1iDhgAqIOWvAvIy8aNG+Pkk0+OT3ziE/GTn/wkJkyYEL/+9a/jgAMOyLu0XdvA5FsGADBCJRsA/+7v/i66urrixhtvrL526KGH5lfQW5RsBAMA1FGyU8B33nlnHHfccfHFL34xJkyYEMcee2xcf/31eZc1hBlgAKAekg2AL7zwQlx77bXxwQ9+MH7605/GBRdcEN/+9rfj5ptv3uN7KpVK9PT0DHnURXUKWAIEAGov2Sng/v7+OO644+KKK66IiIhjjz02VqxYEddee2189atffdv3LFy4ML73ve/VvTYTwABAPSXbAZw8eXIcddRRQ1478sgjY/Xq1Xt8z6WXXhrd3d3Vx5o1a+paoylgAKAeku0AnnzyyfHcc88Nee1Xv/pVTJ06dY/vKZfLUS6X612ak0AAgLpKtgP453/+57F06dK44oor4vnnn48f/ehHcd1118X8+fPzLs0UMABQV8kGwOOPPz5uv/32uOWWW2L69Onx13/913H11VfHeeedl3dpVRqAAEA9JDsFHBExd+7cmDt3bt5l7Ka6EbQ5YACgDpLtABaZjaABgHoSAAuoJP8BAHUkABaYGWAAoB4EwAIabAA6CQQAqAcBsIhMAQMAdSQAFpgpYACgHgTAAhpcBSz/AQD1IAAWkFXAAEA9CYAFZiNoAKAeBMAC2rUKGACg9gTAAmpu2hkB+/tFQACg9gTAAhoMgH2mgAGAOhAAC6gaAPsEQACg9gTAAhoMgL2mgAGAOhAAC6ilaeew9JsCBgDqQAAsoOaBUdEBBADqQQAsoOaBDmCfAAgA1IEAWEDNA0eBCIAAQD0IgAXU3CwAAgD1IwAWkA4gAFBPAmAB2QYGAKgnAbCAWppK1Y8dBwcA1JoAWEBNbwmAuoAAQK0JgAU0pANoM2gAoMYEwAJq1gEEAOpIACygtwZAK4EBgFoTAAtocBuYCAEQAKg9AbCAmppKMZgBe/v78y0GABhxBMCCGlwIIv8BALUmABZUU2lwM2gJEACoLQGwoHQAAYB6EQALqqlJBxAAqA8BsKCqHUAbQQMANSYAFlRz086hsRE0AFBrAmBBNQ+MTG+fAAgA1FayAfDyyy+PUqk05DFp0qS8y6pqGegAmgIGAGqtJe8C8vThD3847rnnnurz5ubmHKsZqmmwA2gKGACosaQDYEtLS6G6fm9V7QAKgABAjSU7BRwRsXLlyujs7Ixp06bFl770pXjhhRfyLqmqqXoUnAAIANRWsh3AE044IW6++eY4/PDD49VXX42/+Zu/iZNOOilWrFgRBx100Nu+p1KpRKVSqT7v6empW32DHcA+ARAAqLFkO4Bz5syJz3/+8zFjxow444wz4sc//nFERNx00017fM/ChQujo6Oj+ujq6qpbfc0DLUABEACotWQD4O9qa2uLGTNmxMqVK/d4zaWXXhrd3d3Vx5o1a+pWjwAIANRLslPAv6tSqcSzzz4bp5xyyh6vKZfLUS6Xh6UeARAAqJdkO4B/8Rd/EUuWLIkXX3wxHn300fjCF74QPT09MW/evLxLi4hdAdAiEACg1pLtAL700kvx5S9/OV5//fU4+OCD48QTT4ylS5fG1KlT8y4tInYFQBtBAwC1lmwAXLRoUd4l7FVzSQcQAKiPZKeAi66lefAewP6cKwEARhoBsKB2LQLJuRAAYMQRAAtqcApYBxAAqDUBsKB0AAGAehEAC2pXAJQAAYDaEgALykbQAEC9CIAFZSNoAKBeBMCC0gEEAOpFACyocsvOodne6x5AAKC2BMCCKrc0R0TEtt6+nCsBAEYaAbCgyq07h6ayQwcQAKgtAbCgRusAAgB1IgAW1OjWgQCoAwgA1JgAWFCDi0C27dABBABqSwAsqDGjBjuAAiAAUFsCYEHtNxAAt1QEQACgtgTAgtq/3BIREVu39+Z
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from math import log\n",
"\n",
"def rang_log_freq(name, g):\n",
" freq = freq_list(g)\n",
"\n",
" plt.figure().clear()\n",
" plt.plot(range(1, len(freq.values())+1), [log(y) for y in freq.values()])\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"rang_log_freq('pt-words-log', get_words(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"****Pytanie**** Dlaczego widzimy coraz dłuższe „schodki”?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hapax legomena\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Z poprzedniego wykresu możemy odczytać, że ok. 2/3 wyrazów wystąpiło\n",
"dokładnie 1 raz. Słowa występujące jeden raz w danym korpusie noszą\n",
"nazwę *hapax legomena* (w liczbie pojedynczej *hapax legomenon*, ἅπαξ\n",
"λεγόμενον, „raz powiedziane”, żargonowo: „hapaks”).\n",
"\n",
"„Prawdziwe” hapax legomena, słowa, które wystąpiły tylko raz w *całym*\n",
"korpusie tekstów danego języka (np. starożytnego) rzecz jasna\n",
"sprawiają olbrzymie trudności w tłumaczeniu. Przykładem jest greckie\n",
"słowo ἐπιούσιος, przydawka odnosząca się do chleba w modlitwie „Ojcze\n",
"nasz”. Jest to jedyne poświadczenie tego słowa w całym znanym korpusie\n",
"greki (nie tylko z Pisma Świętego). W języku polskim tłumaczymy je na\n",
"„powszedni”, ale na przykład w rosyjskim przyjął się odpowiednik\n",
"„насущный” — o przeciwstawnym do polskiego znaczeniu!\n",
"\n",
"W sumie podobne problemy hapaksy mogą sprawiać metodom statystycznym\n",
"przy przetwarzaniu jakiekolwiek korpusu.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Wykres log-log\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jeśli wspomniany wcześniej wykres narysujemy używając skali\n",
"logarytmicznej dla ****obu**** osi, otrzymamy kształt zbliżony do linii prostej.\n",
"\n",
"Tę własność tekstów nazywamy ****prawem Zipfa****.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA340lEQVR4nO3deXhU5cH+8fvMJJksZAaSkEBMgLALyI7IDiooWqvVuqLi2mpBRboo2vd1Jz+1tfYtimKtG3WpG6J1w4WALLIrsohsSVjCTiYJZEIy8/sjMBpZAiSZZ2bO93Ndc7VzZgbupqPPneec5zlWIBAICAAAALbhMB0AAAAAoUUBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZmJMB4hkfr9fW7ZsUXJysizLMh0HAAAch0AgoJKSEmVmZsrhsOdcGAWwDrZs2aLs7GzTMQAAwEkoLCxUVlaW6RhGUADrIDk5WVL1F8jtdhtOAwAAjofX61V2dnZwHLcjCmAdHDrt63a7KYAAAEQYO1++Zc8T3wAAADZGAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABsJsZ0ABzu4++26qPvimoc++ntqn9+82rrqE8k62cHfvrRn98C+6evNU6MU6vUJLVKS1ROWpIykuPlcNj3ptkAAEQTCmAYWl1UoveWbTEdo4b4WEd1IUxNUqu0JOWkJapVapJy0pLUNNl1WCkFAADhiwIYhga1a6pGrqP/XxMI/Oy5ArW8fuKf31Hi08ZdZdq4s0yFe/ar/IBfq4tKtLqo5LA8SXFOtTxYBlv9pBi2SktSalIc5RAAgDBjBQI/rwM4Xl6vVx6PR8XFxXK73abjNJgDVX5t2rNfG3eWacPOMm3c9eN/bt6zX/5jfIPSGsVpVN+Wuq5/KzVJigtdaAAAjsIu4/exUADrgC+Q5KusUuHu6nL402K4cec+bSneH5xtTIh16orTs3XzoNbKbJxgNjQAwNYYvymAdcIX6NjKD1Tp81Xb9fTMtVqxxStJinFYuqjHKbplSGu1TU82nBAAYEeM3xTAOuELdHwCgYC+WrtTk2eu09x1u4LHR3TK0K1D26hHiyYG0wEA7IbxmwJYJ3yBTtzSgj16Jm+dPlmxLXjsjNYpunVoWw1ul8aCEQBAg2P8pgDWCV+gk7d2e4mezVuvd5duVuXBVSSdM926dWgbjezSXE72HAQANBDGbwpgnfAFqrste/fr+a826LUFBdpXUSVJapmaqMv7ZOuSnlnKcMcbTggAiDaM3xTAOuELVH/2lFXo5Xn5enHuBu3Zd0CS5LCkwe2b6tJe2Tq7U7pcMU7DKQEA0YDx2+YFsFWrVsrPzz/s+O9+9zs99dRTtX6eL1D921dRqQ++3aq3Fm3Sgo27g8cbJ8bqwm6ZurR3tjpnurlWEABw0hi/bV4Ad+zYoaqqquDz7777TsOHD9eXX36poUOH1vp5vkANa8POMr21uFBvL96sIm958HjHZsm6tHe2LuqeqdRGLoMJAQCRiPHb5gXw58aNG6cPPvhAP/zww3HNMPEFCo0qf/U2Mm8uKtSnK7epotIvSXLFOPTSDafrjNaphhMCACIJ47fkMB0gXFRUVGjq1Km64YYbOL0YZpwOS0PaN9Wkq3pqwT1n6cELO6tjs2T5Kv167OPVpuMBABBxKIAHTZs2TXv37tV111131Pf4fD55vd4aD4RW48Q4XduvlV65sa/inA4tKdirxfm7a/8gAAAIogAe9Pzzz2vkyJHKzMw86ntyc3Pl8XiCj+zs7BAmxE81TXbp4p6nSJKmzFpvOA0AAJGFAigpPz9fn332mW666aZjvm/ChAkqLi4OPgoLC0OUEEdy06AcSdKnK7dp/Y5Sw2kAAIgcFEBJL7zwgtLT03X++ecf830ul0tut7vGA+a0TU/WWR3TFQhIz3+1wXQcAAAihu0LoN/v1wsvvKDRo0crJibGdBycoJsHt5YkvbV4k3aV+gynAQAgMti+AH722WcqKCjQDTfcYDoKTkLfnBR1y/LIV+nXK/MP39QbAAAczvYFcMSIEQoEAmrfvr3pKDgJlmUFZwFfnpev8gNVtXwCAADYvgAi8p3buZmymiRod1mF3l6yyXQcAADCHgUQES/G6dCNA6tXBP9z9gZV+bm5DQAAx0IBRFS4rHe23PEx2rCzTJ+t2mY6DgAAYY0CiKiQ5IrR1We0lCQ9x8bQAAAcEwUQUeO6/q0U53RoUf4eLc7fYzoOAABhiwKIqJHujtdFPapv5ffP2cwCAgBwNBRARJWbBlVvCfPxiiLl7yoznAYAgPBEAURUaZ+RrGEdmioQqF4RDAAADkcBRNQ5tDH0m4sLtbuswnAaAADCDwUQUadf61SddopH5Qf8uued5Vpd5DUdCQCAsEIBRNSxLEtjz2wrqfpawHOfnK2Ln56jNxcVan8Ft4oDAMAKBALcNuEkeb1eeTweFRcXy+12m46Dn5m7bqemzs/Xpyu2qfLg3UGS42P0qx6n6Io+LdQpk//PAMCOGL8pgHXCFygy7Cjx6a3Fm/TaggIV7N4XPH7XuR1169A2BpMBAExg/KYA1glfoMji9wc0b/0uvTxvoz5ZsU3Jrhh9fe9ZSoyLMR0NABBCjN9cAwgbcTgsDWibpsmjeqlFSqJKfJX64JutpmMBABByFEDYjsNh6aq+LSRJ//4633AaAABCjwIIW7q0V5ZinZa+2VSs7zYXm44DAEBIUQBhS6mNXDq3S3NJ0r+/LjCcBgCA0KIAwrZGHTwN/N6yzSopP2A4DQAAoUMBhG31zUlRm6ZJ2ldRpWnLtpiOAwBAyFAAYVuWZemqvi0lSa9+XSB2RAIA2AUFELZ2Sc9T5IpxaNVWr5YW7jUdBwCAkKAAwtYaJ8bpF10zJUn/ns9iEACAPVAAYXuH9gT84NstKt7HYhAAQPSjAML2erZorI7NkuWr9OvtJZtMxwEAoMFRAGF7lmVp1BnVi0H+/XU+i0EAAFGPAghIuqh7phLjnFq3o0wLNuw2HQcAgAZFAQQkJcfH6sLu1YtB/u+LH1RZ5TecCACAhkMBBA66YUCOXDEOzVm7S/e8u5xTwQCAqEUBBA5ql5GsSVf1lMOS/rNok/766RrTkQAAaBAUQOAnhnfK0CO/Ok2SNOnLtXp53kazgQAAaAAUQOBnrjy9he48u70k6b7pK/Th8q2GEwEAUL8ogMAR3H5WW43q20KBgDTu9WWat26X6UgAANQbCiBwBJZl6cELu+jczs1UUeXXdS8s0OOfrFapr9J0NAAA6owCCByF02HpySu6a1iHpvJV+vXUl+s09PGZen1Bgar8rBAGAEQuWxfAzZs36+qrr1ZqaqoSExPVvXt3LV682HQshJH4WKf+dV0fPXtNL7VKTdTOUp/ufme5zv+/2Zr5/Xa2igEARKQY0wFM2bNnjwYMGKBhw4bpo48+Unp6utatW6fGjRubjoYwY1mWzuncTMM6pOuV+fn6+2drtLqoRNe9sFA9WzTWncPba2DbNFmWZToqAADHxQrYdArj7rvv1pw5czR79uyT/jO8Xq88Ho+Ki4vldrvrMR3C2Z6yCk36cq2mzs+
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from math import log\n",
"\n",
"def log_rang_log_freq(name, g):\n",
" freq = freq_list(g)\n",
"\n",
" plt.figure().clear()\n",
" plt.plot([log(x) for x in range(1, len(freq.values())+1)], [log(y) for y in freq.values()])\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"log_rang_log_freq('pt-words-log-log', get_words(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Związek między frekwencją a długością\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Powiązane z prawem Zipfa prawo językowe opisuje zależność między\n",
"częstością użycia słowa a jego długością. Generalnie im krótsze słowo, tym częstsze.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAACkPklEQVR4nO2dd3gUZdvFz6bSQugldBDpvYMiUkXBLlbE3hALVuwdFJXXXpFib6AIKKICKoiAgIL0DgIiIAk1pMz3x/nG2U02hXlmU5jzu669Zneezdyz2U3m7F0DlmVZEEIIIYQQviGqsE9ACCGEEEIULBKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz4gp7BMozmRmZmL79u1ISEhAIBAo7NMRQgghRD6wLAv79+9HUlISoqL86QuTADRg+/btqFWrVmGfhhBCCCFcsHXrVtSsWbOwT6NQkAA0ICEhAQA/QGXLli3ksxFCCCFEfkhJSUGtWrX+u477EQlAA+ywb9myZSUAhRBCiGKGn9O3/Bn4FkIIIYTwMRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz1Aj6KLG0aPA118D27YBVaoAZ5wBlCpV2Gd17KSlAdOnA+vWAeXLA2edBVSsWNhnJYQQQghIABYtPvgAGD4c+PtvICYGSE+neHr8cWDo0MI+u/wzZQpwww3Ajh1AmTLAoUPATTfxtT3xBODTwdtCCCFEUUFX4qLC558Dl14K9OgB/PknPWjr1gEXXADcfDPw2muFfYb54/vvgXPPBTp0AP74A9i/H9i5E7jnHmDUKGDEiMI+QyGEEML3BCzLsgr7JIorKSkpSExMRHJystks4MxMoFEjoEkTevrGj3dCwIMH06P21VfA1q1AiRKenX9E6NqVHr45c4Do6NC1xx/nbetWoGrVwjk/IYQQvsez63cxRh7AosCCBfT2/fsvcNppwIoVQL16FIHnnAMsXw7s3g3MmFHYZ5o7GzcCv/wC3HprdvEH0JMZCACffVbw5yaEEEKI/1AOYFHgn3+4XbgQmDyZBROBAPfNns3HALBrV6GcXr7Zs4fbE04Iv16+PFChgvM8IYQQQhQK8gAWBcqX5/bcc4Gzz3bEH8CcwBtv5P2SJQv6zI6NWrUY/l2wIPz65s0scKlbt0BPSwghhBChSAAWBQ4c4HbVKraBCSYzk8UUAKuCizJVqwIDBgDPPQfs2xe6ZlnAI48AZcsC551XGGcnhBBCiP9HArAokJnJ7fLl9PhNmQJs2QJ8+y3Qrx/wzTehzyvKPP008xU7d2Yxy5o1wMyZwMCBfPz880Dp0oV9lkIIIYSvkQAsCrRty75/N91Eb+BZZwF16lD8bdwI3H03n9e5c+GeZ35o3BiYO5dh3iuvZHVz377Ahg3Ap58CV11V2GcohBBC+B61gTHA0zLyiy8GvvgCOHKEhRI1a7J/3q5dnATSsSMwa5Yn511gbNkCbNoElCsHtGgRmtsohBBCFBJqA6Mq4KJDvXoUf1FRQK9eQLt2bAfz4YecpFEcCydq1+ZNCCGEEEUKeQAN8OwbxOHD9PhdfDFbqLzzjtMs+fLLuT56NPsCVqrk3QsQQgghfIg8gBKARnj2Afr+e6B3b2DZMqB58+zr//zDqSDvvw9ccol7OwCQkQFMnQqMHcu2LBUrcgTdJZcU/TYzQgghhAdIAKoIpGhw+DC3FSqEX7f7BNrPc8vRo5wscvbZ7Md38slAXBxw7bVAp07cJ4QQQojjHl8LwLp16yIQCGS7DR06tGBPxC6QyGnUm72/VSszOw8+yGN99RXw66/Ayy+zxczvv9PLOHiw2fGFEEIIUSzwdQj4n3/+QUZGxn+Ply9fjj59+mDWrFno0aNHnj/vqQt5wAD2AZw7F6hRw9m/dy97A8bHc1ScWw4e5HFvuAEYNSr7+scfAxddxHNo1sy9HSGEEKKIoxCwz6uAK1euHPJ41KhRaNCgAU455ZSCP5lXXwVOOglo2RK4+mqgdWtOBnnrLSAtDZgzx+z4S5cCycksNAnHOecwHPzDDxKAQgghxHGOrwVgMEePHsV7772H4cOHI1AY/epq1+YM3dGjgbffBv79FyhTBrjsMjaCrlfP7Pi2ozc6Ovx6VBTD0P51CAshhBC+QQLw//niiy+wb98+XHHFFTk+JzU1Fampqf89TklJ8fYkqlXjHN1nn2Xvv5IlKcy8oFUrCspPPw1faTx1KpCaysIQIYQQQhzX+LoIJJixY8eif//+SEpKyvE5I0eORGJi4n+3WrVqReZkAgHOy/VK/AFAQgJHsz37LPDjj6FrGzYAt91G8demjXc2hRBCCFEk8XURiM3mzZtRv359TJo0CWeddVaOzwvnAaxVq1bxSSI9dIjFJrNmAX36AB06AOvXA5MmMQQ9axYQKVErhBBCFBFUBCIPIABg3LhxqFKlCs4444xcnxcfH4+yZcuG3IoVpUqx7cuECQz3vv8+sHYtq4J/+03iTwghhPAJvs8BzMzMxLhx4zBkyBDExPjg1xEXx/Fyl19e2GcihBBCiELC9x7A7777Dlu2bMFVV11V2KcihBBCCFEg+MDllTt9+/aF0iCFEEII4Sd87wEUQgghhPAbEoBCCCGEED5DAlAIIYQQwmdIAAohhBBC+AwJQCGEEEIIn+H7KmARIbZuBcaPB9atA8qXBy66COjUiWPuhBBCCFGoyAMovGfUKKBuXWD0aArASZOALl2AgQOBAwcK++yEEEII3yMBKLxlwgRgxAjgnnuAv/4C5s4FNm0CPvsMmDMHuPrqwj5DIYQQwvdIABYljh7lfN7evYHGjYHu3YG33gIOHy7sM8sflgWMHAmcey7w1FNAQgL3R0UB550HvPAC8MknnD8shBBCiEJDArCocOAA0KcPcNllFFJnnEEBdf31QNeuwO7dhX2GebNqFbB6NXDddeHXL7kEKFUKmDKlYM9LCCGEECGoCKSoMHw4sHgx8NNPwEknOfv/+IMewauvBr78svDOLz8cOsRtpUrh10uUAMqUcZ4nhBBCiEJBHsCiwJ49wMSJwP33h4o/AGjZEnj6aeCrr4D16wvn/PLLCScAJUsC334bfn3pUmDXLqBFiwI9LSGEEEKEIgFYFFi4EEhNBS68MPz6oEEMC//0U8Ge17GSmAhcfDHw/PPZ8/wOHwbuuAOoWRMYMKBwzk8IIYQQABQCLhpYVu7rxal33qhRwLx5QPv2DFt37Qps3gy88QawbRswbRoQo4+dEEIIUZjIA1gU6NgRiI9nhWw4PvmEIvDkkwv2vNxQuTJbv9xwA/Duu8AFF7AtTNu2wC+/AKeeWthnKIQQQviegGXl5X4SOZGSkoLExEQkJyejbNmyZge77jrgo4+Ar78GunVz9i9bxiKQzp2LfhFIVjIygJQUoHRpIC6usM9GCCGEAODx9buYolh
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def freq_vs_length(name, g, top=None):\n",
" freq = freq_list(g)\n",
"\n",
" plt.figure().clear()\n",
" plt.scatter([len(x) for x in freq.keys()], [log(y) for y in freq.values()],\n",
" facecolors='none', edgecolors='r')\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"freq_vs_length('pt-lengths', get_words(pan_tadeusz))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
},
"org": null
},
"nbformat": 4,
"nbformat_minor": 1
}