2022-03-06 17:51:23 +01:00
|
|
|
{
|
|
|
|
"cells": [
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
|
|
|
|
"<div class=\"alert alert-block alert-info\">\n",
|
|
|
|
"<h1> Modelowanie języka</h1>\n",
|
|
|
|
"<h2> 2. <i>Języki</i> [wykład]</h2> \n",
|
|
|
|
"<h3> Filip Graliński (2022)</h3>\n",
|
|
|
|
"</div>\n",
|
|
|
|
"\n",
|
|
|
|
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"## Języki i ich prawa\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Jakim rozkładom statystycznym podlegają języki?\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"### Język naturalny albo „Pan Tadeusz” w liczbach\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Przygotujmy najpierw „infrastrukturę” do *segmentacji* tekstu na różnego rodzaju jednostki.\n",
|
|
|
|
"Używać będziemy generatorów.\n",
|
|
|
|
"\n",
|
|
|
|
"**Pytanie** Dlaczego generatory zamiast list?\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"text/plain": [
|
|
|
|
"'Księga pierwsza\\r\\n\\r\\n\\r\\n\\r\\nGospodarstwo\\r\\n\\r\\nPowrót pani'"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "execute_result"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import requests\n",
|
|
|
|
"\n",
|
|
|
|
"url = 'https://wolnelektury.pl/media/book/txt/pan-tadeusz.txt'\n",
|
|
|
|
"pan_tadeusz = requests.get(url).content.decode('utf-8')\n",
|
|
|
|
"\n",
|
|
|
|
"pan_tadeusz[100:150]"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Znaki\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 2,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"text/plain": [
|
|
|
|
"['K',\n",
|
|
|
|
" 's',\n",
|
|
|
|
" 'i',\n",
|
|
|
|
" 'ę',\n",
|
|
|
|
" 'g',\n",
|
|
|
|
" 'a',\n",
|
|
|
|
" ' ',\n",
|
|
|
|
" 'p',\n",
|
|
|
|
" 'i',\n",
|
|
|
|
" 'e',\n",
|
|
|
|
" 'r',\n",
|
|
|
|
" 'w',\n",
|
|
|
|
" 's',\n",
|
|
|
|
" 'z',\n",
|
|
|
|
" 'a',\n",
|
|
|
|
" '\\r',\n",
|
|
|
|
" '\\n',\n",
|
|
|
|
" '\\r',\n",
|
|
|
|
" '\\n',\n",
|
|
|
|
" '\\r',\n",
|
|
|
|
" '\\n',\n",
|
|
|
|
" '\\r',\n",
|
|
|
|
" '\\n',\n",
|
|
|
|
" 'G',\n",
|
|
|
|
" 'o',\n",
|
|
|
|
" 's',\n",
|
|
|
|
" 'p',\n",
|
|
|
|
" 'o',\n",
|
|
|
|
" 'd',\n",
|
|
|
|
" 'a',\n",
|
|
|
|
" 'r',\n",
|
|
|
|
" 's',\n",
|
|
|
|
" 't',\n",
|
|
|
|
" 'w',\n",
|
|
|
|
" 'o',\n",
|
|
|
|
" '\\r',\n",
|
|
|
|
" '\\n',\n",
|
|
|
|
" '\\r',\n",
|
|
|
|
" '\\n',\n",
|
|
|
|
" 'P',\n",
|
|
|
|
" 'o',\n",
|
|
|
|
" 'w',\n",
|
|
|
|
" 'r',\n",
|
|
|
|
" 'ó',\n",
|
|
|
|
" 't',\n",
|
|
|
|
" ' ',\n",
|
|
|
|
" 'p',\n",
|
|
|
|
" 'a',\n",
|
|
|
|
" 'n',\n",
|
|
|
|
" 'i']"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"execution_count": 2,
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "execute_result"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"from itertools import islice\n",
|
|
|
|
"\n",
|
|
|
|
"def get_characters(t):\n",
|
|
|
|
" yield from t\n",
|
|
|
|
"\n",
|
|
|
|
"list(islice(get_characters(pan_tadeusz), 100, 150))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 3,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"text/plain": [
|
|
|
|
"Counter({'A': 698,\n",
|
|
|
|
" 'd': 11465,\n",
|
|
|
|
" 'a': 30979,\n",
|
|
|
|
" 'm': 10269,\n",
|
|
|
|
" ' ': 63444,\n",
|
|
|
|
" 'M': 585,\n",
|
|
|
|
" 'i': 29353,\n",
|
|
|
|
" 'c': 14153,\n",
|
|
|
|
" 'k': 12362,\n",
|
|
|
|
" 'e': 25343,\n",
|
|
|
|
" 'w': 14625,\n",
|
|
|
|
" 'z': 22741,\n",
|
|
|
|
" '\\r': 10851,\n",
|
|
|
|
" '\\n': 10851,\n",
|
|
|
|
" 'P': 1265,\n",
|
|
|
|
" 'n': 15505,\n",
|
|
|
|
" 'T': 971,\n",
|
|
|
|
" 'u': 7699,\n",
|
|
|
|
" 's': 15255,\n",
|
|
|
|
" 'y': 13732,\n",
|
|
|
|
" 'l': 6677,\n",
|
|
|
|
" 'o': 23050,\n",
|
|
|
|
" 't': 10757,\n",
|
|
|
|
" 'j': 6586,\n",
|
|
|
|
" 'L': 316,\n",
|
|
|
|
" 'I': 795,\n",
|
|
|
|
" 'S': 1045,\n",
|
|
|
|
" 'B': 567,\n",
|
|
|
|
" 'N': 793,\n",
|
|
|
|
" '9': 8,\n",
|
|
|
|
" '7': 2,\n",
|
|
|
|
" '8': 10,\n",
|
|
|
|
" '-': 33,\n",
|
|
|
|
" '3': 3,\n",
|
|
|
|
" '2': 6,\n",
|
|
|
|
" '4': 2,\n",
|
|
|
|
" '5': 2,\n",
|
|
|
|
" 'K': 683,\n",
|
|
|
|
" 'ę': 5534,\n",
|
|
|
|
" 'g': 4775,\n",
|
|
|
|
" 'p': 8031,\n",
|
|
|
|
" 'r': 15328,\n",
|
|
|
|
" 'G': 358,\n",
|
|
|
|
" 'ó': 3097,\n",
|
|
|
|
" '—': 720,\n",
|
|
|
|
" ',': 9130,\n",
|
|
|
|
" 'ł': 10059,\n",
|
|
|
|
" 'W': 1258,\n",
|
|
|
|
" 'ż': 3334,\n",
|
|
|
|
" 'ś': 2524,\n",
|
|
|
|
" 'ą': 4794,\n",
|
|
|
|
" 'Ż': 219,\n",
|
|
|
|
" 'O': 567,\n",
|
|
|
|
" 'ź': 414,\n",
|
|
|
|
" 'b': 5753,\n",
|
|
|
|
" 'R': 489,\n",
|
|
|
|
" 'E': 23,\n",
|
|
|
|
" '!': 1083,\n",
|
|
|
|
" ':': 1152,\n",
|
|
|
|
" 'ć': 1956,\n",
|
|
|
|
" '.': 2380,\n",
|
|
|
|
" 'D': 552,\n",
|
|
|
|
" 'J': 729,\n",
|
|
|
|
" 'C': 556,\n",
|
|
|
|
" 'h': 3915,\n",
|
|
|
|
" '(': 76,\n",
|
|
|
|
" 'f': 386,\n",
|
|
|
|
" ';': 1445,\n",
|
|
|
|
" 'ń': 651,\n",
|
|
|
|
" ')': 76,\n",
|
|
|
|
" 'Z': 785,\n",
|
|
|
|
" 'Ś': 71,\n",
|
|
|
|
" 'U': 184,\n",
|
|
|
|
" 'F': 47,\n",
|
|
|
|
" 'é': 43,\n",
|
|
|
|
" '?': 441,\n",
|
|
|
|
" '…': 157,\n",
|
|
|
|
" '«': 540,\n",
|
|
|
|
" 'H': 309,\n",
|
|
|
|
" '»': 538,\n",
|
|
|
|
" 'Ó': 13,\n",
|
|
|
|
" 'Ł': 24,\n",
|
|
|
|
" 'x': 3,\n",
|
|
|
|
" 'v': 5,\n",
|
|
|
|
" '*': 150,\n",
|
|
|
|
" 'à': 1,\n",
|
|
|
|
" 'Ź': 4,\n",
|
|
|
|
" 'V': 3,\n",
|
|
|
|
" '/': 19,\n",
|
|
|
|
" 'Ć': 1,\n",
|
|
|
|
" 'q': 2,\n",
|
|
|
|
" '1': 4,\n",
|
|
|
|
" 'æ': 2,\n",
|
|
|
|
" '6': 1,\n",
|
|
|
|
" '0': 1})"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"execution_count": 3,
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "execute_result"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"from collections import Counter\n",
|
|
|
|
"\n",
|
|
|
|
"c = Counter(get_characters(pan_tadeusz))\n",
|
|
|
|
"\n",
|
|
|
|
"c"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Napiszmy pomocniczą funkcję, która zwraca **listę frekwencyjną**.\n",
|
|
|
|
"\n",
|
|
|
|
"Counter({' ': 63444, 'a': 30979, 'i': 29353, 'e': 25343, 'o': 23050, 'z': 22741, 'n': 15505, 'r': 15328, 's': 15255, 'w': 14625, 'c': 14153, 'y': 13732, 'k': 12362, 'd': 11465, '\\r': 10851, '\\n': 10851, 't': 10757, 'm': 10269, 'ł': 10059, ',': 9130, 'p': 8031, 'u': 7699, 'l': 6677, 'j': 6586, 'b': 5753, 'ę': 5534, 'ą': 4794, 'g': 4775, 'h': 3915, 'ż': 3334, 'ó': 3097, 'ś': 2524, '.': 2380, 'ć': 1956, ';': 1445, 'P': 1265, 'W': 1258, ':': 1152, '!': 1083, 'S': 1045, 'T': 971, 'I': 795, 'N': 793, 'Z': 785, 'J': 729, '—': 720, 'A': 698, 'K': 683, 'ń': 651, 'M': 585, 'B': 567, 'O': 567, 'C': 556, 'D': 552, '«': 540, '»': 538, 'R': 489, '?': 441, 'ź': 414, 'f': 386, 'G': 358, 'L': 316, 'H': 309, 'Ż': 219, 'U': 184, '…': 157, '\\*': 150, '(': 76, ')': 76, 'Ś': 71, 'F': 47, 'é': 43, '-': 33, 'Ł': 24, 'E': 23, '/': 19, 'Ó': 13, '8': 10, '9': 8, '2': 6, 'v': 5, 'Ź': 4, '1': 4, '3': 3, 'x': 3, 'V': 3, '7': 2, '4': 2, '5': 2, 'q': 2, 'æ': 2, 'à': 1, 'Ć': 1, '6': 1, '0': 1})\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 4,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"text/plain": [
|
|
|
|
"OrderedDict([(' ', 63444),\n",
|
|
|
|
" ('a', 30979),\n",
|
|
|
|
" ('i', 29353),\n",
|
|
|
|
" ('e', 25343),\n",
|
|
|
|
" ('o', 23050),\n",
|
|
|
|
" ('z', 22741),\n",
|
|
|
|
" ('n', 15505),\n",
|
|
|
|
" ('r', 15328)])"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"execution_count": 4,
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "execute_result"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"from collections import Counter\n",
|
|
|
|
"from collections import OrderedDict\n",
|
|
|
|
"\n",
|
|
|
|
"def freq_list(g, top=None):\n",
|
|
|
|
" c = Counter(g)\n",
|
|
|
|
"\n",
|
|
|
|
" if top is None:\n",
|
|
|
|
" items = c.items()\n",
|
|
|
|
" else:\n",
|
|
|
|
" items = c.most_common(top)\n",
|
|
|
|
"\n",
|
|
|
|
" return OrderedDict(sorted(items, key=lambda t: -t[1]))\n",
|
|
|
|
"\n",
|
|
|
|
"freq_list(get_characters(pan_tadeusz), top=8)"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 5,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stderr",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"/tmp/ipykernel_6969/6903746.py:14: UserWarning: Glyph 13 (\r",
|
|
|
|
") missing from current font.\n",
|
|
|
|
" plt.savefig(fname)\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"text/plain": [
|
|
|
|
"'02_Jezyki/pt-chars.png'"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"execution_count": 5,
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "execute_result"
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"name": "stderr",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"/usr/lib/python3.10/site-packages/IPython/core/pylabtools.py:151: UserWarning: Glyph 13 (\r",
|
|
|
|
") missing from current font.\n",
|
|
|
|
" fig.canvas.print_figure(bytes_io, **kw)\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAuEAAADCCAYAAADn5xwjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAsGElEQVR4nO3de/wVVb3/8ddbQMULXhBNQf1q0kU9pUJm2U2pI5WFdjSxY2JRpFnasX6G1TnZhdLudtEyMdG8cSiTY2oZaFqRiooB4oUUlTTBO2qo4Of3x1pbhs3s/d3fL9+9v7f38/HYjz2zZtbMmtlrz17z2WtmFBGYmZmZmVnrbNDdBTAzMzMz62/cCDczMzMzazE3ws3MzMzMWsyNcDMzMzOzFnMj3MzMzMysxdwINzMzMzNrsYHdXYDusM0220RbW1t3F8PMzMzM+rBbbrnl0YgYVjatXzbC29ramDt3bncXw8zMzMz6MEn315rm7ihmZmZmZi3mRriZmZmZWYu5EW5mZmZm1mJuhJuZmZmZtZgb4WZmZmZmLdYv747SXdom/3adtCWnvbcbSmJmZmZm3cmRcDMzMzOzFnMj3MzMzMysxdwINzMzMzNrMTfCzczMzMxarKmNcElbSpoh6U5JiyS9SdLWkq6RdE9+36ow/ymSFku6S9JBhfRRkubnaT+UpJy+kaRLc/qNktqauT1mZmZmZl2h2ZHwM4CrI+I1wOuBRcBkYFZEjARm5XEk7Q6MB/YAxgJnShqQl3MWMAkYmV9jc/pE4ImI2A34PnB6k7fHzMzMzGy9Na0RLmkI8DZgKkBEvBARTwLjgGl5tmnAIXl4HHBJRDwfEfcBi4F9JW0PDImIORERwPlVeSrLmgGMqUTJzczMzMx6qmZGwncFlgO/kHSbpHMkbQpsFxEPA+T3bfP8w4EHC/mX5rThebg6fa08EbEKeAoY2pzNMTMzMzPrGs1shA8E9gHOioi9gWfJXU9qKItgR530ennWXbg0SdJcSXOXL19epxhmZmZmZs3VzEb4UmBpRNyYx2eQGuWP5C4m5Pdlhfl3LOQfATyU00eUpK+VR9JAYAvg8bLCRMTZETE6IkYPGzZsPTfNzMzMzKzzmtYIj4h/Ag9KenVOGgPcAcwEJuS0CcDleXgmMD7f8WQX0gWYN+UuKysk7Zf7ex9dlaeyrMOA2bnfuJmZmZlZjzWwycv/NHChpA2Be4GPkBr+0yVNBB4ADgeIiIWSppMa6quA4yNidV7OccB5wGDgqvyCdNHnBZIWkyLg45u8PWZmZmZm662pjfCImAeMLpk0psb8U4ApJelzgT1L0leSG/FmZmZmZr2Fn5hpZmZmZtZiboSbmZmZmbWYG+FmZmZmZi3mRriZmZmZWYu5EW5mZmZm1mJuhJuZmZmZtZgb4WZmZmZmLeZGuJmZmZlZi7kRbmZmZmbWYm6Em5mZmZm1mBvhZmZmZmYt5ka4mZmZmVmLuRFuZmZmZtZiboSbmZmZmbWYG+FmZmZmZi3mRriZmZmZWYs1vREuaYmk+ZLmSZqb07aWdI2ke/L7VoX5T5G0WNJdkg4qpI/Ky1ks6YeSlNM3knRpTr9RUluzt8nMzMzMbH20KhJ+QETsFRGj8/hkYFZEjARm5XEk7Q6MB/YAxgJnShqQ85wFTAJG5tfYnD4ReCIidgO+D5zegu0xMzMzM+u07uqOMg6YloenAYcU0i+JiOcj4j5gMbCvpO2BIRExJyICOL8qT2VZM4AxlSi5mZmZmVlP1IpGeAC/l3SLpEk5bbuIeBggv2+b04cDDxbyLs1pw/NwdfpaeSJiFfAUMLS6EJImSZorae7y5cu7ZMPMzMzMzDpjYAvWsX9EPCRpW+AaSXfWmbcsgh110uvlWTsh4mzgbIDRo0evM93MzMzMrFUaioRL2j9fQHm3pHsl3Sfp3kbyRsRD+X0ZcBmwL/BI7mJCfl+WZ18K7FjIPgJ4KKePKElfK4+kgcAWwOONlM3MzMzMrDvUbIRLOlhSpcvHVOB7wFuANwCj83tdkjaVtHllGPh3YAEwE5iQZ5sAXJ6HZwLj8x1PdiFdgHlT7rKyQtJ+ub/30VV5Kss6DJid+42bmZmZmfVI9bqj3A38TNLXgKci4qpOLH874LJ8neRA4KKIuFrSzcB0SROBB4DDASJioaTpwB3AKuD4iFidl3UccB4wGLgqvyCdIFwgaTEpAj6+E+U0MzMzM2uZmo3wiLhb0jhgN+BaSd8Gfg08X5jn1noLj4h7gdeXpD8GjKmRZwowpSR9LrBnSfpKciPezMzMzKw3qHthZo5C3yXpjTlpdHEycGCzCmZmZmZm1lc1dHeUiDig2QUxMzMzM+svGr07ynaSpkq6Ko/vnvtzm5mZmZlZBzX6sJ7zgN8BO+Txu4HPNKE8ZmZmZmZ9XqON8G0iYjrwErz8ZMrV9bOYmZmZmVmZRhvhz0oaSn4SpaT9SI+HNzMzMzOzDmr0sfUnkR6K80pJfwaGkR6MY2ZmZmZmHdTo3VFulfR24NWAgLsi4sWmlszMzMzMrI+q2wiXdGBEzJb0gapJr5JERPy6iWUzMzMzM+uT2ouEvx2YDbyvZFqQnqBpZmZmZmYd0N4TM7+c3z/SmuKYmZmZmfV9jT6sZ6ikH0q6VdItks7Id0sxMzMzM7MOavQWhZcAy4H/IN0VZTlwabMKZWZmZmbWlzV6i8KtI+JrhfGvSzqkCeUxMzMzM+vzGo2EXytpvKQN8uuDwG+bWTAzMzMzs76q0Ub4J4CLgBfy6xLgJEkrJD1dL6OkAZJuk3RFHt9a0jWS7snvWxXmPUXSYkl3STqokD5K0vw87YeSlNM3knRpTr9RUluHtt7MzMzMrBs01AiPiM0jYoOIGJhfG+S0zSNiSDvZTwQWFcYnA7MiYiQwK48jaXdgPLAHMBY4U9KAnOcsYBIwMr/G5vSJwBMRsRvwfeD0RrbHzMzMzKw7NRoJR9JWkvaV9LbKK6d/sk6eEcB7gXMKyeOAaXl4GnBIIf2SiHg+Iu4DFgP7StoeGBIRcyIigPOr8lSWNQMYU4mSm5mZmZn1VA1dmCnpY6SI9ghgHrAfMEfSHcBo4MwaWX8AnAxsXkjbLiIeBoiIhyVtm9OHA38tzLc0p72Yh6vTK3kezMtaJekpYCjwaCPbZWZmZmbWHRqNhJ8IvAG4PyIOAPYm3abwC6yJSq9F0sHAsoi4pcF1lEWwo056vTxl5Zkkaa6kucuXL2+wSGZmZmZmXa/RRvjKiFgJ6WLIiLgTeHVEPB0R/6yRZ3/g/ZKWkC7kPFDSL4FHchcT8vuyPP9SYMdC/hHAQzl9REn6WnkkDQS2AB4vK0xEnB0RoyNi9LBhwxrcbDMzMzOzrtdoI3yppC2B3wDXSLqcNQ3hUhFxSkSMiIg20gWXsyPiKGAmMCHPNgG4PA/PBMbnO57sQroA86bcdWWFpP1yf++jq/JUlnVYXkdpJNzMzMzMrKdoqE94RByaB0+VdC0p4nxVJ9d5GjBd0kTgAeDwvI6FkqYDdwCrgOMjYnXOcxxwHjA4r7ey7qnABZIWkyLg4ztZJjMzMzOzlmn0wswLIuLDABHxx0oa8OFG8kfEdcB1efgxYEyN+aYAU0rS5wJ7lqSvJDfizczMzMx6i0a7o+xRHMn37x7V9cUxMzMzM+v76jbC8xMsVwCvk/R0fq0gXUx5eb28ZmZmZmZWrm4jPCK+GRGbA9+OiCH5tXlEDI2IU1pURjMzMzOzPqXR7ihXSNoUQNJRkr4naecmlsvMzMzMrM9qtBF+FvCcpNeTnoB5P+nx8WZmZmZm1kGNNsJX5ftvjwPOiIgzWPtR9GZmZmZm1qCGblFIeljOKcBRwNvy3VEGNa9YZmZmZmZ9V6OR8COA54GJ+TH1w4FvN61UZmZmZmZ9WKOR8MOAX0TEEwAR8QDuE25mZmZm1imNRsJfAdwsabqksZLUzEKZmZmZmfVlDTXCI+JLwEhgKnAMcI+kb0h6ZRPLZmZmZmbWJzUaCSffHeWf+bUK2AqYIelbTSqbmZmZmVmf1FCfcEknABOAR4FzgP8XES9K2gC4h3TvcDM
|
|
|
|
"text/plain": [
|
|
|
|
"<Figure size 864x216 with 1 Axes>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {
|
|
|
|
"needs_background": "light"
|
|
|
|
},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
"from collections import OrderedDict\n",
|
|
|
|
"\n",
|
|
|
|
"def rang_freq_with_labels(name, g, top=None):\n",
|
|
|
|
" freq = freq_list(g, top)\n",
|
|
|
|
"\n",
|
|
|
|
" plt.figure(figsize=(12, 3))\n",
|
|
|
|
" plt.ylabel('liczba wystąpień')\n",
|
|
|
|
"\n",
|
|
|
|
" plt.bar(freq.keys(), freq.values())\n",
|
|
|
|
"\n",
|
|
|
|
" fname = f'02_Jezyki/{name}.png'\n",
|
|
|
|
"\n",
|
|
|
|
" plt.savefig(fname)\n",
|
|
|
|
"\n",
|
|
|
|
" return fname\n",
|
|
|
|
"\n",
|
|
|
|
"rang_freq_with_labels('pt-chars', get_characters(pan_tadeusz))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Słowa\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Co rozumiemy pod pojęciem słowa czy wyrazu, nie jest oczywiste. W praktyce zależy to od wyboru **tokenizatora**.\n",
|
|
|
|
"\n",
|
|
|
|
"Załóżmy, że przez wyraz rozumieć będziemy nieprzerwany ciąg liter bądź cyfr (oraz gwiazdek\n",
|
|
|
|
"— to za chwilę ułatwi nam analizę pewnego tekstu…).\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 6,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"text/plain": [
|
|
|
|
"['Ty',\n",
|
|
|
|
" 'co',\n",
|
|
|
|
" 'gród',\n",
|
|
|
|
" 'zamkowy',\n",
|
|
|
|
" 'Nowogródzki',\n",
|
|
|
|
" 'ochraniasz',\n",
|
|
|
|
" 'z',\n",
|
|
|
|
" 'jego',\n",
|
|
|
|
" 'wiernym',\n",
|
|
|
|
" 'ludem',\n",
|
|
|
|
" 'Jak',\n",
|
|
|
|
" 'mnie',\n",
|
|
|
|
" 'dziecko',\n",
|
|
|
|
" 'do',\n",
|
|
|
|
" 'zdrowia',\n",
|
|
|
|
" 'powróciłaś',\n",
|
|
|
|
" 'cudem',\n",
|
|
|
|
" 'Gdy',\n",
|
|
|
|
" 'od',\n",
|
|
|
|
" 'płaczącej',\n",
|
|
|
|
" 'matki',\n",
|
|
|
|
" 'pod',\n",
|
|
|
|
" 'Twoją',\n",
|
|
|
|
" 'opiekę',\n",
|
|
|
|
" 'Ofiarowany',\n",
|
|
|
|
" 'martwą',\n",
|
|
|
|
" 'podniosłem',\n",
|
|
|
|
" 'powiekę',\n",
|
|
|
|
" 'I',\n",
|
|
|
|
" 'zaraz']"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"execution_count": 6,
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "execute_result"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"from itertools import islice\n",
|
|
|
|
"import regex as re\n",
|
|
|
|
"\n",
|
|
|
|
"def get_words(t):\n",
|
|
|
|
" for m in re.finditer(r'[\\p{L}0-9\\*]+', t):\n",
|
|
|
|
" yield m.group(0)\n",
|
|
|
|
"\n",
|
|
|
|
"list(islice(get_words(pan_tadeusz), 100, 130))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Zobaczmy 20 najczęstszych wyrazów.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAABC3UlEQVR4nO3deViU9f7/8dfIruAEmBCJS7mkgktWbhV43FJJO56yo4b2y6OWueCu2YLmUnYUCk+LHk+YS6aWrSdzSUkOrigWaZqmuQTRQihKgHD//ujy/jqCZsVwz9DzcV1zXd6f+z0zrxmYG3zzuT+3zTAMQwAAAAAAAICLqmZ1AAAAAAAAAOBKaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApXlaHQBVT2lpqb755hsFBATIZrNZHQcAAAAAUMUZhqEzZ84oLCxM1aoxV6cqooGFCvfNN98oPDzc6hgAAAAAgD+ZEydOqE6dOlbHgBPQwEKFCwgIkPTLgaNmzZoWpwEAAAAAVHWnT59WeHi4+f9RVD00sFDhLpw2WLNmTRpYAAAAAIBKwzI2VRcnhgIAAAAAAMCl0cACAAAAAACAS6OBBQAAAAAAAJdGAwsAAAAAAAAujQYWAAAAAAAAXBoNLAAAAAAAALg0GlgAAAAAAABwaTSwAAAAAAAA4NI8rQ4AWKn+lA+sjiBJOvZMr1+tcaesAAAAAABUJGZgAQAAAAAAwKXRwAIAAAAAAIBLo4EFAAAAAAAAl0YDCwAAAAAAAC6NBhYAAAAAAABcGg0sAAAAAAAAuDQaWAAAAAAAAHBpNLAAAAAAAADg0mhgAQAAAAAAwKXRwHITn3zyie6++26FhYXJZrPp7bffvmzt8OHDZbPZlJiY6DBeWFioUaNGqVatWqpRo4Z69+6tkydPOtTk5uYqNjZWdrtddrtdsbGx+umnnyr+BQEAAAAAAFwlGlhu4uzZs2rZsqUWLFhwxbq3335bO3bsUFhYWJl9cXFxWrt2rVauXKnU1FTl5+crJiZGJSUlZs2AAQOUkZGhdevWad26dcrIyFBsbGyFvx4AAAAAAICr5Wl1AFydHj16qEePHlesOXXqlEaOHKmPPvpIvXr1ctiXl5enxYsXa+nSperSpYskadmyZQoPD9fGjRvVvXt3HThwQOvWrdP27dvVtm1bSdKiRYvUvn17HTx4UE2aNHHOi0OVU3/KB1ZHkCQde6bXrxcBAAAAAFweM7CqiNLSUsXGxmrixIlq3rx5mf3p6ekqLi5Wt27dzLGwsDBFREQoLS1NkrRt2zbZ7XazeSVJ7dq1k91uN2vKU1hYqNOnTzvcAAAAAAAAKgoNrCri2Weflaenp0aPHl3u/uzsbHl7eyswMNBhPCQkRNnZ2WZN7dq1y9y3du3aZk155syZY66ZZbfbFR4e/gdeCQAAAAAAgCMaWFVAenq6nn/+eSUnJ8tms/2m+xqG4XCf8u5/ac2lpk6dqry8PPN24sSJ35QBAAAAAADgSmhgVQFbt25VTk6O6tatK09PT3l6eurrr7/W+PHjVb9+fUlSaGioioqKlJub63DfnJwchYSEmDXffvttmcf/7rvvzJry+Pj4qGbNmg43AAAAAACAikIDqwqIjY3Vp59+qoyMDPMWFhamiRMn6qOPPpIktWnTRl5eXtqwYYN5v6ysLGVmZqpDhw6SpPbt2ysvL087d+40a3bs2KG8vDyzBgAAAAAAoLJxFUI3kZ+fr8OHD5vbR48eVUZGhoKCglS3bl0FBwc71Ht5eSk0NNS8cqDdbteQIUM0fvx4BQcHKygoSBMmTFBkZKR5VcKmTZvqrrvu0tChQ/XKK69IkoYNG6aYmBiuQAgAAAAAACxDA8tN7N69W506dTK3x40bJ0kaPHiwkpOTr+oxEhIS5OnpqX79+qmgoECdO3dWcnKyPDw8zJrly5dr9OjR5tUKe/furQULFlTcCwEAAAAAAPiNaGC5iejoaBmGcdX1x44dKzPm6+urpKQkJSUlXfZ+QUFBWrZs2e+JCAAAAAAA4BSsgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACXRgMLAAAAAAAALo0GFgAAAAAAAFwaDSwAAAAAAAC4NBpYAAAAAAAAcGk0sAAAAAAAAODSaGABAAAAAADApdHAAgAAAAAAgEujgQUAAAAAAACX5ml1AACwSv0pH1gdwXTsmV5WRwAAAAAAl8UMLDfxySef6O6771ZYWJhsNpvefvttc19xcbEmT56syMhI1ahRQ2FhYRo0aJC++eYbh8coLCzUqFGjVKtWLdWoUUO9e/fWyZMnHWpyc3MVGxsru90uu92u2NhY/fTTT5XwCgEAAAAAAMpHA8tNnD17Vi1bttSCBQvK7Dt37pz27NmjJ554Qnv27NFbb72lQ4cOqXfv3g51cXFxWrt2rVauXKnU1FTl5+crJiZGJSUlZs2AAQOUkZGhdevWad26dcrIyFBsbKzTXx8AAAAAAMDlcAqhm+jRo4d69OhR7j673a4NGzY4jCUlJem2227T8ePHVbduXeXl5Wnx4sVaunSpunTpIklatmyZwsPDtXHjRnXv3l0HDhzQunXrtH37drVt21aStGjRIrVv314HDx5UkyZNnPsiAQAAAAAAysEMrCoqLy9PNptN11xzjSQpPT1dxcXF6tatm1kTFhamiIgIpaWlSZK2bdsmu91uNq8kqV27drLb7WYNAAAAAABAZWMGVhX0888/a8qUKRowYIBq1qwpScrOzpa3t7cCAwMdakNCQpSdnW3W1K5du8zj1a5d26wpT2FhoQoLC83t06dPV8TLAAAAAAAAkMQMrCqnuLhYf//731VaWqoXX3zxV+sNw5DNZjO3L/735WouNWfOHHPRd7vdrvDw8N8XHgAAAAAAoBw0sKqQ4uJi9evXT0ePHtWGDRvM2VeSFBoaqqKiIuXm5jrcJycnRyEhIWbNt99+W+Zxv/vuO7OmPFOnTlVeXp55O3HiRAW9IgAAAAAAABpYVcaF5tWXX36pjRs3Kjg42GF/mzZt5OXl5bDYe1ZWljIzM9WhQwdJUvv27ZWXl6edO3eaNTt27FBeXp5ZUx4fHx/VrFnT4QYAAAAAAFBRWAPLTeTn5+vw4cPm9tGjR5WRkaGgoCCFhYXp3nvv1Z49e/T++++rpKTEXLMqKChI3t7estvtGjJkiMaPH6/g4GAFBQVpwoQJioyMNK9K2LRpU911110aOnSoXnnlFUnSsGHDFBMTwxUIAQAAAACAZWhgOdlPP/2knTt3KicnR6WlpQ77Bg0adNWPs3v3bnXq1MncHjdunCRp8ODBio+P17vvvitJatWqlcP9Nm/erOjoaElSQkKCPD091a9fPxUUFKhz585KTk6Wh4eHWb98+XKNHj3avFph7969tWDBgqvOCQAAAAAAUNFoYDnRe++9p4EDB+rs2bMKCAgos1j6b2lgRUdHyzCMy+6/0r4LfH19lZSUpKSkpMvWBAUFadmyZVedCwAAAAAAwNlYA6sCrVq1Sl9//bW5PX78eD300EM6c+aMfvrpJ+Xm5pq3H3/80cKkAAAAAAAA7oMGVgX
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"rang_freq_with_labels('pt-words-20', get_words(pan_tadeusz), top=20)"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Zobaczmy pełny obraz, już bez etykiet.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAwSElEQVR4nO3df3QV9Z3/8dclP64hm1zzo7mXWwKme7LWNSm1wYZEt2CBADWkLt9dtLgp3eWruAg0S/AHx+2Kfk+Tyq7ArqlWWb5CQYyn3xXWs7WRUDWaL6AYyAqIqGsKQXMN+r3cJBBvQjLfPyizvSSAQiZ3bub5OGfOyZ15z+QzfrwnLz4znxmXYRiGAAAA4Bijot0AAAAADC8CIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADhMf7QbEsv7+fn388cdKSUmRy+WKdnMAAMAXYBiGOjs75ff7NWqUM8fCCICX4eOPP1Z2dna0mwEAAC5Ba2urxo4dG+1mRAUB8DKkpKRIOvM/UGpqapRbAwAAvoiOjg5lZ2ebf8ediAB4Gc5e9k1NTSUAAgAQY5x8+5YzL3wDAAA4GAEQAADAYQiAAAAADkMABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCoA39Zn+bfly7T796qzXaTQEAACMQAdCG3g106t+bP9a+1hPRbgoAABiBCIA2lBh/plv6+owotwQAAIxEBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMABAAAcBgCoI0ZYhYwAAAYegRAAAAAhyEAAgAAOAwBEAAAwGEIgAAAAA5DAAQAAHAYAqCNGUwCBgAAFiAA2pDLFe0WAACAkYwACAAA4DAEQAAAAIchAAIAADiM7QLga6+9ptmzZ8vv98vlcmnbtm3nrV24cKFcLpfWrl0bsT4cDmvJkiXKzMxUcnKyysrKdOzYsYiaYDCo8vJyeTweeTwelZeX68SJE0N/QgAAADZjuwB48uRJTZgwQTU1NRes27Ztm9544w35/f4B2yoqKrR161bV1taqsbFRXV1dKi0tVV9fn1kzb948NTc3q66uTnV1dWpublZ5efmQn8/lYBIwAACwQny0G3CuWbNmadasWRes+eijj7R48WK99NJLuvnmmyO2hUIhrV+/Xps2bdK0adMkSZs3b1Z2drZ27NihGTNm6NChQ6qrq9Pu3btVWFgoSVq3bp2Kiop0+PBhXX311dac3BfkEtOAAQCAdWw3Angx/f39Ki8v1z333KNrr712wPampib19vaqpKTEXOf3+5WXl6edO3dKknbt2iWPx2OGP0maNGmSPB6PWTOYcDisjo6OiAUAACDWxFwAfOSRRxQfH6+lS5cOuj0QCCgxMVFpaWkR671erwKBgFmTlZU1YN+srCyzZjDV1dXmPYMej0fZ2dmXcSYAAADREVMBsKmpSf/8z/+sDRs2yPUln5ZsGEbEPoPtf27NuVasWKFQKGQura2tX6oNAAAAdhBTAfD1119Xe3u7xo0bp/j4eMXHx+vIkSOqrKzUVVddJUny+Xzq6elRMBiM2Le9vV1er9es+eSTTwYc//jx42bNYNxut1JTUyMWAACAWBNTAbC8vFxvv/22mpubzcXv9+uee+7RSy+9JEkqKChQQkKC6uvrzf3a2tp04MABFRcXS5KKiooUCoX05ptvmjVvvPGGQqGQWQMAADBS2W4WcFdXlz744APzc0tLi5qbm5Wenq5x48YpIyMjoj4hIUE+n8+cuevxeLRgwQJVVlYqIyND6enpWr58ufLz881Zwddcc41mzpypO+64Q08++aQk6c4771RpaWnUZwBL//0uYIPnwAAAAAvYLgC+9dZbuummm8zPy5YtkyTNnz9fGzZs+ELHWLNmjeLj4zV37lx1d3dr6tSp2rBhg+Li4syaZ555RkuXLjVnC5eVlV302YMAAAAjgcswGGe6VB0dHfJ4PAqFQkN6P+AvGv5LP/vNu/of3xqrR+dOGLLjAgAA6/5+x5KYugcQAAAAl48ACAAA4DAEQAAAAIchANrQ2UdRG+L2TAAAMPQIgAAAAA5DAAQAAHAYAiAAAIDDEAABAAAchgAIAADgMARAGzr7LmAAAAArEADtjKfAAAAACxAAAQAAHIYACAAA4DAEQAAAAIchAAIAADgMARAAAMBhCIA25NKZ58AwCRgAAFiBAAgAAOAwBEAAAACHIQACAAA4DAEQAADAYQiAAAAADkMAtCGXK9otAAAAIxkB0MYMgwfBAACAoUcABAAAcBgCIAAAgMMQAAEAAByGAAgAAOAwBEAAAACHIQDaGHOAAQCAFQiAAAAADkMABAAAcBgCIAAAgMMQAAEAABzGdgHwtdde0+zZs+X3++VyubRt2zZzW29vr+677z7l5+crOTlZfr9fP/zhD/Xxxx9HHCMcDmvJkiXKzMxUcnKyysrKdOzYsYiaYDCo8vJyeTweeTwelZeX68SJE8Nwhhfn4mXAAADAQrYLgCdPntSECRNUU1MzYNupU6e0d+9e/eQnP9HevXv1/PPP67333lNZWVlEXUVFhbZu3ara2lo1Njaqq6tLpaWl6uvrM2vmzZun5uZm1dXVqa6uTs3NzSovL7f8/AAAAKItPtoNONesWbM0a9asQbd5PB7V19dHrHvsscf07W9/W0ePHtW4ceMUCoW0fv16bdq0SdOmTZMkbd68WdnZ2dqxY4dmzJihQ4cOqa6uTrt371ZhYaEkad26dSoqKtLhw4d19dVXW3uSX5DBc2AAAIAFbDcC+GWFQiG5XC5deeWVkqSmpib19vaqpKTErPH7/crLy9POnTslSbt27ZLH4zHDnyRNmjRJHo/HrBlMOBxWR0dHxAIAABBrYjoAfv7557r//vs1b948paamSpICgYASExOVlpYWUev1ehUIBMyarKysAcfLysoyawZTXV1t3jPo8XiUnZ09hGcDAAAwPGI2APb29uq2225Tf3+/Hn/88YvWG4YRMblisIkW59aca8WKFQqFQubS2tp6aY0HAACIopgMgL29vZo7d65aWlpUX19vjv5Jks/nU09Pj4LBYMQ+7e3t8nq9Zs0nn3wy4LjHjx83awbjdruVmpoasQAAAMSamAuAZ8Pf+++/rx07digjIyNie0FBgRISEiImi7S1tenAgQMqLi6WJBUVFSkUCunNN980a9544w2FQiGzJpp4CAwAALCS7WYBd3V16YMPPjA/t7S0qLm5Wenp6fL7/fqLv/gL7d27V//xH/+hvr4+85699PR0JSYmyuPxaMGCBaqsrFRGRobS09O1fPly5efnm7OCr7nmGs2cOVN33HGHnnzySUnSnXfeqdLSUtvMAJYkJgEDAAAr2C4AvvXWW7rpppvMz8uWLZMkzZ8/XytXrtQLL7wgSfrmN78Zsd8rr7yiKVOmSJLWrFmj+Ph4zZ07V93d3Zo6dao2bNiguLg4s/6ZZ57R0qVLzdnCZWVlgz57EAAAYKRxGQZPm7tUHR0d8ng8CoVCQ3o/4P9ubNHD//GOZk/w67EfXDdkxwUAANb9/Y4lMXcPIAAAAC4PARAAAMBhCIA2dIFHEQIAAFw2AiAAAIDDEABtjPk5AADACgR
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
"from math import log\n",
|
|
|
|
"\n",
|
|
|
|
"def rang_freq(name, g):\n",
|
|
|
|
" freq = freq_list(g)\n",
|
|
|
|
"\n",
|
|
|
|
" plt.figure().clear()\n",
|
|
|
|
" plt.plot(range(1, len(freq.values())+1), freq.values())\n",
|
|
|
|
"\n",
|
|
|
|
" fname = f'02_Jezyki/{name}.png'\n",
|
|
|
|
"\n",
|
|
|
|
" plt.savefig(fname)\n",
|
|
|
|
"\n",
|
|
|
|
" return fname\n",
|
|
|
|
"\n",
|
|
|
|
"rang_freq('pt-words', get_words(pan_tadeusz))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Widać, jak różne skale obejmuje ten wykres. Zastosujemy logartm,\n",
|
|
|
|
"najpierw tylko do współrzędnej y.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAo10lEQVR4nO3dfZSV5X03+t+eFzY4DqOIvEwYkZgYNaCxaBWNifGFSMQkzcsyqTEkabqWWWBqtWc1NutZmrQV2z718TzHxlaPx2rTiOusqvU8MSZ4VNSjGF+wQTQGowFUECU4w4tsmJn7/MHMxglCVPae+95zfT5r7eXsPfdmfsM1s/j6u+7rukpZlmUBAEAymvIuAACA4SUAAgAkRgAEAEiMAAgAkBgBEAAgMQIgAEBiBEAAgMQIgAAAiREAAQASIwACACRGAAQASIwACACQGAEQACAxAiAAQGIEQACAxAiAAACJEQABABIjAAIAJEYABABIjAAIAJAYARAAIDECIABAYgRAAIDECIAAAIkRAAEAEiMAAgAkRgAEAEiMAAgAkBgBEAAgMQIgAEBiBEAAgMQIgAAAiREAAQASIwACACRGAAQASIwACACQGAEQACAxAiAAQGIEQACAxAiAAACJEQABABIjAAIAJEYABABIjAAIAJAYARAAIDECIABAYgRAAIDECIAAAIkRAAEAEiMAAgAkRgAEAEiMAAgAkBgBEAAgMQIgAEBiBEAAgMQIgAAAiWnJu4BG1t/fH6+88kq0t7dHqVTKuxwA4B3Isiw2bdoUnZ2d0dSUZi9MANwHr7zySnR1deVdBgDwHqxZsyamTJmSdxm5EAD3QXt7e0Ts/AEaO3ZsztUAAO9ET09PdHV1Vf8dT5EAuA8Gp33Hjh0rAAJAg0n59q00J74BABImAAIAJEYABABIjAAIAJAYARAAIDECIABAYgRAAIDECIAAAIkRAAEAEiMAAgAkRgAEAEiMAFhQWZblXQIAMEIJgAX0y3U98elr/r/4xUtv5F0KADACCYAF9I8/+1Usf7k7/u/HX8q7FABgBBIAC+gjXQdERESlty/fQgCAEUkALKBSaed/3QYIANSDAFhApdiZAPsFQACgDgTAAmoa7ACGBAgA1J4AWECDU8DyHwBQDwJgAe2aApYAAYDaEwALqLoIJN8yAIARSgAsoNJAAtQABADqQQAsILcAAgD1lHQAPPTQQ6NUKu32mD9/fq51Da4Cdg8gAFAPLXkXkKfHHnss+vp2nbbx9NNPx5lnnhlf/OIXc6xq1xSwFiAAUA9JB8CDDz54yPMrr7wyDjvssPj4xz+eU0U7lewDCADUUdJTwG+1ffv2+OEPfxjf+MY3dnXgclK9B1D+AwDqIOkO4Fvdcccd8cYbb8TXvva1PV5TqVSiUqlUn/f09NSnGKuAAYA60gEccMMNN8ScOXOis7Nzj9csXLgwOjo6qo+urq661LJrFbAECADUngAYEatWrYp77rknvvnNb+71uksvvTS6u7urjzVr1tSlnpxnoAGAEc4UcETceOONMWHChDj77LP3el25XI5yuTxMVZkCBgDqI/kOYH9/f9x4440xb968aGkpRh4ePAtY/gMA6iH5AHjPPffE6tWr4xvf+EbepVSZAgYA6qkYLa8czZ49O7KCzrUWtCwAoMEl3wEsol0NQAkQAKg9AbCATAEDAPUkABaYKWAAoB4EwAKyChgAqCcBsIhMAQMAdSQAFlhRVycDAI1NACygXWcBAwDUngBYQKWBZcAagABAPQiABeQWQACgngTAAtMABADqQQAsoMGNoC0CAQDqQQAsICeBAAD1JAACACRGACyg6kkgZoABgDoQAAvIFDAAUE8CYIFl1gEDAHUgABaYKWAAoB4EwAIqmQMGAOpIACwwHUAAoB4EwAIa7P+5BxAAqAcBsIB2nQSSbx0AwMgkABZQKdwDCADUjwBYYBqAAEA9CIAFVNp1EyAAQM0JgAVkAhgAqCcBsMCsAgYA6kEALCCrgAGAehIAC8kkMABQPwJggWkAAgD1IAAW0K4pYBEQAKg9AbCATAADAPUkABaY/h8AUA8CYAGVBuaAzQADAPUgABaQg0AAgHoSAAuo5CZAAKCOkg6AL7/8cnzlK1+Jgw46KPbbb7/4yEc+Ek888UTeZe1iDhgAqIOWvAvIy8aNG+Pkk0+OT3ziE/GTn/wkJkyYEL/+9a/jgAMOyLu0XdvA5FsGADBCJRsA/+7v/i66urrixhtvrL526KGH5lfQW5RsBAMA1FGyU8B33nlnHHfccfHFL34xJkyYEMcee2xcf/31eZc1hBlgAKAekg2AL7zwQlx77bXxwQ9+MH7605/GBRdcEN/+9rfj5ptv3uN7KpVK9PT0DHnURXUKWAIEAGov2Sng/v7+OO644+KKK66IiIhjjz02VqxYEddee2189atffdv3LFy4ML73ve/VvTYTwABAPSXbAZw8eXIcddRRQ1478sgjY/Xq1Xt8z6WXXhrd3d3Vx5o1a+paoylgAKAeku0AnnzyyfHcc88Nee1Xv/pVTJ06dY/vKZfLUS6X612ak0AAgLpKtgP453/+57F06dK44oor4vnnn48f/ehHcd1118X8+fPzLs0UMABQV8kGwOOPPz5uv/32uOWWW2L69Onx13/913H11VfHeeedl3dpVRqAAEA9JDsFHBExd+7cmDt3bt5l7Ka6EbQ5YACgDpLtABaZjaABgHoSAAuoJP8BAHUkABaYGWAAoB4EwAIabAA6CQQAqAcBsIhMAQMAdSQAFpgpYACgHgTAAhpcBSz/AQD1IAAWkFXAAEA9CYAFZiNoAKAeBMAC2rUKGACg9gTAAmpu2hkB+/tFQACg9gTAAhoMgH2mgAGAOhAAC6gaAPsEQACg9gTAAhoMgL2mgAGAOhAAC6ilaeew9JsCBgDqQAAsoOaBUdEBBADqQQAsoOaBDmCfAAgA1IEAWEDNA0eBCIAAQD0IgAXU3CwAAgD1IwAWkA4gAFBPAmAB2QYGAKgnAbCAWppK1Y8dBwcA1JoAWEBNbwmAuoAAQK0JgAU0pANoM2gAoMYEwAJq1gEEAOpIACygtwZAK4EBgFoTAAtocBuYCAEQAKg9AbCAmppKMZgBe/v78y0GABhxBMCCGlwIIv8BALUmABZUU2lwM2gJEACoLQGwoHQAAYB6EQALqqlJBxAAqA8BsKCqHUAbQQMANSYAFlRz086hsRE0AFBrAmBBNQ+MTG+fAAgA1FayAfDyyy+PUqk05DFp0qS8y6pqGegAmgIGAGqtJe8C8vThD3847rnnnurz5ubmHKsZqmmwA2gKGACosaQDYEtLS6G6fm9V7QAKgABAjSU7BRwRsXLlyujs7Ixp06bFl770pXjhhRfyLqmqqXoUnAAIANRWsh3AE044IW6++eY4/PDD49VXX42/+Zu/iZNOOilWrFgRBx100Nu+p1KpRKVSqT7v6empW32DHcA+ARAAqLFkO4Bz5syJz3/+8zFjxow444wz4sc//nFERNx00017fM/ChQujo6Oj+ujq6qpbfc0DLUABEACotWQD4O9qa2uLGTNmxMqVK/d4zaWXXhrd3d3Vx5o1a+pWjwAIANRLslPAv6tSqcSzzz4bp5xyyh6vKZfLUS6Xh6UeARAAqJdkO4B/8Rd/EUuWLIkXX3wxHn300fjCF74QPT09MW/evLxLi4hdAdAiEACg1pLtAL700kvx5S9/OV5//fU4+OCD48QTT4ylS5fG1KlT8y4tInYFQBtBAwC1lmwAXLRoUd4l7FVzSQcQAKiPZKeAi66lefAewP6cKwEARhoBsKB2LQLJuRAAYMQRAAtqcApYBxAAqDUBsKB0AAGAehEAC2pXAJQAAYDaEgALykbQAEC9CIAFZSNoAKBeBMCC0gEEAOpFACyocsvOodne6x5AAKC2BMCCKrc0R0TEtt6+nCsBAEYaAbCgyq07h6ayQwcQAKgtAbCgRusAAgB1IgAW1OjWgQCoAwgA1JgAWFCDi0C27dABBABqSwAsqDGjBjuAAiAAUFsCYEHtNxAAt1QEQACgtgTAgtq/3BIREVu39+Z
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
"from math import log\n",
|
|
|
|
"\n",
|
|
|
|
"def rang_log_freq(name, g):\n",
|
|
|
|
" freq = freq_list(g)\n",
|
|
|
|
"\n",
|
|
|
|
" plt.figure().clear()\n",
|
|
|
|
" plt.plot(range(1, len(freq.values())+1), [log(y) for y in freq.values()])\n",
|
|
|
|
"\n",
|
|
|
|
" fname = f'02_Jezyki/{name}.png'\n",
|
|
|
|
"\n",
|
|
|
|
" plt.savefig(fname)\n",
|
|
|
|
"\n",
|
|
|
|
" return fname\n",
|
|
|
|
"\n",
|
|
|
|
"rang_log_freq('pt-words-log', get_words(pan_tadeusz))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"****Pytanie**** Dlaczego widzimy coraz dłuższe „schodki”?\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Hapax legomena\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Z poprzedniego wykresu możemy odczytać, że ok. 2/3 wyrazów wystąpiło\n",
|
|
|
|
"dokładnie 1 raz. Słowa występujące jeden raz w danym korpusie noszą\n",
|
|
|
|
"nazwę *hapax legomena* (w liczbie pojedynczej *hapax legomenon*, ἅπαξ\n",
|
|
|
|
"λεγόμενον, „raz powiedziane”, żargonowo: „hapaks”).\n",
|
|
|
|
"\n",
|
|
|
|
"„Prawdziwe” hapax legomena, słowa, które wystąpiły tylko raz w *całym*\n",
|
|
|
|
"korpusie tekstów danego języka (np. starożytnego) rzecz jasna\n",
|
|
|
|
"sprawiają olbrzymie trudności w tłumaczeniu. Przykładem jest greckie\n",
|
|
|
|
"słowo ἐπιούσιος, przydawka odnosząca się do chleba w modlitwie „Ojcze\n",
|
|
|
|
"nasz”. Jest to jedyne poświadczenie tego słowa w całym znanym korpusie\n",
|
|
|
|
"greki (nie tylko z Pisma Świętego). W języku polskim tłumaczymy je na\n",
|
|
|
|
"„powszedni”, ale na przykład w rosyjskim przyjął się odpowiednik\n",
|
|
|
|
"„насущный” — o przeciwstawnym do polskiego znaczeniu!\n",
|
|
|
|
"\n",
|
|
|
|
"W sumie podobne problemy hapaksy mogą sprawiać metodom statystycznym\n",
|
|
|
|
"przy przetwarzaniu jakiekolwiek korpusu.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Wykres log-log\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Jeśli wspomniany wcześniej wykres narysujemy używając skali\n",
|
|
|
|
"logarytmicznej dla ****obu**** osi, otrzymamy kształt zbliżony do linii prostej.\n",
|
|
|
|
"\n",
|
|
|
|
"Tę własność tekstów nazywamy ****prawem Zipfa****.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA340lEQVR4nO3deXhU5cH+8fvMJJksZAaSkEBMgLALyI7IDiooWqvVuqLi2mpBRboo2vd1Jz+1tfYtimKtG3WpG6J1w4WALLIrsohsSVjCTiYJZEIy8/sjMBpZAiSZZ2bO93Ndc7VzZgbupqPPneec5zlWIBAICAAAALbhMB0AAAAAoUUBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZiiAAAAANkMBBAAAsBkKIAAAgM1QAAEAAGyGAggAAGAzFEAAAACboQACAADYDAUQAADAZmJMB4hkfr9fW7ZsUXJysizLMh0HAAAch0AgoJKSEmVmZsrhsOdcGAWwDrZs2aLs7GzTMQAAwEkoLCxUVlaW6RhGUADrIDk5WVL1F8jtdhtOAwAAjofX61V2dnZwHLcjCmAdHDrt63a7KYAAAEQYO1++Zc8T3wAAADZGAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABsJsZ0ABzu4++26qPvimoc++ntqn9+82rrqE8k62cHfvrRn98C+6evNU6MU6vUJLVKS1ROWpIykuPlcNj3ptkAAEQTCmAYWl1UoveWbTEdo4b4WEd1IUxNUqu0JOWkJapVapJy0pLUNNl1WCkFAADhiwIYhga1a6pGrqP/XxMI/Oy5ArW8fuKf31Hi08ZdZdq4s0yFe/ar/IBfq4tKtLqo5LA8SXFOtTxYBlv9pBi2SktSalIc5RAAgDBjBQI/rwM4Xl6vVx6PR8XFxXK73abjNJgDVX5t2rNfG3eWacPOMm3c9eN/bt6zX/5jfIPSGsVpVN+Wuq5/KzVJigtdaAAAjsIu4/exUADrgC+Q5KusUuHu6nL402K4cec+bSneH5xtTIh16orTs3XzoNbKbJxgNjQAwNYYvymAdcIX6NjKD1Tp81Xb9fTMtVqxxStJinFYuqjHKbplSGu1TU82nBAAYEeM3xTAOuELdHwCgYC+WrtTk2eu09x1u4LHR3TK0K1D26hHiyYG0wEA7IbxmwJYJ3yBTtzSgj16Jm+dPlmxLXjsjNYpunVoWw1ul8aCEQBAg2P8pgDWCV+gk7d2e4mezVuvd5duVuXBVSSdM926dWgbjezSXE72HAQANBDGbwpgnfAFqrste/fr+a826LUFBdpXUSVJapmaqMv7ZOuSnlnKcMcbTggAiDaM3xTAOuELVH/2lFXo5Xn5enHuBu3Zd0CS5LCkwe2b6tJe2Tq7U7pcMU7DKQEA0YDx2+YFsFWrVsrPzz/s+O9+9zs99dRTtX6eL1D921dRqQ++3aq3Fm3Sgo27g8cbJ8bqwm6ZurR3tjpnurlWEABw0hi/bV4Ad+zYoaqqquDz7777TsOHD9eXX36poUOH1vp5vkANa8POMr21uFBvL96sIm958HjHZsm6tHe2LuqeqdRGLoMJAQCRiPHb5gXw58aNG6cPPvhAP/zww3HNMPEFCo0qf/U2Mm8uKtSnK7epotIvSXLFOPTSDafrjNaphhMCACIJ47fkMB0gXFRUVGjq1Km64YYbOL0YZpwOS0PaN9Wkq3pqwT1n6cELO6tjs2T5Kv167OPVpuMBABBxKIAHTZs2TXv37tV111131Pf4fD55vd4aD4RW48Q4XduvlV65sa/inA4tKdirxfm7a/8gAAAIogAe9Pzzz2vkyJHKzMw86ntyc3Pl8XiCj+zs7BAmxE81TXbp4p6nSJKmzFpvOA0AAJGFAigpPz9fn332mW666aZjvm/ChAkqLi4OPgoLC0OUEEdy06AcSdKnK7dp/Y5Sw2kAAIgcFEBJL7zwgtLT03X++ecf830ul0tut7vGA+a0TU/WWR3TFQhIz3+1wXQcAAAihu0LoN/v1wsvvKDRo0crJibGdBycoJsHt5YkvbV4k3aV+gynAQAgMti+AH722WcqKCjQDTfcYDoKTkLfnBR1y/LIV+nXK/MP39QbAAAczvYFcMSIEQoEAmrfvr3pKDgJlmUFZwFfnpev8gNVtXwCAADYvgAi8p3buZmymiRod1mF3l6yyXQcAADCHgUQES/G6dCNA6tXBP9z9gZV+bm5DQAAx0IBRFS4rHe23PEx2rCzTJ+t2mY6DgAAYY0CiKiQ5IrR1We0lCQ9x8bQAAAcEwUQUeO6/q0U53RoUf4eLc7fYzoOAABhiwKIqJHujtdFPapv5ffP2cwCAgBwNBRARJWbBlVvCfPxiiLl7yoznAYAgPBEAURUaZ+RrGEdmioQqF4RDAAADkcBRNQ5tDH0m4sLtbuswnAaAADCDwUQUadf61SddopH5Qf8uued5Vpd5DUdCQCAsEIBRNSxLEtjz2wrqfpawHOfnK2Ln56jNxcVan8Ft4oDAMAKBALcNuEkeb1eeTweFRcXy+12m46Dn5m7bqemzs/Xpyu2qfLg3UGS42P0qx6n6Io+LdQpk//PAMCOGL8pgHXCFygy7Cjx6a3Fm/TaggIV7N4XPH7XuR1169A2BpMBAExg/KYA1glfoMji9wc0b/0uvTxvoz5ZsU3Jrhh9fe9ZSoyLMR0NABBCjN9cAwgbcTgsDWibpsmjeqlFSqJKfJX64JutpmMBABByFEDYjsNh6aq+LSRJ//4633AaAABCjwIIW7q0V5ZinZa+2VSs7zYXm44DAEBIUQBhS6mNXDq3S3NJ0r+/LjCcBgCA0KIAwrZGHTwN/N6yzSopP2A4DQAAoUMBhG31zUlRm6ZJ2ldRpWnLtpiOAwBAyFAAYVuWZemqvi0lSa9+XSB2RAIA2AUFELZ2Sc9T5IpxaNVWr5YW7jUdBwCAkKAAwtYaJ8bpF10zJUn/ns9iEACAPVAAYXuH9gT84NstKt7HYhAAQPSjAML2erZorI7NkuWr9OvtJZtMxwEAoMFRAGF7lmVp1BnVi0H+/XU+i0EAAFGPAghIuqh7phLjnFq3o0wLNuw2HQcAgAZFAQQkJcfH6sLu1YtB/u+LH1RZ5TecCACAhkMBBA66YUCOXDEOzVm7S/e8u5xTwQCAqEUBBA5ql5GsSVf1lMOS/rNok/766RrTkQAAaBAUQOAnhnfK0CO/Ok2SNOnLtXp53kazgQAAaAAUQOBnrjy9he48u70k6b7pK/Th8q2GEwEAUL8ogMAR3H5WW43q20KBgDTu9WWat26X6UgAANQbCiBwBJZl6cELu+jczs1UUeXXdS8s0OOfrFapr9J0NAAA6owCCByF02HpySu6a1iHpvJV+vXUl+s09PGZen1Bgar8rBAGAEQuWxfAzZs36+qrr1ZqaqoSExPVvXt3LV682HQshJH4WKf+dV0fPXtNL7VKTdTOUp/ufme5zv+/2Zr5/Xa2igEARKQY0wFM2bNnjwYMGKBhw4bpo48+Unp6utatW6fGjRubjoYwY1mWzuncTMM6pOuV+fn6+2drtLqoRNe9sFA9WzTWncPba2DbNFmWZToqAADHxQrYdArj7rvv1pw5czR79uyT/jO8Xq88Ho+Ki4vldrvrMR3C2Z6yCk36cq2mzs+
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
"from math import log\n",
|
|
|
|
"\n",
|
|
|
|
"def log_rang_log_freq(name, g):\n",
|
|
|
|
" freq = freq_list(g)\n",
|
|
|
|
"\n",
|
|
|
|
" plt.figure().clear()\n",
|
|
|
|
" plt.plot([log(x) for x in range(1, len(freq.values())+1)], [log(y) for y in freq.values()])\n",
|
|
|
|
"\n",
|
|
|
|
" fname = f'02_Jezyki/{name}.png'\n",
|
|
|
|
"\n",
|
|
|
|
" plt.savefig(fname)\n",
|
|
|
|
"\n",
|
|
|
|
" return fname\n",
|
|
|
|
"\n",
|
|
|
|
"log_rang_log_freq('pt-words-log-log', get_words(pan_tadeusz))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Związek między frekwencją a długością\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Powiązane z prawem Zipfa prawo językowe opisuje zależność między\n",
|
|
|
|
"częstością użycia słowa a jego długością. Generalnie im krótsze słowo, tym częstsze.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAACkPklEQVR4nO2dd3gUZdvFz6bSQugldBDpvYMiUkXBLlbE3hALVuwdFJXXXpFib6AIKKICKoiAgIL0DgIiIAk1pMz3x/nG2U02hXlmU5jzu669Zneezdyz2U3m7F0DlmVZEEIIIYQQviGqsE9ACCGEEEIULBKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz5AAFEIIIYTwGRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz4gp7BMozmRmZmL79u1ISEhAIBAo7NMRQgghRD6wLAv79+9HUlISoqL86QuTADRg+/btqFWrVmGfhhBCCCFcsHXrVtSsWbOwT6NQkAA0ICEhAQA/QGXLli3ksxFCCCFEfkhJSUGtWrX+u477EQlAA+ywb9myZSUAhRBCiGKGn9O3/Bn4FkIIIYTwMRKAQgghhBA+QwJQCCGEEMJnSAAKIYQQQvgMCUAhhBBCCJ8hASiEEEII4TMkAIUQQgghfIYEoBBCCCGEz1Aj6KLG0aPA118D27YBVaoAZ5wBlCpV2Gd17KSlAdOnA+vWAeXLA2edBVSsWNhnJYQQQghIABYtPvgAGD4c+PtvICYGSE+neHr8cWDo0MI+u/wzZQpwww3Ajh1AmTLAoUPATTfxtT3xBODTwdtCCCFEUUFX4qLC558Dl14K9OgB/PknPWjr1gEXXADcfDPw2muFfYb54/vvgXPPBTp0AP74A9i/H9i5E7jnHmDUKGDEiMI+QyGEEML3BCzLsgr7JIorKSkpSExMRHJystks4MxMoFEjoEkTevrGj3dCwIMH06P21VfA1q1AiRKenX9E6NqVHr45c4Do6NC1xx/nbetWoGrVwjk/IYQQvsez63cxRh7AosCCBfT2/fsvcNppwIoVQL16FIHnnAMsXw7s3g3MmFHYZ5o7GzcCv/wC3HprdvEH0JMZCACffVbw5yaEEEKI/1AOYFHgn3+4XbgQmDyZBROBAPfNns3HALBrV6GcXr7Zs4fbE04Iv16+PFChgvM8IYQQQhQK8gAWBcqX5/bcc4Gzz3bEH8CcwBtv5P2SJQv6zI6NWrUY/l2wIPz65s0scKlbt0BPSwghhBChSAAWBQ4c4HbVKraBCSYzk8UUAKuCizJVqwIDBgDPPQfs2xe6ZlnAI48AZcsC551XGGcnhBBCiP9HArAokJnJ7fLl9PhNmQJs2QJ8+y3Qrx/wzTehzyvKPP008xU7d2Yxy5o1wMyZwMCBfPz880Dp0oV9lkIIIYSvkQAsCrRty75/N91Eb+BZZwF16lD8bdwI3H03n9e5c+GeZ35o3BiYO5dh3iuvZHVz377Ahg3Ap58CV11V2GcohBBC+B61gTHA0zLyiy8GvvgCOHKEhRI1a7J/3q5dnATSsSMwa5Yn511gbNkCbNoElCsHtGgRmtsohBBCFBJqA6Mq4KJDvXoUf1FRQK9eQLt2bAfz4YecpFEcCydq1+ZNCCGEEEUKeQAN8OwbxOHD9PhdfDFbqLzzjtMs+fLLuT56NPsCVqrk3QsQQgghfIg8gBKARnj2Afr+e6B3b2DZMqB58+zr//zDqSDvvw9ccol7OwCQkQFMnQqMHcu2LBUrcgTdJZcU/TYzQgghhAdIAKoIpGhw+DC3FSqEX7f7BNrPc8vRo5wscvbZ7Md38slAXBxw7bVAp07cJ4QQQojjHl8LwLp16yIQCGS7DR06tGBPxC6QyGnUm72/VSszOw8+yGN99RXw66/Ayy+zxczvv9PLOHiw2fGFEEIIUSzwdQj4n3/+QUZGxn+Ply9fjj59+mDWrFno0aNHnj/vqQt5wAD2AZw7F6hRw9m/dy97A8bHc1ScWw4e5HFvuAEYNSr7+scfAxddxHNo1sy9HSGEEKKIoxCwz6uAK1euHPJ41KhRaNCgAU455ZSCP5lXXwVOOglo2RK4+mqgdWtOBnnrLSAtDZgzx+z4S5cCycksNAnHOecwHPzDDxKAQgghxHGOrwVgMEePHsV7772H4cOHI1AY/epq1+YM3dGjgbffBv79FyhTBrjsMjaCrlfP7Pi2ozc6Ovx6VBTD0P51CAshhBC+QQLw//niiy+wb98+XHHFFTk+JzU1Fampqf89TklJ8fYkqlXjHN1nn2Xvv5IlKcy8oFUrCspPPw1faTx1KpCaysIQIYQQQhzX+LoIJJixY8eif//+SEpKyvE5I0eORGJi4n+3WrVqReZkAgHOy/VK/AFAQgJHsz37LPDjj6FrGzYAt91G8demjXc2hRBCCFEk8XURiM3mzZtRv359TJo0CWeddVaOzwvnAaxVq1bxSSI9dIjFJrNmAX36AB06AOvXA5MmMQQ9axYQKVErhBBCFBFUBCIPIABg3LhxqFKlCs4444xcnxcfH4+yZcuG3IoVpUqx7cuECQz3vv8+sHYtq4J/+03iTwghhPAJvs8BzMzMxLhx4zBkyBDExPjg1xEXx/Fyl19e2GcihBBCiELC9x7A7777Dlu2bMFVV11V2KcihBBCCFEg+MDllTt9+/aF0iCFEEII4Sd87wEUQgghhPAbEoBCCCGEED5DAlAIIYQQwmdIAAohhBBC+AwJQCGEEEIIn+H7KmARIbZuBcaPB9atA8qXBy66COjUiWPuhBBCCFGoyAMovGfUKKBuXWD0aArASZOALl2AgQOBAwcK++yEEEII3yMBKLxlwgRgxAjgnnuAv/4C5s4FNm0CPvsMmDMHuPrqwj5DIYQQwvdIABYljh7lfN7evYHGjYHu3YG33gIOHy7sM8sflgWMHAmcey7w1FNAQgL3R0UB550HvPAC8MknnD8shBBCiEJDArCocOAA0KcPcNllFFJnnEEBdf31QNeuwO7dhX2GebNqFbB6NXDddeHXL7kEKFUKmDKlYM9LCCGEECGoCKSoMHw4sHgx8NNPwEknOfv/+IMewauvBr78svDOLz8cOsRtpUrh10uUAMqUcZ4nhBBCiEJBHsCiwJ49wMSJwP33h4o/AGjZEnj6aeCrr4D16wvn/PLLCScAJUsC334bfn3pUmDXLqBFiwI9LSGEEEKEIgFYFFi4EEhNBS68MPz6oEEMC//0U8Ge17GSmAhcfDHw/PPZ8/wOHwbuuAOoWRMYMKBwzk8IIYQQABQCLhpYVu7rxal33qhRwLx5QPv2DFt37Qps3gy88QawbRswbRoQo4+dEEIIUZjIA1gU6NgRiI9nhWw4PvmEIvDkkwv2vNxQuTJbv9xwA/Duu8AFF7AtTNu2wC+/AKeeWthnKIQQQviegGXl5X4SOZGSkoLExEQkJyejbNmyZge77jrgo4+Ar78GunVz9i9bxiKQzp2LfhFIVjIygJQUoHRpIC6usM9GCCGEAODx9buYolh
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"def freq_vs_length(name, g, top=None):\n",
|
|
|
|
" freq = freq_list(g)\n",
|
|
|
|
"\n",
|
|
|
|
" plt.figure().clear()\n",
|
|
|
|
" plt.scatter([len(x) for x in freq.keys()], [log(y) for y in freq.values()],\n",
|
|
|
|
" facecolors='none', edgecolors='r')\n",
|
|
|
|
"\n",
|
|
|
|
" fname = f'02_Jezyki/{name}.png'\n",
|
|
|
|
"\n",
|
|
|
|
" plt.savefig(fname)\n",
|
|
|
|
"\n",
|
|
|
|
" return fname\n",
|
|
|
|
"\n",
|
|
|
|
"freq_vs_length('pt-lengths', get_words(pan_tadeusz))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"### N-gramy\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"W modelowaniu języka często rozpatruje się n-gramy, czyli podciągi o\n",
|
|
|
|
"rozmiarze $n$.\n",
|
|
|
|
"\n",
|
|
|
|
"Na przykład *digramy* (*bigramy*) to zbitki dwóch jednostek, np. liter albo wyrazów.\n",
|
|
|
|
"\n",
|
|
|
|
"| $n$|$n$-gram|nazwa|\n",
|
|
|
|
"|---|---|---|\n",
|
|
|
|
"| 1|1-gram|unigram|\n",
|
|
|
|
"| 2|2-gram|digram/bigram|\n",
|
|
|
|
"| 3|3-gram|trigram|\n",
|
|
|
|
"| 4|4-gram|tetragram|\n",
|
|
|
|
"| 5|5-gram|pentagram|\n",
|
|
|
|
"\n",
|
|
|
|
"**Pytanie:** Jak nazywa się 6-gram?\n",
|
|
|
|
"\n",
|
|
|
|
"Jak widać, dla symetrii mówimy czasami o unigramach, jeśli operujemy\n",
|
|
|
|
"po prostu na jednostkach, nie na ich podciągach.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### N-gramy z Pana Tadeusza\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Statystyki, które policzyliśmy dla pojedynczych liter czy wyrazów możemy powtórzyć dla n-gramów.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"[('k', 'o', 't'), ('o', 't', 'e'), ('t', 'e', 'k')]"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"def ngrams(iter, size):\n",
|
|
|
|
" ngram = []\n",
|
|
|
|
" for item in iter:\n",
|
|
|
|
" ngram.append(item)\n",
|
|
|
|
" if len(ngram) == size:\n",
|
|
|
|
" yield tuple(ngram)\n",
|
|
|
|
" ngram = ngram[1:]\n",
|
|
|
|
"\n",
|
|
|
|
"list(ngrams(\"kotek\", 3))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Zauważmy, że policzyliśmy wszystkie n-gramy, również częściowo pokrywające się.\n",
|
|
|
|
"\n",
|
|
|
|
"Zawsze powinniśmy się upewnić, czy jest jasne, czy chodzi o n-gramy znakowe czy wyrazowe\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### 3-gramy znakowe\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA7x0lEQVR4nO3deXhU5eH28XsySSb7kBUICSFsssmOyCoIoqhULdVqEXFtrbjQtFZwK6gYtWrtqwVFrba40VbFpehPXNj3VWTfE9YQlkxIyCSZmfePwCAFZMnyzMz5fq5rLjgnM/Q212meO895zjk2n8/nEwAAACwjzHQAAAAA1C0KIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFhNuOkAw83q92rVrl+Lj42Wz2UzHAQAAZ8Hn86m4uFjp6ekKC7PmXBgFsBp27dqlzMxM0zEAAMB5yM/PV0ZGhukYRlAAqyE+Pl5S1QGUkJBgOA0AADgbLpdLmZmZ/nHciiiA1XDstG9CQgIFEACAIGPl5VvWPPENAABgYRRAAAAAi6EAAgAAWAwFEAAAwGIogAAAABZDAQQAALAYCiAAAIDFUAABAAAshgIIAABgMRRAAAAAi6EAAgAAWAwFEAAAwGLCTQfAyb5es1cfr9ip7ORYNUmJVfbRV2JMhKUfXA0AAGoGBTAALc07qP9+v/uk/c7oiKpCmByj7JQ4NUmJUdOjf8ZHRRhICgAAghEFMAANbtdAybGR2lJYom2FJdpaWKLdRWUqOlKhlfmHtDL/0EmfSYlzKDslRk2SY5WdGqvso382SY5VVIS97v8jAABAwLL5fD6f6RDByuVyyel0qqioSAkJCbX6v3Wk3KPtB0q0dV+Jtu6v+nPb/hJtLSxV4WH3T3423RmlJilVp5ObpsT6S2JmYowiw1kGCgCwlrocvwOVZQtgZWWlxo4dq3fffVd79uxRw4YNdeutt+rRRx9VWNjZlaJAOYCKyyq0rbBUWwoPa1thqbbtL9GWwhJt3XdYrrLK034uzCZlJMb41xhm/6gkpteLlj2M9YYAgNATKOO3SZY9Bfzss8/q1Vdf1T/+8Q+1bdtWS5Ys0W233San06kHHnjAdLxzEh8VoQsznLoww3nCfp/Pp4OlFdr6o1PJP549LC33KO9AqfIOlGrmhn0nfDbSHqbGyTHq3Lie+rRIVa/mKUqKjazL/ywAAFBLLDsDePXVV6t+/fp68803/fuGDh2qmJgYTZ48+az+jWD+DcLn82lfsfuEdYZbC6uK4bb9pSqv9J7wfptNapfuVJ8WKerdIkVdshLlCGdtIQAg+ATz+F1TLDsD2Lt3b7366qvasGGDWrZsqZUrV2rOnDl66aWXTEerEzabTWkJUUpLiNLFTZNP+JrH69PuoiPauPew5m0u1OyNhVq3p1irdhZp1c4iTZixWdERdnVvmqQ+LVLVt0WKmqfFcYsaAACChGUL4EMPPaSioiK1atVKdrtdHo9H48eP10033XTaz7jdbrndxy+4cLlcdRG1ztnDbMpIjFFGYoz6t0qTJBW4yjRnU1UZnL2xUIWH3Zqxfp9mrK86dVw/waE+LVLVp0WKejVPUUqcw+R/AgAA+AmWPQX8wQcf6MEHH9Sf//xntW3bVitWrNCoUaP04osvasSIEaf8zNixYzVu3LiT9lttCtnn82ndnmLN3rhPszcWatHWA3L/zynjtukJ6t0iRX1bpKpLViK3ogEABAxOAVu4AGZmZmr06NEaOXKkf99TTz2ld955R+vWrTvlZ041A5iZmWnpA0iSyio8WrLtoGZv3KdZGwu1dveJM6NREWG6KDtZrRvGyxFulyM8TJH2MEWGh1X9/djLv8/u/1pUhF1NU2IVxhXJAIAaQgG08Cng0tLSk273Yrfb5fV6T/MJyeFwyOHg1Ob/ioqwq/fRi0PGSNpX7NZc/+nifSoodmvWhn2a9T9XGp+trlmJevPWbnJG87QTAABqgmUL4JAhQzR+/Hg1btxYbdu21fLly/Xiiy/q9ttvNx0t6KXGO3Rtp0a6tlMj+Xw+bSw4rNkbC7Xr0BGVV3qrXh6v3JUelVd65f7xvoqqP4+970BpuZZsP6hfvb5Ak+/ozq1oAACoAZY9BVxcXKzHHntMH3/8sQoKCpSenq6bbrpJjz/+uCIjz65kMIVc+9budmn4mwtVeLhcLevH6Z07uistIcp0LABAEGP8tnABrAkcQHVjU8FhDXtjgfa63MpOidW7d3ZXer1o07EAAEGK8VviQbAIeM3T4vTv3/RURmK0thaW6PpX5ytvf6npWAAABC0KIIJC4+QY/es3PZSdEqudh47o+tfmaVPBYdOxAAAIShRABI30etGa8puL1bJ+nPa63Lpx0vyTbjkDAADOjAKIoJIWH6UPft1DbdMTVHi4XDdOWqDvdxwyHQsAgKBCAUTQSYqN1Ht3XaxOjeup6EiFhr2+UEu2HTAdCwCAoEEBRFByRkdo8h3d1T07ScXuSg1/c5HmbSo0HQsAgKDAbWCqgcvIzTtS7tGvJy/R7I2FirSHqXeLFHXJSlTXrER1yKzHM4gBACdh/KYAVgsHUGBwV3p033vL9dWavSfsj7Db1Dbdqa5ZieraJFFdspKUGs+j/ADA6hi/KYDVwgEUOHw+n77fUaTF2w5o6faDWrL9oPYVu096X/0Eh9Lio5Qa71BqnKPqz6OvpNhIRdhtkmwKs0lhNptsR/9snhbHbCIAhAjGbwpgtXAABS6fz6cdB49oyfYDWrLtoJZuP6j1e4t1vkd7ujNK/++mTuraJKlmgwIA6hzjNwWwWjiAgkvRkQpt31+ifcXu46/Dx/9+oLRcXq9PXp/k9fnk81UVyeKyShW7K2UPs+l3A1vot/2ayx5mM/2fAwA4T4zfUrjpAEBdcUZHqH1GvXP+3GF3pR79eJWmrtil57/aoPlb9usvN3RUWkJUzYcEAKAOcBsY4AziHOH6yy876vnrOyg6wq65m/Zr8F9n67t1BSotr1R5pVdMpAMAggmngKuBKWTr2VRwWPe+t0zr9hSf9LUwmxQdYVfnrERd0jJVfVumqkVanGw2ThcDQCBh/KYAVgsHkDWVVXiUO22t3l2Yp0rvT//fp6EzSle0a6D7Lm2hpNjIOkoIAPgpjN8UwGrhALK2Co9X5ZVeVXp8qvR6Ven16UBJueZuKtSsjYVauGW/3JVeSVXrD38/qKV+dVFjhdtZeQEAJjF+UwCrhQMIP6WswqM5Gwv1/Ffr/aeMWzWI19iftdXFTZMNpwMA62L8pgBWCwcQzkalx6v3F+Xp+a82qOhIhSRpSId0jRncSun1og2nAwDrYfymAFYLBxDOxcGScj3/1Xq9tyhPPl/VBSMj+zfTnX2a8pQRAKhDjN8UwGrhAML5+GFnkcZ9tlqLtx2UJGUkRuuBAS10XadGrA8EgDrA+E0BrBYOIJwvn8+nT1fuUu60ddrjKpMkNU2J1f0DWqhl/XiFhUl2m02ZSTHMDgJADWP8pgBWCwcQqutIuUeTF2zTxBmbdbC04qSvJ8ZE6Nae2RrRM0v1YriNDADUBMZvCmC1cAChphx2V+qtOVv1n2U7VFrukc/nU1mFV4fdlZKk2Ei7rm6fru5Nk9StSZIa1YtWGM8jBoDzwvh
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"log_rang_log_freq('pt-3-char-ngrams-log-log', ngrams(get_characters(pan_tadeusz), 3))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### 2-gramy wyrazowe\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAnlUlEQVR4nO3df3CV9Z3o8c9JAgloiAKSHxI0bL1IQRGCU7GCet3GhV133XJ33V/V2dY7y4xWMeOOojvTrbuV267rMK4Kw66ut/W669yLtXpLd+S2Cna1W8Wg1iLWygICKWA1AawJIef+gQmmRERJzvOc87xeM5lpDufoJxn1vPt9zvf75PL5fD4AAMiMsqQHAACgsAQgAEDGCEAAgIwRgAAAGSMAAQAyRgACAGSMAAQAyBgBCACQMQIQACBjBCAAQMYIQACAjBGAAAAZIwABADJGAAIAZIwABADIGAEIAJAxAhAAIGMEIABAxghAAICMEYAAABkjAAEAMkYAAgBkjAAEAMgYAQgAkDECEAAgYwQgAEDGCEAAgIwRgAAAGSMAAQAyRgACAGSMAAQAyBgBCACQMQIQACBjBCAAQMYIQACAjBGAAAAZIwABADJGAAIAZIwABADIGAEIAJAxAhAAIGMEIABAxghAAICMEYAAABkjAAEAMkYAAgBkjAAEAMgYAQgAkDECEAAgYwQgAEDGCEAAgIwRgAAAGSMAAQAyRgACAGSMAAQAyBgBCACQMRVJD1DMent7Y8eOHVFdXR25XC7pcQCAY5DP52Pv3r3R0NAQZWXZXAsTgMdhx44d0djYmPQYAMAnsG3btpg4cWLSYyRCAB6H6urqiDj0D9CYMWMSngYAOBadnZ3R2NjY/z6eRQLwOPRd9h0zZowABIAik+WPb2XzwjcAQIYJQACAjBGAAAAZIwABADJGAAIAZIwABADIGAEIAJAxAhAAIGMEIABAxghAAICMEYAAABkjAAEAMkYAptBTm3bF1f/zuXjilfakRwEASlBF0gNwpB+98cv4fxt3RUQuWqbVJT0OAFBirACm0MJZp0bEoZXAt/Z1JTwNAFBqBGAKnVFbHWdPrIme3nw89uKOpMcBAEqMAEyphbMmRkTEqhfeTHgSAKDUCMCUumxGQ4woz8VPtnfGpva9SY8DAJQQAZhSY08YGRdPmRARVgEBgKElAFNsYfOhy8DfbtsePQd7E54GACgVAjDFLp4yIU4ePSJ27+2KH76+J+lxAIASIQBTbGRFWfzujIaIiFj1wvaEpwEASkVJBODSpUvj3HPPjerq6pgwYUJcfvnlsWnTpo983dq1a6O5uTmqqqpi8uTJsWLFigJM+/H0XQZ+4pX26HzvQMLTAACloCQCcO3atXHNNdfEj370o1izZk309PRES0tL7N+//0Nfs3nz5liwYEHMnTs32tra4pZbbonrrrsuVq1aVcDJP9pZp9bEGRNOjK6e3lj90s6kxwEASkAun8/nkx5iqO3evTsmTJgQa9eujXnz5g36nJtuuikee+yx2LhxY/9jixYtihdffDGeffbZY/r7dHZ2Rk1NTXR0dMSYMWOGZPbBrFj78/gf33s1zj395Pjfi84ftr8PAGRBod6/06wkVgB/XUdHR0REjB079kOf8+yzz0ZLS8uAxy699NJ4/vnn48CBwS+1dnV1RWdn54CvQrj8nFOjLBfx3H++HVve+vBVTQCAY1FyAZjP56O1tTUuuOCCmD59+oc+r729PWprawc8VltbGz09PbFnz+A7bpcuXRo1NTX9X42NjUM6+4epq6mKz35qfEREPGIzCABwnEouAK+99tp46aWX4l/+5V8+8rm5XG7A931Xw3/98T5LliyJjo6O/q9t27Yd/8DH6L+9vxnkkbY3o7e35K7aAwAFVJH0AEPpy1/+cjz22GOxbt26mDhx4lGfW1dXF+3t7QMe27VrV1RUVMS4ceMGfU1lZWVUVlYO2bwfR8un6+LEyorY9stfxXP/+cv4zOTBZwQA+CglsQKYz+fj2muvjUceeSR+8IMfRFNT00e+Zs6cObFmzZoBjz3xxBMxe/bsGDFixHCN+omNGlkeC86qiwiXgQGA41MSAXjNNdfEgw8+GA899FBUV1dHe3t7tLe3x69+9av+5yxZsiSuvPLK/u8XLVoUW7ZsidbW1ti4cWPcf//9cd9998WNN96YxI9wTBbOOrSq+d2Xd8avug8mPA0AUKxKIgCXL18eHR0dcdFFF0V9fX3/18MPP9z/nJ07d8bWrVv7v29qaorVq1fHU089Feecc078zd/8Tdx1112xcOHCJH6EY3Lu6WOjceyo2NfVE0/8tP2jXwAAMIiSPAewUJI4R+jONa/FXd//Wcw9Y3x860ufKcjfEwBKiXMAS2QFMEsWzjo1IiL+/fU90d7xXsLTAADFSAAWmdPGnRDnnn5y9OYjHt1gMwgA8PEJwCL0+fc3g6xa/2a4gg8AfFwCsAj99tn1UVlRFj/btS9+sr0wt6MDAEqHACxCY6pGRMu0Q2cCrnrhzYSnAQCKjQAsUn2bQb6zYXt09/QmPA0AUEwEYJG64FPj45Tqynj73QPx5KZdSY8DABQRAVikKsrL4vdnHloFfMRlYADgYxCARazv1nA/eHVXvL2/O+FpAIBiIQCL2JS66pjWMCYOHMzHYy/uSHocAKBICMAi17cK6DIwAHCsBGCR+91zGqKiLBcvvtkRr+/am/Q4AEAREIBFbvyJlXHRlFMiIuL/rHdrOADgownAEtB3GfjbbW/GwV63hgMAjk4AloD/OnVC1IwaEb/o7Ipnfr4n6XEAgJQTgCWgsqI8fndGQ0RErFpvMwgAcHQCsER8/v1bw/3bK+2x970DCU8DAKSZACwR5zSeFJNPOSHeO9Ab33u5PelxAIAUE4AlIpfL9W8GWeVMQADgKARgCfn9madGLhfxH5t/Gdt++W7S4wAAKSUAS0jDSaPi/N8YFxERj7zgTEAAYHACsMT03xqu7c3I550JCAAcSQCWmN+aXhejR5bHlrfejfVb3k56HAAghQRgiRk9siLmT6+PCJtBAIDBCcAStLD50JmA//fFnfHegYMJTwMApI0ALEHnNY2LU08aFXu7emLNT3+R9DgAQMoIwBJUVpaL3595aBXQZWAA4NcJwBLVd2u4da/tjl2d7yU8DQCQJgKwRE0+5cSYNemk6M1HfGfDjqTHAQBSRACWsM9/4NZwzgQEAPoIwBJ22dkNMbKiLF5t3xuv7OhMehwAICUEYAmrGT0iPje1NiJsBgEADhOAJa7vTMDHNuyIAwd7E54GAEgDAVji5p5xSow/cWS8tb871m7anfQ4AEAKCMASN6K8LH7vHGcCAgCHCcAMWPj+buDvb9wV77zbnfA0AEDSBGAGfLphTJxZVx3dB3vj8Zd2Jj0OAJAwAZgR/635/TMB17sMDABZJwAz4vfOOTXKy3KxYds78fPd+5IeBwBIkADMiFOqK+PC/3JKREQ8YjMIAGSaAMyQz886tBv42y9sj95et4YDgKwSgBnym1NrY0xVRezoeC+efeOtpMcBABIiADOkakR5/M6MhohwJiAAZJkAzJiF718G/reftMf+rp6EpwEAkiAAM2bWpJOjafwJ8W73wfjeT9qTHgcASIAAzJhcLhefn/n+reGcCQgAmSQAM+jy9wPw2TfeijfffjfhaQCAQhOAGdQ4dnScN3lsREQ82rY94WkAgEITgBm1cNb7t4Z7YXvk884EBIAsEYAZNf+s+hg1ojw279kfbdveSXocAKCABGBGnVhZEb81vS4ibAYBgKwRgBnWdxn48Rd3xHsHDiY8DQBQKAIww+b8xrior6mKzvd64vsbdyU9DgBQIAIww8rLcv1Hwjzi1nAAkBkCMOP6LgM/9dru2L23K+FpAIBCqEh6AJL1qQknxozGk+LFbe/El/7nc3HKiZVJj9Qvl8vFH86eGC3T6pIeBQBKigAkrpjdGC9ueydeerMj6VGOsHFnpwAEgCEmAIk/OrcxTh49IjrfO5D0KP0O9kbc+ujLsf2dX8WefV0xPkUrkwBQ7AQgUVaWi/ln1Sc9xhHu++E
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"log_rang_log_freq('pt-2-word-ngrams-log-log', ngrams(get_words(pan_tadeusz), 3))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"### Tajemniczy język Manuskryptu Wojnicza\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"[Manuskrypt Wojnicza](https://pl.wikipedia.org/wiki/Manuskrypt_Wojnicza) to powstały w XV w. manuskrypt spisany w\n",
|
|
|
|
"tajemniczym alfabecie, do dzisiaj nieodszyfrowanym. Rękopis stanowi\n",
|
|
|
|
"jedną z największych zagadek historii (i lingwistyki).\n",
|
|
|
|
"\n",
|
2022-03-06 17:55:47 +01:00
|
|
|
"![Źródło](./02_Jezyki/voynich135.jpg)\n",
|
2022-03-06 17:51:23 +01:00
|
|
|
"\n",
|
|
|
|
"Sami zbadajmy statystyczne własności tekstu manuskryptu. Użyjmy\n",
|
|
|
|
"transkrypcji Vnow, gdzie poszczególne znaki tajemniczego alfabetu\n",
|
|
|
|
"zamienione na litery alfabetu łacińskiego, cyfry i gwiazdkę. Jak\n",
|
|
|
|
"transkrybować manuskrypt, pozostaje sprawą dyskusyjną, natomiast wybór\n",
|
|
|
|
"takiego czy innego systemu transkrypcji nie powinien wpływać\n",
|
|
|
|
"dramatycznie na analizę statystyczną.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"9 OR 9FAM ZO8 QOAR9 Q*R 8ARAM 29 [O82*]OM OPCC9 OP"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import requests\n",
|
|
|
|
"\n",
|
|
|
|
"voynich_url = 'http://www.voynich.net/reeds/gillogly/voynich.now'\n",
|
|
|
|
"voynich = requests.get(voynich_url).content.decode('utf-8')\n",
|
|
|
|
"\n",
|
|
|
|
"voynich = re.sub(r'\\{[^\\}]+\\}|^<[^>]+>|[-# ]+', '', voynich, flags=re.MULTILINE)\n",
|
|
|
|
"\n",
|
|
|
|
"voynich = voynich.replace('\\n\\n', '#')\n",
|
|
|
|
"voynich = voynich.replace('\\n', ' ')\n",
|
|
|
|
"voynich = voynich.replace('#', '\\n')\n",
|
|
|
|
"\n",
|
|
|
|
"voynich = voynich.replace('.', ' ')\n",
|
|
|
|
"\n",
|
|
|
|
"voynich[100:150]"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAABLY0lEQVR4nO3deVjU1f///8fEJipMoAKiuJSGu6aWovVGU0Fzac/SSHPLXEnNLMuld7mnZaaVuZBL+jHflmUiZmqa4oJZmuSSkKYQlQhiBgqv3x/9nK8jCDMjOCPdb9c11+Wc13nOeb5wFubJOedlMgzDEAAAAAAAAOCibnF2AgAAAAAAAEBhKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApbk7OwGUPnl5eTp9+rR8fHxkMpmcnQ4AAAAAoJQzDEPnzp1TcHCwbrmFuTqlEQUsFLvTp08rJCTE2WkAAAAAAP5lTp48qapVqzo7DZQAClgodj4+PpL+eePw9fV1cjYAAAAAgNIuMzNTISEhlu+jKH0oYKHYXV426OvrSwELAAAAAHDDsI1N6cXCUAAAAAAAALg0ClgAAAAAAABwaRSwAAAAAAAA4NIoYAEAAAAAAMClUcACAAAAAACAS6OABQAAAAAAAJdGAQsAAAAAAAAujQIWAAAAAAAAXJq7sxMAnKnGmHV29U+e0rmEMgEAAAAAANfCDCwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAcgGTJ0/WXXfdJR8fHwUEBOjBBx/U4cOHrfoYhqEJEyYoODhY3t7eatOmjX788UerPtnZ2Ro6dKgqVqyocuXKqVu3bvr111+t+qSnpysqKkpms1lms1lRUVE6e/asVZ8TJ06oa9euKleunCpWrKhhw4YpJyenRM4dAAAAAACgKBSwXMDWrVs1ePBgxcfHa+PGjbp06ZIiIiJ0/vx5S59p06Zp5syZmjNnjvbs2aOgoCB16NBB586ds/SJjo7WmjVrtGLFCm3fvl1ZWVnq0qWLcnNzLX169Oih/fv3KzY2VrGxsdq/f7+ioqIsx3Nzc9W5c2edP39e27dv14oVK7R69WqNHDnyxvwwAAAAAAAArmIyDMNwdhKw9vvvvysgIEBbt27Vf/7zHxmGoeDgYEVHR+vFF1+U9M9sq8DAQE2dOlXPPvusMjIyVKlSJS1ZskTdu3eXJJ0+fVohISH68ssvFRkZqcTERNWrV0/x8fFq0aKFJCk+Pl5hYWH66aefFBoaqvXr16tLly46efKkgoODJUkrVqxQ7969lZaWJl9f3yLzz8zMlNlsVkZGhk39nanGmHV29U+e0rmEMgEAAAAAOOpm+h4KxzADywVlZGRIkvz9/SVJSUlJSk1NVUREhKWPl5eXwsPDtWPHDklSQkKCLl68aNUnODhYDRo0sPTZuXOnzGazpXglSS1btpTZbLbq06BBA0vxSpIiIyOVnZ2thISEAvPNzs5WZmam1Q0AAAAAAKC4UMByMYZhaMSIEbrnnnvUoEEDSVJqaqokKTAw0KpvYGCg5Vhqaqo8PT3l5+dXaJ+AgIB8YwYEBFj1uXocPz8/eXp6WvpcbfLkyZY9tcxms0JCQuw9bQAAAAAAgGuigOVihgwZoh9++EEff/xxvmMmk8nqvmEY+dqudnWfgvo70udKL730kjIyMiy3kydPFpoTAAAAAACAPShguZChQ4dq7dq12rx5s6pWrWppDwoKkqR8M6DS0tIss6WCgoKUk5Oj9PT0Qvv89ttv+cb9/fffrfpcPU56erouXryYb2bWZV5eXvL19bW6AQAAAAAAFBcKWC7AMAwNGTJE//vf//T111+rZs2aVsdr1qypoKAgbdy40dKWk5OjrVu3qlWrVpKkZs2aycPDw6pPSkqKDh48aOkTFhamjIwM7d6929Jn165dysjIsOpz8OBBpaSkWPrExcXJy8tLzZo1K/6TBwAAAAAAKIK7sxOANHjwYC1fvlyfffaZfHx8LDOgzGazvL29ZTKZFB0drUmTJql27dqqXbu2Jk2apLJly6pHjx6Wvn379tXIkSNVoUIF+fv7a9SoUWrYsKHat28vSapbt646duyo/v376/3335ckDRgwQF26dFFoaKgkKSIiQvXq1VNUVJSmT5+uM2fOaNSoUerfvz8zqwAAAAAAgFNQwHIB8+bNkyS1adPGqn3RokXq3bu3JGn06NG6cOGCBg0apPT0dLVo0UJxcXHy8fGx9J81a5bc3d31+OOP68KFC2rXrp0WL14sNzc3S59ly5Zp2LBhlqsVduvWTXPmzLEcd3Nz07p16zRo0CC1bt1a3t7e6tGjh2bMmFFCZw8AAAAAAFA4k2EYhrOTQOmSmZkps9msjIwMl5+1VWPMOrv6J0/pXEKZAAAAAAAcdTN9D4Vj2AMLAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACX5u7sBG5mZ8+e1e7du5WWlqa8vDyrY08//bSTsgIAAAAAAChdKGA56PPPP1fPnj11/vx5+fj4yGQyWY6ZTCYKWAAAAAAAAMWEJYQ2+r//+z/98ssvlvsjR45Unz59dO7cOZ09e1bp6emW25kzZ5yYKQAAAAAAQOlCActG5cuXV9u2bbV3715J0qlTpzRs2DCVLVvWyZkBAAAAAACUbhSwbHT//ffrs88+03PPPSdJioyMtBSzAAAAAAAAUHLYA8sODRs21DfffCNJ6ty5s1544QUdOnRIDRs2lIeHh1Xfbt26OSNFAAAAAACAUocClp28vb0lSf3795ckvfbaa/n6mEwm5ebm3tC8AAAAAAAASisKWA7Ky8tzdgoAAAAAAAD/CuyBVQz+/vtvZ6cAAAAAAABQalHAclBubq7++9//qkqVKipfvryOHz8uSXr11Ve1YMECJ2cHAAAAAABQelDActAbb7yhxYsXa9q0afL09LS0N2zYUB9++KFdj/XNN9+oa9euCg4Olslk0qeffmp1vHfv3jKZTFa3li1bWvXJzs7W0KFDVbFiRZUrV07dunXTr7/+atUnPT1dUVFRMpvNMpvNioqK0tmzZ636nDhxQl27dlW5cuVUsWJFDRs2TDk5OXadDwAAAAAAQHGigOWgjz76SB988IF69uwpNzc3S3ujRo30008/2fVY58+fV+PGjTVnzpxr9unYsaNSUlIsty+//NLqeHR0tNasWaMVK1Zo+/btysrKUpcuXaw2k+/Ro4f279+v2NhYxcbGav/+/YqKirIcz83NVefOnXX
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"rang_freq_with_labels('voy-chars', get_characters(voynich))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA2SElEQVR4nO3deXhU5cH+8fvMJJksJJOEkISYCQZlCSBbUGTRYkEqbqVardSt1PpKiwvmrVa0qz81rdalr1YUXutaKnVBqa9LcWMR2cKq7GsCBEJYMlnIJJnM74+ESMpiQpZnJuf7ua65JCcz6e2VY8/Nc57nOVYgEAgIAAAAtuEwHQAAAADtiwIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNhJkOEMpqa2u1Z88excbGyrIs03EAAEATBAIBlZaWKi0tTQ6HPcfCKIAtsGfPHnk8HtMxAADAaSgoKFB6errpGEZQAFsgNjZWUt0JFBcXZzgNAABoCq/XK4/H03AdtyMKYAscve0bFxdHAQQAIMTYefqWPW98AwAA2BgFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQxStbUB0xEAAEAHRQEMQjuKy3Xp/yzQ6oLDpqMAAIAOiAIYhP74wQZt2Fuqa57/Um/m7TIdBwAAdDAUwCD06DX9NSYrWVU1tfrlG6v1+zlfq9pfazoWAADoICiAQSguMlzTbxyiu0b3kCS9tGiHbnxhiQ6U+QwnAwAAHQEFMEg5HJbuvrinnr8xWzERTi3edlBXPvOFvtpdYjoaAAAIcRTAIPe9vql6Z/IIZSbFaPfhI7p62iK9s3K36VgAACCEUQBDQI+UWL0zeYRG9eoiX02tpsxapYf/b51qmBcIAABOAwUwRLijwvXCzedq8kVnSZJmLNiun7y4TIfKqwwnAwAAoYYCGEKcDkv3fK+3nr1+sKIjnFq4pVhX/nWh1u3xmo4GAABCCAUwBF16Tle9/YvhykiMVsHBunmB763ZYzoWAAAIERTAENU7NU5zbh+hC3ok6Ui1X7fPXKk/frBBfh4hBwAAvgUFMITFR0fopYnn6bbvdJckPTdvqya+tEwlFdWGkwEAgGBGAQxxToelqeOy9D8TBiky3KH5m/bryr8u1KZ9paajAQCAIEUB7CCuHJCmt34+XOkJUdp5oELj//qFPvyq0HQsAAAQhGxdAHfv3q0bbrhBnTt3VnR0tAYOHKi8vDzTsU5b3zS35tw+UsPP6qyKKr8mvbZCf/5oo2qZFwgAAI5h2wJ46NAhjRgxQuHh4frggw+0bt06Pf7444qPjzcdrUUSYyL0yk/P0y0jMyVJz3y2RT97Zbm8lcwLBAAAdaxAIGDL4aH77rtPX3zxhRYsWHDaP8Pr9crtdqukpERxcXGtmK51zF65S/e9tVa+mlp1T4rR9JuydXZyrOlYAAAYFezX7/Zg2xHAOXPmaMiQIbrmmmuUnJysQYMGacaMGaf8jM/nk9frbfQKZj8YlK43Jw1XmjtS24rLNf6vizR33T7TsQAAgGG2LYDbtm3TtGnT1KNHD3300UeaNGmS7rzzTr3yyisn/Uxubq7cbnfDy+PxtGPi03NOultz7hipoZmJKvPV6NZXluupjzcxLxAAABuz7S3giIgIDRkyRIsWLWo4duedd2rZsmX68ssvT/gZn88nn8/X8LXX65XH4wmJIeRqf60e/r/1emnRDknSxX1S9MS1AxQbGW42GAAA7YxbwDYeAezatav69OnT6FhWVpby8/NP+hmXy6W4uLhGr1AR7nTo91f21WM/7K+IMIfmrtunHzy7SNv2l5mOBgAA2pltC+CIESO0cePGRsc2bdqkbt26GUrUPq4Z4tE/bxum1LhIbSkq0/f/+oU+3cC8QAAA7MS2BfDuu+/W4sWL9cgjj2jLli2aOXOmpk+frsmTJ5uO1uYGeuI1544ROvfMBJVW1uiWl5frr59tkU1nAwAAYDu2nQMoSe+9956mTp2qzZs3KzMzUzk5Obr11lub/PlQn0NQVVOrB9/7Wq8trrvtPa5fqv58zQDFuMIMJwMAoO2E+vW7Ndi6ALZURzmB/rE0X7999ytV+wPqlRKr6Tdlq1vnGNOxAABoEx3l+t0Str0FjG9MOC9Dr//XMCXHurRxX6mueHqh5m3abzoWAABoIxRASJKyuyXoX3eM1KCMeHkrazTxxaV6bt5W5gUCANABUQDRICUuUq//1/m67lyPagPSHz/YoDv+sVIVVTWmowEAgFZEAUQjrjCncq86Rw+N76cwh6X31hTq6mlfquBgheloAACglVAAcRzLsnTD+d0089bzldQpQusLvbrimYX657ICHSyvMh0PAAC0EKuAW8AOq4gKS45o0qt5Wr2rRJLksKTBGQn6blayxmSlqEdyJ1mWZTglAABNZ4fr97ehALaAXU6gymq/Zszfpve/2qv1hd5G3/MkRml07xSNzkrW0MzOighjUBkAENzscv0+FQpgC9jxBNp9+Ig+3VCkT9bv06KtB1RVU9vwvU6uMF3QI0mjs1J0Ua8u6tzJZTApAAAnZsfr93+iALaA3U+giqoaLdxcrE/WF+mTDUUqLvM1fM+ypEGeeI3OStGYrBT1TOFWMQAgONj9+i1RAFuEE+gbtbUBrd1dok/W79PH64u07j9uFacnRGl072SNzkrR0O6JcoU5DSUFANgd128KYItwAp1cYcmRupHB9fv0xX/cKo6JcOqCHl00OitZ556ZqK7xkRRCAEC74fpNAWwRTqCmqaiq0RdbDuiT9fv0yYYi7S/1HfeepE4R6uqOUld3pNLio5QWH6mu7m/+mRzrUpiTBSYAgJbj+k0BbBFOoOarrQ3oqz0l+nh9kT7bUKRN+0rlO2Z08GScDkspsS51jf+mJHZPitGgjAT1SO4kh4P5hQCApuH6TQFsEU6glgsEAjpUUa09h49oz+EjKiyp1J6SIyo8XKnCkiPac7hSe72V8tee/DSNdYVpgCdegzPiNSgjQYMy4hUfHdGO/xYAgFDC9ZsC2CKcQO3DXxvQ/lJfo2K4+/ARrS/0anVBiY5U+4/7zNHRwUEZ8RqckaCeKZ24hQwAkMT1W6IAtggnkHk1/lpt3FeqlfmHtSL/kFbmH9b24vLj3hcT4dQDl/XRj4dmGEgJAAgmXL8pgC3CCRScDpVXaWXBoYZSuLqgRGW+GoU5LL318+Ea4Ik3HREAYBDXbwpgi3AChQZ/bUB3vr5S/7emUN2TYvTenSMVHRFmOhYAwBCu3xKTotDhOR2WHh7fT6lxkdpWXK5H3l9vOhIAAEZRAGEL8dERevzaAZKk1xbn69MN+wwnAgDAHAogbGPE2Um6ZWSmJOneN9c0enYxAAB2QgGErdzzvV7qlRKr4rIq3ffWGjEFFgBgRxRA2EpkuFNPXTdQEU6HPl5fpNeXFZiOBABAu6MAwnayusbpnu/1kiQ9+K91J9w3EACAjowCCFu6ZWSmhnXvrCPVfk2ZtUrV/m9/HjEAAB0FBRC25HBYevzaAYqLDNPqgsN65tMtpiMBANBuKICwrbT4KD30g3MkSc98tkV5Ow8ZTgQAQPugAMLWrhyQpvE
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"log_rang_log_freq('voy-log-log', get_words(voynich))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAABCNklEQVR4nO3de5xO9d7/8fdlzjPG1RyYMRmHMhGGdrRJdrQxklNpxy4d2aXkMCGRasve0e5gdLj1a5fbSCGUUqnQjkgip3JIYuTQTBTNODXDzOf3R3uu22UGY8zMWng9H4/14Frre631Weu6rrWueV/ftZbHzEwAAAAAAACAS1VyugAAAAAAAADgZAiwAAAAAAAA4GoEWAAAAAAAAHA1AiwAAAAAAAC4GgEWAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8ACAAAAAACAqxFgAQAAAAAAwNUIsAAAAAAAAOBqBFgAAAAAAABwNQIsAAAAAAAAuBoBFgAAAAAAAFyNAAsAAAAAAACuRoAFAAAAAAAAVyPAAgAAAAAAgKsRYAEAAAAAAMDVCLAAAAAAAADgagRYAAAAAAAAcDUCLAAAAAAAALgaARYAAAAAAABcjQALAAAAAAAArkaABQAAAAAAAFcjwAIAAAAAAICrEWABAAAAAADA1QiwAAAAAAAA4GoEWAAAAAAAAHA1AiwAAAAAAAC4GgEWAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8ACAAAAAACAqxFgAQAAAAAAwNUIsAAAAAAAAOBqgU4XgHNPQUGBfvzxR0VGRsrj8ThdDgAAAADgHGdm2r9/vxISElSpEn11zkUEWChzP/74oxITE50uAwAAAABwntmxY4dq1KjhdBkoBwRYKHORkZGSft9xVKlSxeFqAAAAAADnupycHCUmJvr+HsW5hwALZa7wtMEqVaoQYAEAAAAAKgyXsTl3cWIoAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8ACAAAAAACAqxFgAQAAAAAAwNUIsAAAAAAAAOBqBFgAAAAAAABwNQIsAAAAAAAAuFqg0wUATqo9/AOnS5AkbXuyk9MlAAAAAADgWvTAAgAAAAAAgKsRYAEAAAAAAMDVCLAAAAAAAADgagRYAAAAAAAAcDUCLAAAAAAAALgaARYAAAAAAABcjQALAAAAAAAArkaABQAAAAAAAFcjwAIAAAAAAICrEWABAAAAAADA1QiwAAAAAAAA4GoEWAAAAAAAAHA1AiwAAAAAAAC4GgEWAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8A6h4waNUoej8dviI+P9003M40aNUoJCQkKCwtTmzZttH79er955ObmasCAAYqNjVVERIS6du2qnTt3VvSqAAAAAAAA+BBgnWMaNmyozMxM3/DNN9/4pj311FMaN26cXnzxRa1YsULx8fFq37699u/f72uTmpqq2bNna/r06VqyZIkOHDigzp07Kz8/34nVAQAAAAAAUKDTBaBsBQYG+vW6KmRmGj9+vEaOHKnu3btLkiZPnqy4uDhNnTpVffv2VXZ2tiZOnKgpU6aoXbt2kqTXX39diYmJWrBggTp06FCh6wIAAAAAACDRA+ucs3nzZiUkJKhOnTr661//qq1bt0qSMjIylJWVpZSUFF/bkJAQtW7dWkuXLpUkrVy5UkeOHPFrk5CQoEaNGvnaFCc3N1c5OTl+AwAAAAAAQFkhwDqHNG/eXK+99po+/vhjvfLKK8rKylLLli31yy+/KCsrS5IUFxfn95y4uDjftKysLAUHBysqKuqEbYozduxYeb1e35CYmFjGawYAAAAAAM5nBFjnkI4dO+rGG29UcnKy2rVrpw8++EDS76cKFvJ4PH7PMbMi4453qjYjRoxQdna2b9ixY8cZrAUAAAAAAIA/AqxzWEREhJKTk7V582bfdbGO70m1e/duX6+s+Ph45eXlad++fSdsU5yQkBBVqVLFbwAAAAAAACgrBFjnsNzcXG3cuFHVq1dXnTp1FB8fr/nz5/um5+XladGiRWrZsqUkqWnTpgoKCvJrk5mZqXXr1vnaAAAAAAAAVDTuQngOGTp0qLp06aKaNWtq9+7d+uc//6mcnBzdcccd8ng8Sk1N1ZgxY5SUlKSkpCSNGTNG4eHhuuWWWyRJXq9Xffr00ZAhQxQTE6Po6GgNHTrUd0oiAAAAAACAEwiwziE7d+7UzTffrJ9//llVq1ZVixYttGzZMtWqVUuSNGzYMB0+fFj9+vXTvn371Lx5c82bN0+RkZG+eaSlpSkwMFA9evTQ4cOH1bZtW6WnpysgIMCp1QIAAAAAAOc5j5mZ00Xg3JKTkyOv16vs7GzXXw+r9vAPnC5BkrTtyU5OlwAAAAAAZ62z6e9QlA7XwAIAAAAAAICrEWABAAAAAADA1QiwAAAAAAAA4GoEWAAAAAAAAHA1AiwAAAAAAAC4GgEWAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8ACAAAAAACAqxFgAQAAAAAAwNUIsAAAAAAAAOBqBFgAAAAAAABwNQIsAAAAAAAAuBoBFgAAAAAAAFyNAAsAAAAAAACuRoAFAAAAAAAAVyPAAgAAAAAAgKsRYAEAAAAAAMDVCLAAAAAAAADgagRYAAAAAAAAcLVApwuA9Ouvv2r58uXavXu3CgoK/KbdfvvtDlUFAAAAAADgDgRYDnvvvffUq1cvHTx4UJGRkfJ4PL5pHo+HAAsAAAAAAJz3OIWwgs2YMUM//PCD7/GQIUPUu3dv7d+/X7/++qv27dvnG/bu3etgpQAAAAAAAO5AgFXBKleurGuuuUZfffWVJGnXrl0aOHCgwsPDHa4MAAAAAADAnQiwKth1112nd999V/fdd58kqUOHDr4wCwAAAAAAAEVxDSwHJCcn67PPPpMkderUSQ8++KA2bNig5ORkBQUF+bXt2rWrEyUCAAAAAAC4BgGWQ8LCwiRJd999tyRp9OjRRdp4PB7l5+dXaF0AAAAAAABuQ4DlsIKCAqdLAAAAAAAAcDWugeUiv/32W5nNa+zYsfJ4PEpNTfWNMzONGjVKCQkJCgsLU5s2bbR+/Xq/5+Xm5mrAgAGKjY1VRESEunbtqp07d5ZZXQAAAAAAAKeLAMth+fn5+sc//qELL7xQlStX1tatWyVJjz76qCZOnFiqea5YsUL//ve/1bhxY7/xTz31lMaNG6cXX3xRK1asUHx8vNq3b6/9+/f72qSmpmr27NmaPn26lixZogMHDqhz586cyggAAAAAABxDgOWwJ554Qunp6XrqqacUHBzsG5+cnKxXX331tOd34MAB9erVS6+88oqioqJ8481M48eP18iRI9W9e3c1atRIkydP1qFDhzR16lRJUnZ2tiZOnKhnn31W7dq10x/+8Ae9/vrr+uabb7RgwYIzX1kAAAAAAIBSIMBy2GuvvaZ///vf6tWrlwICAnzjGzdurG+//fa053f//ferU6dOateund/4jIwMZWVlKSUlxTcuJCRErVu31tKlSyVJK1eu1JEjR/zaJCQkqFGjRr42AAAAAAAAFY2LuDts165dqlu3bpHxBQUFOnLkyGnNa/r06Vq1apVWrFhRZFpWVpYkKS4uzm98XFycfvjhB1+b4OBgv55bhW0Kn1+c3Nxc5ebm+h7n5OScVt0AAAAAAAAnQw8shzVs2FCLFy8uMn7mzJn6wx/+UOL57NixQ4MGDdLrr7+u0NDQE7bzeDx+j82syLjjnarN2LFj5fV6fUNiYmKJ6wYAAAAAADgVemA57O9//7tuu+027dq1SwUFBXr77be1adMmvfbaa3r//fdLPJ+VK1dq9+7datq0qW9cfn6+PvvsM7344ovatGmTpN97WVWvXt3XZvfu3b5eWfHx8crLy9O+ffv8emHt3r1bLVu2POGyR4wYocGDB/se5+TkEGIBAAAAAIAyQw8sh3Xp0kVvvvmm5s6dK4/Ho8cee0wbN27Ue++9p/bt25d4Pm3bttU333yjNWvW+IZmzZqpV69eWrNmjS666CLFx8dr/vz5vufk5eVp0aJ
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"rang_freq_with_labels('voy-words-20', get_words(voynich), top=20)"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA2SElEQVR4nO3deXhU5cH+8fvMJJksJJOEkISYCQZlCSBbUGTRYkEqbqVardSt1PpKiwvmrVa0qz81rdalr1YUXutaKnVBqa9LcWMR2cKq7GsCBEJYMlnIJJnM74+ESMpiQpZnJuf7ua65JCcz6e2VY8/Nc57nOVYgEAgIAAAAtuEwHQAAAADtiwIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNhJkOEMpqa2u1Z88excbGyrIs03EAAEATBAIBlZaWKi0tTQ6HPcfCKIAtsGfPHnk8HtMxAADAaSgoKFB6errpGEZQAFsgNjZWUt0JFBcXZzgNAABoCq/XK4/H03AdtyMKYAscve0bFxdHAQQAIMTYefqWPW98AwAA2BgFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQxStbUB0xEAAEAHRQEMQjuKy3Xp/yzQ6oLDpqMAAIAOiAIYhP74wQZt2Fuqa57/Um/m7TIdBwAAdDAUwCD06DX9NSYrWVU1tfrlG6v1+zlfq9pfazoWAADoICiAQSguMlzTbxyiu0b3kCS9tGiHbnxhiQ6U+QwnAwAAHQEFMEg5HJbuvrinnr8xWzERTi3edlBXPvOFvtpdYjoaAAAIcRTAIPe9vql6Z/IIZSbFaPfhI7p62iK9s3K36VgAACCEUQBDQI+UWL0zeYRG9eoiX02tpsxapYf/b51qmBcIAABOAwUwRLijwvXCzedq8kVnSZJmLNiun7y4TIfKqwwnAwAAoYYCGEKcDkv3fK+3nr1+sKIjnFq4pVhX/nWh1u3xmo4GAABCCAUwBF16Tle9/YvhykiMVsHBunmB763ZYzoWAAAIERTAENU7NU5zbh+hC3ok6Ui1X7fPXKk/frBBfh4hBwAAvgUFMITFR0fopYnn6bbvdJckPTdvqya+tEwlFdWGkwEAgGBGAQxxToelqeOy9D8TBiky3KH5m/bryr8u1KZ9paajAQCAIEUB7CCuHJCmt34+XOkJUdp5oELj//qFPvyq0HQsAAAQhGxdAHfv3q0bbrhBnTt3VnR0tAYOHKi8vDzTsU5b3zS35tw+UsPP6qyKKr8mvbZCf/5oo2qZFwgAAI5h2wJ46NAhjRgxQuHh4frggw+0bt06Pf7444qPjzcdrUUSYyL0yk/P0y0jMyVJz3y2RT97Zbm8lcwLBAAAdaxAIGDL4aH77rtPX3zxhRYsWHDaP8Pr9crtdqukpERxcXGtmK51zF65S/e9tVa+mlp1T4rR9JuydXZyrOlYAAAYFezX7/Zg2xHAOXPmaMiQIbrmmmuUnJysQYMGacaMGaf8jM/nk9frbfQKZj8YlK43Jw1XmjtS24rLNf6vizR33T7TsQAAgGG2LYDbtm3TtGnT1KNHD3300UeaNGmS7rzzTr3yyisn/Uxubq7cbnfDy+PxtGPi03NOultz7hipoZmJKvPV6NZXluupjzcxLxAAABuz7S3giIgIDRkyRIsWLWo4duedd2rZsmX68ssvT/gZn88nn8/X8LXX65XH4wmJIeRqf60e/r/1emnRDknSxX1S9MS1AxQbGW42GAAA7YxbwDYeAezatav69OnT6FhWVpby8/NP+hmXy6W4uLhGr1AR7nTo91f21WM/7K+IMIfmrtunHzy7SNv2l5mOBgAA2pltC+CIESO0cePGRsc2bdqkbt26GUrUPq4Z4tE/bxum1LhIbSkq0/f/+oU+3cC8QAAA7MS2BfDuu+/W4sWL9cgjj2jLli2aOXOmpk+frsmTJ5uO1uYGeuI1544ROvfMBJVW1uiWl5frr59tkU1nAwAAYDu2nQMoSe+9956mTp2qzZs3KzMzUzk5Obr11lub/PlQn0NQVVOrB9/7Wq8trrvtPa5fqv58zQDFuMIMJwMAoO2E+vW7Ndi6ALZURzmB/rE0X7999ytV+wPqlRKr6Tdlq1vnGNOxAABoEx3l+t0Str0FjG9MOC9Dr//XMCXHurRxX6mueHqh5m3abzoWAABoIxRASJKyuyXoX3eM1KCMeHkrazTxxaV6bt5W5gUCANABUQDRICUuUq//1/m67lyPagPSHz/YoDv+sVIVVTWmowEAgFZEAUQjrjCncq86Rw+N76cwh6X31hTq6mlfquBgheloAACglVAAcRzLsnTD+d0089bzldQpQusLvbrimYX657ICHSyvMh0PAAC0EKuAW8AOq4gKS45o0qt5Wr2rRJLksKTBGQn6blayxmSlqEdyJ1mWZTglAABNZ4fr97ehALaAXU6gymq/Zszfpve/2qv1hd5G3/MkRml07xSNzkrW0MzOighjUBkAENzscv0+FQpgC9jxBNp9+Ig+3VCkT9bv06KtB1RVU9vwvU6uMF3QI0mjs1J0Ua8u6tzJZTApAAAnZsfr93+iALaA3U+giqoaLdxcrE/WF+mTDUUqLvM1fM+ypEGeeI3OStGYrBT1TOFWMQAgONj9+i1RAFuEE+gbtbUBrd1dok/W79PH64u07j9uFacnRGl072SNzkrR0O6JcoU5DSUFANgd128KYItwAp1cYcmRupHB9fv0xX/cKo6JcOqCHl00OitZ556ZqK7xkRRCAEC74fpNAWwRTqCmqaiq0RdbDuiT9fv0yYYi7S/1HfeepE4R6uqOUld3pNLio5QWH6mu7m/+mRzrUpiTBSYAgJbj+k0BbBFOoOarrQ3oqz0l+nh9kT7bUKRN+0rlO2Z08GScDkspsS51jf+mJHZPitGgjAT1SO4kh4P5hQCApuH6TQFsEU6glgsEAjpUUa09h49oz+EjKiyp1J6SIyo8XKnCkiPac7hSe72V8tee/DSNdYVpgCdegzPiNSgjQYMy4hUfHdGO/xYAgFDC9ZsC2CKcQO3DXxvQ/lJfo2K4+/ARrS/0anVBiY5U+4/7zNHRwUEZ8RqckaCeKZ24hQwAkMT1W6IAtggnkHk1/lpt3FeqlfmHtSL/kFbmH9b24vLj3hcT4dQDl/XRj4dmGEgJAAgmXL8pgC3CCRScDpVXaWXBoYZSuLqgRGW+GoU5LL318+Ea4Ik3HREAYBDXbwpgi3AChQZ/bUB3vr5S/7emUN2TYvTenSMVHRFmOhYAwBCu3xKTotDhOR2WHh7fT6lxkdpWXK5H3l9vOhIAAEZRAGEL8dERevzaAZKk1xbn69MN+wwnAgDAHAogbGPE2Um6ZWSmJOneN9c0enYxAAB2QgGErdzzvV7qlRKr4rIq3ffWGjEFFgBgRxRA2EpkuFNPXTdQEU6HPl5fpNeXFZiOBABAu6MAwnayusbpnu/1kiQ9+K91J9w3EACAjowCCFu6ZWSmhnXvrCPVfk2ZtUrV/m9/HjEAAB0FBRC25HBYevzaAYqLDNPqgsN65tMtpiMBANBuKICwrbT4KD30g3MkSc98tkV5Ow8ZTgQAQPugAMLWrhyQpvE
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"log_rang_log_freq('voy-words-log-log', get_words(voynich))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"### Język DNA\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Kod genetyczny przejawia własności zaskakująco podobne do języków naturalnych.\n",
|
|
|
|
"Przede wszystkim ma charakter dyskretny, genotyp to ciąg symboli ze skończonego alfabetu.\n",
|
|
|
|
"Podstawowe litery są tylko cztery, reprezentują one nukleotydy, z których zbudowana jest nić DNA:\n",
|
|
|
|
"a, g, c, t.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"TATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import requests\n",
|
|
|
|
"\n",
|
|
|
|
"dna_url = 'https://raw.githubusercontent.com/egreen18/NanO_GEM/master/rawGenome.txt'\n",
|
|
|
|
"dna = requests.get(dna_url).content.decode('utf-8')\n",
|
|
|
|
"\n",
|
|
|
|
"dna = ''.join(dna.split('\\n')[1:])\n",
|
|
|
|
"dna = dna.replace('N', 'A')\n",
|
|
|
|
"\n",
|
|
|
|
"dna[0:100]"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAwBElEQVR4nO3df1iUdb7/8dcEMiLCBLJAY5R26ZIE9gM3JdvFPQVoItt2NjthJCeXbDWJ0GOZm6mb0JqiJz22ZRbmj2NHzVObxUJuaV5GEEmJGtamx18g7oqgZIAw3z827++OjLYWct8yz8d1zXUxn/vNzIu5rnuvfO3nvsfmcrlcAgAAAAAAACzqMrMDAAAAAAAAAOdDgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFFgAAAAAAACyNAgsAAAAAAACWRoEFAAAAAAAAS6PAAgAAAAAAgKVRYAEAAAAAAMDSKLAAAAAAAABgaRRYAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFFgAAAAAAACyNAgsAAAAAAACWRoEFAAAAAAAAS/M1OwC6nra2Nh0+fFiBgYGy2WxmxwEAAAAAdHEul0snTpyQ0+nUZZexV6crosBChzt8+LAiIyPNjgEAAAAA8DIHDhzQlVdeaXYMXAQUWOhwgYGBkv7+PxxBQUEmpwEAAAAAdHUNDQ2KjIw0/j2KrocCCx3uzGWDQUFBFFgAAAAAgE7DbWy6Li4MBQAAAAAAgKVRYAEAAAAAAMDSKLAAAAAAAABgaRRYAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEvzNTsAYKY+j280OwLwg+17ZqTZEQAAAADgoqLAAgB0OspjdAWUxwAAAJ2HSwgBAAAAAABgaRRYAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFVifLy8uTzWZTdna2seZyuTRz5kw5nU75+/tr2LBh2rlzp9vvNTU1adKkSQoNDVVAQIBSU1N18OBBt5m6ujqlp6fL4XDI4XAoPT1dx48fd5vZv3+/Ro0apYCAAIWGhiorK0vNzc1uMzt27FBCQoL8/f3Vu3dvzZ49Wy6Xq0M/BwAAAAAAgH8WBVYnKisr04svvqiBAwe6rc+dO1f5+flavHixysrKFBERocTERJ04ccKYyc7O1oYNG7RmzRpt3bpVJ0+eVEpKilpbW42ZtLQ0VVRUqLCwUIWFhaqoqFB6erpxvLW1VSNHjlRjY6O2bt2qNWvWaP369Zo8ebIx09DQoMTERDmdTpWVlWnRokWaN2+e8vPzL+InAwAAAAAAcG6+ZgfwFidPntSYMWO0dOlSPf3008a6y+XSwoULNX36dN11112SpOXLlys8PFyrV6/W+PHjVV9fr2XLlmnFihW6/fbbJUkrV65UZGSk3n33XSUnJ2v37t0qLCxUSUmJBg8eLElaunSp4uPjVVVVpaioKBUVFWnXrl06cOCAnE6nJGn+/PnKyMjQnDlzFBQUpFWrVumbb75RQUGB7Ha7YmJitGfPHuXn5ysnJ0c2m62TPzkAAAAAAODt2IHVSSZOnKiRI0caBdQZe/fuVU1NjZKSkow1u92uhIQEbdu2TZJUXl6ulpYWtxmn06mYmBhj5sMPP5TD4TDKK0kaMmSIHA6H20xMTIxRXklScnKympqaVF5ebswkJCTIbre7zRw+fFj79u3z+Lc1NTWpoaHB7QEAAAAAANBRKLA6wZo1a/TJJ58oLy+v3bGamhpJUnh4uNt6eHi4caympkZ+fn4KDg4+70xYWFi71w8LC3ObOft9goOD5efnd96ZM8/PzJwtLy/PuO+Ww+FQZGSkxzkAAAAAAIDvgwLrIjtw4IAeeeQRrVy5Ut27dz/n3NmX5rlcru+8XO/sGU/zHTFz5gbu58ozbdo01dfXG48DBw6cNzcAAAAAAMCFoMC6yMrLy1VbW6u4uDj5+vrK19dXmzdv1nPPPSdfX99z7m6qra01jkVERKi5uVl1dXXnnTly5Ei79z969KjbzNnvU1dXp5aWlvPO1NbWSmq/S+wMu92uoKAgtwcAAAAAAEBHocC6yG677Tbt2LFDFRUVxmPQoEEaM2aMKioqdM011ygiIkLFxcXG7zQ3N2vz5s265ZZbJElxcXHq1q2b20x1dbUqKyuNmfj4eNXX16u0tNSY+eijj1RfX+82U1lZqerqamOmqKhIdrtdcXFxxsyWLVvU3NzsNuN0OtWnT5+O/4AAAAAAAAC+A99CeJEFBgYqJibGbS0gIEC9evUy1rOzs5Wbm6v+/furf//+ys3NVY8ePZSWliZJcjgcGjdunCZPnqxevXopJCREU6ZMUWxsrHFT+AEDBmj48OHKzMzUCy+8IEl68MEHlZKSoqioKElSUlKSoqOjlZ6ermeffVbHjh3TlClTlJmZaeyaSktL06xZs5SRkaEnnnhCX3zxhXJzczVjxgy+gRAAAAAAAJiCAssCpk6dqlOnTmnChAmqq6vT4MGDVVRUpMDAQGNmwYIF8vX11ejRo3Xq1CnddtttKigokI+PjzGzatUqZWVlGd9WmJqaqsWLFxvHfXx8tHHjRk2YMEFDhw6Vv7+/0tLSNG/ePGPG4XCouLhYEydO1KBBgxQcHKycnBzl5OR0wicBAAAAAADQns115g7dQAdpaGiQw+FQfX295e+H1efxjWZHAH6wfc+MNDvCBePcQ1dwKZ57AAB0VZfSv0Px/XAPLAAAAAAAAFgalxACAAB4AXY+oitg5yMAeC8KLAAAAAC4SCiP0RVQHsMKuIQQAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpvmYHsKrjx4+rtLRUtbW1amtrczt2//33m5QKAAAAAADA+1BgefDHP/5RY8aMUWNjowIDA2Wz2YxjNpuNAgsAAAAAAKATcQmhpP/5n//R//3f/xnPJ0+erAceeEAnTpzQ8ePHVVdXZzyOHTtmYlIAAAAAAADvQ4ElqWfPnvr5z3+ujz/+WJJ06NAhZWVlqUePHiYnAwAAAAAAAAWWpDvuuENvvPGGfvOb30iSkpOTjTILAAAAAAAA5uIeWN+KjY3Vli1bJEkjR47Uf/zHf2jXrl2KjY1Vt27d3GZTU1PNiAgAAAAAAOCVKLD+gb+/vyQpMzNTkjR79ux2MzabTa2trZ2aCwAAAAAAwJtRYHnQ1tZmdgQAAAAAAAB8i3tgfYdvvvnG7AgAAAAAAABejQLLg9bWVv3ud79T79691bNnT3311VeSpCeffFLLli0zOR0AAAAAAIB3ocDyYM6cOSooKNDcuXPl5+dnrMfGxuqll14yMRkAAAAAAID3ocDy4NVXX9WLL76oMWPGyMfHx1gfOHCgPv/8cxOTAQAAAAAAeB8KLA8OHTqkfv36tVtva2tTS0uLCYkAAAAAAAC8FwWWB9ddd50++OCDdutr167VjTfeaEIiAAAAAAAA7+VrdgAreuqpp5Senq5
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"rang_freq_with_labels('dna-chars', get_characters(dna))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Tryplety — znaczące cząstki genotypu\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Nukleotydy rzeczywiście są jak litery, same w sobie nie niosą\n",
|
|
|
|
"znaczenia. Dopiero ciągi trzech nukleotydów, *tryplety*, kodują jeden\n",
|
|
|
|
"z dwudziestu aminokwasów.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA9HUlEQVR4nO3de1hU9b7H8c8EgogwgQRE4qU0UtEuVEq2A1NBN4hd9rYTRllutU2JpB7TPBlWonkvOXYxDfOSZWal7gg1MQlJQ6lM0txqooFaIl4yQJzzR4/rNKLuGLG1pPfreeZ5nLW+rPnMqFw+/NYam8PhcAgAAAAAAACwqMvMDgAAAAAAAACcDwUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFFgAAAAAAACyNAgsAAAAAAACWRoEFAAAAAAAAS6PAAgAAAAAAgKVRYAEAAAAAAMDSKLAAAAAAAABgaRRYAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFFgAAAAAAACzN3ewAqH9OnTqlH374QT4+PrLZbGbHAQAAAADUcw6HQ0ePHlVISIguu4y1OvURBRbq3A8//KDQ0FCzYwAAAAAA/mSKi4vVtGlTs2PgIqDAQp3z8fGR9OsnDl9fX5PTAAAAAADquyNHjig0NNT4eRT1DwUW6tzp0wZ9fX0psAAAAAAAfxguY1N/cWIoAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWJq72QEAM7UYucLsCNo9Ic7sCAAAAAAAWBorsAAAAAAAAGBprMACLM7sVWKsEAMAAAAAmI0VWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABL410IAVww3ikRAAAAAHAxsQILAAAAAAAAlkaBBQAAAAAAAEujwLKAtLQ02Ww2p1twcLCx3+FwKC0tTSEhIfLy8lJ0dLS++eYbp2NUVFRo8ODBCggIkLe3txISErR3716nmbKyMiUlJclut8tutyspKUmHDx92mtmzZ4969eolb29vBQQEKCUlRZWVlRftuQMAAAAAAPwnFFgW0a5dO5WUlBi3r7/+2tg3ceJETZ06VRkZGdq4caOCg4PVvXt3HT161JhJTU3V0qVLtWjRIuXm5urYsWOKj49XdXW1MZOYmKjCwkJlZWUpKytLhYWFSkpKMvZXV1crLi5Ox48fV25urhYtWqQlS5Zo2LBhf8yLAAAAAAAAcBZcxN0i3N3dnVZdneZwODR9+nSNHj1a99xzjyRp7ty5CgoK0sKFCzVo0CCVl5dr9uzZmjdvnrp16yZJmj9/vkJDQ7Vq1SrFxsaqqKhIWVlZys/PV8eOHSVJs2bNUmRkpLZt26awsDBlZ2dr69atKi4uVkhIiCRpypQp6tevn8aNGydfX98/6NUAAAAAAAD4f6zAsojvvvtOISEhatmypf7rv/5LO3fulCTt2rVLpaWliomJMWY9PT0VFRWlvLw8SVJBQYGqqqqcZkJCQhQeHm7MrF+/Xna73SivJKlTp06y2+1OM+Hh4UZ5JUmxsbGqqKhQQUHBObNXVFToyJEjTjcAAAAAAIC6QoFlAR07dtSbb76pjz/+WLNmzVJpaaluu+02/fTTTyotLZUkBQUFOX1MUFCQsa+0tFQeHh7y8/M770xgYGCNxw4MDHSaOfNx/Pz85OHhYcyczfjx443ratntdoWGhtbyFQAAAAAAADg3CiwL6Nmzp+699161b99e3bp104oVKyT9eqrgaTabzeljHA5HjW1nOnPmbPOuzJxp1KhRKi8vN27FxcXnzQUAAAAAAFAbFFgW5O3trfbt2+u7774zrot15gqoAwcOGKulgoODVVlZqbKysvPO7N+/v8ZjHTx40GnmzMcpKytTVVVVjZVZv+Xp6SlfX1+nGwAAAAAAQF2hwLKgiooKFRUV6corr1TLli0VHByslStXGvsrKyu1du1a3XbbbZKkiIgINWjQwGmmpKREW7ZsMWYiIyNVXl6uDRs2GDOff/65ysvLnWa2bNmikpISYyY7O1uenp6KiIi4qM8ZAAAAAADgXHgXQgsYPny4evXqpWbNmunAgQN6/vnndeTIET300EOy2WxKTU1Venq6WrdurdatWys9PV2NGjVSYmKiJMlut6t///4aNmyYmjRpIn9/fw0fPtw4JVGS2rRpox49emjAgAF69dVXJUkDBw5UfHy8wsLCJEkxMTFq27atkpKSNGnSJB06dEjDhw/XgAEDWFUFAAAAAABMQ4FlAXv37tX999+vH3/8UVdccYU6deqk/Px8NW/eXJI0YsQInThxQsnJySorK1PHjh2VnZ0tHx8f4xjTpk2Tu7u7+vTpoxMnTqhr167KzMyUm5ubMbNgwQKlpKQY71aYkJCgjIwMY7+bm5tWrFih5ORkde7cWV5eXkpMTNTkyZP/oFcCAAAAAACgJpvD4XCYHQL1y5EjR2S321VeXm75lVstRq4wO4J2T4g7736zM/6nfNKlkREAAABA/XUp/RwK13ANLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNLczQ4AABdbi5ErzI6g3RPizI4AAAAAAJcsCiwAsACzSzYKNgAAAABWximEAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYmrvZAS5lhw8f1oYNG3TgwAGdOnXKad+DDz5oUioAAAAAAID6hQLLRcuWLVPfvn11/Phx+fj4yGazGftsNhsFFgAAAAAAQB3hFMLf6Z133tH3339v3B82bJgeeeQRHT16VIcPH1ZZWZlxO3TokIlJAQAAAAAA6hcKrN+pcePG6tKli7744gtJ0r59+5SSkqJGjRqZnAwAAAAAAKB+o8D6nf7617/qgw8+0D//+U9JUmxsrFFm1bXx48fLZrMpNTXV2OZwOJSWlqaQkBB5eXkpOjpa33zzjdPHVVRUaPDgwQoICJC3t7cSEhK0d+9ep5mysjIlJSXJbrfLbrcrKSlJhw8fdprZs2ePevXqJW9vbwUEBCglJUWVlZUX5bkCAAAAAAD8J1wDqxbat2+vTz/9VJIUFxen//7v/9bWrVvVvn17NWjQwGk2ISHBpcfYuHGjXnvtNXXo0MFp+8SJEzV16lRlZmbq2muv1fPPP6/u3btr27Zt8vHxkSSlpqZq2bJlWrRokZo0aaJhw4YpPj5eBQUFcnNzkyQlJiZq7969ysrKkiQNHDhQSUlJWrZsmSSpurpacXFxuuKKK5Sbm6uffvpJDz30kBwOh2bMmOHScwIAAAAAALgQFFi15OXlJUkaMGCAJOnZZ5+tMWOz2VRdXV3rYx87dkx9+/bVrFmz9PzzzxvbHQ6Hpk+frtGjR+uee+6RJM2dO1dBQUFauHChBg0apPLycs2ePVvz5s1Tt27dJEnz589XaGioVq1apdjYWBUVFSkrK0v5+fnq2LGjJGnWrFmKjIzUtm3bFBYWpuzsbG3dulXFxcUKCQmRJE2ZMkX9+vX
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"genetic_code = {\n",
|
|
|
|
" 'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',\n",
|
|
|
|
" 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',\n",
|
|
|
|
" 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',\n",
|
|
|
|
" 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',\n",
|
|
|
|
" 'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',\n",
|
|
|
|
" 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',\n",
|
|
|
|
" 'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',\n",
|
|
|
|
" 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',\n",
|
|
|
|
" 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',\n",
|
|
|
|
" 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',\n",
|
|
|
|
" 'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',\n",
|
|
|
|
" 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',\n",
|
|
|
|
" 'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',\n",
|
|
|
|
" 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',\n",
|
|
|
|
" 'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',\n",
|
|
|
|
" 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',\n",
|
|
|
|
" }\n",
|
|
|
|
"\n",
|
|
|
|
"def get_triplets(t):\n",
|
|
|
|
" for triplet in re.finditer(r'.{3}', t):\n",
|
|
|
|
" yield genetic_code[triplet.group(0)]\n",
|
|
|
|
"\n",
|
|
|
|
"rang_freq_with_labels('dna-aminos', get_triplets(dna))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### „Zdania” w języku DNA\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Z aminokwasów zakodowanych przez tryplet budowane są białka.\n",
|
|
|
|
"Maszyneria budująca białka czyta sekwencję aż do napotkania\n",
|
|
|
|
"trypletu STOP (\\_ powyżej). Taka sekwencja to *gen*.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA0B0lEQVR4nO3df3RU9Z3/8deU/DBkk1sCZIZZI6ZuSsFEV0M3JP0BKxBgjanHHsGGneKRAhaFzgrLj3W7RU+bAN2Cu83WokuFIm56vucYt6fQlLDVVDYE0mhaiEDpMfKjZBLqTiZB00kM9/uHh1smiYAKzEw+z8c59xzmc9/3zuc9dzAvP5k7uGzbtgUAAABjfCLaEwAAAMD1RQAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMQAAEAAAxDAAQAADAMARAAAMAwBEAAAADDEAABAAAMQwAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMQAAEAAAxDAAQAADAMARAAAMAwBEAAAADDEAABAAAMQwAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMQAAEAAAxDAAQAADAMARAAAMAwBEAAAADDEAABAAAMQwAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMQAAEAAAxDAAQAADAMARAAAMAwBEAAAADDEAABAAAMQwAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMkRHsC8ez8+fM6c+aM0tLS5HK5oj0dAABwBWzbVnd3t7xerz7xCTPXwgiAH8OZM2eUlZUV7WkAAICP4NSpU7rxxhujPY2oIAB+DGlpaZLefwOlp6dHeTYAAOBKdHV1KSsry/k5biIC4Mdw4de+6enpBEAAAOKMyR/fMvMX3wAAAAYjAAIAABiGAAgAAGAYAiAAAIBhCIAAAACGIQACAAAYhgAIAABgGAIgAACAYQiAAAAAhiEAAgAAGIYACAAAYBgCIAAAgGEIgAAAAIYhAAIAABgmIdoTwAe7ec2uiMdvrb87SjMBAADDCSuAAAAAhom5APjee+/pn//5n5Wdna2UlBR96lOf0pNPPqnz5887NbZta926dfJ6vUpJSdG0adPU0tIScZ5wOKxly5ZpzJgxSk1NVWlpqU6fPh1REwwG5fP5ZFmWLMuSz+dTZ2fn9WgTAAAgamIuAG7YsEE//OEPVVlZqSNHjmjjxo367ne/q+9///tOzcaNG7Vp0yZVVlaqsbFRHo9HM2fOVHd3t1Pj9/tVXV2tqqoq7du3T+fOnVNJSYn6+/udmrKyMjU3N6umpkY1NTVqbm6Wz+e7rv0CAABcby7btu1oT+JiJSUlcrvd2rp1qzP25S9/WSNHjtSOHTtk27a8Xq/8fr9Wr14t6f3VPrfbrQ0bNmjJkiUKhUIaO3asduzYoXnz5kmSzpw5o6ysLO3evVuzZs3SkSNHNGnSJDU0NKigoECS1NDQoMLCQh09elQTJky47Fy7urpkWZZCoZDS09Ov+mvBZwABALj6rvXP73gQcyuAn//85/U///M/+t3vfidJ+s1vfqN9+/bp7/7u7yRJra2tCgQCKi4udo5JTk7W1KlTVV9fL0lqampSX19fRI3X61Vubq5Ts3//flmW5YQ/SZoyZYosy3JqBgqHw+rq6orYAAAA4k3M3QW8evVqhUIhfeYzn9GIESPU39+v73znO/rKV74iSQoEApIkt9sdcZzb7daJEyecmqSkJI0aNWpQzYXjA4GAMjMzBz1/ZmamUzNQRUWFnnjiiY/XIAAAQJTF3ArgT37yEz3//PN64YUX9Nprr2n79u3613/9V23fvj2izuVyRTy2bXvQ2EADa4aqv9R51q5dq1Ao5GynTp260rYAAABiRsytAP7jP/6j1qxZowceeECSlJeXpxMnTqiiokILFiyQx+OR9P4K3rhx45zjOjo6nFVBj8ej3t5eBYPBiFXAjo4OFRUVOTXt7e2Dnv/s2bODVhcvSE5OVnJy8tVpFAAAIEpibgXw3Xff1Sc+ETmtESNGOF8Dk52dLY/Ho9raWmd/b2+v6urqnHCXn5+vxMTEiJq2tjYdPnzYqSksLFQoFNLBgwedmgMHDigUCjk1AAAAw1HMrQDec889+s53vqObbrpJt956q15//XVt2rRJDz30kKT3f23r9/tVXl6unJwc5eTkqLy8XCNHjlRZWZkkybIsLVy4UCtWrNDo0aOVkZGhlStXKi8vTzNmzJAkTZw4UbNnz9aiRYu0ZcsWSdLixYtVUlJyRXcAAwAAxKuYC4Df//739c1vflNLly5VR0eHvF6vlixZon/5l39xalatWqWenh4tXbpUwWBQBQUF2rNnj9LS0pyazZs3KyEhQXPnzlVPT4+mT5+ubdu2acSIEU7Nzp07tXz5cudu4dLSUlVWVl6/ZgEAAKIg5r4HMJ7wPYAAAMQfvgcwBj8DCAAAgGuLAAgAAGAYAiAAAIBhCIAAAACGIQACAAAYhgAIAABgGAIgAACAYQiAAAAAhiEAAgAAGIYACAAAYBgCIAAAgGEIgAAAAIYhAAIAABiGAAgAAGAYAiAAAIBhCIAAAACGIQACAAAYhgAIAABgGAIgAACAYQiAAAAAhiEAAgAAGIYACAAAYBgCIAAAgGEIgAAAAIYhAAIAABiGAAgAAGAYAiAAAIBhCIAAAACGIQACAAAYhgAIAABgGAIgAACAYQiAAAAAhom5AHjzzTfL5XIN2h555BFJkm3bWrdunbxer1JSUjRt2jS1tLREnCMcDmvZsmUaM2aMUlNTVVpaqtOnT0fUBINB+Xw+WZYly7Lk8/nU2dl5vdoEAACImpgLgI2NjWpra3O22tpaSdL9998vSdq4caM2bdqkyspKNTY2yuPxaObMmeru7nbO4ff7VV1draqqKu3bt0/nzp1TSUmJ+vv7nZqysjI1NzerpqZGNTU1am5uls/nu77NAgAARIHLtm072pO4FL/fr5/97Gc6fvy4JMnr9crv92v16tWS3l/tc7vd2rBhg5YsWaJQKKSxY8dqx44dmjdvniTpzJkzysrK0u7duzVr1iwdOXJEkyZNUkNDgwoKCiRJDQ0NKiws1NGjRzVhwoQrmltXV5csy1IoFFJ6evpV7/3mNbsiHr+1/u6r/hwAAJjmWv/8jgcxtwJ4sd7eXj3//PN66KGH5HK51NraqkAgoOLiYqcmOTlZU6dOVX19vSSpqalJfX19ETVer1e5ublOzf79+2VZlhP+JGnKlCmyLMupGUo4HFZXV1fEBgAAEG9iOgC+9NJL6uzs1IMPPihJCgQCkiS32x1R53a7nX2BQEBJSUkaNWrUJWsyMzMHPV9mZqZTM5SKigrnM4OWZSkrK+sj9wYAABAtMR0At27dqjlz5sjr9UaMu1yuiMe2bQ8aG2hgzVD1lzvP2rVrFQqFnO3UqVNX0gYAAEBMidkAeOLECe3du1df+9rXnDGPxyNJg1bpOjo6nFVBj8ej3t5eBYPBS9a0t7cPes6zZ88OWl28WHJystLT0yM2AACAeBOzAfC5555TZmam7r77zzc+ZGdny+PxOHcGS+9/TrCurk5FRUWSpPz8fCUmJkbUtLW16fDhw05NYWGhQqGQDh486NQcOHBAoVDIqQEAABiuEqI9gaGcP39ezz33nBYsWKCEhD9P0eVyye/3q7y8XDk5OcrJyVF5eblGjhypsrIySZJlWVq4cKFWrFih0aNHKyMjQytXrlReXp5mzJghSZo4caJmz56tRYsWacuWLZKkxYsXq6Sk5IrvAAYAAIhXMRkA9+7dq5MnT+qhhx4atG/VqlXq6enR0qVLFQwGVVBQoD179igtLc2p2bx5sxISEjR37lz19PRo+vTp2rZtm0aMGOHU7Ny5U8uXL3fuFi4tLVVlZeW1bw4AACDKYv57AGMZ3wMIAED84XsAY/gzgAAAALg2CIAAAACGIQACAAAYhgAIAABgGAI
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"def get_genes(triplets):\n",
|
|
|
|
" gene = []\n",
|
|
|
|
" for ammino in triplets:\n",
|
|
|
|
" if ammino == '_':\n",
|
|
|
|
" yield gene\n",
|
|
|
|
" gene = []\n",
|
|
|
|
" else:\n",
|
|
|
|
" gene.append(ammino)\n",
|
|
|
|
"\n",
|
|
|
|
"plt.figure().clear()\n",
|
|
|
|
"plt.hist([len(g) for g in get_genes(get_triplets(dna))], bins=100)\n",
|
|
|
|
"\n",
|
|
|
|
"fname = '02_Jezyki/dna_length.png'\n",
|
|
|
|
"\n",
|
|
|
|
"plt.savefig(fname)\n",
|
|
|
|
"\n",
|
|
|
|
"fname"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"### Entropia\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"**Entropia** ($E$) to miara nieuporządkowania, niepewności, niewiedzy. Im\n",
|
|
|
|
"większa entropia, tym mniej wiemy. Pojęcie to pierwotnie wywodzi się z\n",
|
|
|
|
"termodynamiki, później znaleziono wiele zaskakujących analogii i zastosowań w\n",
|
|
|
|
"innych dyscyplinach nauki.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Entropia w fizyce\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"W termodynamice entropia jest miarą nieuporządkowania układów\n",
|
|
|
|
"fizycznych, na przykład pojemników z gazem. Przykładowo, wyobraźmy\n",
|
|
|
|
"sobie dwa pojemniki z gazem, w którym panuje różne temperatury.\n",
|
|
|
|
"\n",
|
|
|
|
"![img](./02_Jezyki/gas-low-entropy.drawio.png)\n",
|
|
|
|
"\n",
|
|
|
|
"Jeśli usuniemy przegrodę między pojemnikami, temperatura się wyrówna,\n",
|
|
|
|
"a uporządkowanie się zmniejszy.\n",
|
|
|
|
"\n",
|
|
|
|
"![img](./02_Jezyki/gas-high-entropy.drawio.png)\n",
|
|
|
|
"\n",
|
|
|
|
"Innymi słowy, zwiększy się stopień nieuporządkowania układu, czyli właśnie entropia.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### II prawo termodynamiki\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Jedno z najbardziej fundamentalnych praw fizyki, II prawo\n",
|
|
|
|
"termodynamiki głosi, że w układzie zamkniętym entropia nie spada.\n",
|
|
|
|
"\n",
|
|
|
|
"****Pytanie****: Czy to, że napisałem te materiały do wykładu i\n",
|
|
|
|
"*uporządkowałem* wiedzę odnośnie do statystycznych własności języka, nie\n",
|
|
|
|
"jest sprzeczne z II prawem termodynamiki?\n",
|
|
|
|
"\n",
|
|
|
|
"Konsekwencją II prawa termodynamiki jest śmierć cieplna Wszechświata\n",
|
|
|
|
"(zob. [wizualizacja przyszłości Wszechświata]([https://www.youtube.com/watch?v=uD4izuDMUQA](https://www.youtube.com/watch?v=uD4izuDMUQA))).\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Entropia w teorii informacji\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Pojęcie entropii zostało „odkryte” na nowo przez Claude'a Shannona,\n",
|
|
|
|
"gdy wypracował ogólną teorię informacji.\n",
|
|
|
|
"\n",
|
|
|
|
"Teoria informacji zajmuje się między innymi zagadnieniem optymalnego kodowania komunikatów.\n",
|
|
|
|
"\n",
|
|
|
|
"Wyobraźmy sobie pewne źródło (generator) losowych komunikatów z\n",
|
|
|
|
"zamkniętego zbioru symboli ($\\Sigma$; nieprzypadkowo używamy oznaczeń\n",
|
|
|
|
"z poprzedniego wykładu). Nadawca $N$ chce przesłać komunikat o wyniku\n",
|
|
|
|
"losowania do odbiorcy $O$ używając zer i jedynek (bitów).\n",
|
|
|
|
"Teorioinformacyjną entropię można zdefiniować jako średnią liczbę\n",
|
|
|
|
"bitów wymaganych do przesłania komunikatu.\n",
|
|
|
|
"\n",
|
|
|
|
"![img](./02_Jezyki/communication.drawio.png)\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Obliczanie entropii — proste przykłady\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Załóżmy, że nadawca chce przekazać odbiorcy informację o wyniku rzutu monetą.\n",
|
|
|
|
"Entropia wynosi wówczas rzecz jasna 1 — na jedno losowanie wystarczy jeden bit\n",
|
|
|
|
"(informację o tym, że wypadł orzeł, możemy zakodować na przykład za pomocą zera,\n",
|
|
|
|
"zaś to, że wypadła reszka — za pomocą jedynki).\n",
|
|
|
|
"\n",
|
|
|
|
"Rozpatrzmy przypadek, gdy nadawca rzuca ośmiościenną kością. Aby przekazać\n",
|
|
|
|
"wynik, potrzebuje wówczas 3 bity (a więc entropia ośmiościennej kości\n",
|
|
|
|
"wynosi 3 bity). Przykładowe kodowanie może mieć następującą postać:\n",
|
|
|
|
"\n",
|
|
|
|
"| Wynik|Kodowanie|\n",
|
|
|
|
"|---|---|\n",
|
|
|
|
"| 1|001|\n",
|
|
|
|
"| 2|010|\n",
|
|
|
|
"| 3|011|\n",
|
|
|
|
"| 4|100|\n",
|
|
|
|
"| 5|101|\n",
|
|
|
|
"| 6|110|\n",
|
|
|
|
"| 7|111|\n",
|
|
|
|
"| 8|000|\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Obliczenie entropii — trudniejszy przykład\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Załóżmy, że $\\Sigma = \\{A, B, C, D\\}$, natomiast poszczególne komunikaty\n",
|
|
|
|
"są losowane zgodnie z następującym rozkładem prawdopodobieństwa:\n",
|
|
|
|
"$P(A)=1/2$, $P(B)=1/4$, $P(C)=1/8$, $P(D)=1/8$. Ile wynosi entropia w\n",
|
|
|
|
"takim przypadku? Można by sądzić, że 2, skoro wystarczą 2 bity do\n",
|
|
|
|
"przekazania wyniku losowania przy zastosowaniu następującego kodowania:\n",
|
|
|
|
"\n",
|
|
|
|
"| Wynik|Kodowanie|\n",
|
|
|
|
"|---|---|\n",
|
|
|
|
"| A|00|\n",
|
|
|
|
"| B|01|\n",
|
|
|
|
"| C|10|\n",
|
|
|
|
"| D|11|\n",
|
|
|
|
"\n",
|
|
|
|
"Problem w tym, że w rzeczywistości nie jest to *optymalne* kodowanie.\n",
|
|
|
|
"Możemy sprytnie zmniejszyć średnią liczbę bitów wymaganych do\n",
|
|
|
|
"przekazania losowego wyniku przypisując częstszym wynikom krótsze\n",
|
|
|
|
"kody, rzadszym zaś — dłuższe. Oto takie optymalne kodowanie:\n",
|
|
|
|
"\n",
|
|
|
|
"| Wynik|Kodowanie|\n",
|
|
|
|
"|---|---|\n",
|
|
|
|
"| A|0|\n",
|
|
|
|
"| B|10|\n",
|
|
|
|
"| C|110|\n",
|
|
|
|
"| D|111|\n",
|
|
|
|
"\n",
|
|
|
|
"Używając takiego kodowanie średnio potrzebujemy:\n",
|
|
|
|
"\n",
|
|
|
|
"$$\\frac{1}{2}1 + \\frac{1}{4}2 + \\frac{1}{8}3 + \\frac{1}{8}3 = 1,75$$\n",
|
|
|
|
"\n",
|
|
|
|
"bita. Innymi słowy, entropia takiego źródła wynosi 1,75 bita.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Kodowanie musi być jednoznaczne!\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Można by sądzić, że da się stworzyć jeszcze krótsze kodowanie dla omawianego rozkładu nierównomiernego:\n",
|
|
|
|
"\n",
|
|
|
|
"| Wynik|Kodowanie|\n",
|
|
|
|
"|---|---|\n",
|
|
|
|
"| A|0|\n",
|
|
|
|
"| B|1|\n",
|
|
|
|
"| C|01|\n",
|
|
|
|
"| D|11|\n",
|
|
|
|
"\n",
|
|
|
|
"Niestety, nie jest to właściwe rozwiązanie — kodowanie musi być\n",
|
|
|
|
"jednoznaczne nie tylko dla pojedynczego komunikatu, lecz dla całej sekwencji.\n",
|
|
|
|
"Na przykład ciąg 0111 nie jest jednoznaczny przy tym kodowaniu (ABBB czy CD?).\n",
|
|
|
|
"Podane wcześniej kodowanie spełnia warunek jednoznaczności, ciąg 0111 można odkodować tylko\n",
|
|
|
|
"jako AD.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Ogólny wzór na entropię.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Na podstawie poprzedniego przykładu można dojść do intuicyjnego wniosku, że\n",
|
|
|
|
"optymalny kod dla wyniku o prawdopodobieństwie $p$ ma długość $-\\log_2(p)$, a zatem ogólnie\n",
|
|
|
|
"entropia źródła o rozkładzie prawdopodobieństwa $\\{p_1,\\ldots,p_|\\Sigma|\\}$ wynosi:\n",
|
|
|
|
"\n",
|
|
|
|
"$$E = -\\sum_{i=1}^{|\\Sigma|} p_i\\log_2(p_i)$$.\n",
|
|
|
|
"\n",
|
|
|
|
"Zauważmy, że jest to jeden z nielicznych przypadków, gdy w nauce naturalną\n",
|
|
|
|
"podstawą logarytmu jest 2 zamiast… podstawy logarytmu naturalnego ($e$).\n",
|
|
|
|
"\n",
|
|
|
|
"Teoretycznie można mierzyć entropię używając logarytmu naturalnego\n",
|
|
|
|
"($\\ln$), jednostką entropii będzie wówczas **nat** zamiast bita,\n",
|
|
|
|
"niewiele to jednak zmienia i jest mniej poręczne i trudniejsze do interpretacji\n",
|
|
|
|
"(przynajmniej w kontekście informatyki) niż operowanie na bitach.\n",
|
|
|
|
"\n",
|
|
|
|
"****Pytanie**** Ile wynosi entropia zwykłej sześciennej kostki? Jak wygląda\n",
|
|
|
|
"optymalne kodowanie wyników rzutu taką kostką?\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Entropia dla próby Bernoulliego\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Wiemy już, że entropia dla rzutu monetą wynosi 1 bit. A jaki będzie wynik dla źle wyważonej monety?\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"data": {
|
|
|
|
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAABlTElEQVR4nO3deVhU9f4H8PcsDPsMsoMgi4qCKCi4YeSSS1ourVY3zbJuXvOaetvMe69l3euvW5l5S9tMr7lkVlaWlVqpKC6J4AIICCgiIIKy7zPf3x8DUwQqKnBm5rxfzzNPceYM85nDwfPmux2FEEKAiIiIiGRDKXUBRERERNS5GACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZIYBkIiIiEhmGACJiIiIZEYtdQGWzGAwIC8vD87OzlAoFFKXQ0RERG0ghEB5eTl8fX2hVMqzLYwB8Cbk5eXB399f6jKIiIjoBpw7dw5+fn5SlyEJBsCb4OzsDMB4Amm1WomrISIiorYoKyuDv7+/6TouRwyAN6Gp21er1TIAEhERWRg5D9+SZ8c3ERERkYwxABIRERHJDAMgERERkcwwABIRERHJDAMgERERkcwwABIRERHJDAMgERERkcwwABIRERHJDAMgERERkcxYTQDcu3cvJk6cCF9fXygUCnz11VfXfM2ePXsQFRUFOzs7BAcH47333uv4QomIiIgkZjUBsLKyEhEREXjnnXfatH92djYmTJiA2NhYJCYm4sUXX8TcuXPxxRdfdHClRERERNKymnsBjx8/HuPHj2/z/u+99x66deuG5cuXAwBCQ0Nx5MgRvPHGG7jnnns6qEoiIiIi6VlNALxeBw4cwNixY5ttGzduHFavXo36+nrY2Ni0eE1tbS1qa2tNX5eVlXV4nUTUfur1BhRV1KK4og6Xq+pwqdL4KK9pQE29HrUNBtQ26FFTb4DBIKBSKqBWKaBUKKBWKmBno4LW3gZaO3Xjf23g6qiBj84O7k62UCrle2N5IrIssg2ABQUF8PLyarbNy8sLDQ0NKCoqgo+PT4vXLF26FC+//HJnlUhEN6BBb0B2USVOFZQj62Ilzl2uwrlLVci9XI380moYRMe8r1qpgJfWDj46O/i7OqC7hyN6eDqhh6cTurk6QqO2mhE3RGQFZBsAAUChaP7XuhCi1e1NFi5ciAULFpi+Lisrg7+/f8cVSERXVa83ICWvDEdzLuNEbilOFZTj9MUK1DUYrvgatVIBV0eN6dHFUQOtnQ3sbJSws1HBVq2ErVoFlRLQGwC9wYAGg4DeIFBdp0dZTT3KqhtQVlOP0up6FFfUobC8Bg0GgfMl1ThfUo0jZy+3eM9gD0f083NBhJ8O/fxc0NvHGbZqVUcfIiKiVsk2AHp7e6OgoKDZtsLCQqjVari5ubX6GltbW9ja2nZGeUTUiqq6BhzKvoSDWcVIPFuCY7klqG0l7DloVAjxckZPTyd0c3WAv6sD/F3t4d/FoUO6auv1Blwsr0V+aQ3yS6txtrgKmYUVOH2xApmFFais0yP9QgXSL1Tg84RcAIBGpUSfrlrEdHfD0GB3RAV0gb2GgZCIOodsA+DQoUOxbdu2Ztt27NiB6OjoVsf/EVHnMxgEUvLLsDfjIvZlFOHImcuo0zcPfC4ONujv74JI/y4I9XFGb28t/LrYd+p4PBuVEr4u9vB1sQfQpdlzQgjkl9YgOa8Mx3NLcCy3FMdzS1BSVY/EnBIk5pTg3V8yoVEpEdnNBbE93DE6zAu9vZ2v2BtBRHSzFKKp39PCVVRU4PTp0wCA/v37Y9myZRg5ciRcXV3RrVs3LFy4EOfPn8e6desAGJeBCQ8Px5NPPoknnngCBw4cwKxZs7Bp06Y2zwIuKyuDTqdDaWkptFpth302Ijlp0BtwOPsSfkguwI/JBbhQVtvs+a4u9hjWww3Rga4Y0K0Lgt0dLW7yhRACOZeqcDj7Eg5kFuNAVjHyS2ua7dPVxR6jQz1xW6gXhgS7cQwhUTvi9duKAuDu3bsxcuTIFtsfeeQRrF27FjNmzMCZM2ewe/du03N79uzB/PnzkZycDF9fXzz//POYNWtWm9+TJxBR+9AbBOIzi7DtWB52plzA5ap603MOGhViurshtqcHYnu6I8jd0epaxoQQOFtchfjMYvx8qhD7Tl9ETf1vLZ06extM6OuNSRFdMTjI1eICL5G54fXbigKgFHgCEd2c04Xl+OLoeWw9eh4FZb+1gHVxsMGYMC/cHu6NmO7usLOR19i46jo94jOLsCv1AnalFuJi+W+toN5aO0yK9MXdA7qitzf/3SG6Ebx+MwDeFJ5ARNevqq4BXyfl4dPDOTiWW2rarrO3wZ39fHBHXx8MCnKFWsUuT8DYOnooqxhfJ+Vh+8l8lNc0mJ4b0M0FDw7qhjv7+XICCdF14PWbAfCm8AQiarvsokp8cuAstiScM4UYtVKBEb08cM8AP4wK9eSyKNdQ26DH7rSL2Hr0PHalXkBD46KGWjs17h7gh4eHdEMPT2eJqyQyf7x+MwDeFJ5ARFcnhEBcRhFW78vGnvSLpu3dXB0wbUgA7hrQFe5OXFrpRhSW12DLkVx8+msOzl2qNm0f1dsTT8QGY0iwq9WNlSRqL7x+MwDeFJ5ARK3TGwS+P5mPVbszkZxnvGWiQgGMCPHA9JhADO/pwYkM7cRgEIg7XYT1B89iV+oFNP2LHt5ViydigzGhrw9s2J1O1Ayv3wyAN4UnEFFztQ16fJFwHh/szcSZ4ioAgL2NCg8M8seMmEAEuDlKXKF1yy6qxOp9Wfg8Idc0i9jf1R5/HdkTdw3oyiBI1IjXbwbAm8ITiMioQW/AF0dzseKn0zhfYuyOdHGwwSNDA/FITCBcHTUSVygvlyrrsP7gWfwv/gyKK+sAGLvd54zqgbv6MwgS8frNAHhTeAKR3BkMAt+eyMdbO9ORXVQJAPB0tsWfbw3Gg4O6wdFWtjcbMgtVdQ3YcDAH7+/NRFGFMQgGuDng6dt6YkpkV3bDk2zx+s0AeFN4ApGc7U4rxP99fwqnCsoBAK6OGswe0R0PDwmQ3bp95q4pCL63J9PUItjHV4sXJ4RiWA93iasj6ny8fjMA3hSeQCRHpwvL8ep3qdidZpzV62yrxhO3BuOxW4LgxBY/s1ZV14C18Wew6pdMlNcal+IZ0csDC8eHopc3l48h+eD1mwHwpvAEIjm5XFmHt3/KwCcHz0JvELBRKfDI0EDMGdUDLg4c42dJLlXWYcVPGVh/8CwaDAJKBfDAoG54dmwvdOF4TZIBXr8ZAG8KTyCSA4NBYMPhHLzxYxpKq4336B0T5oUXJ4QiyJ2zei1ZdlEl/vPDKXx/sgCA8RZ8L4zvjfui/Dk+kKwar98MgDeFJxBZu9T8Mry49QQSc0oAAL29nfGPO8M4bszKHMoqxj++Pon0CxUAgP7dXPDK5HCEd9VJXBlRx+D1mwHwpvAEImtVVdeAt3dl4KN92dAbBJxs1XhmbAimDQ2Eii1DVqleb8D/4s/grZ3pqKzTQ6kAHokJxLPjesFBw7GdZF14/WYAvCk8gcga7Um/iBe/PGFaz29CX2/8884
|
|
|
|
"text/plain": [
|
|
|
|
"<matplotlib.figure.Figure>"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
"metadata": {},
|
|
|
|
"output_type": "display_data"
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import matplotlib.pyplot as plt\n",
|
|
|
|
"from math import log\n",
|
|
|
|
"import numpy as np\n",
|
|
|
|
"\n",
|
|
|
|
"def binomial_entropy(p):\n",
|
|
|
|
" return -(p * log(p, 2) + (1-p) * log(1-p, 2))\n",
|
|
|
|
"\n",
|
|
|
|
"x = list(np.arange(0.001,1,0.001))\n",
|
|
|
|
"y = [binomial_entropy(x) for x in x]\n",
|
|
|
|
"plt.figure().clear()\n",
|
|
|
|
"plt.xlabel('prawdopodobieństwo wylosowania orła')\n",
|
|
|
|
"plt.ylabel('entropia')\n",
|
|
|
|
"plt.plot(x, y)\n",
|
|
|
|
"\n",
|
|
|
|
"fname = f'02_Jezyki/binomial-entropy.png'\n",
|
|
|
|
"\n",
|
|
|
|
"plt.savefig(fname)\n",
|
|
|
|
"\n",
|
|
|
|
"fname"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"**Pytanie** Dla oszukańczej monety (np. dla której wypada zawsze orzeł) entropia\n",
|
|
|
|
"wynosi 0, czy to wynik zgodny z intuicją?\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"### Entropia a język\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Tekst w danym języku możemy traktować jako ciąg symboli (komunikatów) losowanych według jakiegoś\n",
|
|
|
|
"rozkładu prawdopodobieństwa. W tym sensie możemy mówić o entropii języka.\n",
|
|
|
|
"\n",
|
|
|
|
"Oczywiście, jak zawsze, musimy jasno stwierdzić, czym są symbole\n",
|
|
|
|
"języka: literami, wyrazami czy jeszcze jakimiś innymi jednostkami.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Pomiar entropii języka — pierwsze przybliżenie\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Załóżmy, że chcemy zmierzyć entropię języka polskiego na przykładzie\n",
|
|
|
|
"„Pana Tadeusza” — na poziomie znaków. W pierwszym przybliżeniu można\n",
|
|
|
|
"by policzyć liczbę wszystkich znaków…\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"95"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"chars_in_pan_tadeusz = len(set(get_characters(pan_tadeusz)))\n",
|
|
|
|
"chars_in_pan_tadeusz"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"… założyć jednostajny rozkład prawdopodobieństwa i w ten sposób policzyć entropię:\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"6.569855608330948"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"from math import log\n",
|
|
|
|
"\n",
|
|
|
|
"95 * (1/95) * log(95, 2)"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Mniej rozrzutne kodowanie\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Przypomnijmy sobie jednak, że rozkład jednostek języka jest zawsze\n",
|
|
|
|
"skrajnie nierównomierny! Jeśli uwzględnić ten nierównomierny rozkład\n",
|
|
|
|
"znaków, można opracować o wiele efektywniejszy sposób zakodowania znaków składających się na „Pana Tadeusza”\n",
|
|
|
|
"(częste litery, np. „a” i „e” powinny mieć krótkie kody, a rzadkie, np. „ź” — dłuższe).\n",
|
|
|
|
"\n",
|
|
|
|
"Policzmy entropię przy takim założeniu:\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"4.938605272823633"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"from collections import Counter\n",
|
|
|
|
"from math import log\n",
|
|
|
|
"\n",
|
|
|
|
"def unigram_entropy(t):\n",
|
|
|
|
" counter = Counter(t)\n",
|
|
|
|
"\n",
|
|
|
|
" total = counter.total()\n",
|
|
|
|
" return -sum((p := count / total) * log(p, 2) for count in counter.values())\n",
|
|
|
|
"\n",
|
|
|
|
"unigram_entropy(get_characters(pan_tadeusz))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"(Jak dowiemy się na kolejnym wykładzie, zastosowaliśmy tutaj **unigramowy model języka**).\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Ile wynosi entropia rękopisu Wojnicza?\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"3.902708104423842"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"unigram_entropy(get_characters(voynich))"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Rzeczywista entropia?\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"W rzeczywistości entropia jest jeszcze mniejsza, tekst nie jest\n",
|
|
|
|
"generowany przecież według rozkładu wielomianowego. Istnieją rzecz\n",
|
|
|
|
"jasna pewne zależności między znakami, np. niemożliwe, żeby po „ń”\n",
|
|
|
|
"wystąpiły litera „a” czy „e”. Na poziomie wyrazów zależności mogę mieć\n",
|
|
|
|
"jeszcze bardziej skrajny charakter, np. po wyrazie „przede” prawie na\n",
|
|
|
|
"pewno wystąpi „wszystkim”, co oznacza, że w takiej sytuacji słowo\n",
|
|
|
|
"„wszystkim” może zostać zakodowane za pomocą 0 (!) bitów.\n",
|
|
|
|
"\n",
|
|
|
|
"Można uwzględnić takie zależności i uzyskać jeszcze lepsze kodowanie,\n",
|
|
|
|
"a co za tym idzie lepsze oszacowanie entropii. (Jak wkrótce się\n",
|
|
|
|
"dowiemy, oznacza to użycie digramowego, trigramowego, etc. modelu języka).\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Rozmiar skompresowanego pliku jako przybliżenie entropii\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Celem algorytmów kompresji jest właściwie wyznaczanie efektywnych\n",
|
|
|
|
"sposobów kodowania danych. Możemy więc użyć rozmiaru skompresowanego pliku w bitach\n",
|
|
|
|
"(po podzieleniu przez oryginalną długość) jako dobrego przybliżenia entropii.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"3.673019884633768"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"import zlib\n",
|
|
|
|
"\n",
|
|
|
|
"def entropy_by_compression(t):\n",
|
|
|
|
" compressed = zlib.compress(t.encode('utf-8'))\n",
|
|
|
|
" return 8 * len(compressed) / len(t)\n",
|
|
|
|
"\n",
|
|
|
|
"entropy_by_compression(pan_tadeusz)"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Dla porównania wynik dla rękopisu Wojnicza:\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "code",
|
|
|
|
"execution_count": 1,
|
|
|
|
"metadata": {},
|
|
|
|
"outputs": [
|
|
|
|
{
|
|
|
|
"name": "stdout",
|
|
|
|
"output_type": "stream",
|
|
|
|
"text": [
|
|
|
|
"2.942372881355932"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"source": [
|
|
|
|
"entropy_by_compression(voynich)"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"#### Gra Shannona\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"cell_type": "markdown",
|
|
|
|
"metadata": {},
|
|
|
|
"source": [
|
|
|
|
"Innym sposobem oszacowania entropii tekstu jest użycie… ludzi. Można poprosić rodzimych użytkowników\n",
|
|
|
|
"danego języka o przewidywanie kolejnych liter (bądź wyrazów) i w ten sposób oszacować entropię.\n",
|
|
|
|
"\n",
|
|
|
|
"**Projekt** Zaimplementuj aplikację webową, która umożliwi „rozegranie” gry Shannona.\n",
|
|
|
|
"\n"
|
|
|
|
]
|
|
|
|
}
|
|
|
|
],
|
|
|
|
"metadata": {
|
|
|
|
"kernelspec": {
|
|
|
|
"display_name": "Python 3 (ipykernel)",
|
|
|
|
"language": "python",
|
|
|
|
"name": "python3"
|
|
|
|
},
|
|
|
|
"language_info": {
|
|
|
|
"codemirror_mode": {
|
|
|
|
"name": "ipython",
|
|
|
|
"version": 3
|
|
|
|
},
|
|
|
|
"file_extension": ".py",
|
|
|
|
"mimetype": "text/x-python",
|
|
|
|
"name": "python",
|
|
|
|
"nbconvert_exporter": "python",
|
|
|
|
"pygments_lexer": "ipython3",
|
|
|
|
"version": "3.10.2"
|
|
|
|
},
|
|
|
|
"org": null
|
|
|
|
},
|
|
|
|
"nbformat": 4,
|
|
|
|
"nbformat_minor": 1
|
|
|
|
}
|