aitech-moj-2023/wyk/02_Jezyki.ipynb

1400 lines
194 KiB
Plaintext
Raw Normal View History

2022-03-06 17:51:23 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Modelowanie języka</h1>\n",
"<h2> 2. <i>Języki</i> [wykład]</h2> \n",
"<h3> Filip Graliński (2022)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Języki i ich prawa\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jakim rozkładom statystycznym podlegają języki?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Język naturalny albo „Pan Tadeusz” w liczbach\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Przygotujmy najpierw „infrastrukturę” do *segmentacji* tekstu na różnego rodzaju jednostki.\n",
"Używać będziemy generatorów.\n",
"\n",
"**Pytanie** Dlaczego generatory zamiast list?\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Księga pierwsza\\r\\n\\r\\n\\r\\n\\r\\nGospodarstwo\\r\\n\\r\\nPowrót pani'"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import requests\n",
"\n",
"url = 'https://wolnelektury.pl/media/book/txt/pan-tadeusz.txt'\n",
"pan_tadeusz = requests.get(url).content.decode('utf-8')\n",
"\n",
"pan_tadeusz[100:150]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Znaki\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['K',\n",
" 's',\n",
" 'i',\n",
" 'ę',\n",
" 'g',\n",
" 'a',\n",
" ' ',\n",
" 'p',\n",
" 'i',\n",
" 'e',\n",
" 'r',\n",
" 'w',\n",
" 's',\n",
" 'z',\n",
" 'a',\n",
" '\\r',\n",
" '\\n',\n",
" '\\r',\n",
" '\\n',\n",
" '\\r',\n",
" '\\n',\n",
" '\\r',\n",
" '\\n',\n",
" 'G',\n",
" 'o',\n",
" 's',\n",
" 'p',\n",
" 'o',\n",
" 'd',\n",
" 'a',\n",
" 'r',\n",
" 's',\n",
" 't',\n",
" 'w',\n",
" 'o',\n",
" '\\r',\n",
" '\\n',\n",
" '\\r',\n",
" '\\n',\n",
" 'P',\n",
" 'o',\n",
" 'w',\n",
" 'r',\n",
" 'ó',\n",
" 't',\n",
" ' ',\n",
" 'p',\n",
" 'a',\n",
" 'n',\n",
" 'i']"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from itertools import islice\n",
"\n",
"def get_characters(t):\n",
" yield from t\n",
"\n",
"list(islice(get_characters(pan_tadeusz), 100, 150))"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Counter({'A': 698,\n",
" 'd': 11465,\n",
" 'a': 30979,\n",
" 'm': 10269,\n",
" ' ': 63444,\n",
" 'M': 585,\n",
" 'i': 29353,\n",
" 'c': 14153,\n",
" 'k': 12362,\n",
" 'e': 25343,\n",
" 'w': 14625,\n",
" 'z': 22741,\n",
" '\\r': 10851,\n",
" '\\n': 10851,\n",
" 'P': 1265,\n",
" 'n': 15505,\n",
" 'T': 971,\n",
" 'u': 7699,\n",
" 's': 15255,\n",
" 'y': 13732,\n",
" 'l': 6677,\n",
" 'o': 23050,\n",
" 't': 10757,\n",
" 'j': 6586,\n",
" 'L': 316,\n",
" 'I': 795,\n",
" 'S': 1045,\n",
" 'B': 567,\n",
" 'N': 793,\n",
" '9': 8,\n",
" '7': 2,\n",
" '8': 10,\n",
" '-': 33,\n",
" '3': 3,\n",
" '2': 6,\n",
" '4': 2,\n",
" '5': 2,\n",
" 'K': 683,\n",
" 'ę': 5534,\n",
" 'g': 4775,\n",
" 'p': 8031,\n",
" 'r': 15328,\n",
" 'G': 358,\n",
" 'ó': 3097,\n",
" '—': 720,\n",
" ',': 9130,\n",
" 'ł': 10059,\n",
" 'W': 1258,\n",
" 'ż': 3334,\n",
" 'ś': 2524,\n",
" 'ą': 4794,\n",
" 'Ż': 219,\n",
" 'O': 567,\n",
" 'ź': 414,\n",
" 'b': 5753,\n",
" 'R': 489,\n",
" 'E': 23,\n",
" '!': 1083,\n",
" ':': 1152,\n",
" 'ć': 1956,\n",
" '.': 2380,\n",
" 'D': 552,\n",
" 'J': 729,\n",
" 'C': 556,\n",
" 'h': 3915,\n",
" '(': 76,\n",
" 'f': 386,\n",
" ';': 1445,\n",
" 'ń': 651,\n",
" ')': 76,\n",
" 'Z': 785,\n",
" 'Ś': 71,\n",
" 'U': 184,\n",
" 'F': 47,\n",
" 'é': 43,\n",
" '?': 441,\n",
" '…': 157,\n",
" '«': 540,\n",
" 'H': 309,\n",
" '»': 538,\n",
" 'Ó': 13,\n",
" 'Ł': 24,\n",
" 'x': 3,\n",
" 'v': 5,\n",
" '*': 150,\n",
" 'à': 1,\n",
" 'Ź': 4,\n",
" 'V': 3,\n",
" '/': 19,\n",
" 'Ć': 1,\n",
" 'q': 2,\n",
" '1': 4,\n",
" 'æ': 2,\n",
" '6': 1,\n",
" '0': 1})"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"\n",
"c = Counter(get_characters(pan_tadeusz))\n",
"\n",
"c"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Napiszmy pomocniczą funkcję, która zwraca **listę frekwencyjną**.\n",
"\n",
"Counter({' ': 63444, 'a': 30979, 'i': 29353, 'e': 25343, 'o': 23050, 'z': 22741, 'n': 15505, 'r': 15328, 's': 15255, 'w': 14625, 'c': 14153, 'y': 13732, 'k': 12362, 'd': 11465, '\\r': 10851, '\\n': 10851, 't': 10757, 'm': 10269, 'ł': 10059, ',': 9130, 'p': 8031, 'u': 7699, 'l': 6677, 'j': 6586, 'b': 5753, 'ę': 5534, 'ą': 4794, 'g': 4775, 'h': 3915, 'ż': 3334, 'ó': 3097, 'ś': 2524, '.': 2380, 'ć': 1956, ';': 1445, 'P': 1265, 'W': 1258, ':': 1152, '!': 1083, 'S': 1045, 'T': 971, 'I': 795, 'N': 793, 'Z': 785, 'J': 729, '—': 720, 'A': 698, 'K': 683, 'ń': 651, 'M': 585, 'B': 567, 'O': 567, 'C': 556, 'D': 552, '«': 540, '»': 538, 'R': 489, '?': 441, 'ź': 414, 'f': 386, 'G': 358, 'L': 316, 'H': 309, 'Ż': 219, 'U': 184, '…': 157, '\\*': 150, '(': 76, ')': 76, 'Ś': 71, 'F': 47, 'é': 43, '-': 33, 'Ł': 24, 'E': 23, '/': 19, 'Ó': 13, '8': 10, '9': 8, '2': 6, 'v': 5, 'Ź': 4, '1': 4, '3': 3, 'x': 3, 'V': 3, '7': 2, '4': 2, '5': 2, 'q': 2, 'æ': 2, 'à': 1, 'Ć': 1, '6': 1, '0': 1})\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"OrderedDict([(' ', 63444),\n",
" ('a', 30979),\n",
" ('i', 29353),\n",
" ('e', 25343),\n",
" ('o', 23050),\n",
" ('z', 22741),\n",
" ('n', 15505),\n",
" ('r', 15328)])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from collections import Counter\n",
"from collections import OrderedDict\n",
"\n",
"def freq_list(g, top=None):\n",
" c = Counter(g)\n",
"\n",
" if top is None:\n",
" items = c.items()\n",
" else:\n",
" items = c.most_common(top)\n",
"\n",
" return OrderedDict(sorted(items, key=lambda t: -t[1]))\n",
"\n",
"freq_list(get_characters(pan_tadeusz), top=8)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2022-03-08 18:09:12 +01:00
"/tmp/ipykernel_8747/6903746.py:14: UserWarning: Glyph 13 (\r",
2022-03-06 17:51:23 +01:00
") missing from current font.\n",
" plt.savefig(fname)\n"
]
},
{
"data": {
"text/plain": [
"'02_Jezyki/pt-chars.png'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/lib/python3.10/site-packages/IPython/core/pylabtools.py:151: UserWarning: Glyph 13 (\r",
") missing from current font.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAuEAAADCCAYAAADn5xwjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAsGElEQVR4nO3de/wVVb3/8ddbQMULXhBNQf1q0kU9pUJm2U2pI5WFdjSxY2JRpFnasX6G1TnZhdLudtEyMdG8cSiTY2oZaFqRiooB4oUUlTTBO2qo4Of3x1pbhs3s/d3fL9+9v7f38/HYjz2zZtbMmtlrz17z2WtmFBGYmZmZmVnrbNDdBTAzMzMz62/cCDczMzMzazE3ws3MzMzMWsyNcDMzMzOzFnMj3MzMzMysxdwINzMzMzNrsYHdXYDusM0220RbW1t3F8PMzMzM+rBbbrnl0YgYVjatXzbC29ramDt3bncXw8zMzMz6MEn315rm7ihmZmZmZi3mRriZmZmZWYu5EW5mZmZm1mJuhJuZmZmZtZgb4WZmZmZmLdYv747SXdom/3adtCWnvbcbSmJmZmZm3cmRcDMzMzOzFnMj3MzMzMysxdwINzMzMzNrMTfCzczMzMxarKmNcElbSpoh6U5JiyS9SdLWkq6RdE9+36ow/ymSFku6S9JBhfRRkubnaT+UpJy+kaRLc/qNktqauT1mZmZmZl2h2ZHwM4CrI+I1wOuBRcBkYFZEjARm5XEk7Q6MB/YAxgJnShqQl3MWMAkYmV9jc/pE4ImI2A34PnB6k7fHzMzMzGy9Na0RLmkI8DZgKkBEvBARTwLjgGl5tmnAIXl4HHBJRDwfEfcBi4F9JW0PDImIORERwPlVeSrLmgGMqUTJzczMzMx6qmZGwncFlgO/kHSbpHMkbQpsFxEPA+T3bfP8w4EHC/mX5rThebg6fa08EbEKeAoY2pzNMTMzMzPrGs1shA8E9gHOioi9gWfJXU9qKItgR530ennWXbg0SdJcSXOXL19epxhmZmZmZs3VzEb4UmBpRNyYx2eQGuWP5C4m5Pdlhfl3LOQfATyU00eUpK+VR9JAYAvg8bLCRMTZETE6IkYPGzZsPTfNzMzMzKzzmtYIj4h/Ag9KenVOGgPcAcwEJuS0CcDleXgmMD7f8WQX0gWYN+UuKysk7Zf7ex9dlaeyrMOA2bnfuJmZmZlZjzWwycv/NHChpA2Be4GPkBr+0yVNBB4ADgeIiIWSppMa6quA4yNidV7OccB5wGDgqvyCdNHnBZIWkyLg45u8PWZmZmZm662pjfCImAeMLpk0psb8U4ApJelzgT1L0leSG/FmZmZmZr2Fn5hpZmZmZtZiboSbmZmZmbWYG+FmZmZmZi3mRriZmZmZWYu5EW5mZmZm1mJuhJuZmZmZtZgb4WZmZmZmLeZGuJmZmZlZi7kRbmZmZmbWYm6Em5mZmZm1mBvhZmZmZmYt5ka4mZmZmVmLuRFuZmZmZtZiboSbmZmZmbWYG+FmZmZmZi3mRriZmZmZWYs1vREuaYmk+ZLmSZqb07aWdI2ke/L7VoX5T5G0WNJdkg4qpI/Ky1ks6YeSlNM3knRpTr9RUluzt8nMzMzMbH20KhJ+QETsFRGj8/hkYFZEjARm5XEk7Q6MB/YAxgJnShqQ85wFTAJG5tfYnD4ReCIidgO+D5zegu0xMzMzM+u07uqOMg6YloenAYcU0i+JiOcj4j5gMbCvpO2BIRExJyICOL8qT2VZM4AxlSi5mZmZmVlP1IpGeAC/l3SLpEk5bbuIeBggv2+b04cDDxbyLs1pw/NwdfpaeSJiFfAUMLS6EJImSZorae7y5cu7ZMPMzMzMzDpjYAvWsX9EPCRpW+AaSXfWmbcsgh110uvlWTsh4mzgbIDRo0evM93MzMzMrFUaioRL2j9fQHm3pHsl3Sfp3kbyRsRD+X0ZcBmwL/BI7mJCfl+WZ18K7FjIPgJ4KKePKElfK4+kgcAWwOONlM3MzMzMrDvUbIRLOlhSpcvHVOB7wFuANwCj83tdkjaVtHllGPh3YAEwE5iQZ5sAXJ6HZwLj8x1PdiFdgHlT7rKyQtJ+ub/30VV5Kss6DJid+42bmZmZmfVI9bqj3A38TNLXgKci4qpOLH874LJ8neRA4KKIuFrSzcB0SROBB4DDASJioaTpwB3AKuD4iFidl3UccB4wGLgqvyCdIFwgaTEpAj6+E+U0MzMzM2uZmo3wiLhb0jhgN+BaSd8Gfg08X5jn1noLj4h7gdeXpD8GjKmRZwowpSR9LrBnSfpKciPezMzMzKw3qHthZo5C3yXpjTlpdHEycGCzCmZmZmZm1lc1dHeUiDig2QUxMzMzM+svGr07ynaSpkq6Ko/vnvtzm5mZmZlZBzX6sJ7zgN8BO+Txu4HPNKE8ZmZmZmZ9XqON8G0iYjrwErz8ZMrV9bOYmZmZmVmZRhvhz0oaSn4SpaT9SI+HNzMzMzOzDmr0sfUnkR6K80pJfwaGkR6MY2ZmZmZmHdTo3VFulfR24NWAgLsi4sWmlszMzMzMrI+q2wiXdGBEzJb0gapJr5JERPy6iWUzMzMzM+uT2ouEvx2YDbyvZFqQnqBpZmZmZmYd0N4TM7+c3z/SmuKYmZmZmfV9jT6sZ6ikH0q6VdItks7Id0sxMzMzM7MOavQWhZcAy4H/IN0VZTlwabMKZWZmZmbWlzV6i8KtI+JrhfGvSzqkCeUxMzMzM+vzGo2EXytpvKQN8uuDwG+bWTAzMzMzs76q0Ub4J4CLgBfy6xLgJEkrJD1dL6OkAZJuk3RFHt9a0jWS7snvWxXmPUXSYkl3STqokD5K0vw87YeSlNM3knRpTr9RUluHtt7MzMzMrBs01AiPiM0jYoOIGJhfG+S0zSNiSDvZTwQWFcYnA7MiYiQwK48jaXdgPLAHMBY4U9KAnOcsYBIwMr/G5vSJwBMRsRvwfeD0RrbHzMzMzKw7NRoJR9JWkvaV9LbKK6d/sk6eEcB7gXMKyeOAaXl4GnBIIf2SiHg+Iu4DFgP7StoeGBIRcyIigPOr8lSWNQMYU4mSm5mZmZn1VA1dmCnpY6SI9ghgHrAfMEfSHcBo4MwaWX8AnAxsXkjbLiIeBoiIhyVtm9OHA38tzLc0p72Yh6vTK3kezMtaJekpYCjwaCPbZWZmZmbWHRqNhJ8IvAG4PyIOAPYm3abwC6yJSq9F0sHAsoi4pcF1lEWwo056vTxl5Zkkaa6kucuXL2+wSGZmZmZmXa/RRvjKiFgJ6WLIiLgTeHVEPB0R/6yRZ3/g/ZKWkC7kPFDSL4FHchcT8vuyPP9SYMdC/hHAQzl9REn6WnkkDQS2AB4vK0xEnB0RoyNi9LBhwxrcbDMzMzOzrtdoI3yppC2B3wDXSLqcNQ3hUhFxSkSMiIg20gWXsyPiKGAmMCHPNgG4PA/PBMbnO57sQroA86bcdWWFpP1yf++jq/JUlnVYXkdpJNzMzMzMrKdoqE94RByaB0+VdC0p4nxVJ9d5GjBd0kTgAeDwvI6FkqYDdwCrgOMjYnXOcxxwHjA4r7ey7qnABZIWkyLg4ztZJjMzMzOzlmn0wswLIuLDABHxx0oa8OFG8kfEdcB1efgxYEyN+aYAU0rS5wJ7lqSvJDfizczMzMx6i0a7o+xRHMn37x7V9cUxMzMzM+v76jbC8xMsVwCvk/R0fq0gXUx5eb28ZmZmZmZWrm4jPCK+GRGbA9+OiCH5tXlEDI2IU1pURjMzMzOzPqXR7ihXSNoUQNJRkr4naecmlsvMzMzMrM9qtBF+FvCcpNeTnoB5P+nx8WZmZmZm1kGNNsJX5ftvjwPOiIgzWPtR9GZmZmZm1qCGblFIeljOKcBRwNvy3VEGNa9YZmZmZmZ9V6OR8COA54GJ+TH1w4FvN61UZmZmZmZ9WKOR8MOAX0TEEwAR8QDuE25mZmZm1imNRsJfAdwsabqksZLUzEKZmZmZmfVlDTXCI+JLwEhgKnAMcI+kb0h6ZRPLZmZmZmbWJzUaCSffHeWf+bUK2AqYIelbTSqbmZmZmVmf1FCfcEknABOAR4FzgP8XES9K2gC4h3TvcDM
"text/plain": [
"<Figure size 864x216 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from collections import OrderedDict\n",
"\n",
"def rang_freq_with_labels(name, g, top=None):\n",
" freq = freq_list(g, top)\n",
"\n",
" plt.figure(figsize=(12, 3))\n",
" plt.ylabel('liczba wystąpień')\n",
"\n",
" plt.bar(freq.keys(), freq.values())\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"rang_freq_with_labels('pt-chars', get_characters(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Słowa\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Co rozumiemy pod pojęciem słowa czy wyrazu, nie jest oczywiste. W praktyce zależy to od wyboru **tokenizatora**.\n",
"\n",
"Załóżmy, że przez wyraz rozumieć będziemy nieprzerwany ciąg liter bądź cyfr (oraz gwiazdek\n",
"— to za chwilę ułatwi nam analizę pewnego tekstu…).\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Ty',\n",
" 'co',\n",
" 'gród',\n",
" 'zamkowy',\n",
" 'Nowogródzki',\n",
" 'ochraniasz',\n",
" 'z',\n",
" 'jego',\n",
" 'wiernym',\n",
" 'ludem',\n",
" 'Jak',\n",
" 'mnie',\n",
" 'dziecko',\n",
" 'do',\n",
" 'zdrowia',\n",
" 'powróciłaś',\n",
" 'cudem',\n",
" 'Gdy',\n",
" 'od',\n",
" 'płaczącej',\n",
" 'matki',\n",
" 'pod',\n",
" 'Twoją',\n",
" 'opiekę',\n",
" 'Ofiarowany',\n",
" 'martwą',\n",
" 'podniosłem',\n",
" 'powiekę',\n",
" 'I',\n",
" 'zaraz']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from itertools import islice\n",
"import regex as re\n",
"\n",
"def get_words(t):\n",
" for m in re.finditer(r'[\\p{L}0-9\\*]+', t):\n",
" yield m.group(0)\n",
"\n",
"list(islice(get_words(pan_tadeusz), 100, 130))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy 20 najczęstszych wyrazów.\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 7,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/pt-words-20.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 7,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtoAAADCCAYAAAB39GXsAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAdPUlEQVR4nO3de5hddX3v8feHYFFuchspJsRBG7WB3iTloNQWxVNppQbrhdCiUbFRi4K1HoTaU+wlLa1tLX16oFKggFowWltoURFRhCKIIVwDgmnDJYIQC0rERzDxe/5Ya8pmMgk7mVmzZzLv1/PsZ9b6rbX397tn1t77O7/9W7+VqkKSJEnSxNpu0AlIkiRJ2yILbUmSJKkDFtqSJElSByy0JUmSpA5YaEuSJEkdsNCWJEmSOrD9oBPoyl577VXDw8ODTkOSJEnbuOuvv/7bVTU0un2bLbSHh4dZvnz5oNOQJEnSNi7J3WO1O3REkiRJ6oCFtiRJktQBC21JkiSpAxbakiRJUgcstCVJkqQObLOzjgzS8EmXdPr4d536qk4fX5IkSeNnob0N6brAB4t8SZKkfjl0RJIkSeqAhbYkSZLUAQttSZIkqQMW2pIkSVIHLLQlSZKkDnRaaCc5J8mDSW4dY9v7klSSvXraTk6yKskdSV7Z035gklvabX+bJF3mLUmSJI1X1z3a5wKHj25Msi/wv4F7etrmA4uA/dv7nJ5kVrv5DGAJMK+9bfSYkiRJ0lTS6TzaVXVlkuExNn0YOBG4qKdtIXBhVT0GrE6yCjgoyV3ArlV1DUCS84Ejgc92mLq2kHN4S5IkPdmkj9FO8mrgm1V106hNs4F7e9bXtG2z2+XR7ZIkSdKUNalXhkyyI/AB4JfH2jxGW22mfazHX0IzxIS5c+duZZaSJEnS+E12j/bzgP2Am9ohIXOAFUl+nKanet+efecA97Xtc8Zo30hVnVlVC6pqwdDQUAfpS5IkSf2Z1EK7qm6pqmdV1XBVDdMU0S+qqm8BFwOLkuyQZD+akx6vq6r7gXVJDm5nG3kTTx7bLUmSJE05XU/vdwFwDfCCJGuSHLupfatqJbAMuA34HHBcVW1oN78TOAtYBfwnnggpSZKkKa7rWUeOfortw6PWlwJLx9hvOXDAhCYnSZIkdcgrQ0qSJEkdsNCWJEmSOmChLUmSJHXAQluSJEnqgIW2JEmS1AELbUmSJKkDFtqSJElSByy0JUmSpA5YaEuSJEkdsNCWJEmSOmChLUmSJHXAQluSJEnqwPZdPniSc4AjgAer6oC27UPArwGPA/8JvKWqvtNuOxk4FtgAHF9Vl7btBwLnAs8APgOcUFXVZe6aPoZPuqTTx7/r1Fd1+viSJGnb1HWP9rnA4aPaLgMOqKqfBu4ETgZIMh9YBOzf3uf0JLPa+5wBLAHmtbfRjylJkiRNKZ0W2lV1JfDQqLbPV9X6dvVaYE67vBC4sKoeq6rVwCrgoCT7ALtW1TVtL/b5wJFd5i1JkiSN16DHaL8V+Gy7PBu4t2fbmrZtdrs8un0jSZYkWZ5k+dq1aztIV5IkSerPwArtJB8A1gMfH2kaY7faTPvGjVVnVtWCqlowNDQ0MYlKkiRJW6HTkyE3JclimpMkD+s5qXENsG/PbnOA+9r2OWO0S5IkSVPWpPdoJzkceD/w6qr6fs+mi4FFSXZIsh/NSY/XVdX9wLokBycJ8CbgosnOW5IkSdoSffVoJzkE+CDwnPY+AaqqnvsU97sAOBTYK8ka4BSaWUZ2AC5r6maurap3VNXKJMuA22iGlBxXVRvah3onT0zv91meGNctSZIkTUmbLLSTHAHcUFXfBM4Gfge4nmaO675U1dFjNJ+9mf2XAkvHaF8OHNBvXEmSJGnQNtejfSfwkSR/DHy3quxFliRJkvq0yUK7qu5MshD4CeBL7RUdPw081rPPiu5TlKYur0opSZI2ZbNjtNsx0nck+V9t04LezcDLu0pMkiRJms76Ohmyql7WdSKSJEnStqSv6f2S7J3k7CSfbdfnJzm229QkSZKk6avfebTPBS4Fnt2u3wm8p4N8JEmSpG1Cv4X2XlW1DPgRQFWtZwum+ZMkSZJmmn4L7UeT7ElzAiRJDga+21lWkiRJ0jTX18mQwHtpLpH+vCRXA0PA6zrLSpIkSZrm+p11ZEWSXwJeQHP59Tuq6oedZiZJkiRNY5sttJO8vKq+mOTXR216fhKq6tMd5iZJkiRNW0/Vo/1LwBeBXxtjW9FcKVKSJEnSKE91ZchT2p9v2ZoHT3IOcATwYFUd0LbtAXwCGAbuAt5QVQ+3204GjqWZ0eT4qrq0bT+QZorBZwCfAU6oqtqanCRJkqTJ0O8Fa/ZM8rdJViS5Pslp7SwkT+Vc4PBRbScBl1fVPODydp0k84FFwP7tfU5PMqu9zxnAEmBeexv9mJIkSdKU0u/0fhcCa4HX0sw2spamV3qzqupK4KFRzQuB89rl84Aje9ovrKrHqmo1sAo4KMk+wK5VdU3bi31+z30kSZKkKanfQnuPqvrjqlrd3v4E2G0rY+5dVfcDtD+f1bbPBu7t2W9N2za7XR7dvpEkS5IsT7J87dq1W5meJEmSNH79FtpfSrIoyXbt7Q3AJROcS8Zoq820b9xYdWZVLaiqBUNDQxOanCRJkrQl+i203w78E/B4e7sQeG+SdUke2cKYD7TDQWh/Pti2rwH27dlvDnBf2z5njHZJkiRpyuqr0K6qXapqu6ravr1t17btUlW7bmHMi4HF7fJi4KKe9kVJdkiyH81Jj9e1w0vWJTk4SYA39dxHkiRJmpL6vQQ7SXanKX6fPtJWVVcm+e2qOn0T97kAOBTYK8ka4BTgVGBZkmOBe4DXt4+1Msky4DZgPXBcVW1oH+qdPDG932fbmyRJkjRl9VVoJ3kbcALNsI0bgYOBa5LcBiwAxiy0q+roTTzkYZvYfymwdIz25cAB/eQqSZIkTQX9jtE+Afh54O6qehnwczRT/P0eTrUnSZIkbaTfoSM/qKofJCHJDlX19SQvqKpHgC09GVKSJEna5vVbaK9Jshvwr8BlSR7GmT8kSZKkTeqr0K6q17SLH0zyJeCZeEKiJEmStEl9jdFO8tGR5ar6clVdDJzTWVaSJEnSNNfvyZD7964kmQUcOPHpSJIkSduGzRbaSU5Osg746SSPtLd1NFdz9KIxkiRJ0iZsttCuqj+rql2AD1XVru1tl6ras6pOnqQcJUmSpGmn36Ej/55kJ4AkxyT56yTP6TAvSZIkaVrrt9A+A/h+kp8BTgTuBs7vLCtJkiRpmuu30F5fVQUsBE6rqtOAXbpLS5IkSZre+i201yU5GTgGuKSddeRp4wmc5HeSrExya5ILkjw9yR5JLkvyjfbn7j37n5xkVZI7krxyPLElSZKkrvVbaB8FPAYcW1XfAmYDH9raoElmA8cDC6rqAGAWsAg4Cbi8quYBl7frJJnfbt8fOBw4vS32JUmSpCmp30L7dcA/VtVVAFV1T1WNd4z29sAzkmwP7EhzSfeFwHnt9vOAI9vlhcCFVfVYVa0GVgEHjTO+JEmS1Jl+C+0fB76WZFmSw5NkPEGr6pvAXwL3APcD362qzwN7V9X97T73A89q7zIbuLfnIda0bZIkSdKU1FehXVW/D8wDzgbeDHwjyZ8med7WBG3HXi8E9gOeDeyU5JjN3WWstMZ43CVJlidZvnbt2q1JTZIkSZoQ/fZo08468q32th7YHfhUkr/YirivAFZX1dqq+iHwaeAlwANJ9gFofz7Y7r8G2Lfn/nNohpqMzvHMqlpQVQuGhoa2Ii1JkiRpYvRVaCc5Psn1wF8AVwM/VVXvBA4EXrsVce8BDk6yYzsM5TDgduBiYHG7z2KeuMz7xcCiJDsk2Y+md/26rYgrSZIkTYrt+9xvL+DXq+ru3saq+lGSI7Y0aFV9NcmngBU0veM3AGcCOwPLkhxLU4y/vt1/ZZJlwG3t/sdV1YYtjSttK4ZPuqTzGHed+qrOY0iStC3rt9AO8Pwk366qR3s3VNXtWxO4qk4BThnV/BhN7/ZY+y8Flm5NLEmSJGmy9TtGezVwNLA8yXV
"text/plain": [
"<Figure size 864x216 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"rang_freq_with_labels('pt-words-20', get_words(pan_tadeusz), top=20)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zobaczmy pełny obraz, już bez etykiet.\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 8,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/pt-words.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 8,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAWYUlEQVR4nO3df6zd9X3f8ecLG+zwK9hwYa5taqez2kG2JcRldImiaqyDZlXMpEVytC7WhmYtI206tetgmZb+Yyn71a3RBJOXZHG2KMRJ02F1TRfktoqmpJBLAgFDCE4g+AYH34QEWCgGw3t/nC/k3PO9P8w594fv9zwf0tX5ns/3+z3fz/no+nU/fp/v93xTVUiSxsNZK90BSdLyMfQlaYwY+pI0Rgx9SRojhr4kjZG1K92BhVxyySW1bdu2le6GJK0q99xzz/eramKw/YwP/W3btjE5ObnS3ZCkVSXJd2Zrt7wjSWPE0JekMWLoS9IYMfQlaYwY+pI0Rgx9SRojhr4kjZHOhv6BLz3GofueWOluSNIZpbOh/8m7vsP//rqhL0n9Ohv6ZyV4fxhJmqmzoQ9g5kvSTJ0N/SQr3QVJOuN0NvQByzuSNKCzoe88X5LaOhv6PU71JalfZ0Pfkr4ktXU29MGaviQN6mzoO9OXpLYFQz/Jx5KcSPLALOt+K0kluaSv7ZYkR5M8nOS6vva3JLm/WffhLMM5lU70JWmm05npfxy4frAxyVbgl4DH+9quAHYDVzb73JpkTbP6NmAvsKP5ab3mYorn70hSy4KhX1VfBJ6aZdV/An6bmRPqXcDtVXWyqh4FjgJXJ9kEXFhVX66qAj4B3DBq5xdSFvUlaYahavpJ3gl8t6ruG1i1GTjW93yqadvcLA+2z/X6e5NMJpmcnp4epovW9CVpFq859JOcC3wA+DezrZ6lreZpn1VV7a+qnVW1c2Ji4rV2ceEDSNKYWjvEPj8DbAfuaz6L3QJ8NcnV9GbwW/u23QI80bRvmaVdkrSMXvNMv6rur6pLq2pbVW2jF+hXVdX3gEPA7iTrkmyn94Ht3VV1HHg2yTXNWTvvAe5YvLfRZnVHktpO55TNTwFfBn42yVSSG+fatqqOAAeBB4E/Bm6qqpea1e8FPkLvw91vAZ8fse8L8nNcSZppwfJOVb17gfXbBp7vA/bNst0k8MbX2L/h+UmuJLV09opc8INcSRrU2dB3ni9JbZ0NffDiLEka1NnQt6QvSW2dDX1JUltnQ9+JviS1dTb0wfP0JWlQZ0N/Gb6uX5JWnc6GPkB5pr4kzdDZ0HeeL0ltnQ19sKYvSYM6G/qW9CWprbOhD870JWlQZ0PfG6NLUltnQx88e0eSBnU39J3oS1JLd0Mfa/qSNOh0bpf4sSQnkjzQ1/bvk3wjydeT/EGSi/rW3ZLkaJKHk1zX1/6WJPc36z6cJb5k1om+JLWdzkz/48D1A213Am+sqr8GfBO4BSDJFcBu4Mpmn1uTrGn2uQ3YS+9m6Ttmec1F50RfkmZaMPSr6ovAUwNtX6iqU83TPwe2NMu7gNur6mRVPUrvJuhXJ9kEXFhVX67enU0+AdywSO9BknSaFqOm/4+BzzfLm4FjfeummrbNzfJg+5Lx4ixJahsp9JN8ADgFfPKVplk2q3na53rdvUkmk0xOT08P30HrO5I0w9Chn2QP8CvAP6if3Ix2Ctjat9kW4Immfcss7bOqqv1VtbOqdk5MTAzXPz/KlaSWoUI/yfXAvwTeWVXP9a06BOxOsi7Jdnof2N5dVceBZ5Nc05y18x7gjhH7viAvzpKkmdYutEGSTwG/CFySZAr4IL2zddYBdzZnXv55Vf3TqjqS5CDwIL2yz01V9VLzUu+ldybQ6+h9BvB5lpA1fUlqWzD0q+rdszR/dJ7t9wH7ZmmfBN74mno3Ii/OkqSZOntFrjN9SWrrbOiDJ+9I0qDOhr5n70hSW2dDH6As6kvSDJ0NfWv6ktTW2dAHa/qSNKjToS9JmqnToW9JX5Jm6mzoL/E9WiRpVeps6IM1fUka1NnQd54vSW2dDX3Aor4kDehs6FvSl6S2zoY+WNOXpEGdDX0n+pLU1tnQB0v6kjSo06EvSZqps6HvxVmS1LZg6Cf5WJITSR7oa9uY5M4kjzSPG/rW3ZLkaJKHk1zX1/6WJPc36z6cZUhlb4wuSTOdzkz/48D1A203A4eragdwuHlOkiuA3cCVzT63JlnT7HMbsBfY0fwMvuaicp4vSW0Lhn5VfRF4aqB5F3CgWT4A3NDXfntVnayqR4GjwNVJNgEXVtWXq3dnk0/07bNk/CBXkmYatqZ/WVUdB2geL23aNwPH+rabato2N8uD7bNKsjfJZJLJ6enpoTpoSV+S2hb7g9zZorbmaZ9VVe2vqp1VtXNiYmLozjjTl6SZhg39J5uSDc3jiaZ9Ctjat90W4Immfcss7UvIqb4kDRo29A8Be5rlPcAdfe27k6xLsp3eB7Z3NyWgZ5Nc05y1856+fZaME31JmmntQhsk+RTwi8AlSaaADwIfAg4muRF4HHgXQFUdSXIQeBA4BdxUVS81L/VeemcCvQ74fPOzZKzpS1LbgqFfVe+eY9W1c2y/D9g3S/sk8MbX1LsRlUV9SZqhu1fkrnQHJOkM1NnQlyS1dTb0relLUltnQx88T1+SBnU29GNVX5JaOhv64LdsStKgzoa+NX1Jauts6IM1fUka1NnQd6YvSW2dDX3wu3ckaVBnQ9+zdySprbOhD373jiQN6m7oO9GXpJbuhr4kqaXToW9xR5Jm6mzoW92RpLbOhj7gVF+SBowU+kn+eZIjSR5I8qkk65NsTHJnkkeaxw1929+S5GiSh5NcN3r35+3bUr68JK1KQ4d+ks3ArwM7q+qNwBpgN3AzcLiqdgCHm+ckuaJZfyVwPXBrkjWjdX9+TvQlaaZRyztrgdclWQucCzwB7AIONOsPADc0y7uA26vqZFU9ChwFrh7x+HNyni9JbUOHflV9F/gPwOPAceDpqvoCcFlVHW+2OQ5c2uyyGTjW9xJTTVtLkr1JJpNMTk9PD9tFL86SpAGjlHc20Ju9bwd+Cjgvya/Ot8ssbbOmclXtr6qdVbVzYmJiyP4NtZskddoo5Z2/DTxaVdNV9SLwOeBvAk8m2QTQPJ5otp8Ctvbtv4VeOWjJOM+XpJlGCf3HgWuSnJveqTLXAg8Bh4A9zTZ7gDua5UPA7iTrkmwHdgB3j3D8eTnRl6S2tcPuWFV3Jfks8FXgFPA1YD9wPnAwyY30/jC8q9n+SJKDwIPN9jdV1Usj9n+BPi7lq0vS6jN06ANU1QeBDw40n6Q3659t+33AvlGOebo8T1+S2jp9Ra43RpekmTob+s7zJamts6EP1vQlaVB3Q9+pviS1dDf0caYvSYM6G/reGF2S2job+pKkts6GvqfpS1JbZ0Mf/JZNSRrU2dB3oi9JbZ0NfUlSW6dD3+KOJM3U2dD3g1xJauts6IMXZ0nSoM6GvhdnSVJbZ0Mf/GplSRrU2dC3pi9JbZ0NfbCmL0mDRgr9JBcl+WySbyR5KMkvJNmY5M4kjzSPG/q2vyXJ0SQPJ7lu9O7P17elfHVJWp1Gnen/HvDHVfVzwF8HHgJuBg5X1Q7gcPOcJFcAu4ErgeuBW5OsGfH483KiL0kzDR36SS4E3g58FKCqXqiqHwG7gAPNZgeAG5rlXcDtVXWyqh4FjgJXD3v80+jh0r20JK1So8z03wBMA/89ydeSfCTJecBlVXUcoHm8tNl+M3Csb/+ppq0lyd4kk0kmp6enh+6gNX1JmmmU0F8LXAXcVlVvBn5MU8qZw2xT71ljuar2V9XOqto5MTExVOes6UtS2yihPwVMVdVdzfPP0vsj8GSSTQDN44m+7bf27b8FeGKE458Gp/qS1G/o0K+q7wHHkvxs03Qt8CBwCNjTtO0B7miWDwG7k6xLsh3YAdw97PEX4kRfktrWjrj/rwGfTHIO8G3gH9H7Q3IwyY3A48C7AKrqSJKD9P4wnAJuqqqXRjz+vKzpS9JMI4V+Vd0L7Jxl1bVzbL8P2DfKMU+XNX1Jauv2Fbkr3QFJOsN0NvT9lk1
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from math import log\n",
"\n",
"def rang_freq(name, g):\n",
" freq = freq_list(g)\n",
"\n",
" plt.figure().clear()\n",
" plt.plot(range(1, len(freq.values())+1), freq.values())\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"rang_freq('pt-words', get_words(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2022-03-08 18:09:12 +01:00
"Widać, jak różne skale obejmuje ten wykres. Zastosujemy logarytm,\n",
"najpierw tylko do współrzędnej $y$.\n",
2022-03-06 17:51:23 +01:00
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 9,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/pt-words-log.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 9,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAUOklEQVR4nO3de5DdZ33f8ff37E3WzbKs9QVdkC/YGUMHWywUR9QtEBObZkJJ6dTMkJJMOmonpANNM4zTzHRC+0ebtmFKC01HCRRoCKYheJpQl1tsYtqA7ZXjK4pAvoCEZWuFLFuWLK1299s/zlmxN2l/kvec33P2vF8zO3v0O789+z3PHH/87PN7fs8TmYkkqVyNuguQJJ2dQS1JhTOoJalwBrUkFc6glqTC9bfjRTds2JBbt25tx0tL0rK0a9euQ5k5vNBzbQnqrVu3Mjo62o6XlqRlKSJ+cKbnHPqQpMIZ1JJUOINakgpnUEtS4QxqSSqcQS1JhTOoJalwxQT11FTyiXv28vgzL9RdiiQVpS03vJyPoycm+I9f28OR4+O89lUX1l2OJBWjmB71hSsHWD3Yz8SUGxlI0kzFBDUAAW44I0mzFRXUUXcBklSgsoI6AvdwlKTZCgtqMKYlabZFgzoiro2Ih2Z8vRgRH2pHMYFj1JI016LT8zJzD3A9QET0AT8C7mxHMRFB2qeWpFnOdejj7cATmXnGBa5fCS8mStJ85xrUtwGfX+iJiNgREaMRMTo2NnbeBTn0IUmzVQ7qiBgEfh7444Wez8ydmTmSmSPDwwtu+1Xhd3gxUZLmOpce9a3Ag5n5XLuKgbBHLUlznEtQv5czDHsslQiwTy1Js1UK6ohYCdwMfKmdxTg9T5Lmq7R6XmYeBy5ucy3NMWqDWpJmKevORJxHLUlzlRXU9qglaZ6yghovJUrSXGUFdTg9T5LmKiqoJUnzFRXUzTFqu9SSNFN5QV13EZJUmLKCGnd4kaS5igrqhj1qSZqnqKCOCKZMakmapaygxouJkjRXUUGNQx+SNE9RQd1w2ockzVNUUAcw5dCHJM1SVlC7KJMkzVNUUDfCZU4laa6ighpwep4kzVFUULt6niTNV3XPxHUR8cWI+OuI2B0RN7ajmACc9iFJs1XaMxH4GPCVzHxPRAwCK9tRTKPhxURJmmvRoI6ItcBNwC8BZOY4MN6OYoJwep4kzVFl6ONKYAz47xHxVxHxBxGxau5JEbEjIkYjYnRsbOy8ivF+F0mar0pQ9wPbgN/LzBuAY8Dtc0/KzJ2ZOZKZI8PDw+dVjBcTJWm+KkG9H9ifmfe1/v1FmsG95LwzUZLmWzSoM/NZYF9EXNs69Hbgu+0oJqIdrypJ3a3qrI9/BnyuNePjSeCX21FMc5nTdryyJHWvSkGdmQ8BI+0txVvIJWkhhd2ZCFNTdVchSWUpK6ixRy1Jc5UV1C5zKknzGNSSVLiygtqhD0map6ygtkctSfMUFdTN6XmSpJmKCuoIbyGXpLmKCmpw6EOS5ioqqBsRpEktSbMUFdQDfcGpSYNakmYqKqj7Gu7wIklzFRfUk1MGtSTNVFRQNyKYtEctSbMUFdT99qglaZ6igrphUEvSPEUFdV8EUwa1JM1SVlA3ggmDWpJmqbQVV0Q8DRwFJoGJzGzLtlxOz5Ok+apubgvw1sw81LZKsEctSQspauijEV5MlKS5qgZ1Al+LiF0RsWOhEyJiR0SMRsTo2NjYeRXT3/BioiTNVTWot2fmNuBW4AMRcdPcEzJzZ2aOZObI8PDweRXj0IckzVcpqDPzmdb3g8CdwJvaUowXEyVpnkWDOiJWRcSa6cfAO4DH2lHMQKO5ep5LnUrST1SZ9XEpcGdETJ//R5n5lXYUMzTQB8D45BRD/X3t+BWS1HUWDerMfBJ4fQdqYai/2cE/ccqglqRpRU3PW9HqUZ+cmKy5EkkqR1FBPd2jPnlqquZKJKkcRQX1dI/6xCl71JI0rdCgtkctSdMKC+rWxUTHqCXptKKCenqmh0MfkvQTRQX1ysFmUB8fN6glaVpRQb16qDmt+9jJiZorkaRyFBXUqwxqSZqnqKCe7lE/deh4zZVIUjmKCurpWR+D/UWVJUm1KioRI4J1Kwcc+pCkGYoKaoBVg/0GtSTNUFxQrx7q5yWDWpJOKy+oV/RzbNyglqRpxQX1qqF+XjrpDS+SNK24oF491OcYtSTNUGBQ9/PSCYNakqZVDuqI6IuIv4qIL7ezoFVDzvqQpJnOpUf9QWB3uwqZtnqoeTHRncglqalSUEfEJuDvAn/Q3nKaPeqphJdd6lSSgOo96v8EfBg449YrEbEjIkYjYnRsbOy8C+pvBADPHz913q8hScvJokEdET8HHMzMXWc7LzN3ZuZIZo4MDw+fd0GXrl0BwP7DLswkSVCtR70d+PmIeBq4A3hbRPxhuwoaXjMEwJGX7VFLElQI6sz8zczclJlbgduAuzPzfe0q6LJWj9opepLUVNw86jUrmmtSv3jCHrUkAfSfy8mZ+U3gm22ppGXNigEAdh94sZ2/RpK6RnE96ulNA8YnzjjBRJJ6SnFBDXD5hSvY9/zLdZchSUUoMqgvWTPEyQlveJEkKDSoh9cMsffgS3WXIUlFKDKo168a5MSpKaamXO9DkooM6g2rmze9PHf0RM2VSFL9igzq6161FoAXvDtRksoM6pWDfQDsefZozZVIUv2KDOqrh9cAcNTbyCWpzKC+ZG1zjPr7z9mjlqQig3rFQHPoo9Fam1qSelmRQQ2wcd0F/L+9h+ouQ5JqV2xQrxrqY7+3kUtSuUF91fBqHPiQpIKD+trL1nBsfJIfv3Sy7lIkqVbFBvWmi1YCcM+e898oV5KWg2KD+u9c29wg96F9z9dciSTVq9ignl7vw1X0JPW6RYM6IlZExP0R8XBEPB4RH+lEYQBb1q9k32FnfkjqbVV61CeBt2Xm64HrgVsi4s1trarl9ZvX8aMjL7s4k6SetmhQZ9P0+MNA66sjC0W/cetFANz/1OFO/DpJKlKlMeqI6IuIh4CDwNcz874FztkREaMRMTo2tjQzNW688mIAdv3AC4qSeleloM7Mycy8HtgEvCkiXrfAOTszcyQzR4aHh5ekuCuHVwOw7/DxJXk9SepG5zTrIzOPAN8EbmlHMXP1NYLXvmot//vRA0xMTnXiV0pScarM+hiOiHWtxxcAPwP8dZvrOu3ay5prUz956FinfqUkFaVKj/py4J6IeAR4gOYY9ZfbW9ZP/P1tmwD48iMHOvUrJako/YudkJmPADd0oJYFbb96A32N4LPffppfv/mausqQpNoUe2fiTBvXXcCR46fI7MisQEkqSlcE9fvevAWAj/3592uuRJI6ryuC+j1v2AzAd578cc2VSFLndUVQr181yNWXrOY7Tx52fWpJPacrghrgH7yhOfvDaXqSek3XBPX1m9cB8IHPPVhvIZLUYV0T1G/cup6rhlfx8qlJZ39I6ildE9SNRvCzr72Moycm+MVP3l93OZLUMV0T1ADve/OrAXjsmRdqrkSSOqergvpV6y7gn9x0JUeOn+IT9+ytuxxJ6oiuCmqAn756AwD/4at7aq5Ekjqj64L6b18zzG+8o7nmx5Hj4zVXI0nt13VBDbD2ggEAtv2br/PwviP1FiNJbdaVQf2u6zfyz3/mGqYS9j/vLuWSlreuDOoLLxjgF7ZtBGD3gRfdqkvSstaVQQ3N4Y9GwMfv2cutH/uWN8FIWra6NqgvvGCAuz74t/iHI5t56eQEJyfcU1HS8tS1QQ3wU5etPb2n4o+PjXPi1GTNFUnS0quyue3miLgnInZHxOMR8cFOFFbV6qHmbmLb/93dXPevvsK3n3DNaknLy6J7JgITwL/IzAcjYg2wKyK+npnfbXNtldzyNy7j+PgEPz42zn+5ey8/PHyMG6+6uO6yJGnJLNqjzswDmflg6/FRYDewsd2FVbV2xQC/tP0K3v/TWwEcq5a07FTpUZ8WEVtp7kh+3wLP7QB2AGzZsmUpajsnQ/3N/+fc+71DjE9MMdTf4N3bNp0eGpGkblU5xSJiNfAnwIcy88W5z2fmTmAnwMjISMf
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from math import log\n",
"\n",
"def rang_log_freq(name, g):\n",
" freq = freq_list(g)\n",
"\n",
" plt.figure().clear()\n",
" plt.plot(range(1, len(freq.values())+1), [log(y) for y in freq.values()])\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"rang_log_freq('pt-words-log', get_words(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"****Pytanie**** Dlaczego widzimy coraz dłuższe „schodki”?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Hapax legomena\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Z poprzedniego wykresu możemy odczytać, że ok. 2/3 wyrazów wystąpiło\n",
"dokładnie 1 raz. Słowa występujące jeden raz w danym korpusie noszą\n",
"nazwę *hapax legomena* (w liczbie pojedynczej *hapax legomenon*, ἅπαξ\n",
"λεγόμενον, „raz powiedziane”, żargonowo: „hapaks”).\n",
"\n",
"„Prawdziwe” hapax legomena, słowa, które wystąpiły tylko raz w *całym*\n",
"korpusie tekstów danego języka (np. starożytnego) rzecz jasna\n",
"sprawiają olbrzymie trudności w tłumaczeniu. Przykładem jest greckie\n",
"słowo ἐπιούσιος, przydawka odnosząca się do chleba w modlitwie „Ojcze\n",
"nasz”. Jest to jedyne poświadczenie tego słowa w całym znanym korpusie\n",
"greki (nie tylko z Pisma Świętego). W języku polskim tłumaczymy je na\n",
"„powszedni”, ale na przykład w rosyjskim przyjął się odpowiednik\n",
"„насущный” — o przeciwstawnym do polskiego znaczeniu!\n",
"\n",
"W sumie podobne problemy hapaksy mogą sprawiać metodom statystycznym\n",
"przy przetwarzaniu jakiekolwiek korpusu.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Wykres log-log\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jeśli wspomniany wcześniej wykres narysujemy używając skali\n",
"logarytmicznej dla ****obu**** osi, otrzymamy kształt zbliżony do linii prostej.\n",
"\n",
"Tę własność tekstów nazywamy ****prawem Zipfa****.\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 10,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/pt-words-log-log.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 10,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAdZ0lEQVR4nO3deXxU1d3H8c9vJnsIgSxsCSGEVRYRSEFAVLAq1AWrtpW2Pu5oa7Xa+lhtfbT60mpba6u1tiJSpa5UxQ1ttSggimDYlVX2kAhh37Oe548EKhjCxGRyb2a+79fLF5k7k+E7Il9Pzr3nHnPOISIi/hXwOoCIiNRNRS0i4nMqahERn1NRi4j4nIpaRMTnYsLxphkZGS43Nzccby0iEpHmzZu31TmXWdtzYSnq3NxcCgoKwvHWIiIRyczWH+s5TX2IiPicilpExOdU1CIiPqeiFhHxORW1iIjPqahFRHxORS0i4nNhuY7663pk2ioqKquOPGh25MNavu+ol2BHvSolIYaOaUlkt04ku3UiKQmxjZBWRKRp+Kqo/zZjNQfKKw8/DtetslMTY+mYlkh2q/+Wd3WRVz9OjvfVvxYRiXK+aqSl94yq9/ccvfHB0eXugJ37yyjccYCNO/ZTuOMAhTW/rtqyh/dXbKG04shRfLuWCXx7QBbfH5RDx7SkemcSEWlMFo4dXvLz811zWULunGPr3rLD5b1xx37mr9/Je8s344DTumfyw8GdGNGzDcFAbRMvIiINZ2bznHP5tT3nqxG1F8yMzJR4MlPi6Z/T+vDxop0HeOGTjbwwdwNXTyqgQ2oCYwfl8L1BHWmTkuBhYhGJNlE/oj6e8soq/rN0M8/MWc+Hn28jJmCc3bsdlwzqyJC8dGKCunBGRBpOI+oGiA0GGN23PaP7tmdNyV6em7OBf84rZOqSYtKS4zi7d1u+1bc9J+elE6vSFpEwOO6I2sx6AC9+6VAecKdz7k/H+p5IGlHX5mB5JdNXbGHqki+Ytmwz+8sqaZ0Uy1m92jG6bzuGdc1QaYtIvdQ1oq7X1IeZBYFNwGDn3DHvnRrpRf1lB8srmbGyhLeWFDNt2Rb2llYwsmcbJl7+Da+jiUgz0phTH2cAq+sq6WiTEBvk7N7tOLt3Ow6WV/KX9z/nz+99zkefb2Vo1wyv44lIBKjvz+eXAM/X9oSZjTOzAjMrKCkpaXiyZighNsj1I7rSPjWB37+z4ivXeIuIfB0hF7WZxQHnA/+s7Xnn3HjnXL5zLj8zs9Ztv6JCQmyQG8/oxoINO5m2bIvXcUQkAtRnRD0amO+c2xyuMJHi4oHZdEpP4sF3VlBVpVG1iDRMfYp6LMeY9pAjxQYD/OzM7iz/Yg9TlxR7HUdEmrmQitrMkoAzgVfCGydynHdiB3q0TeGP76786h0BRUTqIaSids7td86lO+d2hTtQpAgEjJ+d1Z01W/fxyvxNXscRkWZMqzLC6KxebemXncrD01ZRWlF5/G8QEamFijqMzIxbzu7Bpp0HeGHuRq/jiEgzpaIOs1O6ZjC4cxp/fu9zvth10Os4ItIMqajDzMy4dVRPduwvY8gD07hk/Gyem7OBHfvKvI4mIs2EbnPaRNaU7OW1hUW8saiINVv3ERMwzu/Xgd9/p582JBAR3ebUD/IyW3Dzmd256Zvd+KxoN8/OWc/zczdyVu+2jOrT3ut4IuJjmvpoYmZGn6xU7r2gL53Sk/jrjDW6J4iI1ElF7ZFgwLhmeB6LNu7k4zXbvY4jIj6movbQxQOzyWgRx99mrPY6ioj4mIraQwmxQa4Y1pkZK0tYWrTb6zgi4lMqao/9cHAnkuOCPD5To2oRqZ2K2mOpSbGMHZTDm4uL2bh9v9dxRMSHVNQ+cNXwzgQMJnywxusoIuJDKmofaJ+ayJiTsnixYCNLCnWDQhE5koraJ352ZnfSk+P5/oSPmb9hh9dxRMRHVNQ+0aFVIpOvG0JachyXTpjD3LW6tlpEqqmofSSrVSKTrx1Cu9QELps4l399WqxViyKiovabti0TePHaIeRmJHPdM/MZ85cPeXfpZhW2SBQLdc/EVmb2kpktN7NlZjYk3MGiWUaLeF67fhi/vagvO/eXc82kAkY//AH/LNionWJEolBItzk1s6eBD5xzE8wsDkhyzu081ut1m9PGU1FZxWsLixg/cw0rNu8ho0Uc/3t2D773jRyvo4lII6rrNqfHHVGbWUvgVOBJAOdcWV0lLY0rJhjgooHZ/Oum4Txz1WBy0pK449VP2bTzgNfRRKSJhDL1kQeUAH83swVmNsHMko9+kZmNM7MCMysoKSlp9KDRzsw4pVsGf/7+AAAe142cRKJGKEUdAwwA/uqc6w/sA247+kXOufHOuXznXH5mZmYjx5RDslolcvHAbF74ZCObd2sPRpFoEEpRFwKFzrk5NY9forq4xSM/Oq0rlVWOx2doyblINDhuUTvnvgA2mlmPmkNnAEvDmkrqlJOexAUnZfHc3PWU7Cn1Oo6IhFmo11HfADxrZouBk4DfhC2RhOT6EV0oq6jSpgMiUSCkonbOLayZfz7ROXeBc043o/BYXmYLLh6YzT9mr6dwh26PKhLJtDKxGbvpm90xg4feWel1FBEJIxV1M9ahVSKXD8tlysJN2spLJIKpqJu5H5/WlZYJsdz52qfaIUYkQsV4HUAaJjUpll+dcwK/mrKE0x+czqje7ejRLoVO6Umce2IHggHzOqKINJCKOgJ8N78jw7tl8OQHa5myYBNTlxQDULKnlKuH53mcTkQaKqSbMtWXbsrkrdKKSi59ci4bt+9n5q0jiA1qhkvE7xp0UyZpfuJjgvzotC4U7zrIG4uKvI4jIg2koo5Qp/fIpHvbFoyfuUabDog0cyrqCGVmjDu1C8u/2MMFj33EpNnr2L6vzOtYIvI16GRiBLuwfxa7D5QzuWAjd772Gfe8sZT+Oa3ITInnmuF59M9p7XVEEQmBTiZGiaVFu5myoJBFG3exasseYoMB3rn5VFolxXkdTUSo+2SiRtRRoleHlvTq0AuAJYW7+PZjH3Lna5/x8CUnYaZrrUX8THPUUahvdio3jOzG64uKuO6ZeezaX+51JBGpg4o6St0wsit3nHMC7y3fwq9eXeJ1HBGpg6Y+olQgYFw9PI+SPaVMmLWWTTsPkNUq0etYIlILjaij3KVDOuGcY9LsdV5HEZFjUFFHuezWSYzq046/z1rHZRPn6g58Ij6kohbuOq83lwzqyNy123nwnRVexxGRo4RU1Ga2zsyWmNlCM9MF0hGmbcsE7hnThx8MzuHNxcXa2kvEZ+pzMnGEc25r2JKI5648pTNPfbSOK/7+CQNyWtOtbQtG9GxDl8wWXkcTiWqa+pDDOrRK5J4xfWiVFMu05Zu5d+oyzvjDDB54ezl7DupaaxGvhLSE3MzWAjsABzzunBtfy2vGAeMAcnJyBq5fv76Ro0pTK951gEemreL5uRsBMIN7xvTh0pM7eZxMJPLUtYQ81KLu4JwrMrM2wLvADc65mcd6ve71ETmcc8xevY1Fhbt4fVER2/aW8sEvRhAfE/Q6mkhEafDGAc65oppftwBTgEGNF0/8zMwY2jWDH53ehdtH92TLnlJ+8dJiPlhVQllFldfxRKLCcU8mmlkyEHDO7an5+izgnrAnE98Z3i2DK4bl8uycDby6sIjMlHh6tkvhN9/uS8e0JK/jiUSs4059mFke1aNoqC7255xz99X1PZr6iGx7DpYze/U2XltYxMyVJbRpGc+NZ3Tj/H4ddCc+ka+pwXPU9aWijh6zVm3l9imL2bj9AFef0plfnXOCylrka9DmthI2p3TLYMYtI7hsSCcmzFrLNZPmsWX3Qa9jiUQU3T1PGiwQMO46rzctEmKYOGsd5z06iwv6Z3HBSVmc0L6l1/FEmj1NfUijenfpZh789wpWl+ylosrRs10Kd5/fm8F56V5HE/E1zVFLk9u2t5Q3FxfzxAdrqKh0vPSjIWSmxOv6a5Fj0By1NLn0FvFcNjSXx34wgJ0Hyjjlt+8z6L5pfLppl9fRRJodFbWE1YnZrXj9J6dwxzknEBcT4KK/fsSUBYW
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from math import log\n",
"\n",
"def log_rang_log_freq(name, g):\n",
" freq = freq_list(g)\n",
"\n",
" plt.figure().clear()\n",
" plt.plot([log(x) for x in range(1, len(freq.values())+1)], [log(y) for y in freq.values()])\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"log_rang_log_freq('pt-words-log-log', get_words(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Związek między frekwencją a długością\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Powiązane z prawem Zipfa prawo językowe opisuje zależność między\n",
"częstością użycia słowa a jego długością. Generalnie im krótsze słowo, tym częstsze.\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 11,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/pt-lengths.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 11,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAABQg0lEQVR4nO2dd3gUVffHvzeFhBYIvfemICBFxAIqSBEV7Ng7dn15bdh77w0b+FqwgxVRsSGCUkITUOkQBITQe+r5/fHN/GaS7CZ3dpfdTTyf55lndid7z96d7J45c+4pRkSgKIqixC8JsZ6AoiiKUjqqqBVFUeIcVdSKoihxjipqRVGUOEcVtaIoSpyTdCCE1qlTR1q0aHEgRCuKolRI5syZs1lE6gb62wFR1C1atEBGRsaBEK0oilIhMcasCfY3dX0oiqLEOaqoFUVR4hxV1IqiKHGOKmpFUZQ4J34U9f79wIwZwJIlsZ3Hli3A1KnA6tWxnYeiKEoh8aGo334baNYMuOYaoF8/4Oijgb//ju4cRIDbbwfatAFuuw047DDgtNOAXbuiOw9FUZRixF5Rz5oF3HorcPXVQNu2wIknAh07UklGs7LfmDHA5MnA0qXA9OlAZiZQrRpwww3Rm4OiKEoAYq+oX34ZSE0Ffv0VGDIEaNkS+OILYOVKYP786M3jtdeARx8F6hbGm6emAs88A4wfD+zZE715KIqiFOOAJLz4Yu5cID0d+PZbwBgeO+UUWtVr1gCHHhqdeWzZQveLl/R0IDkZ2L0bqFo1OvNQFEUpRuwt6oICoFIlPt6xA8jOBqpUodujoCB68zjmGOD994semzwZqF8fqFcvevNQFEUpRuwt6s6daU3XqQPs3UsFXakSrdkmTaI3jzvvBI46Cti6FRg8GFiwAHj6aeDNN11LX1EUJQbE3qIeNAjYto2RHocfDvTvDzRuTMu6Z8/ozaNVK2D2bFrzTz0FLF/OC8jgwdGbg6IoSgBib1HPncuQvJ9/ZkhcZiat6tRUYNkyoF07Ozlr1gBPPMGIjQYNGEVy0kn+5tK4MfDII/4/g6IoygEk9hb1ihVUqsuXA9deC4weDfzxB9ClC7BqlZ2MtWuBI45gON1rrwEXXACMHAm8+OKBnbuiKEoUMGV1ITfGtAfwoedQKwB3i8izwcb06NFDrMuc3nMPkJVFBe2wYwfD9BYupJVbFiNHMjrj8cfdY8uWAb17U4lXrmw3F0VRlBhhjJkjIj0C/a1M14eILAHQtVBQIoB1AD6N2Oyuugro3h1o2BA4/3xgwwZg1Cjg7LPtlDQA/PYb/cpe2ralC2TpUlrniqIo5RS/ro9+AFaISNAC175p0IC1NZYvp/vi8suBoUOB55+3l9GoUckaIXv2AOvXM7xOURSlHON3MXE4gPcD/cEYMwLACABoVjxxpCxatwbeesvnVDxccw1w8cW0zLt0YX2OG24Ajj+eFwJFUZRyjLVFbYypBOBkAB8H+ruIvCYiPUSkR926Adt+HTj69QPuv5+hfm3aMMNw/37W71AURSnn+LGoBwOYKyIbD9RkwuKii4BzzmGNkDp1uCmKolQA/CjqsxHE7RE3VKoEdOgQ61koiqJEFCvXhzGmCoDjAXxyYKejKIqiFMdKUYvIXhGpLSI7DvSEYk5eHmOv9+2L9UwURVEAxENmYjwxZgzQvDnQqxdD/m6+GcjNjfWsFEX5lxM/ijojA3jsMaaAb9sW/ff/5BPW+Zg0ifHXixezgt4dd0R/LoqiKB5ir6gLCpjkctppwMaNwI8/MqtwypTozuPZZ5nd6GQxNmoEvPEG8PrrDPVTFEWJEbGvnvfJJ7Sm//jD7aLy/ffAeeexKFNycnTmkZkJHHJI0WNNmvD9t21jiruiKEoMiL1F/fHHwPXXF2111b8/FeOvv0ZvHj160O3hZfZsllvVDi+KosSQ2FvUIkBCgOuFMdHtQn7HHcCAAXTFOB1ebr4ZeOABIDExevNQFEUpRuwt6tNOA154oWg43JQpwN9/s0hTtDj0UOC772jFDx4MvPIK61lfeGH05qAoihKA2FvUZ5xBl0OnTny8YQMwcSLwwQdu09to0bUr8OGHZb5MURQlmsTeok5IYAPZcePYr7BnT+Cvv1j5TlEURYkDixqgP7p3b26KoihKEeJDUUeC7dsZ8+w0tx0xAujWLdazUhRFCZvYuz4iwZYtwOGHs6P5eecBLVoAJ5zA0D9FUZRyTsWwqJ99FjjySGDsWPdYv37AsGHcopU0oyiKcgCoGBb1998DF1xQ9FjPnkD16sCff/qTtXMnE102bIjc/BRFUcKgYijq9PSSijUnhy6RmjXtZIgwuaV5c+CKK4COHelG2bs34tNVFEXxQ8VwfVx6KTML160DFi7kYuLOnUxisW20+9ZbwPjxHN+kCbB7N3DZZcB//8vkF0VRlBhh5ACkaffo0UMyMjIiLjcoO3YABx0EbNoE1K/PCJDsbMZnn3eenYzDDwfuuw8YONA9lpXFZrkbNjDGW1EU5QBhjJkjIj0C/a1iuD6efx5IS+PWqhUt4qZNgVGjgPx8OxkbN3Kslzp1uBC5c2fk56woimKJbc/EmsaY8caYv4wxfxpj4isz5Z13qJBXrgR++YWZjVddRR+17WLi0UeXDOebMgWoVUur5ymKElNsfdTPAfhGRE43xlQCEF9+gC1bgEsucRcOjWHp1NtuA7ZutZNx551Anz7Arl3AoEHA778DDz9M/3Sg6n6KoihRokwNZIxJA9AHwFgAEJEcEdl+gOflj9q1gc8+o28aYATH88+zqFOtWnYy2rUDZsxgFb877gBmzqTMoUMP0KQVRVHssLGoWwHIAvA/Y0wXAHMA3CAie7wvMsaMADACAJrZRlpEivPPB959lz7m444Dli/nAmPt2lxktKVFCybPKIqixBE29/RJALoBeFlEDgWwB8Co4i8SkddEpIeI9Khbt66/WSxfzuiMBg2oWJ980n4REABuuIFRGZ0704I++GCG1732mhb9VxSl3GOjqP8G8LeIzCx8Ph5U3JFhwwagb18q6NmzGVI3cSJw7bX2MtLSgGnTqOxzcpi08uuvrPehKIpSzrGKozbG/ALgMhFZYoy5F0BVEbk52Ot9xVHffTcXA196yT22cyfdEL//zlA7RVGUCk4k4qivA/CuMeZ3AF0BPByhuQHz55dsEpCWxmazixdH7G0URVHKK1bheSIyH0BATR82rVsDc+awyp1DTg5TuYsnoJTFkiV0eTRoQOWfVDEy5BVF+XcTe0129dVsYtupE9CrF5Cby+JIvXoBbdvaySgooJzx41lMacsWYP9+4JtvmAKuKIpSjom9om7blqneF11E5QrQkp4yxV7GW28BX3xBhZ2Wxpof2dnAmWeymYCiKEo5JvYpd4sXA489Bnz+OWOfN29mZuCll9rLeOopICUFWLYM+PJL4I8/gMsvp+yVKw/c3BVFUaJA7BX1yy8D110HDBhAa7h2beC556hs//jDTsa6dUx6qV2bz41xCzKtWHHg5q4oihIFYq+oMzPpn/aSlMS46rVr7WSkpQE//8zUcYeff2aNDi2opChKOSf2ivqww4CvvqKSXbeO7o8tW4CMDKBLFzsZ551H6/vYY1nj4/rrgTPOYJGmjh3t5/Lnnyzu1LkzcPLJwA8/hPSRFEVRIknsFfWVV9KvXKcOcMghQKNGLJDkpJTbcPPNfO3mzXSl/PQTU8dff90+RG/RIlbPy8tjE4G6delO+eCD0D+boihKBIi9ot60ib7kI46gcjz4YCrsZcvsZaSl0TLPzGTkyJYtQLVqlGXLPfcAVasCS5cCDRuyPGp+PnDjjYwmURRFiRGxD897+WW6Ku66C9i2DahcmV1VWrSgK8Km+t1bb9EiXr+eChoAXniBncl/+81uHj/8wJojn33GxUhHxo038mJia90riqJEmNhb1GvWcNGve3cq53r12AW8bVv+zYZ332WTAEdJA0yAWbPGPjwvO5slUh0lDdBPnZvrr5KfoihKhIm9om7ZEnjoIVrU27cDq1ezeP+0aUDXrnYy9u+nUj7mGFrjTZsyNrtyZcq
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"def freq_vs_length(name, g, top=None):\n",
" freq = freq_list(g)\n",
"\n",
" plt.figure().clear()\n",
" plt.scatter([len(x) for x in freq.keys()], [log(y) for y in freq.values()],\n",
" facecolors='none', edgecolors='r')\n",
"\n",
" fname = f'02_Jezyki/{name}.png'\n",
"\n",
" plt.savefig(fname)\n",
"\n",
" return fname\n",
"\n",
"freq_vs_length('pt-lengths', get_words(pan_tadeusz))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### N-gramy\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W modelowaniu języka często rozpatruje się n-gramy, czyli podciągi o\n",
"rozmiarze $n$.\n",
"\n",
"Na przykład *digramy* (*bigramy*) to zbitki dwóch jednostek, np. liter albo wyrazów.\n",
"\n",
"| $n$|$n$-gram|nazwa|\n",
"|---|---|---|\n",
"| 1|1-gram|unigram|\n",
"| 2|2-gram|digram/bigram|\n",
"| 3|3-gram|trigram|\n",
"| 4|4-gram|tetragram|\n",
"| 5|5-gram|pentagram|\n",
"\n",
"**Pytanie:** Jak nazywa się 6-gram?\n",
"\n",
"Jak widać, dla symetrii mówimy czasami o unigramach, jeśli operujemy\n",
"po prostu na jednostkach, nie na ich podciągach.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### N-gramy z Pana Tadeusza\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
2022-03-08 18:09:12 +01:00
"Statystyki, które policzyliśmy dla pojedynczych liter czy wyrazów, możemy powtórzyć dla n-gramów.\n",
2022-03-06 17:51:23 +01:00
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 12,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
2022-03-08 18:09:12 +01:00
"data": {
"text/plain": [
"[('k', 'o', 't'), ('o', 't', 'e'), ('t', 'e', 'k')]"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
2022-03-06 17:51:23 +01:00
}
],
"source": [
"def ngrams(iter, size):\n",
" ngram = []\n",
" for item in iter:\n",
" ngram.append(item)\n",
" if len(ngram) == size:\n",
" yield tuple(ngram)\n",
" ngram = ngram[1:]\n",
"\n",
"list(ngrams(\"kotek\", 3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zauważmy, że policzyliśmy wszystkie n-gramy, również częściowo pokrywające się.\n",
"\n",
"Zawsze powinniśmy się upewnić, czy jest jasne, czy chodzi o n-gramy znakowe czy wyrazowe\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3-gramy znakowe\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 13,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/pt-3-char-ngrams-log-log.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 13,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAfO0lEQVR4nO3deXxV1b3+8c/3nMyBjCRMSQjzrCJhUKwoIMW5WttK1V/Vtlhr69jbXrVeba9Dbe/11tni3Dq2qP1Z5wEVEEXmMaiICYQpYQqBkJCcs+4fCVyV6QDnZO/kPO/XKy+SnJ3kyQGerKy99l7mnENERPwr4HUAERE5MBW1iIjPqahFRHxORS0i4nMqahERn0uIxSft0KGDKy4ujsWnFhFpk+bOnbvROZe3r8diUtTFxcXMmTMnFp9aRKRNMrPy/T2mqQ8REZ9TUYuI+JyKWkTE51TUIiI+p6IWEfE5FbWIiM+pqEVEfC4m66gP1z3vfk5GSgKFOWkU5qRRkJ1KWpKvIoqItDjftGAo7Jg8bSXb6xu/9v7c9CQKslMpyEmjMLupvHeXeNesVFISgx4lFhFpGREVtZldA/wEcMBi4BLnXF00gwQDxuJbxrNx+y5Wb6mlYstOVm+upaL59aVrqnlr6XoaQl/f6CC/fXLTCDw7lYLsNApzmv/MTqNzVgqJQc3uiEjrdtCiNrOuwJXAAOfcTjP7O3A+8ES0w5gZee2TyWufzLFF2Xs9Hgo7KmvqWL15JxVbav/vzy21zC7bwssL1xL+So8HDDpnplKQnUrP/HYMK86mpFsOBdmpmFm044uIxESkUx8JQKqZNQBpwNrYRdq/YMDonJlK58xUhnfP2evxhlCY9dV1TSPyPSXeNDL/14K1PDNrFQCdM1MYVpzDsOJshnXPoU9+ewIBFbeI+NNBi9o5t8bM/gtYBewE3nLOvfXN48xsEjAJoKioKNo5I5IYDOw5EUnPrz8WCjs+XV/D7LLNzC7bzKwvN/HywqafNxkpCZQU5zCsOIfh3bMZ1DWT5ATNfYuIP9jBNrc1s2zgBeAHwFbgH8AU59xT+/uYkpIS5/e75znnWL15557i/qRsMyurdgCQnBDg6MIsBnTOIDkxQFIwQGLzS1JCgKSg7Xk7Ky2Rk/rmE9SIXESOgJnNdc6V7OuxSKY+xgFfOueqmj/Zi8DxwH6LujUwM4py0yjKTeO7QwsA2LS9ntllW5jTXN4vzK1gVyjMrlCYA/08G9svn7snDqFdsm8W0YhIGxJJs6wCRppZGk1TH2MBfw+XD1Nuu2QmDOrEhEGd9nqsMRSmIeSairsxTEOo6WXq8kpufbWU8x6cySM/KqEgO82D5CLSlh107ZpzbhYwBZhH09K8ADA5xrl8JyEYIDUpSGZqInntk+mSlUq33HQuGdWdxy8expqtO/nO/R8yt3yz11FFpI2JaJGxc+5m51w/59wg59xFzrn6WAdrTU7sk8dLPx9FenICEyfP4qX5FV5HEpE2RFeDREmv/Hb88+ejOLZbFtc8v5A/vbmccPjAJ2pFRCKhoo6i7PQk/nrpCM4fVsj9733Bz5+ex2cbagipsEXkCGiZQpQlJQS449zB9Mpvx+2vlfLG0vWkJgYZ2CWDQV0zGdw1k0FdM+mUkUJGaoKukBSRgzroOurD0RrWUbeE8k07mFu+hUUV1SxZU83StdvY2RDa83hCwMhOTyInLYmstEQSgwF293YwYJwyoCMThxXpqkmROHCgddQq6hYUCjtWVm1n2bptVNXUs3nHLjbv2MWmHbuorm0g5Bxh53AOttU1sLJqB8OKs7nj3KPold/O6/giEkNHesGLREkwYPTu2J7eHdsf9FjnHFPmVnDrq6Wcdvd0fn5yT0q65RAMGAlBo1NGim4uJRInVNQ+ZWZ8r6SQk/rm87t/LeXP73y+1zFdMlMY3j2Hsf07ctrgzrqMXaSN0tRHK/HZhhqqdzbQGHI0hsN8uXEHs77czKyVm9m4vZ7e+e245pQ+TBjYSXPaIq2Q5qjbsHDY8dqSdfz5nc9ZUbmd/p0zuGZcb04Z0FHTIiKtiIo6DoTCjn8tXMuf3/mMsk21HFWQyc9G9+TbAztpSkSkFVBRx5HGUJgX56/h/vdWUL6plsKcVIYUZhMwKMpN5wfDCumalep1TBH5BhV1HAqFHW8v28CTM8tYV72TkHOs2bITgON65jKkMJviDukkBo3M1ES6ZqU2v62LVUW8oOV5cSgYsL1u2VqxpZZnP1nF1OVVPPjBF3td2p6cEGBUrw7cO3EI6bq3tohvaEQdp3buClFZU0dDyLG1tmnn9wWrtvLkR+VcOaYX147v63VEkbiiEbXsJTUpSLfc9D1vlxTncM6QAjbXNjB5+krOH15EF81li/jCQSckzayvmS34yss2M7u6BbKJB34zoS/OwdXPLaC6tsHrOCJCZDu8fOqcO8Y5dwwwFKgFXop1MPFGQXYafzzvKOav3sLp907nlpeXMuPzjcRiikxEInOoUx9jgS+cc+WxCCP+cPYxXemSlcqdry/n+dmreWJmGYU5qQzqksn1p/anKFf7Qoq0pEM6mWhmjwHznHP3Heg4nUxsO+oaQry8cC1TSyv58IuNZKYm8sQlw+iVf/AbS4lI5KKyjtrMkoC1wEDn3IZ9PD4JmARQVFQ0tLxcg+62ZnFFNT98+GNq6hsZ2CWDYcU5jO6bx8l9872OJtLqRauozwaucM6NP9ixGlG3XZXb6nhx/hqmfVbF/FVb2dkQ4tbvDOLCkd28jibSqkWrqJ8D3nTOPX6wY1XU8WFXY5jLn5rLu8srGVacTZesVI4tyuaE3h3o0SFdN4USOQRHXNRmlgasBno456oPdryKOn7s3BXi0RkreWvZBqpq6llXXQdA58wUTu6Xz8XHF9MtN43khKDHSUX8Tff6kBazalMt01dU8eGKjbxbWkl9YxiAtKQgvz97EOcNLfA4oYg/qajFE+uqdzLj842sq67jnwvWsHbrTiYOL+L0wZ0pKc7xOp6Ir6ioxXNVNfX85oVFTF1eSXJCgBcuP55BXTO9jiXiGypq8Y2N2+s5694Z7Ao5fnRcNzpmpnDmUV1ITdIctsQ3FbX4yuKKan757DzKNtUCUJCdyol98ji+Zy5nHNXF43Qi3lBRiy/VNYR4b3klT8wsY3bZZsIOzjy6Cz86rpvmsCXuqKjF9xpCYW57tZSnZ5UTdnDOkK5cMKKIIUXZXkcTaREHKmrtuyS+kBgMcMtZA5nxmzF8e2BH3liynnMemMmdbyynriHkdTwRT2lELb60o76RW18t5dlPVhEMGL3z2zGgcwbFHdLp07EdBdlp9MhLJy1Je19I26CpD2m1PlyxkY9XbmJ22WZWbaplbfOVj9B0Ec2xRdmM65/PiB659OvUXpetS6ulrbik1RrVqwOjenXY83ZdQ4ila7exvrqOT77cxFvLNjBjxUYARvfJY0y/fAZ1zWBA50wt+ZM2QyNqadXCYcfiNdU8P2c1by5Zz6YduwDISkvkO8d0pX/n9gzonMmALhkEAxpti39p6kPignOO9dvqWLh6K0/PWsXsss3UNTTda6Q4N42fn9SLCYM7kZGS6HFSkb2pqCUuhcKO1Ztrmb96C/dOXcHKqh1kpSVy9djenHF0Fzq0S/Y6osgeKmqJe8455q3aym2vLmPeqq2kJQW5ZlwfzhtaQHZ6ktfxRFTUIrs51zSnfesrpXxStpmOGcn884pRdM5M9TqaxDkVtcg3OOeYsWIjP3rsExxwVEEWF4wo4ntDC7TETzxxxFcmmlmWmU0xs+VmVmpmx0U3okjLMjO+1TuPl39xAr88uRcNjWF+PWURP3x4Fm8tXU84HP0BjMjhinQrrieB6c65R5p3I09zzm3d3/EaUUtrEw47npu9mjvfWE71zgYyUhIoKc7hF2N6MaQwS6NsibkjmvowswxgIU37JUY0zFBRS2vVEArzyqK1TPtsI68vWUddQ5hRvXL53VkD6ZXf3ut40oYdaVEfA0wGlgFHA3OBq5xzO75x3CRgEkBRUdHQ8vLyI08u4qGttbt47MMy7nn3cwAKc1I
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"log_rang_log_freq('pt-3-char-ngrams-log-log', ngrams(get_characters(pan_tadeusz), 3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2-gramy wyrazowe\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 14,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/pt-2-word-ngrams-log-log.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 14,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAX7ElEQVR4nO3deXhU9b3H8c93JpONhC0JawgBRBZBBKOCoLbWVqhU7GJrbb2trcVebWtbq60+7a23t3azm4/tbUuVaq3VolKLW60VqkIRDci+aCQgEJawhyX77/6RwGWZkCGZM+fMzPv1PDwmZ04mn3mCH375zlnMOScAQHCF/A4AADg1ihoAAo6iBoCAo6gBIOAoagAIuAwvnrSwsNCVlpZ68dQAkJIWL1680zlXFO0xT4q6tLRU5eXlXjw1AKQkM9vY1mOMPgAg4ChqAAg4ihoAAo6iBoCAo6gBIOAoagAIOIoaAAIuUEV930tv629Lt2jngTq/owBAYHhywktH1DY0aeaCSu051CBJGt4nX5POKNTEoYU6v7SnumQFJioAJJR5ceOAsrIy15EzE5uanVZu2af5FTu1oGKnyjfuUX1jszJCpnElPTTxjEJNGlqgs4u7KxIO1C8DANApZrbYOVcW9bEgFfWJahuaVL5hz9HiXlm1T85JeVkZumBQz9biLtTQXnkyszgkBwB/nKqoAz1PyI6ENWloSxlL0p6D9Vq4fpcWtBb3S2t3SJKK8rM06YxCXTikQJOGFqpvtxw/YwNAXAV6Rd2eTbsP6d/v7NT8il36d8VO7TpYL0kaXNRFk84o1IfH9tfYkh6e5wCAzkra0cfpaG52Wre9Rgsqdmp+xU4tWr9btY1N+syEUt12+TDejAQQaGlR1Cc6WNeoe15Yp4cWblC/bjn6wUdG65Izo17qFQB8d6qiTtlDJ7pkZeiuK8/SE1+coOxISJ+Z+bq+Pmup9rSORwAgWaRsUR9x7sCeevYrF+nLl56hOUur9P5fvKxnl2+VF79JAIAXUr6opZajR279wDDN+dIk9e2Wo5v/vEQ3PrxY2/fX+h0NANqVFkV9xMh+XfXXmy7UHVOG6+W3qnXZz1/WX954l9U1gEBLq6KWpIxwSDdeMkR//+rFGtm3q7755Ap96v5FenfXIb+jAUBUMRe1mYXN7E0ze8bLQIkyqLCLHv3CeN394VFasXmfrvrfBdpRwygEQPCczor6FklrvArih1DI9KkLBmr2TRfqYF2j7nhyBWMQAIETU1GbWbGkKyTd720cfwztna/bJw/XS2t3aFb5Jr/jAMBxYl1R/1LS7ZKa29rBzKabWbmZlVdXV8cjW0Jdf2Gpxg/uqe89vVqbdjOvBhAc7Ra1mU2VtMM5t/hU+znnZjjnypxzZUVFyXcGYChk+unVY2RmuvXxZWpuZgQCIBhiWVFPlHSlmW2Q9JikS83sT56m8klxj1z914dG6vXK3Zq5oNLvOAAgKYaids7d4Zwrds6VSrpG0lzn3Kc9T+aTq88t1mUjeusnL6zT29tr/I4DAOl3HHV7zEw//Mho5WVl6Guzlqqhqc2xPAAkxGkVtXPuX865qV6FCYqi/Cz94MOjtXLLft03t8LvOADSHCvqNkwe1UcfGdtfv55XoWWb9vodB0Aao6hP4btXnqVe+Vn62qylqm1o8jsOgDRFUZ9Ct5yI7vnYGK2vPqg7/7pCB+oa/Y4EIA1R1O2YNLRQN71niGYv2aKLfjxXv335HR2qp7ABJA5FHYPbJw/XUzdP1NnF3fWj59fq4p/M0/2vrmccAiAhUvaeiV4p37Bbv/jnW1pQsUu9u2bpS5cO1bXnlygcMr+jAUhiaXnPRK+UlfbUIzeM12PTx2tgzy76zlMrdc2Mhdqw86Df0QCkKIq6g8YPLtBfbhyvn109Ruu21Wjyva/owQWVXCMEQNxR1J1gZvroucX6x9cu0YTBBbrr6dX65O9f424xAOKKoo6DPt2yNfOz5+knHztbq6v2a+p9r2rvoXq/YwFIERR1nJiZPl42QA/fcIH21zbquRXb/I4EIEVQ1HE2pribhhR10VNLt/gdBUCKoKjjzMx01Tn99Xrlbm3Ze9jvOABSAEXtgWnn9JckzVla5XMSAKmAovZASUGuxpV011NvMv4A0HkUtUeuGttf67bXaM3W/X5HAZDkKGqPXDG6r8Ih401FAJ1GUXukIC9LFw8t1NNLqzhbEUCnUNQeumpsf1Xtq9XrG3b7HQVAEqOoPfT+kb2VmxnW3xh/AOgEitpDuZkZuvysPnp2+VbVNXLtagAdQ1F7bNo5/bS/tlH3/H0ds2oAHUJRe+zioUW69oIS3T+/Ujf/eYkO17OyBnB6KGqPhUKmu68apW9fMUJ/X7VN1/z+Ne2oqfU7FoAkQlEngJnphosG67efPldvbavRh+6brz8velcNTc1+RwOQBCjqBLr8rD56/IsT1Ldbju786wpd+rN/6fHyTfLivpUAUgdFnWCj+nfTX2+6UH/47HnqnpOp255Yrp/94y2/YwEIMIraB2am9w7vpTlfmqhrzhugX82r4FhrAG2iqH1kZvretFE6v7SnbntiuZZu2ut3JAABZF7MR8vKylx5eXncnzdV7TpQp2m/XqADdY06s1f+cY/lZoX1zcnDNaJvV5/SAUgEM1vsnCuL9hgr6gAoyMvSzM+ep3NLeigcsuP+rNyyTx/7zb81d+12v2MC8Akr6oDbtq9WN/zxDa2q2q/vTRul68YP9DsSAA+wok5ifbpla9aNE3TpsF76zlMr9buX3/E7EoAEY0WdJBqamvX1Wcv09LIq5WaGlZUR0revGKmPnlvsdzQAcXCqFXVGosOgYyLhkH75iXM0rqS7tuw5rMXv7tHtTy5XYX6WLjmzyO94ADxEUSeRcMh0/cRBkqSa2gZ9/Hev6fMPvqH87ON/jKP6d9MfP3e+zMyPmADijKJOUvnZET10/Xl6YH6lDjf8/xX5Nu0+pHnrqrVma41G9uOQPiAVtFvUZpYt6RVJWa37P+Gc+67XwdC+Xl2zdccHRxy3bdeBOp139z/1/MqtFDWQImI56qNO0qXOuTGSzpE02czGe5oKHVaQl6ULBhXo+ZXb/I4CIE7aLWrX4kDrp5HWP1zuLcCmjO6jih0HdMfs5Zr/9k6/4wDopJiOozazsJktlbRD0ovOuUVR9pluZuVmVl5dXR3nmDgdHxzdVwMLcjV7yRbd8tibOljX6HckAJ0QU1E755qcc+dIKpZ0vpmNirLPDOdcmXOurKiIw8X8VJiXpZdve68enT5euw7W6w8LKv2OBKATTuvMROfcXkn/kjTZizCIr3ElPXTZiN763SvrtfdQvd9xAHRQu0VtZkVm1r314xxJl0la63EuxMmtHzhTB+oa9btX1vsdBUAHxXIcdV9JD5lZWC3FPss594y3sRAvI/p21ZVj+mnm/EqVb9gtM9OXLz1DFw1lPAUki3aL2jm3XNLYBGSBR77xgWHaf7hBdY3NWrutRj99YR1FDSQRzkxMAwN65uoP158vSXpwQaXuenq1/rl6uwYVdZEkmaSSnrnKCHMxRSCIKOo089Fzi3XPC+t0wx+Pv7rhjZcM1h1TRrTxVQD8RFGnmfzsiP5y4wS9U33g6LaHF27UnKVV+ublwxUKcSEnIGgo6jQ0qn83jerf7ejnjU1Otz6+THc/t0a3XT5M2ZGwj+kAnIihJHTZyN7KzAjpgfmVenLJZr/jADgBRQ11y4lo8bcvU/fciOat3eF3HAAnYPQBSS2z6w+d3U9PLN6s+1+NfnJMyExTx/RVr/zsBKcD0htFjaM+OLqv/rRoo77/7Jo299my97C+M3VkAlMBoKhx1IQhBVr935PV0Nwc9fH/eOB1rdi8L8GpADCjxnFyMsPqmh2J+mdMcTetrNqnV9+uVkNT9DIHEH8UNWJWVtpTh+qbdN0Dr2vO0iq/4wBpg6JGzK4Y3VfPfeUi5UTCWlnFCARIFIoaMQuFTCP7ddWZvfP01vYav+MAaYOixmkb1idfC9/ZpbLvv6hL7pmnbftq/Y4EpDSO+sBp+9ykQcrKCGvv4QY9vax
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
2022-03-08 18:09:12 +01:00
"log_rang_log_freq('pt-2-word-ngrams-log-log', ngrams(get_words(pan_tadeusz), 2))"
2022-03-06 17:51:23 +01:00
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tajemniczy język Manuskryptu Wojnicza\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Manuskrypt Wojnicza](https://pl.wikipedia.org/wiki/Manuskrypt_Wojnicza) to powstały w XV w. manuskrypt spisany w\n",
"tajemniczym alfabecie, do dzisiaj nieodszyfrowanym. Rękopis stanowi\n",
"jedną z największych zagadek historii (i lingwistyki).\n",
"\n",
2022-03-08 18:09:12 +01:00
"![Źródło: Wikipedia Commons](./02_Jezyki/voynich135.jpg)\n",
2022-03-06 17:51:23 +01:00
"\n",
"Sami zbadajmy statystyczne własności tekstu manuskryptu. Użyjmy\n",
"transkrypcji Vnow, gdzie poszczególne znaki tajemniczego alfabetu\n",
"zamienione na litery alfabetu łacińskiego, cyfry i gwiazdkę. Jak\n",
"transkrybować manuskrypt, pozostaje sprawą dyskusyjną, natomiast wybór\n",
"takiego czy innego systemu transkrypcji nie powinien wpływać\n",
"dramatycznie na analizę statystyczną.\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 15,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
2022-03-08 18:09:12 +01:00
"data": {
"text/plain": [
"'9 OR 9FAM ZO8 QOAR9 Q*R 8ARAM 29 [O82*]OM OPCC9 OP'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
2022-03-06 17:51:23 +01:00
}
],
"source": [
"import requests\n",
"\n",
"voynich_url = 'http://www.voynich.net/reeds/gillogly/voynich.now'\n",
"voynich = requests.get(voynich_url).content.decode('utf-8')\n",
"\n",
"voynich = re.sub(r'\\{[^\\}]+\\}|^<[^>]+>|[-# ]+', '', voynich, flags=re.MULTILINE)\n",
"\n",
"voynich = voynich.replace('\\n\\n', '#')\n",
"voynich = voynich.replace('\\n', ' ')\n",
"voynich = voynich.replace('#', '\\n')\n",
"\n",
"voynich = voynich.replace('.', ' ')\n",
"\n",
"voynich[100:150]"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 16,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
2022-03-08 18:09:12 +01:00
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_8747/6903746.py:14: UserWarning: Glyph 9 (\t) missing from current font.\n",
" plt.savefig(fname)\n"
]
},
2022-03-06 17:51:23 +01:00
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/voy-chars.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 16,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/usr/lib/python3.10/site-packages/IPython/core/pylabtools.py:151: UserWarning: Glyph 9 (\t) missing from current font.\n",
" fig.canvas.print_figure(bytes_io, **kw)\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAuEAAADCCAYAAADn5xwjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAiMklEQVR4nO3debhkVXmo8fejW5lkpiFIIw2KGuAqSoskxAliRCECBrRJFFQUNSAglyQQvReMIUG9aEgUDAIyqEAHBwjIYBjEAYFmUGgQQWmgBaUZhHZA7ea7f6x1oDhdw+5zTtU5Xf3+nqeeqr1q7bVXzV+t/e21IzORJEmSNDirTHYHJEmSpJWNQbgkSZI0YAbhkiRJ0oAZhEuSJEkDZhAuSZIkDdj0ye7AZNhwww1z1qxZk90NSZIkDbEbbrjhocyc0e6+lTIInzVrFvPmzZvsbkiSJGmIRcQ9ne4zHUWSJEkaMINwSZIkacAMwiVJkqQBMwiXJEmSBswgXJIkSRowg3BJkiRpwFbKKQony6wjL+p6/4LjdhtQTyRJkjSZHAmXJEmSBswgXJIkSRowg3BJkiRpwPoWhEfEZhFxZUTcHhHzI+LQWr5+RHwzIu6s1+u1rHNURNwVEXdExBtayrePiFvqff8eEVHLV42Ic2v5tRExq1+PR5IkSZoo/RwJXwL878z8Y2BH4KCI2Bo4Erg8M7cCLq/L1PvmANsAuwInRsS02tZJwIHAVvWyay0/AHg0M18AfBr4eB8fjyRJkjQh+haEZ+YDmXljvb0YuB3YFNgDOKNWOwPYs97eAzgnM3+XmXcDdwE7RMQmwNqZeU1mJnDmqHVG2joP2GVklFySJEmaqgaSE17TRF4GXAtsnJkPQAnUgY1qtU2B+1pWW1jLNq23R5c/Y53MXAI8BmzQoQ8HRsS8iJi3aNGiCXhUkiRJ0tj0PQiPiOcAXwEOy8zHu1VtU5Zdyruts2xh5smZOTszZ8+YMaNblyVJkqS+6msQHhHPogTgX8rMr9biX9QUE+r1g7V8IbBZy+ozgftr+cw25c9YJyKmA+sAj0z8I5EkSZImTj9nRwngVOD2zPxUy10XAPvX2/sD57eUz6kznmxBOQDzupqysjgidqxt7jdqnZG29gauqHnjkiRJ0pTVz9PW7wS8A7glIm6uZf8IHAfMjYgDgHuBfQAyc35EzAVuo8ysclBmLq3rfQA4HVgduLheoAT5Z0XEXZQR8Dl9fDySJEnShOhbEJ6Z36F9zjbALh3WORY4tk35PGDbNuVPUIN4SZIkaUXhGTMlSZKkATMIlyRJkgbMIFySJEkaMINwSZIkacAMwiVJkqQBMwiXJEmSBswgXJIkSRowg3BJkiRpwAzCJUmSpAEzCJckSZIGzCBckiRJGjCDcEmSJGnADMIlSZKkATMIlyRJkgbMIFySJEkaMINwSZIkacAMwiVJkqQBMwiXJEmSBswgXJIkSRowg3BJkiRpwAzCJUmSpAEzCJckSZIGbHqTShGxE3AMsHldJ4DMzC371zVJkiRpOHUMwiNid+CmzPwZcCrwIeAGYOmA+iZJkiQNpW4j4T8G/jMiPgY8lpkXD6hPkiRJ0lDrGIRn5o8jYg/gBcCVEfFJ4KvA71rq3Nj/LkqSJEnDpWtOeGYuBe6IiFfWotmtdwM796tjkiRJ0rBqdGBmZr6u3x2RJEmSVhaNpiiMiI0j4tSIuLgubx0RBzRY77SIeDAibm0pOyYifhYRN9fLm1ruOyoi7oqIOyLiDS3l20fELfW+f4+IqOWrRsS5tfzaiJi1HI9dkiRJmhRN5wk/HbgUeG5d/jFwWMP1dm1T/unM3K5evgElsAfmANvUdU6MiGm1/knAgcBW9TLS5gHAo5n5AuDTwMcbPh5JkiRp0jQNwjfMzLnAkwCZuYQGUxVm5tXAIw23sQdwTmb+LjPvBu4CdoiITYC1M/OazEzgTGDPlnXOqLfPA3YZGSWXJEmSpqqmQfivI2IDysGYRMSOwGPj2O7BEfHDmq6yXi3bFLivpc7CWrZpvT26/Bnr1D8GjwEbtNtgRBwYEfMiYt6iRYvG0XVJkiRpfJoG4YcDFwDPj4jvUkajPzjGbZ4EPB/YDngAOL6WtxvBzi7l3dZZtjDz5MycnZmzZ8yYsVwdliRJkiZS09lRboyI1wAvogS+d2TmH8aywcz8xcjtiPg8cGFdXAhs1lJ1JnB/LZ/Zprx1nYURMR1Yh+bpL5IkSdKk6DoSHhE71+u3AG+mBOEvBP6yli23muM9Yi9gZOaUC4A5dcaTLSgHYF6XmQ8AiyNix5rvvR9wfss6+9fbewNX1LxxSZIkacrqNRL+GuAK4C/b3JeUM2h2FBFnA68FNoyIhcDRwGsjYru6/gLgfQCZOT8i5gK3AUuAg+rJggA+QJlpZXXg4noBOBU4KyLuooyAz+nxeCRJkqRJ1+uMmUfX63eNpfHM3LdN8ald6h8LHNumfB6wbZvyJ4B9xtI3SZIkabI0PVnPBvUkOTdGxA0RcUKdLUWSJEnScmo6O8o5wCLgryi514uAc/vVKUmSJGmYNZodBVg/Mz/WsvzPEbFnH/ojSZIkDb2mQfiVETEHmFuX9wYu6k+XBDDryO5P74LjdhtQTyRJkjTRmqajvA/4MvD7ejkHODwiFkfE4/3qnCRJkjSMmp6sZ61+d0SSJElaWTRNRyEi1qOcQGe1kbLMvDoi/jYzT+xH5yRJkqRh1CgIj4j3AIdSThl/M7AjcE1E3AbMBgzCJUmSpIaa5oQfCrwCuCczXwe8jDJN4T8Ce/ana5IkSdJwapqO8kRmPhERRMSqmfmjiHhRZj4OeGCmJEmStByaBuELI2Jd4OvANyPiUeD+fnVKkiRJGmZNZ0fZq948JiKuBNYBLu5bryRJkqQh1ignPCLOGrmdmd/KzAuA0/rWK0mSJGmINT0wc5vWhYiYBmw/8d2RJEmShl/XIDwijoqIxcBLIuLxelkMPAicP5AeSpIkSUOmaxCemf9az5b5ycxcu17WyswNMvOoAfVRkiRJGipN01EujIg1ASLi7RHxqYjYvI/9kiRJkoZW0yD8JOA3EfFS4O+Be4Az+9YrSZIkaYg1DcKXZGYCewAnZOYJwFr965YkSZI0vJqerGdxRBwFvB14dZ0d5Vn965YkSZI0vJqOhL8N+B1wQGb+HNgU+GTfeiVJkiQNsaYj4XsDX8jMRwEy817MCZckSZLGpOlI+B8B10fE3IjYNSKin52SJEmShlmjIDwzPwJsBZwKvBO4MyL+JSKe38e+SZIkSUOp6Ug4dXaUn9fLEmA94LyI+ESf+iZJkiQNpUY54RFxCLA/8BBwCvB3mfmHiFgFuJMyd7gkSZKkBpoemLkh8JbMvKe1MDOfjIjdJ75bkiRJ0vBqmo4SwAtHTl3fKjNvn9guSZIkScOt6Uj43cC+wL9HxGLg28DVmXl+t5Ui4jRgd+DBzNy2lq0PnAvMAhYAbx2Z+rCeEOgAYClwSGZeWsu3B04HVge+ARyamRkRq1KmStweeBh4W2YuaPiYhsKsIy/qev+C43YbU11JkiT1T9PZUU7LzHcDrwO+COxTr3s5Hdh1VNmRwOWZuRVweV0mIrYG5gDb1HVOrGfmBDgJOJAyQ8tWLW0eADyamS8APg18vMnjkSRJkiZToyA8Ik6JiO9RguHplJP3rNdrvcy8GnhkVPEewBn19hnAni3l52Tm7zLzbuAuYIeI2ARYOzOvqTO0nDlqnZG2zgN2cQ5zSZIkTXVNc8I3AKYBv6QE1Q9l5pIxbnPjzHwAoF5vVMs3Be5rqbewlm1ab48uf8Y6tT+P1b4uIyIOjIh5ETFv0aJFY+y6JEmSNH5N01H2ysxXAp8A1gWujIiF3ddabu1GsLNLebd1li3MPDkzZ2fm7BkzZoyxi5IkSdL4NZ0nfHfgVcCrKWkoV1AOzhyLX0TEJpn5QE01ebCWLwQ2a6k3E7i/ls9sU966zsKImA6sw7LpL5IkSdKU0jQd5Y3AjcBfZeaLM/NdmXnaGLd5AeXEP9Tr81vK50TEqhGxBeUAzOtqysriiNix5nvvN2qdkbb
"text/plain": [
"<Figure size 864x216 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"rang_freq_with_labels('voy-chars', get_characters(voynich))"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 17,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/voy-log-log.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 17,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAcd0lEQVR4nO3deXzV5Z328c/3LNkXloRIwhLZRHYwsio4SitVtFaty9NS2zql9bFUHVttp9NOp+3Y2lZKndqp1O1x7ShaOypQcCkoKBr2fd8iwSQgBAhkvZ8/EqggISckJ7/fybnerxcvknNOjpcHcnHnPvfvvs05h4iI+FfA6wAiInJmKmoREZ9TUYuI+JyKWkTE51TUIiI+F4rGk2ZlZbn8/PxoPLWISLu0dOnSMudc9unui0pR5+fnU1hYGI2nFhFpl8xsZ2P3aepDRMTnVNQiIj6nohYR8TkVtYiIz6moRUR8TkUtIuJzKmoREZ/zVVE/+MZmVhcd9DqGiIiv+KaoD1RU8dz7u/jCHxbx+zc3U1unfbJFRMBHRd0hJYE5d1zMpEHn8Jt5m7jx4XfZvb/C61giIp7zTVFDfVn/183DmXHjMDbuPcSkGQt5vnA3OoVGROKZr4oawMy4Zngec+68mEF5mdwzaxXfenop+49UeR1NRMQTvivq47p1TOHZb4zmB5/rz5sbSrh8xkL+vrHE61giIm3Ot0UNEAwY35zQm7/efhEdU8J89fEP+PFf13C0qtbraCIibcbXRX3cgNwM/vfbF/H1cefy5Ls7mfxfb7PmQy3jE5H4EFFRm1kHM5tlZhvMbL2ZjYl2sFMlhYP8+KoBPH3rKI5U1nLNQ4t46K0tWsYnIu1epCPq3wFznXP9gaHA+uhFOrOL+mYx986LuXzQOfz6bxu1jE9E2r0mi9rMMoDxwKMAzrkq59yBKOc6ow4pCfz+5uH89sahbNx7iM/97m1e0DI+EWmnIhlR9wJKgcfNbLmZPWJmqac+yMymmlmhmRWWlpa2etDT/Pf4wvBuzLnzYgbmZvC9Wau47ellWsYnIu1OJEUdAkYA/+2cGw4cAb5/6oOcczOdcwXOuYLs7NOezxgVn1zG98aGj7h8xkIWbIr+PxQiIm0lkqIuAoqcc0saPp9FfXH7xvFlfC/fPo6OKWFueex9/l3L+ESknWiyqJ1ze4HdZnZew02XAeuimuosDczNPLGM7/9pGZ+ItBORrvqYBjxjZquAYcB9UUvUQlrGJyLtjUVjpURBQYErLCxs9edtrgMVVfzw5TW8tqqYC/M7Mv2GYXTvlOJ1LBGRTzGzpc65gtPdFxNXJp6tTy7j21Bcv4zvqXd3UHqo0utoIiIRa9cj6k8q+riCu59fyZLt+wE4LyedsX06M653FqN6dSI9KexxQhGJZ2caUcdNUQPU1TlWf3iQxVv3sXhrGe9v309lTR3BgDGkWyZje9cX94ieHUkKB72OKyJxREXdiMqaWpbtPMDirWUs2lLGyqKD1NY5EkMBCvI7MrZ3FmN7d2ZwXiahYLueJRIRj6moI3ToWDXvb9/Poi31I+4New8BkJ4YYuS5nejROYWcjCS6pCfSJT2JnIz63zOSQ5iZx+lFJJadqahDbR3Gz9KTwlx2fg6XnZ8DQNnhShZv3ce7W8v4YMfHvLdtH0dOcxFNYihAl4xEctKTyMlIok+XNAbmZjAwL5PczCSVuIi0iEbUzXS4soaS8mOUHKrko/JjlB6qPPFxSXkle8uPsXPfEY4v2+6YEmZgbiYDczMYkJvB0G4dyM/61FYpIhLnNKJuRWmJIdKy0+iVndboY45W1bJ+bzlr95Sz9sODrN1TzuOLdlBVWwfAT64awFfHndtWkUUkxqmooyA5IciIHh0Z0aPjiduqa+vYUnKYB+Zt5GevradfTjpj+2R5mFJEYoWWMrSRcDDA+V0zmHHTcHplpXL7s8t04IGIRERF3cbSEkPM/EoBNXWOqU8t1Q5/ItIkFbUHzs1K5cGbh7Nhbzn3vLhKJ9OIyBmpqD3yT+d14XuXn8crK/cwc+E2r+OIiI+pqD1024TeXDm4K/fP3aBTaUSkUSpqD5kZv/7iEPrlpDPt2WXsKDvidSQR8SEVtcdSEkL86SsFBALGFQ++zXdfWMmSbfs0by0iJ2gdtQ9075TCC98cw2OLtvPKymJmLS0iv3MK11/QjauH5pGZfPIWrIEAJIeD2ihKJE7oEnKfqaiqYe6avTxfuJv3tu0/42MTggGSE4KkJARJTgiSHK7/uEenVCae34WL+2WTlqh/i0VigXbPi1G79lWwYFMJ1bUn/xnV1jmOVtdSUVXL0aqa+t+razlaVX/b+r3lHKioJiEY4IKeHcnPSqFrZjLnZCSRmRImPSlEz86p5HVI9uj/TEROpb0+YlSPzilMGZPf7K+rqa1j6c6PeWNDCe9t28f8dSWUHT75+LFQwLj/uiFcd0G3VkorItGiom6HQsEAo3p1ZlSvziduq6yppaS8kvJj1Rw8Ws1Db23h7hdWUna4kqnje2krVhEfU1HHicRQ8KQT2C/o2ZG7n1/JL+ZsoOxwJf96xfkqaxGfiqiozWwHcAioBWoam0eR2JEYCvLgTcPpnJrAn97ezrHqOv7j6oEEAiprEb9pzoj6n5xzZVFLIm0uEDB+cvVAkhKCPLxgG5U1tfzy2iEqaxGf0dRHnDMzvj+pf/0I+43NZCSF+bfJA7yOJSKfEGlRO2CemTngYefczFMfYGZTgakAPXr0aL2EEnVmxl0T+1J+tJpH3tlOTkYS3xjfy+tYItIgonXUZpbrnNtjZl2A+cA059zCxh6vddSxqa7OMe255by2uphLzsvmMwNyyO2QTMCMgp4dSdXFMyJR0+J11M65PQ2/l5jZX4CRQKNFLbEpEDCm3ziUnp1TeHVVMT/8y5oT9/Xtksajt1xIj84pZ3gGEYmGJkfUZpYKBJxzhxo+ng/81Dk3t7Gv0Yg69jnn2FZ2hPKj1RQfPMYPXlpNMGBcPTSXvA7JhIJGh5Qw4/tm0zkt0eu4IjGvpSPqHOAvDWtsQ8CzZyppaR/MjN4NJ60PB/qfk869L67ihcLdHPnE8WFmMKJHRz47IIebLuxBZkq4kWcUkbOlvT6kWZxzlB+rwTlH0cdHeX39R7y+/iPWfFhOemKIX1w3mMlDcr2OKRJztCmTRN26PeXc9T8rqKqt4827J+gqR5FmOlNRa0NjaRUDcjO4ZWw+28uOsK643Os4Iu2KilpazeUDcwgGjNmri72OItKuqKil1XROS2R0r068uqqYLSWHvY4j0m6oqKVV3XRhD3buq2Di9AU89s52r+OItAsqamlVVw3N5Y27JzAyvxN/XLCVqpo6ryOJxDwVtbS63tlp3HZJb0oOVTJnjearRVpKRS1RMaFfNr2yUvnNvI1s+uiQ13FEYpqKWqIiEDB+/cUhHK2q44rfvc2ND7/L9rIjXscSiUkqaomaC3p2YvZ3LuIb43uxYe8hvvPccqprNWct0lwqaomqLhlJ3DupP7+8djCrPzzIF/6wiJkLt7J7f4XX0URihi4hlzbzzJKdPPf+LtZ8WH/l4pBumYzp1Zl+OelcOyJPl51LXGvxftQireFLo3rypVE92bWvgtlripm9uphH39lOTZ2j7HAl35zQ2+uIIr6kEbV4qq7OMe3Py3ltVTFfGtWDaZf25ZzMJK9jibQ5jajFtwIBY8aNw8hJT+KJxdt5vnA3Vw3J5Z8v7sWA3Ayv44n4gkbU4hu79lXw2KL6sq6oqiUhGCAzJcyjtxQwpFsHr+OJRJX2o5aYcrCimpeWF1FyqJIXlxbRISXMH798Ab0aTpwRaY9U1BKzFmwq5fZnlnG0upbBeZn8/JpBDMrL9DqWSKvTwQESsyb0y+at717CV8fms2L3Af7t5TXa6EnijopafC87PZEfTR7AA18cyordB+j/oznc+eflHKyo9jqaSJvQqg+JGddd0I2s9ERmryrmxWVFrNh9gNe+czGpifprLO2bRtQSUyb0y+b+64fw5NdHsnN/Bb+Zt9HrSCJRp6KWmDS2TxZTRvfkicU7WLbrY6/
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"log_rang_log_freq('voy-log-log', get_words(voynich))"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 18,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/voy-words-20.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 18,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAtQAAADCCAYAAABpPVVfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAfDUlEQVR4nO3debwkZX3v8c+XYUdQloEggw4aNIKJqHNxwQXUG1CI4AomGiRETIKKZjGQ5EZzDQnRmJcmBq9c0aBRcRI1okYUR1BiiDgo24BsgjiCMO5oFAV/+aOeI83hLHVOnz4L5/N+vfrV3U9X1fOrp6uqf131VFWqCkmSJEmzs9lCByBJkiQtZSbUkiRJ0hBMqCVJkqQhmFBLkiRJQzChliRJkoZgQi1JkiQNYfOFDmBYu+yyS61evXqhw5AkSdK92EUXXfTNqlo50WdLPqFevXo169evX+gwJEmSdC+W5KuTfWaXD0mSJGkIJtSSJEnSEEyoJUmSpCGYUEuSJElDMKGWJEmShrDkr/KxUFaf+LGR13HDKYeOvA5JkiQNxz3UkiRJ0hBMqCVJkqQhmFBLkiRJQzChliRJkoZgQi1JkiQNwYRakiRJGoIJtSRJkjQEE2pJkiRpCCbUkiRJ0hBMqCVJkqQhmFBLkiRJQxh5Qp3khiSXJbk4yfpWtlOSc5Jc0553HBj+pCTXJrkqycGjjk+SJEkaxnztoT6oqvarqjXt/YnAuqraG1jX3pNkH+AoYF/gEODUJCvmKUZJkiRpxhaqy8fhwBnt9RnAEQPlZ1bV7VV1PXAtsP/8hydJkiT1Mx8JdQGfTHJRkuNa2W5VdTNAe961le8BfG1g3I2t7G6SHJdkfZL1mzZtGmHokiRJ0tQ2n4c6Dqiqm5LsCpyT5MtTDJsJyuoeBVWnAacBrFmz5h6fS5IkSfNl5Huoq+qm9nwr8CG6Lhy3JNkdoD3f2gbfCOw5MPoq4KZRxyhJkiTN1kgT6iTbJdl+7DXwq8DlwFnA0W2wo4EPt9dnAUcl2SrJXsDewIWjjFGSJEkaxqi7fOwGfCjJWF3vraqzk3wBWJvkWOBG4HkAVbUhyVrgCuAO4PiqunPEMUqSJEmzNtKEuqq+AjxigvJvAU+dZJyTgZNHGZckSZI0V7xToiRJkjQEE2pJkiRpCCbUkiRJ0hBMqCVJkqQhmFBLkiRJQzChliRJkoZgQi1JkiQNwYRakiRJGoIJtSRJkjQEE2pJkiRpCCbUkiRJ0hBMqCVJkqQhmFBLkiRJQzChliRJkoaweZ+BkhwAvBZ4YBsnQFXVg0YXmiRJkrT4TZpQJzkM+FJVfR04HXgVcBFw5zzFJkmSJC16U+2hvhp4W5LXAd+rqo/PU0ySJEnSkjFpQl1VVyc5HPhF4NwkbwA+CNw+MMwXRx+iJEmStHhN2Ye6qu4ErkrymFa0ZvBj4CmjCkySJElaCnqdlFhVB822giQrgPXA16vqsCQ7Ae8HVgM3AM+vqu+0YU8CjqXrp/2KqvrEbOuVJEmS5kOvy+Yl2S3J6Uk+3t7vk+TYnnWcAFw58P5EYF1V7Q2sa+9Jsg9wFLAvcAhwakvGJUmSpEWr73Wo/wn4BHD/9v5q4JXTjZRkFXAo8PaB4sOBM9rrM4AjBsrPrKrbq+p64Fpg/57xSZIkSQuib0K9S1WtBX4GUFV30O/yeW8CXj02XrNbVd3cpnMzsGsr3wP42sBwG1vZPSQ5Lsn6JOs3bdrUcxYkSZKkudc3of5hkp3pTkQkyWOB7001QruO9a1VdVHPOjJBWU00YFWdVlVrqmrNypUre05ekiRJmnu9TkoEfh84C3hwks8BK4HnTjPOAcAzkzwD2BrYIck/A7ck2b2qbk6yO3BrG34jsOfA+KuAm3rGJ0mSJC2IXnuo2/Wmnww8HngpsG9VXTrNOCdV1aqqWk13suGnq+qFdIn50W2wo4EPt9dnAUcl2SrJXsDewIUznB9JkiRpXk25hzrJU6rq00mePe6jhyShqj44izpPAda2q4TcCDwPoKo2JFkLXAHcARzfroMtSZIkLVrTdfl4MvBp4Ncm+Kzo7pw4rao6Dzivvf4W8NRJhjsZOLnPNCVJkqTFYLo7Jb6mPR8zP+FIkiRJS0vfG7vsnOTvk3wxyUVJ3tyu+iFJkiQta30vm3cmsAl4Dt3VPTbR3T5ckiRJWtb6XjZvp6p63cD7v0xyxAjikSRJkpaUvnuoz01yVJLN2uP5wMdGGZgkSZK0FPRNqF8KvBf4SXucCfx+ktuSfH9UwUmSJEmLXa8uH1W1/agDkSRJkpaivn2oSbIj3d0Ltx4rq6rPJvm9qjp1FMFJkiRJi12vhDrJbwMnAKuAi4HHAhckuQJYA5hQS5IkaVnq24f6BOB/AV+tqoOAR9JdOu9PgCNGE5okSZK0+PXt8vHjqvpxEpJsVVVfTvLQqvo+4EmJkiRJWrb6JtQbk9wP+DfgnCTfAW4aVVCSJEnSUtH3Kh/Pai9fm+Rc4L7Ax0cWlSRJkrRE9OpDneTdY6+r6jNVdRbwjpFFJUmSJC0RfU9K3HfwTZIVwKPnPhxJkiRpaZkyoU5yUpLbgF9J8v32uA24FfjwvEQoSZIkLWJTJtRV9dftLolvqKod2mP7qtq5qk6apxglSZKkRavvVT4+mmS7qvphkhcCjwLeXFVfHWFsmsTqEz828jpuOOXQkdchSZJ0b9C3D/Vbgf9O8gjg1cBXgXeNLCpJkiRpieibUN9RVQUcTrdn+s3A9lONkGTrJBcmuSTJhiR/0cp3SnJOkmva844D45yU5NokVyU5eLYzJUmSJM2Xvgn1bUlOAl4IfKxd5WOLaca5HXhKVT0C2A84JMljgROBdVW1N7CuvSfJPsBRdFcUOQQ4tdUjSZIkLVp9E+oj6RLkY6vqG8AewBumGqE6P2hvt2iPsb3cZ7TyM4Aj2uvDgTOr6vaquh64Fti/Z3ySJEnSguibUD8XeGdVnQ9QVTdW1bR9qJOsSHIx3WX2zqmqzwO7VdXNbTo3A7u2wfcAvjYw+sZWJkmSJC1afRPqXwC+kGRtkkOSpM9IVXVnVe0HrAL2T/LwKQafaJo14YDJcUnWJ1m/adOmPqFIkiRJI9Eroa6qPwP2Bk4HXgxck+Svkjy45/jfBc6j6xt9S5LdAdrzrW2wjcCeA6OtAm6aZHqnVdWaqlqzcuXKPiFIkiRJI9F3DzXtKh/faI87gB2Bf03y+omGT7Iyyf3a622ApwFfBs4Cjm6DHc1dd1w8CzgqyVZJ9qJL4C+c6QxJkiRJ86nXjV2SvIIu+f0m8Hbgj6rqp0k2A66huzb1eLsDZ7QrdWwGrK2qjya5AFib5FjgRuB5AFW1Icla4Aq6hP34qrpzuNmTJEmSRqvvnRJ3AZ49/s6IVfWzJIdNNEJVXQo8coLybwFPnWSck4GTe8akBTDquzR6h0ZJkrTU9O3yEeAhSbYb/0FVXTm3IUmSJElLR9+E+nrgBcD6dvfDNyY5fIRxSZIkSUtCry4fVfUO4B1JfgF4PvCHwHFMc/txaa6MuqsJ2N1EkiTNTt+TEt8O7APcApxPd6OXL44wLkmSJGlJ6NvlY2dgBfBd4NvAN6vqjlEFJUmSJC0Vfbt8PAsgycOAg4Fzk6yoqlWjDE5aDOxuIkmSptK3y8dhwBOBJ9Hd0OXTdF0/JEmSpGWt73Wonw58FnhzVU14O3BJkiRpOerb5eP4UQciSZIkLUV9T0qUJEmSNAETakmSJGkIJtSSJEnSEPpe5WNv4K/pbu6y9Vh5VT1oRHFJkiRJS0LfPdTvBN4K3AEcBLwLePeogpIkSZKWir4J9TZVtQ5IVX21ql4LPGV0YUmSJElLQ9/rUP84yWbANUleBnwd2HV0YUmSJElLQ9891K8EtgVeATwaeBFw9IhikiRJkpaMvjd2+QJA20v9iqq6baRRSZIkSUtErz3USdYkuQy4FLgsySVJHj3a0CRJkqTFr2+Xj3cAv1dVq6tqNXA83ZU/JpVkzyTnJrkyyYYkJ7TynZKck+Sa9rzjwDgnJbk2yVVJDp7lPEmSJEnzpm9CfVtVnT/2pqr+A5iu28c
"text/plain": [
"<Figure size 864x216 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"rang_freq_with_labels('voy-words-20', get_words(voynich), top=20)"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 19,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/voy-words-log-log.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 19,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAWoAAAD4CAYAAADFAawfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAcd0lEQVR4nO3deXzV5Z328c/3LNkXloRIwhLZRHYwsio4SitVtFaty9NS2zql9bFUHVttp9NOp+3Y2lZKndqp1O1x7ShaOypQcCkoKBr2fd8iwSQgBAhkvZ8/EqggISckJ7/fybnerxcvknNOjpcHcnHnPvfvvs05h4iI+FfA6wAiInJmKmoREZ9TUYuI+JyKWkTE51TUIiI+F4rGk2ZlZbn8/PxoPLWISLu0dOnSMudc9unui0pR5+fnU1hYGI2nFhFpl8xsZ2P3aepDRMTnVNQiIj6nohYR8TkVtYiIz6moRUR8TkUtIuJzKmoREZ/zVVE/+MZmVhcd9DqGiIiv+KaoD1RU8dz7u/jCHxbx+zc3U1unfbJFRMBHRd0hJYE5d1zMpEHn8Jt5m7jx4XfZvb/C61giIp7zTVFDfVn/183DmXHjMDbuPcSkGQt5vnA3OoVGROKZr4oawMy4Zngec+68mEF5mdwzaxXfenop+49UeR1NRMQTvivq47p1TOHZb4zmB5/rz5sbSrh8xkL+vrHE61giIm3Ot0UNEAwY35zQm7/efhEdU8J89fEP+PFf13C0qtbraCIibcbXRX3cgNwM/vfbF/H1cefy5Ls7mfxfb7PmQy3jE5H4EFFRm1kHM5tlZhvMbL2ZjYl2sFMlhYP8+KoBPH3rKI5U1nLNQ4t46K0tWsYnIu1epCPq3wFznXP9gaHA+uhFOrOL+mYx986LuXzQOfz6bxu1jE9E2r0mi9rMMoDxwKMAzrkq59yBKOc6ow4pCfz+5uH89sahbNx7iM/97m1e0DI+EWmnIhlR9wJKgcfNbLmZPWJmqac+yMymmlmhmRWWlpa2etDT/Pf4wvBuzLnzYgbmZvC9Wau47ellWsYnIu1OJEUdAkYA/+2cGw4cAb5/6oOcczOdcwXOuYLs7NOezxgVn1zG98aGj7h8xkIWbIr+PxQiIm0lkqIuAoqcc0saPp9FfXH7xvFlfC/fPo6OKWFueex9/l3L+ESknWiyqJ1ze4HdZnZew02XAeuimuosDczNPLGM7/9pGZ+ItBORrvqYBjxjZquAYcB9UUvUQlrGJyLtjUVjpURBQYErLCxs9edtrgMVVfzw5TW8tqqYC/M7Mv2GYXTvlOJ1LBGRTzGzpc65gtPdFxNXJp6tTy7j21Bcv4zvqXd3UHqo0utoIiIRa9cj6k8q+riCu59fyZLt+wE4LyedsX06M653FqN6dSI9KexxQhGJZ2caUcdNUQPU1TlWf3iQxVv3sXhrGe9v309lTR3BgDGkWyZje9cX94ieHUkKB72OKyJxREXdiMqaWpbtPMDirWUs2lLGyqKD1NY5EkMBCvI7MrZ3FmN7d2ZwXiahYLueJRIRj6moI3ToWDXvb9/Poi31I+4New8BkJ4YYuS5nejROYWcjCS6pCfSJT2JnIz63zOSQ5iZx+lFJJadqahDbR3Gz9KTwlx2fg6XnZ8DQNnhShZv3ce7W8v4YMfHvLdtH0dOcxFNYihAl4xEctKTyMlIok+XNAbmZjAwL5PczCSVuIi0iEbUzXS4soaS8mOUHKrko/JjlB6qPPFxSXkle8uPsXPfEY4v2+6YEmZgbiYDczMYkJvB0G4dyM/61FYpIhLnNKJuRWmJIdKy0+iVndboY45W1bJ+bzlr95Sz9sODrN1TzuOLdlBVWwfAT64awFfHndtWkUUkxqmooyA5IciIHh0Z0aPjiduqa+vYUnKYB+Zt5GevradfTjpj+2R5mFJEYoWWMrSRcDDA+V0zmHHTcHplpXL7s8t04IGIRERF3cbSEkPM/EoBNXWOqU8t1Q5/ItIkFbUHzs1K5cGbh7Nhbzn3vLhKJ9OIyBmpqD3yT+d14XuXn8crK/cwc+E2r+OIiI+pqD1024TeXDm4K/fP3aBTaUSkUSpqD5kZv/7iEPrlpDPt2WXsKDvidSQR8SEVtcdSEkL86SsFBALGFQ++zXdfWMmSbfs0by0iJ2gdtQ9075TCC98cw2OLtvPKymJmLS0iv3MK11/QjauH5pGZfPIWrIEAJIeD2ihKJE7oEnKfqaiqYe6avTxfuJv3tu0/42MTggGSE4KkJARJTgiSHK7/uEenVCae34WL+2WTlqh/i0VigXbPi1G79lWwYFMJ1bUn/xnV1jmOVtdSUVXL0aqa+t+razlaVX/b+r3lHKioJiEY4IKeHcnPSqFrZjLnZCSRmRImPSlEz86p5HVI9uj/TEROpb0+YlSPzilMGZPf7K+rqa1j6c6PeWNDCe9t28f8dSWUHT75+LFQwLj/uiFcd0G3VkorItGiom6HQsEAo3p1ZlSvziduq6yppaS8kvJj1Rw8Ws1Db23h7hdWUna4kqnje2krVhEfU1HHicRQ8KQT2C/o2ZG7n1/JL+ZsoOxwJf96xfkqaxGfiqiozWwHcAioBWoam0eR2JEYCvLgTcPpnJrAn97ezrHqOv7j6oEEAiprEb9pzoj6n5xzZVFLIm0uEDB+cvVAkhKCPLxgG5U1tfzy2iEqaxGf0dRHnDMzvj+pf/0I+43NZCSF+bfJA7yOJSKfEGlRO2CemTngYefczFMfYGZTgakAPXr0aL2EEnVmxl0T+1J+tJpH3tlOTkYS3xjfy+tYItIgonXUZpbrnNtjZl2A+cA059zCxh6vddSxqa7OMe255by2uphLzsvmMwNyyO2QTMCMgp4dSdXFMyJR0+J11M65PQ2/l5jZX4CRQKNFLbEpEDCm3ziUnp1TeHVVMT/8y5oT9/Xtksajt1xIj84pZ3gGEYmGJkfUZpYKBJxzhxo+ng/81Dk3t7Gv0Yg69jnn2FZ2hPKj1RQfPMYPXlpNMGBcPTSXvA7JhIJGh5Qw4/tm0zkt0eu4IjGvpSPqHOAvDWtsQ8CzZyppaR/MjN4NJ60PB/qfk869L67ihcLdHPnE8WFmMKJHRz47IIebLuxBZkq4kWcUkbOlvT6kWZxzlB+rwTlH0cdHeX39R7y+/iPWfFhOemKIX1w3mMlDcr2OKRJztCmTRN26PeXc9T8rqKqt4827J+gqR5FmOlNRa0NjaRUDcjO4ZWw+28uOsK643Os4Iu2KilpazeUDcwgGjNmri72OItKuqKil1XROS2R0r068uqqYLSWHvY4j0m6oqKVV3XRhD3buq2Di9AU89s52r+OItAsqamlVVw3N5Y27JzAyvxN/XLCVqpo6ryOJxDwVtbS63tlp3HZJb0oOVTJnjearRVpKRS1RMaFfNr2yUvnNvI1s+uiQ13FEYpqKWqIiEDB+/cUhHK2q44rfvc2ND7/L9rIjXscSiUkqaomaC3p2YvZ3LuIb43uxYe8hvvPccqprNWct0lwqaomqLhlJ3DupP7+8djCrPzzIF/6wiJkLt7J7f4XX0URihi4hlzbzzJKdPPf+LtZ8WH/l4pBumYzp1Zl+OelcOyJPl51LXGvxftQireFLo3rypVE92bWvgtlripm9uphH39lOTZ2j7HAl35zQ2+uIIr6kEbV4qq7OMe3Py3ltVTFfGtWDaZf25ZzMJK9jibQ5jajFtwIBY8aNw8hJT+KJxdt5vnA3Vw3J5Z8v7sWA3Ayv44n4gkbU4hu79lXw2KL6sq6oqiUhGCAzJcyjtxQwpFsHr+OJRJX2o5aYcrCimpeWF1FyqJIXlxbRISXMH798Ab0aTpwRaY9U1BKzFmwq5fZnlnG0upbBeZn8/JpBDMrL9DqWSKvTwQESsyb0y+at717CV8fms2L3Af7t5TXa6EnijopafC87PZEfTR7AA18cyordB+j/oznc+eflHKyo9jqaSJvQqg+JGddd0I2s9ERmryrmxWVFrNh9gNe+czGpifprLO2bRtQSUyb0y+b+64fw5NdHsnN/Bb+Zt9HrSCJRp6KWmDS2TxZTRvfkicU7WLbrY6/
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"log_rang_log_freq('voy-words-log-log', get_words(voynich))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Język DNA\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kod genetyczny przejawia własności zaskakująco podobne do języków naturalnych.\n",
"Przede wszystkim ma charakter dyskretny, genotyp to ciąg symboli ze skończonego alfabetu.\n",
"Podstawowe litery są tylko cztery, reprezentują one nukleotydy, z których zbudowana jest nić DNA:\n",
"a, g, c, t.\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 20,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
2022-03-08 18:09:12 +01:00
"data": {
"text/plain": [
"'TATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA'"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
2022-03-06 17:51:23 +01:00
}
],
"source": [
"import requests\n",
"\n",
"dna_url = 'https://raw.githubusercontent.com/egreen18/NanO_GEM/master/rawGenome.txt'\n",
"dna = requests.get(dna_url).content.decode('utf-8')\n",
"\n",
"dna = ''.join(dna.split('\\n')[1:])\n",
"dna = dna.replace('N', 'A')\n",
"\n",
"dna[0:100]"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 21,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/dna-chars.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 21,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAucAAADCCAYAAADq+WxkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAVSUlEQVR4nO3df+xd9X3f8ecLkxDaQILBocxmMQ1OW0BpWlyDlK4ZsIGrsBkaaM2UxWvduKNUIeu0FKJJZGFs0KhlyVRIWXAxdK2xaFZQEkotfix0QoChpMxQYi8EcCDgxA5xN0Fi8t4f9/Md11++/vpgfH3PfJ8P6eic8z7n8/m+r3Vlve9Hn/M5qSokSZIkjd9B405AkiRJ0oDFuSRJktQTFueSJElST1icS5IkST1hcS5JkiT1hMW5JEmS1BMHjzuBPjnqqKNq4cKF405DkiRJB7iHHnro21U1b3rc4nzIwoUL2bBhw7jTkCRJ0gEuyVMzxZ3WIkmSJPWExbkkSZLUExbnkiRJUk9YnEuSJEk9YXEuSZIk9YSrtfTAwku+NO4UNGbfuPID405BkiT1gMW5JH8gTjh/HEpSfzitRZIkSeoJi3NJkiSpJyzOJUmSpJ6wOJckSZJ6wuJckiRJ6gmLc0mSJKknLM4lSZKknrA4lyRJknrC4lySJEnqCYtzSZIkqScsziVJkqSesDiXJEmSesLiXJIkSeqJkRfnSeYk+eskX2znc5OsT7Kp7Y8YuvfSJJuTPJHkrKH4yUkebdc+myQtfkiSm1v8/iQLh9qsaH9jU5IVo/6ckiRJ0hu1P0bOLwYeHzq/BLizqhYBd7ZzkpwALAdOBJYC1ySZ09pcC6wCFrVtaYuvBLZX1fHA1cBVra+5wGXAKcAS4LLhHwGSJElSH420OE+yAPgA8Pmh8DJgTTteA5wzFF9bVS9X1ZPAZmBJkmOAw6vqvqoq4MZpbab6ugU4o42qnwWsr6ptVbUdWM+rBb0kSZLUS6MeOf9PwMeBHw7Fjq6q5wDa/h0tPh94Zui+LS02vx1Pj+/Spqp2Ai8CR87S12skWZVkQ5INW7dufZ0fT5IkSdp3RlacJzkbeKGqHuraZIZYzRLf2za7Bquuq6rFVbV43rx5nRKVJEmSRmGUI+fvA/5pkm8Aa4HTk/wx8HybqkLbv9Du3wIcO9R+AfBsiy+YIb5LmyQHA28Dts3SlyRJktRbIyvOq+rSqlpQVQsZPOh5V1V9CLgNmFo9ZQVwazu+DVjeVmA5jsGDnw+0qS87kpza5pN/eFqbqb7Oa3+jgDuAM5Mc0R4EPbPFJEmSpN46eAx/80pgXZKVwNPA+QBVtTHJOuAxYCdwUVW90tpcCNwAHArc3jaA64GbkmxmMGK+vPW1LcnlwIPtvk9V1bZRfzBJkiTpjdgvxXlV3QPc046/A5yxm/uuAK6YIb4BOGmG+Eu04n6Ga6uB1XubsyRp/1h4yZfGnYLG6BtXfmDcKUi9Mo6Rc0mSpF7wx+Fk6+OPw/3xEiJJkiRJHVicS5IkST1hcS5JkiT1hMW5JEmS1BMW55IkSVJPWJxLkiRJPWFxLkmSJPWExbkkSZLUExbnkiRJUk9YnEuSJEk9cXCXm5K8D/gk8M7WJkBV1Y+PLjVJkiRpsuy2OE9yNvDXVfVN4HrgXwEPAa/sp9wkSZKkiTLbyPnXgD9McjnwYlXdvp9ykiRJkibSbovzqvpakmXA8cDdST4NfAF4eeieh0efoiRJkjQZZp1zXlWvAE8kOaWFFg9fBk4fVWKSJEnSpOn0QGhVnTbqRCRJkqRJ12kpxSRHJ7k+ye3t/IQkK0ebmiRJkjRZuq5zfgNwB/D32vnXgI+NIB9JkiRpYnUtzo+qqnXADwGqaicuqShJkiTtU12L8/+d5EgGD4GS5FTgxZFlJUmSJE2gTg+EAr8N3Aa8K8n/AOYB540sK0mSJGkCdV2t5eEk7wd+AgjwRFX9YKSZSZIkSRNm1uI8yelVdVeSX5p26d1JqKovjDA3SZIkaaLsaeT8/cBdwD+Z4VoxeGOoJEmSpH1g1gdCq+qytv/VGbZfm61tkrckeSDJV5NsTPLvWnxukvVJNrX9EUNtLk2yOckTSc4aip+c5NF27bNJ0uKHJLm5xe9PsnCozYr2NzYlWbFX/zqSJEnSftT1JURHtqL44SQPJflMW71lNi8Dp1fVTwPvBZa2VV4uAe6sqkXAne2cJCcAy4ETgaXANUnmtL6uBVYBi9q2tMVXAtur6njgauCq1tdc4DLgFGAJcNnwjwBJkiSpj7oupbgW2Ap8kMEqLVuBm2drUAN/107f1LYClgFrWnwNcE47XgasraqXq+pJYDOwJMkxwOFVdV9VFXDjtDZTfd0CnNFG1c8C1lfVtqraDqzn1YJekiRJ6qWuxfncqrq8qp5s278H3r6nRknmJHkEeIFBsXw/cHRVPQfQ9u9ot88HnhlqvqXF5rfj6fFd2rQXI70IHDlLX5IkSVJvdS3O706yPMlBbftl4Et7alRVr1TVe4EFDEbBT5rl9szUxSzxvW2z6x9NViXZkGTD1q1bZ0lPkiRJGq2uxflvAH8CfL9ta4HfTrIjyff21Liqvgvcw2BqyfNtqgpt/0K7bQtw7FCzBcCzLb5ghvgubZIcDLwN2DZLXzPldl1VLa6qxfPmzdvTR5EkSZJGplNxXlWHVdVBVXVw2w5qscOq6vCZ2iSZl+Tt7fhQ4B8Bf8vgTaNTq6esAG5tx7cBy9sKLMcxePDzgTb1ZUeSU9t88g9PazPV13nAXW1e+h3AmUmOaA+CntlikiRJUm91ekMoQCtyFwFvmYpV1VeS/GZVXTNDk2OANW3FlYOAdVX1xST3AeuSrASeBs5vfW1Msg54DNgJXFRVr7S+LgRuAA4Fbm8bwPXATUk2MxgxX9762pbkcuDBdt+nqmpb188qSZIkjUOn4jzJrwMXM5ge8ghwKnBfkseAxcBrivOq+hvgZ2aIfwc4Y6a/U1VXAFfMEN8AvGa+elW9RCvuZ7i2Gli9u88kSZIk9U3XOecXAz8HPFVVpzEourcCn+DVZQ0lSZIkvQFdp7W8VFUvJSHJIVX1t0l+oqq+B+zxgVBJkiRJe9a1ON/SHu78c2B9ku3sZvUTSZIkSXunU3FeVee2w08muZvBkoW3z9JEkiRJ0uvUac55kpumjqvqv1fVbfiwpSRJkrRPdX0g9MThk7Y84sn7Ph1JkiRpcs1anCe5NMkO4D1Jvte2HQze6nnrbG0lSZIkvT6zFudV9R+r6jDg01V1eNsOq6ojq+rS/ZSjJEmSNBG6Tmv5YpIfBUjyoSS/n+SdI8xLkiRJmjhdi/Nrgf+T5KeBjwNPATeOLCtJkiRpAnUtzndWVQHLgM9U1WeAw0aXliRJkjR5ur6EaEeSS4EPAb/QVmt50+jSkiRJkiZP15HzXwFeBlZW1beA+cCnR5aVJEmSNIG6jpyfB/xRVW0HqKqncc65JEmStE91HTn/MeDBJOuSLE2SUSYlSZIkTaJOxXlV/VtgEXA98C+ATUn+Q5J3jTA3SZIkaaJ0HTmnrdbyrbbtBI4AbknyuyPKTZIkSZooneacJ/kosAL4NvB54N9U1Q+SHARsYrD2uSRJkqQ3oOsDoUcBv1RVTw0Hq+qHSc7e92lJkiRJk6frtJYA707yo9MvVNXj+zYlSZIkaTJ1Lc6fBC4ANiR5IMnvJVk2wrwkSZKkidN1tZbVVfVrwGnAHwPnt70kSZKkfaTrA6GfB04AngfuZfBSoodHmJckSZI0cbpOazkSmAN8F9gGfLuqdo4qKUmSJGkSdRo5r6pzAZL8FHAWcHeSOVW1YJTJSZIkSZOk67SWs4F/APwCg5cP3cVgeoskSZKkfaTrtJZfZDDH/INV9ZNV9atVtXq2BkmOTXJ3kseTbExycYvPTbI+yaa2P2KozaVJNid5IslZQ/GTkzzarn02SVr8kCQ3t/j9SRYOtVnR/samJCu6/5NIkiRJ49F1tZaLqurmqnr2dfS9E/jXVfVTwKnARUlOAC4B7qyqRcCd7Zx2bTlwIrAUuCbJnNbXtcAqYFHblrb4SmB
"text/plain": [
"<Figure size 864x216 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"rang_freq_with_labels('dna-chars', get_characters(dna))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Tryplety — znaczące cząstki genotypu\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nukleotydy rzeczywiście są jak litery, same w sobie nie niosą\n",
"znaczenia. Dopiero ciągi trzech nukleotydów, *tryplety*, kodują jeden\n",
"z dwudziestu aminokwasów.\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 22,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/dna-aminos.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 22,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAuEAAADCCAYAAADn5xwjAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAbKElEQVR4nO3df7hdVX3n8ffHoEhVFCFQJGC0pk6BUSqRUm3HCm2JhRnQooZpC23T0lqsWqdVcDqP2Bme4jhKVQotI1Sgo5ChVagWlQEZrYNgUBSDRVL5YQQFBCG0giR+54+9rj1cbm4OMXuf5OT9ep7znLO/Z6/9XTvJPfmedddeO1WFJEmSpOE8btIdkCRJkrY3FuGSJEnSwCzCJUmSpIFZhEuSJEkDswiXJEmSBmYRLkmSJA1sh0l3YBJ22223Wrx48aS7IUmSpCl27bXX3l1VC+d6b7sswhcvXsyqVasm3Q1JkiRNsSS3buw9p6NIkiRJA7MIlyRJkgZmES5JkiQNzCJckiRJGphFuCRJkjSw7XJ1lElZfOJHezv2Lace3tuxJUmStGU5Ei5JkiQNzJHwKdfn6Ds4Ai9JkrQ5HAmXJEmSBmYRLkmSJA3MIlySJEkamHPC1QtXgpEkSdo4R8IlSZKkgfVehCe5Jcn1Sa5LsqrFnp7ksiQ3teddRvY/KcmaJDcmOWwkfmA7zpok70mSFt8xyYUtfnWSxX2fkyRJkvTDGGok/KVVdUBVLW3bJwKXV9US4PK2TZJ9geXAfsAy4IwkC1qbM4HjgSXtsazFVwD3VtVzgNOAtw9wPpIkSdJmm9R0lCOBc9vrc4GjRuIXVNVDVXUzsAY4KMmewM5VdVVVFXDerDYzx7oIOHRmlFySJEnaGg1RhBfwiSTXJjm+xfaoqjsA2vPuLb4X8PWRtmtbbK/2enb8EW2qaj1wH7Dr7E4kOT7JqiSr7rrrri1yYpIkSdLmGGJ1lBdX1e1JdgcuS/KP8+w71wh2zROfr80jA1VnAWcBLF269FHvS5IkSUPpfSS8qm5vz3cCHwIOAr7VppjQnu9su68F9h5pvgi4vcUXzRF/RJskOwBPBe7p41wkSZKkLaHXIjzJk5I8ZeY18IvAl4FLgOPabscBF7fXlwDL24onz6K7APOaNmVlXZKD23zvY2e1mTnW0cAVbd64JEmStFXqezrKHsCH2nWSOwAfqKqPJfkcsDLJCuA24JUAVbU6yUrgBmA9cEJVbWjHeg3wfmAn4NL2ADgbOD/JGroR8OU9n5MkSZL0Q+m1CK+qrwHPnyP+beDQjbQ5BThljvgqYP854g/SinhJkiRpW+AdMyVJkqSBWYRLkiRJA7MIlyRJkgZmES5JkiQNzCJckiRJGphFuCRJkjQwi3BJkiRpYBbhkiRJ0sAswiVJkqSB9X3bemkwi0/8aG/HvuXUw7eanJIkadvnSLgkSZI0MItwSZIkaWAW4ZIkSdLALMIlSZKkgVmES5IkSQOzCJckSZIGZhEuSZIkDcwiXJIkSRrYWDfrSfJi4GTgma1NgKqqZ/fXNUmSJGk6bbQIT3IE8IWq+gZwNvAHwLXAhoH6JkmSJE2l+aajfBX4yyQ/BdxXVZdW1Z1V9e2ZxzgJkixI8oUkH2nbT09yWZKb2vMuI/uelGRNkhuTHDYSPzDJ9e299yRJi++Y5MIWvzrJ4s35Q5AkSZKGtNEivKq+ChwJfAf4ZJJ3JPnpJC+YeYyZ4/XAV0a2TwQur6olwOVtmyT7AsuB/YBlwBlJFrQ2ZwLHA0vaY1mLrwDurarnAKcBbx+zT5IkSdLEzDsnvKo2ADe20XCApaNvA4fM1z7JIuBw4BTgjS18JPBz7fW5wJXAm1v8gqp6CLg5yRrgoCS3ADtX1VXtmOcBRwGXtjYnt2NdBJyeJFVV8/VLkiRJmqSxLsysqpdu5vH/DHgT8JSR2B5VdUc77h1Jdm/xvYDPjuy3tsUebq9nx2fafL0da32S+4BdgbtndyTJ8XSj6eyzzz6beTqSJEnSD2+sJQqT7JHk7CSXtu19k6zYRJsjgDur6tox+5I5YjVPfL42jw5WnVVVS6tq6cKFC8fskiRJkrTljbtO+PuBjwPPaNtfBd6wiTYvBv5Dm05yAXBIkr8GvpVkT4D2fGfbfy2w90j7RcDtLb5ojvgj2iTZAXgqcM+Y5yRJkiRNxLhF+G5VtRL4PnRTP9jEUoVVdVJVLaqqxXQXXF5RVb8KXAIc13Y7Dri4vb4EWN5WPHkW3QWY17SpK+uSHNxWRTl2VpuZYx3dcjgfXJIkSVu1seaEA/+cZFfaVI8kBwP3bWbOU4GVbTrLbcArAapqdZKVwA3AeuCEdmEowGvoRuN3orsg89IWPxs4v13EeQ9dsS9JkiRt1cYtwt9IN+r8Y0k+AyykG3keS1VdSbcKCm198UM3st8pdCupzI6vAvafI/4grYiXtheLT/xor8e/5dTDez2+JEkaf3WUzyd5CfBcuoshb6yqh3vtmSRJkjSl5i3CkxxSVVckecWst348CVX1tz32TZIkSZpKmxoJfwlwBfDv53ivAItwSZIk6THa1B0z39qef2OY7kiSJEnTb9yb9eya5D1JPp/k2iTvbqulSJIkSXqMxl0n/ALgLuCX6VZFuQu4sK9OSZIkSdNs3CUKn15V/3Vk+78lOaqH/kiSJElTb9yR8E8mWZ7kce3xKqDfxYolSZKkKTVuEf47wAeA77XHBcAbk6xLcn9fnZMkSZKm0bg363lK3x2RJEmSthfjzgknyS7AEuCJM7Gq+lSS36uqM/ronCRJkjSNxirCk/wW8HpgEXAdcDBwVZIbgKWARbgkSZI0pnHnhL8eeCFwa1W9FPhJumUK3wIc1U/XJEmSpOk07nSUB6vqwSQk2bGq/jHJc6vqfsALMyVJkqTHYNwifG2SpwEfBi5Lci9we1+dkiRJkqbZuKujvLy9PDnJJ4GnApf21itJkiRpio01JzzJ+TOvq+r/VtUlwDm99UqSJEmaYuNemLnf6EaSBcCBW747kiRJ0vSbtwhPclKSdcDzktzfHuuAO4GLB+mhJEmSNGXmLcKr6k/b3TLfUVU7t8dTqmrXqjppoD5KkiRJU2Xc6SgfSfIkgCS/muRdSZ45X4MkT0xyTZIvJlmd5G0t/vQklyW5qT3vMtLmpCRrktyY5LCR+IFJrm/vvSdJWnzHJBe2+NVJFj/WPwBJkiRpaOMW4WcC/5Lk+cCbgFuB8zbR5iHgkKp6PnAAsCzJwcCJwOVVtQS4vG2TZF9gOd3882XAGW3u+Uz+44El7bGsxVcA91bVc4DTgLePeT6SJEnSxIxbhK+vqgKOBN5dVe8GnjJfg+o80DYf3x4zxzi3xc/lX++4eSRwQVU9VFU3A2uAg5LsCexcVVe1Ppw3q83MsS4CDp0ZJZckSZK2VuMW4euSnAT8KvDRNkL9+E01SrIgyXV0F3JeVlVXA3tU1R0A7Xn3tvtewNdHmq9tsb3a69nxR7SpqvXAfcCuG+nL8UlWJVl11113bfqMJUmSpJ6MW4S/mm56yYqq+iZd8fuOTTWqqg1VdQCwiG5Ue/95dp9rBLvmic/XZq6+nFVVS6tq6cKFC+fphiRJktSvcYvwo4G/qqpPA1TVbVW1qTnhP1BV3wGupJvL/a02xYT2fGfbbS2w90izRcDtLb5ojvgj2iTZge5OnveM2y9JkiRpEsa6bT3wo8Dnknye7k6ZH2/zszcqyULg4ar6TpKdgJ+nu3DyEuA44NT2PLPe+CXAB5K8C3gG3QWY11TVhiTr2kWdVwPHAu8daXMccBXdF4UrNtUvSZtn8Ykf7e3Yt5x6eG/HliRpazRWEV5Vf5zkvwC/CPwGcHqSlcDZVfVPG2m2J3Bumz/+OGBlVX0kyVXAyiQrgNuAV7Ycq9sxbwDWAydU1YZ2rNcA7wd2Ai5tD4CzgfOTrKEbAV8+/qlL2tpZ+EuSptW4I+FUVSX5JvBNuiJ5F+CiJJdV1Zvm2P9LwE/OEf82cOhGcpwCnDJHfBXwqPnkVfUgrYiXJEmSthVjFeFJXkc37eNu4H3AH1XVw0keB9xEt3a4JEm
"text/plain": [
"<Figure size 864x216 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"genetic_code = {\n",
" 'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',\n",
" 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',\n",
" 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',\n",
" 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',\n",
" 'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',\n",
" 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',\n",
" 'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',\n",
" 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',\n",
" 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',\n",
" 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',\n",
" 'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',\n",
" 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',\n",
" 'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',\n",
" 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',\n",
" 'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',\n",
" 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',\n",
" }\n",
"\n",
"def get_triplets(t):\n",
" for triplet in re.finditer(r'.{3}', t):\n",
" yield genetic_code[triplet.group(0)]\n",
"\n",
"rang_freq_with_labels('dna-aminos', get_triplets(dna))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### „Zdania” w języku DNA\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Z aminokwasów zakodowanych przez tryplet budowane są białka.\n",
"Maszyneria budująca białka czyta sekwencję aż do napotkania\n",
"trypletu STOP (\\_ powyżej). Taka sekwencja to *gen*.\n",
"\n"
]
},
{
"cell_type": "code",
2022-03-08 18:09:12 +01:00
"execution_count": 23,
2022-03-06 17:51:23 +01:00
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
2022-03-08 18:09:12 +01:00
"'02_Jezyki/dna_length.png'"
2022-03-06 17:51:23 +01:00
]
},
2022-03-08 18:09:12 +01:00
"execution_count": 23,
2022-03-06 17:51:23 +01:00
"metadata": {},
2022-03-08 18:09:12 +01:00
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAX0AAAD4CAYAAAAAczaOAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAAsTAAALEwEAmpwYAAAV5klEQVR4nO3dbYxc53ne8f8V0pblF9ZStBJYki7pgnVLCbBsESpTN0ZbJRFtp6baQgWNpiJaAWwFpbX7goaqgTr9QEDuS9AIrRSwtiuqdawwjg0RMZSaYOMGBVTLK1k2Rcks15YsbciQGxWpmbpgQuXuh3noHC+Hu7MUd2bF8/8Bg3PmnvPM3ntWuubwmTNzUlVIkvrhRybdgCRpfAx9SeoRQ1+SesTQl6QeMfQlqUdWT7qBxVx33XW1cePGSbchSa8rTz311O9W1dT8+ooP/Y0bNzI9PT3pNiTpdSXJd4fVnd6RpB4x9CWpRwx9SeoRQ1+SesTQl6QeMfQlqUcMfUnqEUNfknrE0JekHhkp9JP8oyRHkzyb5HNJ3pTk2iSHkhxvy2s629+XZCbJsSS3d+q3JDnSHnsgSZbjlzpv454v/eAmSRoh9JOsA/4hsLWqbgJWATuBPcDhqtoMHG73SbKlPX4jsB14MMmq9nQPAbuBze22/bL+NpKkBY06vbMauDrJauDNwAlgB7C/Pb4fuKOt7wAeraqzVfUCMAPcmmQtsKaqnqjBNRof6YyRJI3BoqFfVb8N/BvgJeAk8H+q6svADVV1sm1zEri+DVkHvNx5itlWW9fW59cvkGR3kukk03Nzc0v7jSRJFzXK9M41DI7eNwF/EnhLkp9ZaMiQWi1Qv7BYta+qtlbV1qmpC74ZVJJ0iUaZ3vkJ4IWqmquqPwS+APwF4FSbsqEtT7ftZ4ENnfHrGUwHzbb1+XVJ0piMEvovAduSvLmdbXMb8DxwENjVttkFPNbWDwI7k1yVZBODN2yfbFNAZ5Jsa89zV2eMJGkMFr2ISlV9NcnngaeBc8DXgX3AW4EDSe5m8MJwZ9v+aJIDwHNt+3ur6tX2dPcADwNXA4+3myRpTEa6clZVfQL4xLzyWQZH/cO23wvsHVKfBm5aYo+SpMvET+RKUo8Y+pLUI4a+JPWIoS9JPWLoS1KPGPqS1COGviT1iKEvST1i6EtSjxj6ktQjhr4k9YihL0k9YuhLUo8Y+pLUI4a+JPWIoS9JPTLKhdHfleSZzu17ST6W5Nokh5Icb8trOmPuSzKT5FiS2zv1W5IcaY890C6bKEkak0VDv6qOVdXNVXUzcAvwfeCLwB7gcFVtBg63+yTZAuwEbgS2Aw8mWdWe7iFgN4Pr5m5uj0uSxmSp0zu3Ad+uqu8CO4D9rb4fuKOt7wAeraqzVfUCMAPcmmQtsKaqnqiqAh7pjJEkjcFSQ38n8Lm2fkNVnQRoy+tbfR3wcmfMbKuta+vz65KkMRk59JO8Efgw8KuLbTqkVgvUh/2s3Ummk0zPzc2N2qIkaRFLOdL/APB0VZ1q90+1KRva8nSrzwIbOuPWAydaff2Q+gWqal9Vba2qrVNTU0toUZK0kKWE/kf446kdgIPArra+C3isU9+Z5Kokmxi8YftkmwI6k2RbO2vnrs4YSdIYrB5loyRvBn4S+Hud8v3AgSR3Ay8BdwJU1dEkB4DngHPAvVX1ahtzD/AwcDXweLtJksZkpNCvqu8DPzqv9gqDs3mGbb8X2DukPg3ctPQ2JUmXg5/IlaQeMfQlqUcMfUnqEUNfknrE0JekHjH0JalHDH1J6hFDX5J6xNCXpB4x9CWpRwx9SeoRQ1+SesTQl6QeMfQlqUcMfUnqEUNfknrE0JekHhkp9JO8Pcnnk3wryfNJfizJtUkOJTneltd0tr8vyUySY0lu79RvSXKkPfZAu1auJGlMRj3S/0XgN6rqzwLvBp4H9gCHq2ozcLjdJ8kWYCdwI7AdeDDJqvY8DwG7GVwsfXN7XJI0JouGfpI1wPuBTwNU1R9U1e8BO4D9bbP9wB1tfQfwaFWdraoXgBng1iRrgTVV9URVFfBIZ4wkaQxGOdJ/JzAH/KckX0/yqSRvAW6oqpMAbXl9234d8HJn/GyrrWvr8+sXSLI7yXSS6bm5uSX9QpKkixsl9FcD7wUeqqr3AP+XNpVzEcPm6WuB+oXFqn1VtbWqtk5NTY3QoiRpFKOE/iwwW1Vfbfc/z+BF4FSbsqEtT3e239AZvx440errh9QlSWOyaOhX1e8ALyd5VyvdBjwHHAR2tdou4LG2fhDYmeSqJJsYvGH7ZJsCOpNkWztr567OGEnSGKwecbt/AHw2yRuB7wB/h8ELxoEkdwMvAXcCVNXRJAcYvDCcA+6tqlfb89wDPAxcDTzebpKkMRkp9KvqGWDrkIduu8j2e4G9Q+rTwE1L6E+SdBn5iVxJ6hFDX5J6xNCXpB4x9CWpRwx9SeoRQ1+SesTQl6QeMfQlqUcMfUnqEUNfknrE0JekHjH0JalHDH1J6hFDX5J6xNCXpB4x9CWpR0YK/SQvJjmS5Jkk0612bZJDSY635TWd7e9LMpPkWJLbO/Vb2vPMJHmgXTZRkjQmSznS/8tVdXNVnb+C1h7gcFVtBg63+yTZAuwEbgS2Aw8mWdXGPATsZnDd3M3tcUnSmLyW6Z0dwP62vh+4o1N/tKrOVtULwAxwa5K1wJqqeqKqCnikM0aSNAajhn4BX07yVJLdrXZDVZ0EaMvrW30d8HJn7GyrrWvr8+uSpDEZ6cLowPuq6kSS64FDSb61wLbD5ulrgfqFTzB4YdkN8I53vGPEFiVJixnpSL+qTrTlaeCLwK3AqTZlQ1uebpvPAhs6w9cDJ1p9/ZD6sJ+3r6q2VtXWqamp0X8bSdKCFg39JG9J8rbz68BPAc8CB4FdbbNdwGNt/SCwM8lVSTYxeMP2yTYFdCbJtnbWzl2dMZKkMRhleucG4Ivt7MrVwC9X1W8k+RpwIMndwEvAnQBVdTTJAeA54Bxwb1W92p7rHuBh4Grg8XaTJI3JoqFfVd8B3j2k/gpw20XG7AX2DqlPAzctvU1J0uXgJ3IlqUcMfUnqEUNfknrE0JekHjH0JalHDH1J6hFDX5J6xNCXpB4x9CWpRwx9SeoRQ1+SesTQl6QeMfQlqUcMfUnqEUNfknrE0JekHjH0JalHRg79JKuSfD3Jr7f71yY5lOR4W17T2fa+JDNJjiW5vVO/JcmR9tgD7Vq5kqQxWcqR/keB5zv39wCHq2ozcLjdJ8kWYCdwI7AdeDDJqjbmIWA3g4ulb26PS5LGZKTQT7Ie+BDwqU55B7C/re8H7ujUH62qs1X1AjAD3JpkLbCmqp6oqgIe6YyRJI3BqEf6/w74Z8AfdWo3VNVJgLa8vtXXAS93tptttXVtfX79Akl2J5lOMj03Nzdii5KkxSwa+kl+GjhdVU+N+JzD5ulrgfqFxap9VbW1qrZOTU2N+GMlSYtZPcI27wM+nOSDwJuANUn+C3AqydqqOtmmbk637WeBDZ3x64ETrb5+SF2SNCaLHulX1X1Vtb6qNjJ4g/a/VdXPAAeBXW2zXcBjbf0gsDPJVUk2MXjD9sk2BXQmybZ21s5dnTGSpDEY5Uj/Yu4HDiS5G3gJuBOgqo4mOQA8B5wD7q2qV9uYe4CHgauBx9tNkjQmSwr9qvoK8JW2/gpw20W22wvsHVKfBm5aapOSpMvDT+RKUo8Y+pLUI4a+JPWIoS9JPWLoS1KPGPqS1COGviT1yGv5cNbrysY9X/rB+ov3f2iCnUjS5HikL0k9YuhLUo8Y+pLUI4a+JPWIoS9JPWLoS1KPGPqS1COGviT1yCgXRn9TkieTfCPJ0ST/stWvTXIoyfG2vKYz5r4kM0mOJbm9U78lyZH22APtsomSpDEZ5Uj/LPBXqurdwM3A9iTbgD3A4araDBxu90myhcG1dG8EtgMPJlnVnushYDeD6+Zubo9LksZklAujV1X9frv7hnYrYAewv9X3A3e09R3Ao1V1tqpeAGaAW5OsBdZU1RNVVcAjnTGSpDEYaU4/yaokzwCngUNV9VXghqo6CdCW17fN1wEvd4b
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
2022-03-06 17:51:23 +01:00
"output_type": "display_data"
}
],
"source": [
"def get_genes(triplets):\n",
" gene = []\n",
" for ammino in triplets:\n",
" if ammino == '_':\n",
" yield gene\n",
" gene = []\n",
" else:\n",
" gene.append(ammino)\n",
"\n",
"plt.figure().clear()\n",
"plt.hist([len(g) for g in get_genes(get_triplets(dna))], bins=100)\n",
"\n",
"fname = '02_Jezyki/dna_length.png'\n",
"\n",
"plt.savefig(fname)\n",
"\n",
"fname"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.2"
},
"org": null
},
"nbformat": 4,
"nbformat_minor": 1
}