moj-2024/wyk/03_Ngramy.ipynb

497 lines
186 KiB
Plaintext
Raw Normal View History

2024-03-27 07:13:21 +01:00
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Modelowanie języka</h1>\n",
"<h2> 03. <i>N-gramy</i> [wykład]</h2> \n",
"<h3> Filip Graliński (2022)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## N-gramy\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W modelowaniu języka często rozpatruje się n-gramy, czyli podciągi o\n",
"rozmiarze $n$.\n",
"\n",
"Na przykład *digramy* (*bigramy*) to zbitki dwóch jednostek, np. liter albo wyrazów.\n",
"\n",
"| $n$|$n$-gram|nazwa|\n",
"|---|---|---|\n",
"| 1|1-gram|unigram|\n",
"| 2|2-gram|digram/bigram|\n",
"| 3|3-gram|trigram|\n",
"| 4|4-gram|tetragram|\n",
"| 5|5-gram|pentagram|\n",
"\n",
"**Pytanie:** Jak nazywa się 6-gram?\n",
"\n",
"Jak widać, dla symetrii mówimy czasami o unigramach, jeśli operujemy\n",
"po prostu na jednostkach, nie na ich podciągach.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### N-gramy z Pana Tadeusza\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Statystyki, które policzyliśmy dla pojedynczych liter czy wyrazów, możemy powtórzyć dla n-gramów.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('k', 'o', 't'), ('o', 't', 'e'), ('t', 'e', 'k')]"
]
}
],
"source": [
"def ngrams(iter, size):\n",
" ngram = []\n",
" for item in iter:\n",
" ngram.append(item)\n",
" if len(ngram) == size:\n",
" yield tuple(ngram)\n",
" ngram = ngram[1:]\n",
"\n",
"list(ngrams(\"kotek\", 3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zauważmy, że policzyliśmy wszystkie n-gramy, również częściowo się pokrywające.\n",
"\n",
"Zawsze powinniśmy się upewnić, czy jest jasne, czy chodzi o n-gramy znakowe czy wyrazowe\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 3-gramy znakowe\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA7x0lEQVR4nO3deXhU5eH28XsySSb7kBUICSFsssmOyCoIoqhULdVqEXFtrbjQtFZwK6gYtWrtqwVFrba40VbFpehPXNj3VWTfE9YQlkxIyCSZmfePwCAFZMnyzMz5fq5rLjgnM/Q212meO895zjk2n8/nEwAAACwjzHQAAAAA1C0KIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFkMBBAAAsBgKIAAAgMVQAAEAACyGAggAAGAxFEAAAACLoQACAABYDAUQAADAYiiAAAAAFhNuOkAw83q92rVrl+Lj42Wz2UzHAQAAZ8Hn86m4uFjp6ekKC7PmXBgFsBp27dqlzMxM0zEAAMB5yM/PV0ZGhukYRlAAqyE+Pl5S1QGUkJBgOA0AADgbLpdLmZmZ/nHciiiA1XDstG9CQgIFEACAIGPl5VvWPPENAABgYRRAAAAAi6EAAgAAWAwFEAAAwGIogAAAABZDAQQAALAYCiAAAIDFUAABAAAshgIIAABgMRRAAAAAi6EAAgAAWAwFEAAAwGLCTQfAyb5es1cfr9ip7ORYNUmJVfbRV2JMhKUfXA0AAGoGBTAALc07qP9+v/uk/c7oiKpCmByj7JQ4NUmJUdOjf8ZHRRhICgAAghEFMAANbtdAybGR2lJYom2FJdpaWKLdRWUqOlKhlfmHtDL/0EmfSYlzKDslRk2SY5WdGqvso382SY5VVIS97v8jAABAwLL5fD6f6RDByuVyyel0qqioSAkJCbX6v3Wk3KPtB0q0dV+Jtu6v+nPb/hJtLSxV4WH3T3423RmlJilVp5ObpsT6S2JmYowiw1kGCgCwlrocvwOVZQtgZWWlxo4dq3fffVd79uxRw4YNdeutt+rRRx9VWNjZlaJAOYCKyyq0rbBUWwoPa1thqbbtL9GWwhJt3XdYrrLK034uzCZlJMb41xhm/6gkpteLlj2M9YYAgNATKOO3SZY9Bfzss8/q1Vdf1T/+8Q+1bdtWS5Ys0W233San06kHHnjAdLxzEh8VoQsznLoww3nCfp/Pp4OlFdr6o1PJP549LC33KO9AqfIOlGrmhn0nfDbSHqbGyTHq3Lie+rRIVa/mKUqKjazL/ywAAFBLLDsDePXVV6t+/fp68803/fuGDh2qmJgYTZ48+az+jWD+DcLn82lfsfuEdYZbC6uK4bb9pSqv9J7wfptNapfuVJ8WKerdIkVdshLlCGdtIQAg+ATz+F1TLDsD2Lt3b7366qvasGGDWrZsqZUrV2rOnDl66aWXTEerEzabTWkJUUpLiNLFTZNP+JrH69PuoiPauPew5m0u1OyNhVq3p1irdhZp1c4iTZixWdERdnVvmqQ+LVLVt0WKmqfFcYsaAACChGUL4EMPPaSioiK1atVKdrtdHo9H48eP10033XTaz7jdbrndxy+4cLlcdRG1ztnDbMpIjFFGYoz6t0qTJBW4yjRnU1UZnL2xUIWH3Zqxfp9mrK86dVw/waE+LVLVp0WKejVPUUqcw+R/AgAA+AmWPQX8wQcf6MEHH9Sf//xntW3bVitWrNCoUaP04osvasSIEaf8zNixYzVu3LiT9lttCtnn82ndnmLN3rhPszcWatHWA3L/zynjtukJ6t0iRX1bpKpLViK3ogEABAxOAVu4AGZmZmr06NEaOXKkf99TTz2ld955R+vWrTvlZ041A5iZmWnpA0iSyio8WrLtoGZv3KdZGwu1dveJM6NREWG6KDtZrRvGyxFulyM8TJH2MEWGh1X9/djLv8/u/1pUhF1NU2IVxhXJAIAaQgG08Cng0tLSk273Yrfb5fV6T/MJyeFwyOHg1Ob/ioqwq/fRi0PGSNpX7NZc/+nifSoodmvWhn2a9T9XGp+trlmJevPWbnJG87QTAABqgmUL4JAhQzR+/Hg1btxYbdu21fLly/Xiiy/q9ttvNx0t6KXGO3Rtp0a6tlMj+Xw+bSw4rNkbC7Xr0BGVV3qrXh6v3JUelVd65f7xvoqqP4+970BpuZZsP6hfvb5Ak+/ozq1oAACoAZY9BVxcXKzHHntMH3/8sQoKCpSenq6bbrpJjz/+uCIjz65kMIVc+9budmn4mwtVeLhcLevH6Z07uistIcp0LABAEGP8tnABrAkcQHVjU8FhDXtjgfa63MpOidW7d3ZXer1o07EAAEGK8VviQbAIeM3T4vTv3/RURmK0thaW6PpX5ytvf6npWAAABC0KIIJC4+QY/es3PZSdEqudh47o+tfmaVPBYdOxAAAIShRABI30etGa8puL1bJ+nPa63Lpx0vyTbjkDAADOjAKIoJIWH6UPft1DbdMTVHi4XDdOWqDvdxwyHQsAgKBCAUTQSYqN1Ht3XaxOjeup6EiFhr2+UEu2HTAdCwCAoEEBRFByRkdo8h3d1T07ScXuSg1/c5HmbSo0HQsAgKDAbWCqgcvIzTtS7tGvJy/R7I2FirSHqXeLFHXJSlTXrER1yKzHM4gBACdh/KYAVgsHUGBwV3p033vL9dWavSfsj7Db1Dbdqa5ZieraJFFdspKUGs+j/ADA6hi/KYDVwgEUOHw+n77fUaTF2w5o6faDWrL9oPYVu096X/0Eh9Lio5Qa71BqnKPqz6OvpNhIRdhtkmwKs0lhNptsR/9snhbHbCIAhAjGbwpgtXAABS6fz6cdB49oyfYDWrLtoJZuP6j1e4t1vkd7ujNK/++mTuraJKlmgwIA6hzjNwWwWjiAgkvRkQpt31+ifcXu46/Dx/9+oLRcXq9PXp/k9fnk81UVyeKyShW7K2UPs+l3A1vot/2ayx5mM/2fAwA4T4zfUrjpAEBdcUZHqH1GvXP+3GF3pR79eJWmrtil57/aoPlb9usvN3RUWkJUzYcEAKAOcBsY4AziHOH6yy876vnrOyg6wq65m/Zr8F9n67t1BSotr1R5pVdMpAMAggmngKuBKWTr2VRwWPe+t0zr9hSf9LUwmxQdYVfnrERd0jJVfVumqkVanGw2ThcDQCBh/KYAVgsHkDWVVXiUO22t3l2Yp0rvT//fp6EzSle0a6D7Lm2hpNjIOkoIAPgpjN8UwGrhALK2Co9X5ZVeVXp8qvR6Ven16UBJueZuKtSsjYVauGW/3JVeSVXrD38/qKV+dVFjhdtZeQEAJjF+UwCrhQMIP6WswqM5Gwv1/Ffr/aeMWzWI19iftdXFTZMNpwMA62L8pgBWCwcQzkalx6v3F+Xp+a82qOhIhSRpSId0jRncSun1og2nAwDrYfymAFYLBxDOxcGScj3/1Xq9tyhPPl/VBSMj+zfTnX2a8pQRAKhDjN8UwGrhAML5+GFnkcZ9tlqLtx2UJGUkRuuBAS10XadGrA8EgDrA+E0BrBYOIJwvn8+nT1fuUu60ddrjKpMkNU2J1f0DWqhl/XiFhUl2m02ZSTHMDgJADWP8pgBWCwcQqutIuUeTF2zTxBmbdbC04qSvJ8ZE6Nae2RrRM0v1YriNDADUBMZvCmC1cAChphx2V+qtOVv1n2U7VFrukc/nU1mFV4fdlZKk2Ei7rm6fru5Nk9StSZIa1YtWGM8jBoDzwvh
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"log_rang_log_freq('pt-3-char-ngrams-log-log', ngrams(get_characters(pan_tadeusz), 3))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 2-gramy wyrazowe\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAtkElEQVR4nO3de3yU1YHG8eedSTK5BwIkEAg35R65ewERsAIVrRXX1taqxbW2tUUL0nZF6a5Vq6naum6l0uK62moRum1R260XvHBTQW4BylUKQkoIIUgySSCTy8z+ERKgBBhIMmfeOb/v5zN/ZOZNeD7jmPfJOe85rxMKhUICAACANTymAwAAACCyKIAAAACWoQACAABYhgIIAABgGQogAACAZSiAAAAAlqEAAgAAWIYCCAAAYBkKIAAAgGUogAAAAJahAAIAAFiGAggAAGAZCiAAAIBlKIAAAACWoQACAABYhgIIAABgGQogAACAZSiAAAAAlqEAAgAAWIYCCAAAYBkKIAAAgGUogAAAAJahAAIAAFiGAggAAGAZCiAAAIBlKIAAAACWoQACAABYhgIIAABgGQogAACAZSiAAAAAlqEAAgAAWIYCCAAAYBkKIAAAgGUogAAAAJahAAIAAFiGAggAAGAZCiAAAIBlKIAAAACWoQACAABYhgIIAABgGQogAACAZSiAAAAAlqEAAgAAWIYCCAAAYBkKIAAAgGUogAAAAJahAAIAAFiGAggAAGAZCiAAAIBlKIAAAACWoQACAABYhgIIAABgGQogAACAZSiAAAAAlqEAAgAAWIYCCAAAYJk40wHcLBgMqqioSGlpaXIcx3QcAAAQhlAopIqKCuXk5MjjsXMsjALYAkVFRcrNzTUdAwAAnIfCwkJ169bNdAwjKIAtkJaWJqnhA5Senm44DQAACIff71dubm7TedxGFMAWaJz2TU9PpwACAOAyNl++ZefENwAAgMUogAAAAJahAAIAAFiGAggAAGAZCiAAAIBlKIAAAACWoQACAABYhgIIAABgGQogAACAZSiAAAAAlqEAAgAAWIYCCAAAYBkKYBT68O+luuPF1Zq/aq8O+KtNxwEAADEmznQAnOqtvxXrvW0lem9bibRIGtItQ1cNyNaEAdka0CVNjuOYjggAAFzMCYVCIdMh3Mrv9ysjI0Pl5eVKT09vtZ+7s6RSb20u1jtbD6igsEwn/hfq2i5JEwZk6aoB2bqsdwclxDGICwDAuWir87ebUABbIBIfoJKKar2/rUSLt5Roxc6Dqq4NNr2W6ovTuL6dNGFglq7sl6V2yQltkgEAgFhCAaQAtkikP0DVtfX6YGep3tl6QO9sLdHBikDTa16Po5E92mvCgGxNGJitXh1T2jwPAABuRAGkALaIyQ9QMBjSxn3lemfLAb2z9YC2FVec9PoFnVI0YWC2Jg7I1rDu7eX1cN0gAAASBVCiALZINH2ACj87onePjQyu3HVIdcHj/1kzUxJ0Zb8sTRyYpSv6dFKKj7U/AAB7RdP52xQKYAtE6wfIX12rpdsP6t2tB/TethL5q+uaXkvwejT6wg7HVhVnqUtGksGkAABEXrSevyOJAtgCbvgA1dYHtebTw8euGzygPYeOnPR6Xtf0husGB2RrUE46W8wAAGKeG87fbY0C2AJu+wCFQiHtLKnUO1tL9M7WA1q39/BJW8zkdU3Xf311mC7olGouJAAAbcxt5++2QAFsAbd/gEorA3pvW4ne2XJAyz8p1dHaeiUnePXoDXm6YVg30/EAAGgTbj9/twYKYAvE0geoxF+t6QsK9NGuQ5Kkm0Z200NfzFNSgtdwMgAAWlcsnb/PF7eRgCQpKz1RL995qe6d0FceR/r9mn/oi3NWaMeBirN/MwAAcBUKIJp4PY6mT+ij3915mbLSfPqkpFJfnLNCv19dKAaKAQCIHRRAnGLUBR301+lXaGzfTqquDerf/rhR9y4sUGWg7uzfDAAAoh4FEM3qmOrTi7dfrH+7up+8HkevFhTpi8+s0OaictPRAABAC1EAcVoej6Pvjr9QC791mXIyErWrtEo3PPuhXlq5hylhAABcjAKIsxrZM1P/970rNGFAlmrqgvr3V/+mafPXyV9dazoaAAA4DxRAhKV9SoKe+/pI/ejaAYr3OvrrpmJd+4vl2lBYZjoaAAA4RxRAhM1xHN15RW/9712jlZuZpMLPjupLv/pQz6/YzZQwAAAuQgHEORua205/uecKTc7rrNr6kB75yxZ987drVXakxnQ0AAAQBgogzktGUryevWW4Hrl+kBK8Hr2z9YCu+a/l2lbsNx0NAACcBQUQ581xHN02qqcWTRutXh1TVFRerXsXblBdfdB0NAAAcAYUwGPy8/PlOI5mzJhhOorrDMrJ0B/uGqWMpHht3e/X71btNR0JAACcAQVQ0urVqzVv3jwNHjzYdBTX6pDq0w8+30+S9LO3t6u0MmA4EQAAOB3rC2BlZaVuueUWPffcc2rfvr3pOK72tUu6K69ruiqq6/T4G9tMxwEAAKdhfQGcNm2arr32Wk2YMOGsxwYCAfn9/pMeOM7rcfTQF/MkSf+79h9au+ew4UQAAKA5VhfABQsWaN26dcrPzw/r+Pz8fGVkZDQ9cnNz2zih+4zo0V43jewmSfqP1/6m+iD7AwIAEG2sLYCFhYWaPn26Xn75ZSUmJob1Pffff7/Ky8ubHoWFhW2c0p3uu7q/0hPjtLnIr/mr9piOAwAA/okTsvQWDq+++qpuuOEGeb3epufq6+vlOI48Ho8CgcBJrzXH7/crIyND5eXlSk9Pb+vIrvLbjz7Vf7y2WemJcXr/B+PVIdVnOhIAAJI4f0sWjwBeddVV2rRpkwoKCpoeI0eO1C233KKCgoKzlj+c2S2X9tDALunyV9fpiTe3m44DAABOYG0BTEtLU15e3kmPlJQUdejQQXl5eabjuZ7X4+iRKYMkSQvXFGr9XhaEAAAQLawtgGh7I3pk6sbhjQtCNrMgBACAKBFnOkA0WbJkiekIMWfW5P56e0uxNu0r1ysf79Wtl/UwHQkAAOsxAog21SnNp+9P7CtJevKt7fqsqsZwIgAAQAFEm7v1sh7q3zlN5Udr9eRb3CEEAADTKIBoc3Fejx6Z0rCwZsHqQm0oLDMbCAAAy1EAEREX98zUvwzrqlCo4Q4hQRaEAABgDAUQETPrmv5K88Vpwz/KtXANd1EBAMAUCiAiJistUTOOLQh5/M1tOsyCEAAAjKAAIqKmjuqhftlpKjtSqyff5g4hAACYQAFERMV5PXr4+oY7hLzy8V79eunfVVMXNJwKAAC7UAARcZf27qCvXpyrUEjKf2Obrn56mZZsLzEdCwAAa1AAYcRjN1ykJ780WB1TE7SrtEq3v7Bad/5mtT4trTIdDQCAmOeEQiH24zhPfr9fGRkZKi8vV3p6uuk4ruSvrtUz736iFz74VHXBkBK8Hn1zbC99d/yFSvFxp0IAQOvj/E0BbBE+QK1nZ0mlHvrzZi3/pFSSlJ3u0wPXDNAXh+TIcRzD6QAAsYTzNwWwRfgAta5QKKR3tpbokb9s0d7PjkiSLu7ZXj+9cbAu6JRqOB0AIFZw/uYaQEQRx3E0cWC23r53rH74+X5Kivdq9aeHdf2cD/TXTftNxwMAIGZQABF1EuO9mnblhXr3++N0Sa9MVQbq9N3frdMjf9mi2nq2jAEAoKUogIhaOe2SNP/OS/Xtcb0lSc+v2K2b561UcXm14WQAALgbBRBRLc7r0f2TB+jXt41QWmKc1uw5rC88s1wf7iw1HQ0AANeiAMIVPj+os/5yzxgN6JKu0soa3fr8Kv3y/Z0KBlnDBADAuaIAwjV6dEjRou+O1pdHdFMwJD351nb94A8bTMcCAMB1KIBwlcR4r5788hA9fuNFchzpT+v2qajsqOlYAAC4CgUQrvSVi7trZI/2kqR3th4wnAYAAHehAMK1Jg7MliS9vZkCCADAuaAAwrUmDuwsSVq565DKj9YaTgMAgHtQAOFavTqmqE9WquqCIS3ZXmI6DgAArkEBhKs1TQNvYRoYAIBwUQDhapMGNUwDL91+UIG6esNpAAB
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"log_rang_log_freq('pt-2-word-ngrams-log-log', ngrams(get_words(pan_tadeusz), 2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tajemniczy język Manuskryptu Wojnicza\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[Manuskrypt Wojnicza](https://pl.wikipedia.org/wiki/Manuskrypt_Wojnicza) to powstały w XV w. manuskrypt spisany w\n",
"tajemniczym alfabecie, do dzisiaj nieodszyfrowanym. Rękopis stanowi\n",
"jedną z największych zagadek historii (i lingwistyki).\n",
"\n",
"[Źródło: https://commons.wikimedia.org/wiki/File:Voynich<sub>Manuscript</sub><sub>(135)</sub>.jpg](./02_Jezyki/voynich135.jpg)\n",
"\n",
"Sami zbadajmy statystyczne własności tekstu manuskryptu. Użyjmy\n",
"transkrypcji Vnow, gdzie poszczególne znaki tajemniczego alfabetu\n",
"zamienione na litery alfabetu łacińskiego, cyfry i gwiazdkę. Jak\n",
"transkrybować manuskrypt, pozostaje sprawą dyskusyjną, natomiast wybór\n",
"takiego czy innego systemu transkrypcji nie powinien wpływać\n",
"dramatycznie na analizę statystyczną.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"9 OR 9FAM ZO8 QOAR9 Q*R 8ARAM 29 [O82*]OM OPCC9 OP"
]
}
],
"source": [
"import requests\n",
"\n",
"voynich_url = 'http://www.voynich.net/reeds/gillogly/voynich.now'\n",
"voynich = requests.get(voynich_url).content.decode('utf-8')\n",
"\n",
"voynich = re.sub(r'\\{[^\\}]+\\}|^<[^>]+>|[-# ]+', '', voynich, flags=re.MULTILINE)\n",
"\n",
"voynich = voynich.replace('\\n\\n', '#')\n",
"voynich = voynich.replace('\\n', ' ')\n",
"voynich = voynich.replace('#', '\\n')\n",
"\n",
"voynich = voynich.replace('.', ' ')\n",
"\n",
"voynich[100:150]"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAABLY0lEQVR4nO3deVjU1f///8fEJipMoAKiuJSGu6aWovVGU0Fzac/SSHPLXEnNLMuld7mnZaaVuZBL+jHflmUiZmqa4oJZmuSSkKYQlQhiBgqv3x/9nK8jCDMjOCPdb9c11+Wc13nOeb5wFubJOedlMgzDEAAAAAAAAOCibnF2AgAAAAAAAEBhKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApbk7OwGUPnl5eTp9+rR8fHxkMpmcnQ4AAAAAoJQzDEPnzp1TcHCwbrmFuTqlEQUsFLvTp08rJCTE2WkAAAAAAP5lTp48qapVqzo7DZQAClgodj4+PpL+eePw9fV1cjYAAAAAgNIuMzNTISEhlu+jKH0oYKHYXV426OvrSwELAAAAAHDDsI1N6cXCUAAAAAAAALg0ClgAAAAAAABwaRSwAAAAAAAA4NIoYAEAAAAAAMClUcACAAAAAACAS6OABQAAAAAAAJdGAQsAAAAAAAAujQIWAAAAAAAAXJq7sxMAnKnGmHV29U+e0rmEMgEAAAAAANfCDCwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAcgGTJ0/WXXfdJR8fHwUEBOjBBx/U4cOHrfoYhqEJEyYoODhY3t7eatOmjX788UerPtnZ2Ro6dKgqVqyocuXKqVu3bvr111+t+qSnpysqKkpms1lms1lRUVE6e/asVZ8TJ06oa9euKleunCpWrKhhw4YpJyenRM4dAAAAAACgKBSwXMDWrVs1ePBgxcfHa+PGjbp06ZIiIiJ0/vx5S59p06Zp5syZmjNnjvbs2aOgoCB16NBB586ds/SJjo7WmjVrtGLFCm3fvl1ZWVnq0qWLcnNzLX169Oih/fv3KzY2VrGxsdq/f7+ioqIsx3Nzc9W5c2edP39e27dv14oVK7R69WqNHDnyxvwwAAAAAAAArmIyDMNwdhKw9vvvvysgIEBbt27Vf/7zHxmGoeDgYEVHR+vFF1+U9M9sq8DAQE2dOlXPPvusMjIyVKlSJS1ZskTdu3eXJJ0+fVohISH68ssvFRkZqcTERNWrV0/x8fFq0aKFJCk+Pl5hYWH66aefFBoaqvXr16tLly46efKkgoODJUkrVqxQ7969lZaWJl9f3yLzz8zMlNlsVkZGhk39nanGmHV29U+e0rmEMgEAAAAAOOpm+h4KxzADywVlZGRIkvz9/SVJSUlJSk1NVUREhKWPl5eXwsPDtWPHDklSQkKCLl68aNUnODhYDRo0sPTZuXOnzGazpXglSS1btpTZbLbq06BBA0vxSpIiIyOVnZ2thISEAvPNzs5WZmam1Q0AAAAAAKC4UMByMYZhaMSIEbrnnnvUoEEDSVJqaqokKTAw0KpvYGCg5Vhqaqo8PT3l5+dXaJ+AgIB8YwYEBFj1uXocPz8/eXp6WvpcbfLkyZY9tcxms0JCQuw9bQAAAAAAgGuigOVihgwZoh9++EEff/xxvmMmk8nqvmEY+dqudnWfgvo70udKL730kjIyMiy3kydPFpoTAAAAAACAPShguZChQ4dq7dq12rx5s6pWrWppDwoKkqR8M6DS0tIss6WCgoKUk5Oj9PT0Qvv89ttv+cb9/fffrfpcPU56erouXryYb2bWZV5eXvL19bW6AQAAAAAAFBcKWC7AMAwNGTJE//vf//T111+rZs2aVsdr1qypoKAgbdy40dKWk5OjrVu3qlWrVpKkZs2aycPDw6pPSkqKDh48aOkTFhamjIwM7d6929Jn165dysjIsOpz8OBBpaSkWPrExcXJy8tLzZo1K/6TBwAAAAAAKIK7sxOANHjwYC1fvlyfffaZfHx8LDOgzGazvL29ZTKZFB0drUmTJql27dqqXbu2Jk2apLJly6pHjx6Wvn379tXIkSNVoUIF+fv7a9SoUWrYsKHat28vSapbt646duyo/v376/3335ckDRgwQF26dFFoaKgkKSIiQvXq1VNUVJSmT5+uM2fOaNSoUerfvz8zqwAAAAAAgFNQwHIB8+bNkyS1adPGqn3RokXq3bu3JGn06NG6cOGCBg0apPT0dLVo0UJxcXHy8fGx9J81a5bc3d31+OOP68KFC2rXrp0WL14sNzc3S59ly5Zp2LBhlqsVduvWTXPmzLEcd3Nz07p16zRo0CC1bt1a3t7e6tGjh2bMmFFCZw8AAAAAAFA4k2EYhrOTQOmSmZkps9msjIwMl5+1VWPMOrv6J0/pXEKZAAAAAAAcdTN9D4Vj2AMLAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACXRgELAAAAAAAALo0CFgAAAAAAAFwaBSwAAAAAAAC4NApYAAAAAAAAcGkUsAAAAAAAAODSKGABAAAAAADApVHAAgAAAAAAgEujgAUAAAAAAACX5u7sBG5mZ8+e1e7du5WWlqa8vDyrY08//bSTsgIAAAAAAChdKGA56PPPP1fPnj11/vx5+fj4yGQyWY6ZTCYKWAAAAAAAAMWEJYQ2+r//+z/98ssvlvsjR45Unz59dO7cOZ09e1bp6emW25kzZ5yYKQAAAAAAQOlCActG5cuXV9u2bbV3715J0qlTpzRs2DCVLVvWyZkBAAAAAACUbhSwbHT//ffrs88+03PPPSdJioyMtBSzAAAAAAAAUHLYA8sODRs21DfffCNJ6ty5s1544QUdOnRIDRs2lIeHh1Xfbt26OSNFAAAAAACAUocClp28vb0lSf3795ckvfbaa/n6mEwm5ebm3tC8AAAAAAAASisKWA7Ky8tzdgoAAAAAAAD/CuyBVQz+/vtvZ6cAAAAAAABQalHAclBubq7++9//qkqVKipfvryOHz8uSXr11Ve1YMECJ2cHAAAAAABQelDActAbb7yhxYsXa9q0afL09LS0N2zYUB9++KFdj/XNN9+oa9euCg4Olslk0qeffmp1vHfv3jKZTFa3li1bWvXJzs7W0KFDVbFiRZUrV07dunXTr7/+atUnPT1dUVFRMpvNMpvNioqK0tmzZ636nDhxQl27dlW5cuVUsWJFDRs2TDk5OXadDwAAAAAAQHGigOWgjz76SB988IF69uwpNzc3S3ujRo30008/2fVY58+fV+PGjTVnzpxr9unYsaNSUlIsty+//NLqeHR0tNasWaMVK1Zo+/btysrKUpcuXaw2k+/Ro4f279+v2NhYxcbGav/+/YqKirIcz83NVefOnXX
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rang_freq_with_labels('voy-chars', get_characters(voynich))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA2SElEQVR4nO3deXhU5cH+8fvMJJksJJOEkISYCQZlCSBbUGTRYkEqbqVardSt1PpKiwvmrVa0qz81rdalr1YUXutaKnVBqa9LcWMR2cKq7GsCBEJYMlnIJJnM74+ESMpiQpZnJuf7ua65JCcz6e2VY8/Nc57nOVYgEAgIAAAAtuEwHQAAAADtiwIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNhJkOEMpqa2u1Z88excbGyrIs03EAAEATBAIBlZaWKi0tTQ6HPcfCKIAtsGfPHnk8HtMxAADAaSgoKFB6errpGEZQAFsgNjZWUt0JFBcXZzgNAABoCq/XK4/H03AdtyMKYAscve0bFxdHAQQAIMTYefqWPW98AwAA2BgFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQxStbUB0xEAAEAHRQEMQjuKy3Xp/yzQ6oLDpqMAAIAOiAIYhP74wQZt2Fuqa57/Um/m7TIdBwAAdDAUwCD06DX9NSYrWVU1tfrlG6v1+zlfq9pfazoWAADoICiAQSguMlzTbxyiu0b3kCS9tGiHbnxhiQ6U+QwnAwAAHQEFMEg5HJbuvrinnr8xWzERTi3edlBXPvOFvtpdYjoaAAAIcRTAIPe9vql6Z/IIZSbFaPfhI7p62iK9s3K36VgAACCEUQBDQI+UWL0zeYRG9eoiX02tpsxapYf/b51qmBcIAABOAwUwRLijwvXCzedq8kVnSZJmLNiun7y4TIfKqwwnAwAAoYYCGEKcDkv3fK+3nr1+sKIjnFq4pVhX/nWh1u3xmo4GAABCCAUwBF16Tle9/YvhykiMVsHBunmB763ZYzoWAAAIERTAENU7NU5zbh+hC3ok6Ui1X7fPXKk/frBBfh4hBwAAvgUFMITFR0fopYnn6bbvdJckPTdvqya+tEwlFdWGkwEAgGBGAQxxToelqeOy9D8TBiky3KH5m/bryr8u1KZ9paajAQCAIEUB7CCuHJCmt34+XOkJUdp5oELj//qFPvyq0HQsAAAQhGxdAHfv3q0bbrhBnTt3VnR0tAYOHKi8vDzTsU5b3zS35tw+UsPP6qyKKr8mvbZCf/5oo2qZFwgAAI5h2wJ46NAhjRgxQuHh4frggw+0bt06Pf7444qPjzcdrUUSYyL0yk/P0y0jMyVJz3y2RT97Zbm8lcwLBAAAdaxAIGDL4aH77rtPX3zxhRYsWHDaP8Pr9crtdqukpERxcXGtmK51zF65S/e9tVa+mlp1T4rR9JuydXZyrOlYAAAYFezX7/Zg2xHAOXPmaMiQIbrmmmuUnJysQYMGacaMGaf8jM/nk9frbfQKZj8YlK43Jw1XmjtS24rLNf6vizR33T7TsQAAgGG2LYDbtm3TtGnT1KNHD3300UeaNGmS7rzzTr3yyisn/Uxubq7cbnfDy+PxtGPi03NOultz7hipoZmJKvPV6NZXluupjzcxLxAAABuz7S3giIgIDRkyRIsWLWo4duedd2rZsmX68ssvT/gZn88nn8/X8LXX65XH4wmJIeRqf60e/r/1emnRDknSxX1S9MS1AxQbGW42GAAA7YxbwDYeAezatav69OnT6FhWVpby8/NP+hmXy6W4uLhGr1AR7nTo91f21WM/7K+IMIfmrtunHzy7SNv2l5mOBgAA2pltC+CIESO0cePGRsc2bdqkbt26GUrUPq4Z4tE/bxum1LhIbSkq0/f/+oU+3cC8QAAA7MS2BfDuu+/W4sWL9cgjj2jLli2aOXOmpk+frsmTJ5uO1uYGeuI1544ROvfMBJVW1uiWl5frr59tkU1nAwAAYDu2nQMoSe+9956mTp2qzZs3KzMzUzk5Obr11lub/PlQn0NQVVOrB9/7Wq8trrvtPa5fqv58zQDFuMIMJwMAoO2E+vW7Ndi6ALZURzmB/rE0X7999ytV+wPqlRKr6Tdlq1vnGNOxAABoEx3l+t0Str0FjG9MOC9Dr//XMCXHurRxX6mueHqh5m3abzoWAABoIxRASJKyuyXoX3eM1KCMeHkrazTxxaV6bt5W5gUCANABUQDRICUuUq//1/m67lyPagPSHz/YoDv+sVIVVTWmowEAgFZEAUQjrjCncq86Rw+N76cwh6X31hTq6mlfquBgheloAACglVAAcRzLsnTD+d0089bzldQpQusLvbrimYX657ICHSyvMh0PAAC0EKuAW8AOq4gKS45o0qt5Wr2rRJLksKTBGQn6blayxmSlqEdyJ1mWZTglAABNZ4fr97ehALaAXU6gymq/Zszfpve/2qv1hd5G3/MkRml07xSNzkrW0MzOighjUBkAENzscv0+FQpgC9jxBNp9+Ig+3VCkT9bv06KtB1RVU9vwvU6uMF3QI0mjs1J0Ua8u6tzJZTApAAAnZsfr93+iALaA3U+giqoaLdxcrE/WF+mTDUUqLvM1fM+ypEGeeI3OStGYrBT1TOFWMQAgONj9+i1RAFuEE+gbtbUBrd1dok/W79PH64u07j9uFacnRGl072SNzkrR0O6JcoU5DSUFANgd128KYItwAp1cYcmRupHB9fv0xX/cKo6JcOqCHl00OitZ556ZqK7xkRRCAEC74fpNAWwRTqCmqaiq0RdbDuiT9fv0yYYi7S/1HfeepE4R6uqOUld3pNLio5QWH6mu7m/+mRzrUpiTBSYAgJbj+k0BbBFOoOarrQ3oqz0l+nh9kT7bUKRN+0rlO2Z08GScDkspsS51jf+mJHZPitGgjAT1SO4kh4P5hQCApuH6TQFsEU6glgsEAjpUUa09h49oz+EjKiyp1J6SIyo8XKnCkiPac7hSe72V8tee/DSNdYVpgCdegzPiNSgjQYMy4hUfHdGO/xYAgFDC9ZsC2CKcQO3DXxvQ/lJfo2K4+/ARrS/0anVBiY5U+4/7zNHRwUEZ8RqckaCeKZ24hQwAkMT1W6IAtggnkHk1/lpt3FeqlfmHtSL/kFbmH9b24vLj3hcT4dQDl/XRj4dmGEgJAAgmXL8pgC3CCRScDpVXaWXBoYZSuLqgRGW+GoU5LL318+Ea4Ik3HREAYBDXbwpgi3AChQZ/bUB3vr5S/7emUN2TYvTenSMVHRFmOhYAwBCu3xKTotDhOR2WHh7fT6lxkdpWXK5H3l9vOhIAAEZRAGEL8dERevzaAZKk1xbn69MN+wwnAgDAHAogbGPE2Um6ZWSmJOneN9c0enYxAAB2QgGErdzzvV7qlRKr4rIq3ffWGjEFFgBgRxRA2EpkuFNPXTdQEU6HPl5fpNeXFZiOBABAu6MAwnayusbpnu/1kiQ9+K91J9w3EACAjowCCFu6ZWSmhnXvrCPVfk2ZtUrV/m9/HjEAAB0FBRC25HBYevzaAYqLDNPqgsN65tMtpiMBANBuKICwrbT4KD30g3MkSc98tkV5Ow8ZTgQAQPugAMLWrhyQpvE
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"log_rang_log_freq('voy-log-log', get_words(voynich))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAABCNklEQVR4nO3de5xO9d7/8fdlzjPG1RyYMRmHMhGGdrRJdrQxklNpxy4d2aXkMCGRasve0e5gdLj1a5fbSCGUUqnQjkgip3JIYuTQTBTNODXDzOf3R3uu22UGY8zMWng9H4/14Frre631Weu6rrWueV/ftZbHzEwAAAAAAACAS1VyugAAAAAAAADgZAiwAAAAAAAA4GoEWAAAAAAAAHA1AiwAAAAAAAC4GgEWAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8ACAAAAAACAqxFgAQAAAAAAwNUIsAAAAAAAAOBqBFgAAAAAAABwNQIsAAAAAAAAuBoBFgAAAAAAAFyNAAsAAAAAAACuRoAFAAAAAAAAVyPAAgAAAAAAgKsRYAEAAAAAAMDVCLAAAAAAAADgagRYAAAAAAAAcDUCLAAAAAAAALgaARYAAAAAAABcjQALAAAAAAAArkaABQAAAAAAAFcjwAIAAAAAAICrEWABAAAAAADA1QiwAAAAAAAA4GoEWAAAAAAAAHA1AiwAAAAAAAC4GgEWAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8ACAAAAAACAqxFgAQAAAAAAwNUIsAAAAAAAAOBqgU4XgHNPQUGBfvzxR0VGRsrj8ThdDgAAAADgHGdm2r9/vxISElSpEn11zkUEWChzP/74oxITE50uAwAAAABwntmxY4dq1KjhdBkoBwRYKHORkZGSft9xVKlSxeFqAAAAAADnupycHCUmJvr+HsW5hwALZa7wtMEqVaoQYAEAAAAAKgyXsTl3cWIoAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8ACAAAAAACAqxFgAQAAAAAAwNUIsAAAAAAAAOBqBFgAAAAAAABwNQIsAAAAAAAAuFqg0wUATqo9/AOnS5AkbXuyk9MlAAAAAADgWvTAAgAAAAAAgKsRYAEAAAAAAMDVCLAAAAAAAADgagRYAAAAAAAAcDUCLAAAAAAAALgaARYAAAAAAABcjQALAAAAAAAArkaABQAAAAAAAFcjwAIAAAAAAICrEWABAAAAAADA1QiwAAAAAAAA4GoEWAAAAAAAAHA1AiwAAAAAAAC4GgEWAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8A6h4waNUoej8dviI+P9003M40aNUoJCQkKCwtTmzZttH79er955ObmasCAAYqNjVVERIS6du2qnTt3VvSqAAAAAAAA+BBgnWMaNmyozMxM3/DNN9/4pj311FMaN26cXnzxRa1YsULx8fFq37699u/f72uTmpqq2bNna/r06VqyZIkOHDigzp07Kz8/34nVAQAAAAAAUKDTBaBsBQYG+vW6KmRmGj9+vEaOHKnu3btLkiZPnqy4uDhNnTpVffv2VXZ2tiZOnKgpU6aoXbt2kqTXX39diYmJWrBggTp06FCh6wIAAAAAACDRA+ucs3nzZiUkJKhOnTr661//qq1bt0qSMjIylJWVpZSUFF/bkJAQtW7dWkuXLpUkrVy5UkeOHPFrk5CQoEaNGvnaFCc3N1c5OTl+AwAAAAAAQFkhwDqHNG/eXK+99po+/vhjvfLKK8rKylLLli31yy+/KCsrS5IUFxfn95y4uDjftKysLAUHBysqKuqEbYozduxYeb1e35CYmFjGawYAAAAAAM5nBFjnkI4dO+rGG29UcnKy2rVrpw8++EDS76cKFvJ4PH7PMbMi4453qjYjRoxQdna2b9ixY8cZrAUAAAAAAIA/AqxzWEREhJKTk7V582bfdbGO70m1e/duX6+s+Ph45eXlad++fSdsU5yQkBBVqVLFbwAAAAAAACgrBFjnsNzcXG3cuFHVq1dXnTp1FB8fr/nz5/um5+XladGiRWrZsqUkqWnTpgoKCvJrk5mZqXXr1vnaAAAAAAAAVDTuQngOGTp0qLp06aKaNWtq9+7d+uc//6mcnBzdcccd8ng8Sk1N1ZgxY5SUlKSkpCSNGTNG4eHhuuWWWyRJXq9Xffr00ZAhQxQTE6Po6GgNHTrUd0oiAAAAAACAEwiwziE7d+7UzTffrJ9//llVq1ZVixYttGzZMtWqVUuSNGzYMB0+fFj9+vXTvn371Lx5c82bN0+RkZG+eaSlpSkwMFA9evTQ4cOH1bZtW6WnpysgIMCp1QIAAAAAAOc5j5mZ00Xg3JKTkyOv16vs7GzXXw+r9vAPnC5BkrTtyU5OlwAAAAAAZ62z6e9QlA7XwAIAAAAAAICrEWABAAAAAADA1QiwAAAAAAAA4GoEWAAAAAAAAHA1AiwAAAAAAAC4GgEWAAAAAAAAXI0ACwAAAAAAAK5GgAUAAAAAAABXI8ACAAAAAACAqxFgAQAAAAAAwNUIsAAAAAAAAOBqBFgAAAAAAABwNQIsAAAAAAAAuBoBFgAAAAAAAFyNAAsAAAAAAACuRoAFAAAAAAAAVyPAAgAAAAAAgKsRYAEAAAAAAMDVCLAAAAAAAADgagRYAAAAAAAAcLVApwuA9Ouvv2r58uXavXu3CgoK/KbdfvvtDlUFAAAAAADgDgRYDnvvvffUq1cvHTx4UJGRkfJ4PL5pHo+HAAsAAAAAAJz3OIWwgs2YMUM//PCD7/GQIUPUu3dv7d+/X7/++qv27dvnG/bu3etgpQAAAAAAAO5AgFXBKleurGuuuUZfffWVJGnXrl0aOHCgwsPDHa4MAAAAAADAnQiwKth1112nd999V/fdd58kqUOHDr4wCwAAAAAAAEVxDSwHJCcn67PPPpMkderUSQ8++KA2bNig5ORkBQUF+bXt2rWrEyUCAAAAAAC4BgGWQ8LCwiRJd999tyRp9OjRRdp4PB7l5+dXaF0AAAAAAABuQ4DlsIKCAqdLAAAAAAAAcDWugeUiv/32W5nNa+zYsfJ4PEpNTfWNMzONGjVKCQkJCgsLU5s2bbR+/Xq/5+Xm5mrAgAGKjY1VRESEunbtqp07d5ZZXQAAAAAAAKeLAMth+fn5+sc//qELL7xQlStX1tatWyVJjz76qCZOnFiqea5YsUL//ve/1bhxY7/xTz31lMaNG6cXX3xRK1asUHx8vNq3b6/9+/f72qSmpmr27NmaPn26lixZogMHDqhz586cyggAAAAAABxDgOWwJ554Qunp6XrqqacUHBzsG5+cnKxXX331tOd34MAB9erVS6+88oqioqJ8481M48eP18iRI9W9e3c1atRIkydP1qFDhzR16lRJUnZ2tiZOnKhnn31W7dq10x/+8Ae9/vrr+uabb7RgwYIzX1kAAAAAAIBSIMBy2GuvvaZ///vf6tWrlwICAnzjGzdurG+//fa053f//ferU6dOateund/4jIwMZWVlKSUlxTcuJCRErVu31tKlSyVJK1eu1JEjR/zaJCQkqFGjRr42AAAAAAAAFY2LuDts165dqlu3bpHxBQUFOnLkyGnNa/r06Vq1apVWrFhRZFpWVpYkKS4uzm98XFycfvjhB1+b4OBgv55bhW0Kn1+c3Nxc5ebm+h7n5OScVt0AAAAAAAAnQw8shzVs2FCLFy8uMn7mzJn6wx/+UOL57NixQ4MGDdLrr7+u0NDQE7bzeDx+j82syLjjnarN2LFj5fV6fUNiYmKJ6wYAAAAAADgVemA57O9//7tuu+027dq1SwUFBXr77be1adMmvfbaa3r//fdLPJ+VK1dq9+7datq0qW9cfn6+PvvsM7344ovatGmTpN97WVWvXt3XZvfu3b5eWfHx8crLy9O+ffv8emHt3r1bLVu2POGyR4wYocGDB/se5+TkEGIBAAAAAIAyQw8sh3Xp0kVvvvmm5s6dK4/Ho8cee0wbN27Ue++9p/bt25d4Pm3bttU333yjNWvW+IZmzZqpV69eWrNmjS666CLFx8dr/vz5vufk5eVp0aJ
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rang_freq_with_labels('voy-words-20', get_words(voynich), top=20)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA2SElEQVR4nO3deXhU5cH+8fvMJJksJJOEkISYCQZlCSBbUGTRYkEqbqVardSt1PpKiwvmrVa0qz81rdalr1YUXutaKnVBqa9LcWMR2cKq7GsCBEJYMlnIJJnM74+ESMpiQpZnJuf7ua65JCcz6e2VY8/Nc57nOVYgEAgIAAAAtuEwHQAAAADtiwIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNhJkOEMpqa2u1Z88excbGyrIs03EAAEATBAIBlZaWKi0tTQ6HPcfCKIAtsGfPHnk8HtMxAADAaSgoKFB6errpGEZQAFsgNjZWUt0JFBcXZzgNAABoCq/XK4/H03AdtyMKYAscve0bFxdHAQQAIMTYefqWPW98AwAA2BgFEAAAwGYogAAAADZDAQQAALAZCiAAAIDNUAABAABshgIIAABgMxRAAAAAm6EAAgAA2AwFEAAAwGYogAAAADZDAQxStbUB0xEAAEAHRQEMQjuKy3Xp/yzQ6oLDpqMAAIAOiAIYhP74wQZt2Fuqa57/Um/m7TIdBwAAdDAUwCD06DX9NSYrWVU1tfrlG6v1+zlfq9pfazoWAADoICiAQSguMlzTbxyiu0b3kCS9tGiHbnxhiQ6U+QwnAwAAHQEFMEg5HJbuvrinnr8xWzERTi3edlBXPvOFvtpdYjoaAAAIcRTAIPe9vql6Z/IIZSbFaPfhI7p62iK9s3K36VgAACCEUQBDQI+UWL0zeYRG9eoiX02tpsxapYf/b51qmBcIAABOAwUwRLijwvXCzedq8kVnSZJmLNiun7y4TIfKqwwnAwAAoYYCGEKcDkv3fK+3nr1+sKIjnFq4pVhX/nWh1u3xmo4GAABCCAUwBF16Tle9/YvhykiMVsHBunmB763ZYzoWAAAIERTAENU7NU5zbh+hC3ok6Ui1X7fPXKk/frBBfh4hBwAAvgUFMITFR0fopYnn6bbvdJckPTdvqya+tEwlFdWGkwEAgGBGAQxxToelqeOy9D8TBiky3KH5m/bryr8u1KZ9paajAQCAIEUB7CCuHJCmt34+XOkJUdp5oELj//qFPvyq0HQsAAAQhGxdAHfv3q0bbrhBnTt3VnR0tAYOHKi8vDzTsU5b3zS35tw+UsPP6qyKKr8mvbZCf/5oo2qZFwgAAI5h2wJ46NAhjRgxQuHh4frggw+0bt06Pf7444qPjzcdrUUSYyL0yk/P0y0jMyVJz3y2RT97Zbm8lcwLBAAAdaxAIGDL4aH77rtPX3zxhRYsWHDaP8Pr9crtdqukpERxcXGtmK51zF65S/e9tVa+mlp1T4rR9JuydXZyrOlYAAAYFezX7/Zg2xHAOXPmaMiQIbrmmmuUnJysQYMGacaMGaf8jM/nk9frbfQKZj8YlK43Jw1XmjtS24rLNf6vizR33T7TsQAAgGG2LYDbtm3TtGnT1KNHD3300UeaNGmS7rzzTr3yyisn/Uxubq7cbnfDy+PxtGPi03NOultz7hipoZmJKvPV6NZXluupjzcxLxAAABuz7S3giIgIDRkyRIsWLWo4duedd2rZsmX68ssvT/gZn88nn8/X8LXX65XH4wmJIeRqf60e/r/1emnRDknSxX1S9MS1AxQbGW42GAAA7YxbwDYeAezatav69OnT6FhWVpby8/NP+hmXy6W4uLhGr1AR7nTo91f21WM/7K+IMIfmrtunHzy7SNv2l5mOBgAA2pltC+CIESO0cePGRsc2bdqkbt26GUrUPq4Z4tE/bxum1LhIbSkq0/f/+oU+3cC8QAAA7MS2BfDuu+/W4sWL9cgjj2jLli2aOXOmpk+frsmTJ5uO1uYGeuI1544ROvfMBJVW1uiWl5frr59tkU1nAwAAYDu2nQMoSe+9956mTp2qzZs3KzMzUzk5Obr11lub/PlQn0NQVVOrB9/7Wq8trrvtPa5fqv58zQDFuMIMJwMAoO2E+vW7Ndi6ALZURzmB/rE0X7999ytV+wPqlRKr6Tdlq1vnGNOxAABoEx3l+t0Str0FjG9MOC9Dr//XMCXHurRxX6mueHqh5m3abzoWAABoIxRASJKyuyXoX3eM1KCMeHkrazTxxaV6bt5W5gUCANABUQDRICUuUq//1/m67lyPagPSHz/YoDv+sVIVVTWmowEAgFZEAUQjrjCncq86Rw+N76cwh6X31hTq6mlfquBgheloAACglVAAcRzLsnTD+d0089bzldQpQusLvbrimYX657ICHSyvMh0PAAC0EKuAW8AOq4gKS45o0qt5Wr2rRJLksKTBGQn6blayxmSlqEdyJ1mWZTglAABNZ4fr97ehALaAXU6gymq/Zszfpve/2qv1hd5G3/MkRml07xSNzkrW0MzOighjUBkAENzscv0+FQpgC9jxBNp9+Ig+3VCkT9bv06KtB1RVU9vwvU6uMF3QI0mjs1J0Ua8u6tzJZTApAAAnZsfr93+iALaA3U+giqoaLdxcrE/WF+mTDUUqLvM1fM+ypEGeeI3OStGYrBT1TOFWMQAgONj9+i1RAFuEE+gbtbUBrd1dok/W79PH64u07j9uFacnRGl072SNzkrR0O6JcoU5DSUFANgd128KYItwAp1cYcmRupHB9fv0xX/cKo6JcOqCHl00OitZ556ZqK7xkRRCAEC74fpNAWwRTqCmqaiq0RdbDuiT9fv0yYYi7S/1HfeepE4R6uqOUld3pNLio5QWH6mu7m/+mRzrUpiTBSYAgJbj+k0BbBFOoOarrQ3oqz0l+nh9kT7bUKRN+0rlO2Z08GScDkspsS51jf+mJHZPitGgjAT1SO4kh4P5hQCApuH6TQFsEU6glgsEAjpUUa09h49oz+EjKiyp1J6SIyo8XKnCkiPac7hSe72V8tee/DSNdYVpgCdegzPiNSgjQYMy4hUfHdGO/xYAgFDC9ZsC2CKcQO3DXxvQ/lJfo2K4+/ARrS/0anVBiY5U+4/7zNHRwUEZ8RqckaCeKZ24hQwAkMT1W6IAtggnkHk1/lpt3FeqlfmHtSL/kFbmH9b24vLj3hcT4dQDl/XRj4dmGEgJAAgmXL8pgC3CCRScDpVXaWXBoYZSuLqgRGW+GoU5LL318+Ea4Ik3HREAYBDXbwpgi3AChQZ/bUB3vr5S/7emUN2TYvTenSMVHRFmOhYAwBCu3xKTotDhOR2WHh7fT6lxkdpWXK5H3l9vOhIAAEZRAGEL8dERevzaAZKk1xbn69MN+wwnAgDAHAogbGPE2Um6ZWSmJOneN9c0enYxAAB2QgGErdzzvV7qlRKr4rIq3ffWGjEFFgBgRxRA2EpkuFNPXTdQEU6HPl5fpNeXFZiOBABAu6MAwnayusbpnu/1kiQ9+K91J9w3EACAjowCCFu6ZWSmhnXvrCPVfk2ZtUrV/m9/HjEAAB0FBRC25HBYevzaAYqLDNPqgsN65tMtpiMBANBuKICwrbT4KD30g3MkSc98tkV5Ow8ZTgQAQPugAMLWrhyQpvE
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"log_rang_log_freq('voy-words-log-log', get_words(voynich))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Język DNA\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kod genetyczny przejawia własności zaskakująco podobne do języków naturalnych.\n",
"Przede wszystkim ma charakter dyskretny, genotyp to ciąg symboli ze skończonego alfabetu.\n",
"Podstawowe litery są tylko cztery, reprezentują one nukleotydy, z których zbudowana jest nić DNA:\n",
"a, g, c, t.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"TATAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA"
]
}
],
"source": [
"import requests\n",
"\n",
"dna_url = 'https://raw.githubusercontent.com/egreen18/NanO_GEM/master/rawGenome.txt'\n",
"dna = requests.get(dna_url).content.decode('utf-8')\n",
"\n",
"dna = ''.join(dna.split('\\n')[1:])\n",
"dna = dna.replace('N', 'A')\n",
"\n",
"dna[0:100]"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAAwBElEQVR4nO3df1iUdb7/8dcEMiLCBLJAY5R26ZIE9gM3JdvFPQVoItt2NjthJCeXbDWJ0GOZm6mb0JqiJz22ZRbmj2NHzVObxUJuaV5GEEmJGtamx18g7oqgZIAw3z827++OjLYWct8yz8d1zXUxn/vNzIu5rnuvfO3nvsfmcrlcAgAAAAAAACzqMrMDAAAAAAAAAOdDgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFFgAAAAAAACyNAgsAAAAAAACWRoEFAAAAAAAAS6PAAgAAAAAAgKVRYAEAAAAAAMDSKLAAAAAAAABgaRRYAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFFgAAAAAAACyNAgsAAAAAAACWRoEFAAAAAAAAS/M1OwC6nra2Nh0+fFiBgYGy2WxmxwEAAAAAdHEul0snTpyQ0+nUZZexV6crosBChzt8+LAiIyPNjgEAAAAA8DIHDhzQlVdeaXYMXAQUWOhwgYGBkv7+PxxBQUEmpwEAAAAAdHUNDQ2KjIw0/j2KrocCCx3uzGWDQUFBFFgAAAAAgE7DbWy6Li4MBQAAAAAAgKVRYAEAAAAAAMDSKLAAAAAAAABgaRRYAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEvzNTsAYKY+j280OwLwg+17ZqTZEQAAAADgoqLAAgB0OspjdAWUxwAAAJ2HSwgBAAAAAABgaRRYAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFVifLy8uTzWZTdna2seZyuTRz5kw5nU75+/tr2LBh2rlzp9vvNTU1adKkSQoNDVVAQIBSU1N18OBBt5m6ujqlp6fL4XDI4XAoPT1dx48fd5vZv3+/Ro0apYCAAIWGhiorK0vNzc1uMzt27FBCQoL8/f3Vu3dvzZ49Wy6Xq0M/BwAAAAAAgH8WBVYnKisr04svvqiBAwe6rc+dO1f5+flavHixysrKFBERocTERJ04ccKYyc7O1oYNG7RmzRpt3bpVJ0+eVEpKilpbW42ZtLQ0VVRUqLCwUIWFhaqoqFB6erpxvLW1VSNHjlRjY6O2bt2qNWvWaP369Zo8ebIx09DQoMTERDmdTpWVlWnRokWaN2+e8vPzL+InAwAAAAAAcG6+ZgfwFidPntSYMWO0dOlSPf3008a6y+XSwoULNX36dN11112SpOXLlys8PFyrV6/W+PHjVV9fr2XLlmnFihW6/fbbJUkrV65UZGSk3n33XSUnJ2v37t0qLCxUSUmJBg8eLElaunSp4uPjVVVVpaioKBUVFWnXrl06cOCAnE6nJGn+/PnKyMjQnDlzFBQUpFWrVumbb75RQUGB7Ha7YmJitGfPHuXn5ysnJ0c2m62TPzkAAAAAAODt2IHVSSZOnKiRI0caBdQZe/fuVU1NjZKSkow1u92uhIQEbdu2TZJUXl6ulpYWtxmn06mYmBhj5sMPP5TD4TDKK0kaMmSIHA6H20xMTIxRXklScnKympqaVF5ebswkJCTIbre7zRw+fFj79u3z+Lc1NTWpoaHB7QEAAAAAANBRKLA6wZo1a/TJJ58oLy+v3bGamhpJUnh4uNt6eHi4caympkZ+fn4KDg4+70xYWFi71w8LC3ObOft9goOD5efnd96ZM8/PzJwtLy/PuO+Ww+FQZGSkxzkAAAAAAIDvgwLrIjtw4IAeeeQRrVy5Ut27dz/n3NmX5rlcru+8XO/sGU/zHTFz5gbu58ozbdo01dfXG48DBw6cNzcAAAAAAMCFoMC6yMrLy1VbW6u4uDj5+vrK19dXmzdv1nPPPSdfX99z7m6qra01jkVERKi5uVl1dXXnnTly5Ei79z969KjbzNnvU1dXp5aWlvPO1NbWSmq/S+wMu92uoKAgtwcAAAAAAEBHocC6yG677Tbt2LFDFRUVxmPQoEEaM2aMKioqdM011ygiIkLFxcXG7zQ3N2vz5s265ZZbJElxcXHq1q2b20x1dbUqKyuNmfj4eNXX16u0tNSY+eijj1RfX+82U1lZqerqamOmqKhIdrtdcXFxxsyWLVvU3NzsNuN0OtWnT5+O/4AAAAAAAAC+A99CeJEFBgYqJibGbS0gIEC9evUy1rOzs5Wbm6v+/furf//+ys3NVY8ePZSWliZJcjgcGjdunCZPnqxevXopJCREU6ZMUWxsrHFT+AEDBmj48OHKzMzUCy+8IEl68MEHlZKSoqioKElSUlKSoqOjlZ6ermeffVbHjh3TlClTlJmZaeyaSktL06xZs5SRkaEnnnhCX3zxhXJzczVjxgy+gRAAAAAAAJiCAssCpk6dqlOnTmnChAmqq6vT4MGDVVRUpMDAQGNmwYIF8vX11ejRo3Xq1CnddtttKigokI+PjzGzatUqZWVlGd9WmJqaqsWLFxvHfXx8tHHjRk2YMEFDhw6Vv7+/0tLSNG/ePGPG4XCouLhYEydO1KBBgxQcHKycnBzl5OR0wicBAAAAAADQns115g7dQAdpaGiQw+FQfX295e+H1efxjWZHAH6wfc+MNDvCBePcQ1dwKZ57AAB0VZfSv0Px/XAPLAAAAAAAAFgalxACAAB4AXY+oitg5yMAeC8KLAAAAAC4SCiP0RVQHsMKuIQQAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpvmYHsKrjx4+rtLRUtbW1amtrczt2//33m5QKAAAAAADA+1BgefDHP/5RY8aMUWNjowIDA2Wz2YxjNpuNAgsAAAAAAKATcQmhpP/5n//R//3f/xnPJ0+erAceeEAnTpzQ8ePHVVdXZzyOHTtmYlIAAAAAAADvQ4ElqWfPnvr5z3+ujz/+WJJ06NAhZWVlqUePHiYnAwAAAAAAAAWWpDvuuENvvPGGfvOb30iSkpOTjTILAAAAAAAA5uIeWN+KjY3Vli1bJEkjR47Uf/zHf2jXrl2KjY1Vt27d3GZTU1PNiAgAAAAAAOCVKLD+gb+/vyQpMzNTkjR79ux2MzabTa2trZ2aCwAAAAAAwJtRYHnQ1tZmdgQAAAAAAAB8i3tgfYdvvvnG7AgAAAAAAABejQLLg9bWVv3ud79T79691bNnT3311VeSpCeffFLLli0zOR0AAAAAAIB3ocDyYM6cOSooKNDcuXPl5+dnrMfGxuqll14yMRkAAAAAAID3ocDy4NVXX9WLL76oMWPGyMfHx1gfOHCgPv/8cxOTAQAAAAAAeB8KLA8OHTqkfv36tVtva2tTS0uLCYkAAAAAAAC8FwWWB9ddd50++OCDdutr167VjTfeaEIiAAAAAAAA7+VrdgAreuqpp5Senq5
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"rang_freq_with_labels('dna-chars', get_characters(dna))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Tryplety — znaczące cząstki genotypu\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Nukleotydy rzeczywiście są jak litery, same w sobie nie niosą\n",
"znaczenia. Dopiero ciągi trzech nukleotydów, *tryplety*, kodują jeden\n",
"z dwudziestu aminokwasów.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABLAAAAEsCAYAAADTvUpQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA9HUlEQVR4nO3de1hU9b7H8c8EgogwgQRE4qU0UtEuVEq2A1NBN4hd9rYTRllutU2JpB7TPBlWonkvOXYxDfOSZWal7gg1MQlJQ6lM0txqooFaIl4yQJzzR4/rNKLuGLG1pPfreeZ5nLW+rPnMqFw+/NYam8PhcAgAAAAAAACwqMvMDgAAAAAAAACcDwUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFFgAAAAAAACyNAgsAAAAAAACWRoEFAAAAAAAAS6PAAgAAAAAAgKVRYAEAAAAAAMDSKLAAAAAAAABgaRRYAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWBoFFgAAAAAAACzN3ewAqH9OnTqlH374QT4+PrLZbGbHAQAAAADUcw6HQ0ePHlVISIguu4y1OvURBRbq3A8//KDQ0FCzYwAAAAAA/mSKi4vVtGlTs2PgIqDAQp3z8fGR9OsnDl9fX5PTAAAAAADquyNHjig0NNT4eRT1DwUW6tzp0wZ9fX0psAAAAAAAfxguY1N/cWIoAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNIosAAAAAAAAGBpFFgAAAAAAACwNAosAAAAAAAAWJq72QEAM7UYucLsCNo9Ic7sCAAAAAAAWBorsAAAAAAAAGBprMACLM7sVWKsEAMAAAAAmI0VWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABL410IAVww3ikRAAAAAHAxsQILAAAAAAAAlkaBBQAAAAAAAEujwLKAtLQ02Ww2p1twcLCx3+FwKC0tTSEhIfLy8lJ0dLS++eYbp2NUVFRo8ODBCggIkLe3txISErR3716nmbKyMiUlJclut8tutyspKUmHDx92mtmzZ4969eolb29vBQQEKCUlRZWVlRftuQMAAAAAAPwnFFgW0a5dO5WUlBi3r7/+2tg3ceJETZ06VRkZGdq4caOCg4PVvXt3HT161JhJTU3V0qVLtWjRIuXm5urYsWOKj49XdXW1MZOYmKjCwkJlZWUpKytLhYWFSkpKMvZXV1crLi5Ox48fV25urhYtWqQlS5Zo2LBhf8yLAAAAAAAAcBZcxN0i3N3dnVZdneZwODR9+nSNHj1a99xzjyRp7ty5CgoK0sKFCzVo0CCVl5dr9uzZmjdvnrp16yZJmj9/vkJDQ7Vq1SrFxsaqqKhIWVlZys/PV8eOHSVJs2bNUmRkpLZt26awsDBlZ2dr69atKi4uVkhIiCRpypQp6tevn8aNGydfX98/6NUAAAAAAAD4f6zAsojvvvtOISEhatmypf7rv/5LO3fulCTt2rVLpaWliomJMWY9PT0VFRWlvLw8SVJBQYGqqqqcZkJCQhQeHm7MrF+/Xna73SivJKlTp06y2+1OM+Hh4UZ5JUmxsbGqqKhQQUHBObNXVFToyJEjTjcAAAAAAIC6QoFlAR07dtSbb76pjz/+WLNmzVJpaaluu+02/fTTTyotLZUkBQUFOX1MUFCQsa+0tFQeHh7y8/M770xgYGCNxw4MDHSaOfNx/Pz85OHhYcyczfjx443ratntdoWGhtbyFQAAAAAAADg3CiwL6Nmzp+699161b99e3bp104oVKyT9eqrgaTabzeljHA5HjW1nOnPmbPOuzJxp1KhRKi8vN27FxcXnzQUAAAAAAFAbFFgW5O3trfbt2+u7774zrot15gqoAwcOGKulgoODVVlZqbKysvPO7N+/v8ZjHTx40GnmzMcpKytTVVVVjZVZv+Xp6SlfX1+nGwAAAAAAQF2hwLKgiooKFRUV6corr1TLli0VHByslStXGvsrKyu1du1a3XbbbZKkiIgINWjQwGmmpKREW7ZsMWYiIyNVXl6uDRs2GDOff/65ysvLnWa2bNmikpISYyY7O1uenp6KiIi4qM8ZAAAAAADgXHgXQgsYPny4evXqpWbNmunAgQN6/vnndeTIET300EOy2WxKTU1Venq6WrdurdatWys9PV2NGjVSYmKiJMlut6t///4aNmyYmjRpIn9/fw0fPtw4JVGS2rRpox49emjAgAF69dVXJUkDBw5UfHy8wsLCJEkxMTFq27atkpKSNGnSJB06dEjDhw/XgAEDWFUFAAAAAABMQ4FlAXv37tX999+vH3/8UVdccYU6deqk/Px8NW/eXJI0YsQInThxQsnJySorK1PHjh2VnZ0tHx8f4xjTpk2Tu7u7+vTpoxMnTqhr167KzMyUm5ubMbNgwQKlpKQY71aYkJCgjIwMY7+bm5tWrFih5ORkde7cWV5eXkpMTNTkyZP/oFcCAAAAAACgJpvD4XCYHQL1y5EjR2S321VeXm75lVstRq4wO4J2T4g7736zM/6nfNKlkREAAABA/XUp/RwK13ANLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYGgUWAAAAAAAALI0CCwAAAAAAAJZGgQUAAAAAAABLo8ACAAAAAACApVFgAQAAAAAAwNLczQ4AABdbi5ErzI6g3RPizI4AAAAAAJcsCiwAsACzSzYKNgAAAABWximEAAAAAAAAsDQKLAAAAAAAAFgaBRYAAAAAAAAsjQILAAAAAAAAlkaBBQAAAAAAAEujwAIAAAAAAIClUWABAAAAAADA0iiwAAAAAAAAYGkUWAAAAAAAALA0CiwAAAAAAABYmrvZAS5lhw8f1oYNG3TgwAGdOnXKad+DDz5oUioAAAAAAID6hQLLRcuWLVPfvn11/Phx+fj4yGazGftsNhsFFgAAAAAAQB3hFMLf6Z133tH3339v3B82bJgeeeQRHT16VIcPH1ZZWZlxO3TokIlJAQAAAAAA6hcKrN+pcePG6tKli7744gtJ0r59+5SSkqJGjRqZnAwAAAAAAKB+o8D6nf7617/qgw8+0D//+U9JUmxsrFFm1bXx48fLZrMpNTXV2OZwOJSWlqaQkBB5eXkpOjpa33zzjdPHVVRUaPDgwQoICJC3t7cSEhK0d+9ep5mysjIlJSXJbrfLbrcrKSlJhw8fdprZs2ePevXqJW9vbwUEBCglJUWVlZUX5bkCAAAAAAD8J1wDqxbat2+vTz/9VJIUFxen//7v/9bWrVvVvn17NWjQwGk2ISHBpcfYuHGjXnvtNXXo0MFp+8SJEzV16lRlZmbq2muv1fPPP6/u3btr27Zt8vHxkSSlpqZq2bJlWrRokZo0aaJhw4YpPj5eBQUFcnNzkyQlJiZq7969ysrKkiQNHDhQSUlJWrZsmSSpurpacXFxuuKKK5Sbm6uffvpJDz30kBwOh2bMmOHScwIAAAAAALgQFFi15OXlJUkaMGCAJOnZZ5+tMWOz2VRdXV3rYx87dkx9+/bVrFmz9PzzzxvbHQ6Hpk+frtGjR+uee+6RJM2dO1dBQUFauHChBg0apPLycs2ePVvz5s1Tt27dJEnz589XaGioVq1apdjYWBUVFSkrK0v5+fnq2LGjJGnWrFmKjIzUtm3bFBYWpuzsbG3dulXFxcUKCQmRJE2ZMkX9+vX
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"genetic_code = {\n",
" 'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',\n",
" 'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',\n",
" 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',\n",
" 'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',\n",
" 'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',\n",
" 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',\n",
" 'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',\n",
" 'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',\n",
" 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',\n",
" 'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',\n",
" 'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',\n",
" 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',\n",
" 'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',\n",
" 'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',\n",
" 'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_',\n",
" 'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W',\n",
" }\n",
"\n",
"def get_triplets(t):\n",
" for triplet in re.finditer(r'.{3}', t):\n",
" yield genetic_code[triplet.group(0)]\n",
"\n",
"rang_freq_with_labels('dna-aminos', get_triplets(dna))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### „Zdania” w języku DNA\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Z aminokwasów zakodowanych przez tryplet budowane są białka.\n",
"Maszyneria budująca białka czyta sekwencję aż do napotkania\n",
"trypletu STOP (\\_ powyżej). Taka sekwencja to *gen*.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAoAAAAHgCAYAAAA10dzkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/YYfK9AAAACXBIWXMAAA9hAAAPYQGoP6dpAAA0B0lEQVR4nO3df3RU9Z3/8deU/DBkk1sCZIZZI6ZuSsFEV0M3JP0BKxBgjanHHsGGneKRAhaFzgrLj3W7RU+bAN2Cu83WokuFIm56vucYt6fQlLDVVDYE0mhaiEDpMfKjZBLqTiZB00kM9/uHh1smiYAKzEw+z8c59xzmc9/3zuc9dzAvP5k7uGzbtgUAAABjfCLaEwAAAMD1RQAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMQAAEAAAxDAAQAADAMARAAAMAwBEAAAADDEAABAAAMQwAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMQAAEAAAxDAAQAADAMARAAAMAwBEAAAADDEAABAAAMQwAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMQAAEAAAxDAAQAADAMARAAAMAwBEAAAADDEAABAAAMQwAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMQAAEAAAxDAAQAADAMARAAAMAwBEAAAADDEAABAAAMQwAEAAAwDAEQAADAMARAAAAAwxAAAQAADEMABAAAMAwBEAAAwDAEQAAAAMMkRHsC8ez8+fM6c+aM0tLS5HK5oj0dAABwBWzbVnd3t7xerz7xCTPXwgiAH8OZM2eUlZUV7WkAAICP4NSpU7rxxhujPY2oIAB+DGlpaZLefwOlp6dHeTYAAOBKdHV1KSsry/k5biIC4Mdw4de+6enpBEAAAOKMyR/fMvMX3wAAAAYjAAIAABiGAAgAAGAYAiAAAIBhCIAAAACGIQACAAAYhgAIAABgGAIgAACAYQiAAAAAhiEAAgAAGIYACAAAYBgCIAAAgGEIgAAAAIYhAAIAABgmIdoTwAe7ec2uiMdvrb87SjMBAADDCSuAAAAAhom5APjee+/pn//5n5Wdna2UlBR96lOf0pNPPqnz5887NbZta926dfJ6vUpJSdG0adPU0tIScZ5wOKxly5ZpzJgxSk1NVWlpqU6fPh1REwwG5fP5ZFmWLMuSz+dTZ2fn9WgTAAAgamIuAG7YsEE//OEPVVlZqSNHjmjjxo367ne/q+9///tOzcaNG7Vp0yZVVlaqsbFRHo9HM2fOVHd3t1Pj9/tVXV2tqqoq7du3T+fOnVNJSYn6+/udmrKyMjU3N6umpkY1NTVqbm6Wz+e7rv0CAABcby7btu1oT+JiJSUlcrvd2rp1qzP25S9/WSNHjtSOHTtk27a8Xq/8fr9Wr14t6f3VPrfbrQ0bNmjJkiUKhUIaO3asduzYoXnz5kmSzpw5o6ysLO3evVuzZs3SkSNHNGnSJDU0NKigoECS1NDQoMLCQh09elQTJky47Fy7urpkWZZCoZDS09Ov+mvBZwABALj6rvXP73gQcyuAn//85/U///M/+t3vfidJ+s1vfqN9+/bp7/7u7yRJra2tCgQCKi4udo5JTk7W1KlTVV9fL0lqampSX19fRI3X61Vubq5Ts3//flmW5YQ/SZoyZYosy3JqBgqHw+rq6orYAAAA4k3M3QW8evVqhUIhfeYzn9GIESPU39+v73znO/rKV74iSQoEApIkt9sdcZzb7daJEyecmqSkJI0aNWpQzYXjA4GAMjMzBz1/ZmamUzNQRUWFnnjiiY/XIAAAQJTF3ArgT37yEz3//PN64YUX9Nprr2n79u3613/9V23fvj2izuVyRTy2bXvQ2EADa4aqv9R51q5dq1Ao5GynTp260rYAAABiRsytAP7jP/6j1qxZowceeECSlJeXpxMnTqiiokILFiyQx+OR9P4K3rhx45zjOjo6nFVBj8ej3t5eBYPBiFXAjo4OFRUVOTXt7e2Dnv/s2bODVhcvSE5OVnJy8tVpFAAAIEpibgXw3Xff1Sc+ETmtESNGOF8Dk52dLY/Ho9raWmd/b2+v6urqnHCXn5+vxMTEiJq2tjYdPnzYqSksLFQoFNLBgwedmgMHDigUCjk1AAAAw1HMrQDec889+s53vqObbrpJt956q15//XVt2rRJDz30kKT3f23r9/tVXl6unJwc5eTkqLy8XCNHjlRZWZkkybIsLVy4UCtWrNDo0aOVkZGhlStXKi8vTzNmzJAkTZw4UbNnz9aiRYu0ZcsWSdLixYtVUlJyRXcAAwAAxKuYC4Df//739c1vflNLly5VR0eHvF6vlixZon/5l39xalatWqWenh4tXbpUwWBQBQUF2rNnj9LS0pyazZs3KyEhQXPnzlVPT4+mT5+ubdu2acSIEU7Nzp07tXz5cudu4dLSUlVWVl6/ZgEAAKIg5r4HMJ7wPYAAAMQfvgcwBj8DCAAAgGuLAAgAAGAYAiAAAIBhCIAAAACGIQACAAAYhgAIAABgGAIgAACAYQiAAAAAhiEAAgAAGIYACAAAYBgCIAAAgGEIgAAAAIYhAAIAABiGAAgAAGAYAiAAAIBhCIAAAACGIQACAAAYhgAIAABgGAIgAACAYQiAAAAAhiEAAgAAGIYACAAAYBgCIAAAgGEIgAAAAIYhAAIAABiGAAgAAGAYAiAAAIBhCIAAAACGIQACAAAYhgAIAABgGAIgAACAYQiAAAAAhom5AHjzzTfL5XIN2h555BFJkm3bWrdunbxer1JSUjRt2jS1tLREnCMcDmvZsmUaM2aMUlNTVVpaqtOnT0fUBINB+Xw+WZYly7Lk8/nU2dl5vdoEAACImpgLgI2NjWpra3O22tpaSdL9998vSdq4caM2bdqkyspKNTY2yuPxaObMmeru7nbO4ff7VV1draqqKu3bt0/nzp1TSUmJ+vv7nZqysjI1NzerpqZGNTU1am5uls/nu77NAgAARIHLtm072pO4FL/fr5/97Gc6fvy4JMnr9crv92v16tWS3l/tc7vd2rBhg5YsWaJQKKSxY8dqx44dmjdvniTpzJkzysrK0u7duzVr1iwdOXJEkyZNUkNDgwoKCiRJDQ0NKiws1NGjRzVhwoQrmltXV5csy1IoFFJ6evpV7/3mNbsiHr+1/u6r/hwAAJjmWv/8jgcxtwJ4sd7eXj3//PN66KGH5HK51NraqkAgoOLiYqcmOTlZU6dOVX19vSSpqalJfX19ETVer1e5ublOzf79+2VZlhP+JGnKlCmyLMupGUo4HFZXV1fEBgAAEG9iOgC+9NJL6uzs1IMPPihJCgQCkiS32x1R53a7nX2BQEBJSUkaNWrUJWsyMzMHPV9mZqZTM5SKigrnM4OWZSkrK+sj9wYAABAtMR0At27dqjlz5sjr9UaMu1yuiMe2bQ8aG2hgzVD1lzvP2rVrFQqFnO3UqVNX0gYAAEBMidkAeOLECe3du1df+9rXnDGPxyNJg1bpOjo6nFVBj8ej3t5eBYPBS9a0t7cPes6zZ88OWl28WHJystLT0yM2AACAeBOzAfC5555TZmam7r77zzc+ZGdny+PxOHcGS+9/TrCurk5FRUWSpPz8fCUmJkbUtLW16fDhw05NYWGhQqGQDh486NQcOHBAoVDIqQEAABiuEqI9gaGcP39ezz33nBYsWKCEhD9P0eVyye/3q7y8XDk5OcrJyVF5eblGjhypsrIySZJlWVq4cKFWrFih0aNHKyMjQytXrlReXp5mzJghSZo4caJmz56tRYsWacuWLZKkxYsXq6Sk5IrvAAYAAIhXMRkA9+7dq5MnT+qhhx4atG/VqlXq6enR0qVLFQwGVVBQoD179igtLc2p2bx5sxISEjR37lz19PRo+vTp2rZtm0aMGOHU7Ny5U8uXL3fuFi4tLVVlZeW1bw4AACDKYv57AGMZ3wMIAED84XsAY/gzgAAAALg2CIAAAACGIQACAAAYhgAIAABgGAI
"text/plain": [
"<matplotlib.figure.Figure>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"def get_genes(triplets):\n",
" gene = []\n",
" for ammino in triplets:\n",
" if ammino == '_':\n",
" yield gene\n",
" gene = []\n",
" else:\n",
" gene.append(ammino)\n",
"\n",
"plt.figure().clear()\n",
"plt.hist([len(g) for g in get_genes(get_triplets(dna))], bins=100)\n",
"\n",
"fname = '03_Ngramy/dna_length.png'\n",
"\n",
"plt.savefig(fname)\n",
"\n",
"fname"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
},
"org": null
},
"nbformat": 4,
"nbformat_minor": 1
}