{ "cells": [ { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [], "source": [ "import re\n", "from itertools import islice\n", "from collections import Counter\n", "import pandas as pd\n", "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": 209, "metadata": {}, "outputs": [], "source": [ "import lzma\n", "from collections import Counter, OrderedDict\n", "import matplotlib.pyplot as plt\n", "from math import log\n", "import re\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [], "source": [ "with open(\"train/in.tsv\", encoding='utf8', mode=\"rt\") as file:\n", " a = file.readlines()\n", "\n", "a = [line.split(\"\\t\") for line in a]\n", "text = \" \".join([line[-2] + \" \" + line[-1] for line in a])\n", "text = re.sub(r\"\\\\+n\", \" \", text)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "del a" ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "19560075" ] }, "execution_count": 199, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(text)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "words = re.findall(\"\\w+\", text)\n", "bigram_counter = Counter(zip(words, islice(words, 1, None)))\n", "bigram_counter = dict(sorted(bigram_counter.items(), key=lambda item: item[1], reverse=True))\n", "\n", "del words" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "bigram_counter_short = {}\n", "for key, value in bigram_counter.items():\n", " if value > 5:\n", " bigram_counter_short[key] = value\n", "\n", "bigram_counter = bigram_counter_short\n", "del bigram_counter_short" ] }, { "cell_type": "code", "execution_count": 201, "metadata": {}, "outputs": [], "source": [ "unigram_counter = Counter(text.split(' '))\n", "unigram_counter = unigram_counter.most_common(10_000)\n", "# unigram_counter = dict(sorted(unigram_counter.items(), key=lambda item: item[1]), reverse=True)\n", "unigram_counter_list = unigram_counter\n", "unigram_counter = dict(unigram_counter) " ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [], "source": [ "# with open(\"dev-0/in.tsv\", encoding='utf8', mode=\"rt\") as file:\n", "# a = file.readlines()\n", "\n", "# a = [line.split(\"\\t\") for line in a]\n", "# text = \" \".join([line[-2] + \" \" + line[-1] for line in a])" ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\micha\\AppData\\Local\\Temp\\ipykernel_14716\\2692353843.py:1: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.\n", "\n", "\n", " test_data = pd.read_csv('dev-0/in.tsv', sep='\\t', error_bad_lines=False, header=None)\n", "Skipping line 654: expected 8 fields, saw 9\n", "Skipping line 2220: expected 8 fields, saw 9\n", "\n" ] } ], "source": [ "test_data = pd.read_csv('dev-0/in.tsv', sep='\\t', error_bad_lines=False, header=None)" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | 0 | \n", "1 | \n", "2 | \n", "3 | \n", "4 | \n", "5 | \n", "6 | \n", "7 | \n", "
---|---|---|---|---|---|---|---|---|
0 | \n", "662ed514d56f7bc8743aa6f23794c731 | \n", "LINCOLN TELEGRAPH | \n", "ChronAm | \n", "1838.834247 | \n", "43.910755 | \n", "-69.820862 | \n", "rin 11K ui i rsognfd inlriliinnts i>r the town... | \n", "Northeasterly hv the head of said .^corn’s\\nan... | \n", "
1 | \n", "0c3ac40edfe6a167ab692fdb9219a93c | \n", "THE WYANDOT PIONEER | \n", "ChronAm | \n", "1857.691781 | \n", "40.827279 | \n", "-83.281309 | \n", "ton County feel an interest in. tn great is-\\n... | \n", "and design,\\nand hence, every election, be it ... | \n", "
2 | \n", "b298097f3afd2f8c06b61fa2308ec725 | \n", "RICHMOND ENQUIRER | \n", "ChronAm | \n", "1847.012329 | \n", "37.538509 | \n", "-77.434280 | \n", "But at our own doors we have evidence ten\\ning... | \n", "Democrat\\nenlisting lor the Mexican wvir. They... | \n", "
3 | \n", "1d50cf957a6a9cbbe0ee7773a72a76d4 | \n", "RAFTSMAN'S JOURNAL | \n", "ChronAm | \n", "1867.541096 | \n", "41.027280 | \n", "-78.439188 | \n", "The wonderful Flexibility and great comfort\\na... | \n", "will preserve their perfect aud grace\\nful sha... | \n", "
4 | \n", "5a7297b76de00c7d9e1fb159384238c0 | \n", "RICHMOND ENQUIRER | \n", "ChronAm | \n", "1826.083562 | \n", "37.538509 | \n", "-77.434280 | \n", "Illinois.—The Legislature met at Ya:.ualia\\non... | \n", "to run the line between Arkansas and\\nthe’Vhnc... | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
10397 | \n", "02e9e019df1992daeafe82b041d94aac | \n", "WATERBURY EVENING DEMOCRAT | \n", "ChronAm | \n", "1888.949454 | \n", "41.558153 | \n", "-73.051497 | \n", "the Fitzgeralds should perish like a common\\nt... | \n", "Brian, but there was also a touch\\nof self int... | \n", "
10398 | \n", "74fa28868cbc998d15c242baea4e1faa | \n", "RICHMOND ENQUIRER | \n", "ChronAm | \n", "1836.012295 | \n", "37.538509 | \n", "-77.434280 | \n", "herd, so soon as he conveniently can, after th... | \n", "Court dotli lurlher adjudge, order, and decree... | \n", "
10399 | \n", "147be715e90bac01c55969d90254f29e | \n", "EVENING CAPITAL | \n", "ChronAm | \n", "1907.004110 | \n", "38.978640 | \n", "-76.492786 | \n", "Drs. James J. Murphy, of Annapo-\\nlis, and Tho... | \n", "in the matter\\nor show any inclination to help... | \n", "
10400 | \n", "1357f703947d912523ac23540cb99a0f | \n", "RAFTSMAN'S JOURNAL | \n", "ChronAm | \n", "1868.077869 | \n", "41.027280 | \n", "-78.439188 | \n", "the soles of the feet spikes or corks are fixe... | \n", "\\nIn order to prevent \"the giant\" from\\nfright... | \n", "
10401 | \n", "23346293dbc949ee2edc3380db29f33b | \n", "THE DEMOCRATIC WHIG | \n", "ChronAm | \n", "1843.760274 | \n", "33.495674 | \n", "-88.427263 | \n", "tion which his opponent had taken, and whilst\\... | \n", "come criterion, by which to judge\\nof a nation... | \n", "
10402 rows × 8 columns
\n", "