Compare commits

..

No commits in common. "13-2" and "master" have entirely different histories.
13-2 ... master

20 changed files with 47572 additions and 7774 deletions

634
09_Zanurzenia_slow.ipynb Normal file
View File

@ -0,0 +1,634 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
"<div class=\"alert alert-block alert-info\">\n",
"<h1> Modelowanie języka</h1>\n",
"<h2> 09. <i>Zanurzenia słów (Word2vec)</i> [wykład]</h2> \n",
"<h3> Filip Graliński (2022)</h3>\n",
"</div>\n",
"\n",
"![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Zanurzenia słów (Word2vec)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W praktyce stosowalność słowosieci okazała się zaskakująco\n",
"ograniczona. Większy przełom w przetwarzaniu języka naturalnego przyniosły\n",
"wielowymiarowe reprezentacje słów, inaczej: zanurzenia słów.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### „Wymiary” słów\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Moglibyśmy zanurzyć (ang. *embed*) w wielowymiarowej przestrzeni, tzn. zdefiniować odwzorowanie\n",
"$E \\colon V \\rightarrow \\mathcal{R}^m$ dla pewnego $m$ i określić taki sposób estymowania\n",
"prawdopodobieństw $P(u|v)$, by dla par $E(v)$ i $E(v')$ oraz $E(u)$ i $E(u')$ znajdujących się w pobliżu\n",
"(według jakiejś metryki odległości, na przykład zwykłej odległości euklidesowej):\n",
"\n",
"$$P(u|v) \\approx P(u'|v').$$\n",
"\n",
"$E(u)$ nazywamy zanurzeniem (embeddingiem) słowa.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Wymiary określone z góry?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Można by sobie wyobrazić, że $m$ wymiarów mogłoby być z góry\n",
"określonych przez lingwistę. Wymiary te byłyby związane z typowymi\n",
"„osiami” rozpatrywanymi w językoznawstwie, na przykład:\n",
"\n",
"- czy słowo jest wulgarne, pospolite, potoczne, neutralne czy książkowe?\n",
"- czy słowo jest archaiczne, wychodzące z użycia czy jest neologizmem?\n",
"- czy słowo dotyczy kobiet, czy mężczyzn (w sensie rodzaju gramatycznego i/lub\n",
" socjolingwistycznym)?\n",
"- czy słowo jest w liczbie pojedynczej czy mnogiej?\n",
"- czy słowo jest rzeczownikiem czy czasownikiem?\n",
"- czy słowo jest rdzennym słowem czy zapożyczeniem?\n",
"- czy słowo jest nazwą czy słowem pospolitym?\n",
"- czy słowo opisuje konkretną rzecz czy pojęcie abstrakcyjne?\n",
"- …\n",
"\n",
"W praktyce okazało się jednak, że lepiej, żeby komputer uczył się sam\n",
"możliwych wymiarów — z góry określamy tylko $m$ (liczbę wymiarów).\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Bigramowy model języka oparty na zanurzeniach\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zbudujemy teraz najprostszy model język oparty na zanurzeniach. Będzie to właściwie najprostszy\n",
"**neuronowy model języka**, jako że zbudowany model można traktować jako prostą sieć neuronową.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Słownik\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"W typowym neuronowym modelu języka rozmiar słownika musi być z góry\n",
"ograniczony. Zazwyczaj jest to liczba rzędu kilkudziesięciu wyrazów —\n",
"po prostu będziemy rozpatrywać $|V|$ najczęstszych wyrazów, pozostałe zamienimy\n",
"na specjalny token `<unk>` reprezentujący nieznany (*unknown*) wyraz.\n",
"\n",
"Aby utworzyć taki słownik, użyjemy gotowej klasy `Vocab` z pakietu torchtext:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from itertools import islice\n",
"import regex as re\n",
"import sys\n",
"from torchtext.vocab import build_vocab_from_iterator\n",
"import pickle\n",
"import lzma"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1027"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from itertools import islice\n",
"import regex as re\n",
"import sys\n",
"from torchtext.vocab import build_vocab_from_iterator\n",
"import lzma\n",
"\n",
"\n",
"def get_words_from_line(line):\n",
" line = line.rstrip()\n",
" yield '<s>'\n",
" for m in re.finditer(r'[\\p{L}0-9\\*]+|\\p{P}+', line):\n",
" yield m.group(0).lower()\n",
" yield '</s>'\n",
"\n",
"\n",
"def get_word_lines_from_file(file_name):\n",
" with lzma.open(file_name, 'r') as fh:\n",
" for line in fh:\n",
" yield get_words_from_line(line.decode('utf-8'))\n",
"\n",
"vocab_size = 20000\n",
"\n",
"vocab = build_vocab_from_iterator(\n",
" get_word_lines_from_file('train/in.tsv.xz'),\n",
" max_tokens = vocab_size,\n",
" specials = ['<unk>'])\n",
"\n",
"vocab['human']"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['<unk>', '\\\\', 'the', '-\\\\', 'nmighty']"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"vocab.lookup_tokens([0, 1, 2, 10, 12345])"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"with open('vocabulary.pickle', 'wb') as fh:\n",
" pickle.dump(vocab, fh)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Definicja sieci\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Naszą prostą sieć neuronową zaimplementujemy używając frameworku PyTorch.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/jacob/opt/anaconda3/lib/python3.9/site-packages/torch/nn/modules/container.py:217: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.\n",
" input = module(input)\n"
]
},
{
"data": {
"text/plain": [
"tensor(2.9869e-05, grad_fn=<SelectBackward0>)"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from torch import nn\n",
"import torch\n",
"\n",
"embed_size = 100\n",
"\n",
"class SimpleBigramNeuralLanguageModel(nn.Module):\n",
" def __init__(self, vocabulary_size, embedding_size):\n",
" super(SimpleBigramNeuralLanguageModel, self).__init__()\n",
" self.model = nn.Sequential(\n",
" nn.Embedding(vocabulary_size, embedding_size),\n",
" nn.Linear(embedding_size, vocabulary_size),\n",
" nn.Softmax()\n",
" )\n",
"\n",
" def forward(self, x):\n",
" return self.model(x)\n",
"\n",
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size)\n",
"\n",
"vocab.set_default_index(vocab['<unk>'])\n",
"ixs = torch.tensor(vocab.forward(['is']))\n",
"out = model(ixs)\n",
"out[0][vocab['the']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz wyuczmy model. Wpierw tylko potasujmy nasz plik:\n",
"\n",
" shuf < opensubtitlesA.pl.txt > opensubtitlesA.pl.shuf.txt\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"from torch.utils.data import IterableDataset\n",
"import itertools\n",
"\n",
"def look_ahead_iterator(gen):\n",
" prev = None\n",
" for item in gen:\n",
" if prev is not None:\n",
" yield (prev, item)\n",
" prev = item\n",
"\n",
"class Bigrams(IterableDataset):\n",
" def __init__(self, text_file, vocabulary_size):\n",
" self.vocab = build_vocab_from_iterator(\n",
" get_word_lines_from_file(text_file),\n",
" max_tokens = vocabulary_size,\n",
" specials = ['<unk>'])\n",
" self.vocab.set_default_index(self.vocab['<unk>'])\n",
" self.vocabulary_size = vocabulary_size\n",
" self.text_file = text_file\n",
"\n",
" def __iter__(self):\n",
" return look_ahead_iterator(\n",
" (self.vocab[t] for t in itertools.chain.from_iterable(get_word_lines_from_file(self.text_file))))\n",
"\n",
"train_dataset = Bigrams('train/in.tsv.xz', vocab_size)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(43, 0)"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from torch.utils.data import DataLoader\n",
"\n",
"next(iter(train_dataset))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[tensor([ 2, 5, 51, 3481, 231]), tensor([ 5, 51, 3481, 231, 4])]"
]
}
],
"source": [
"from torch.utils.data import DataLoader\n",
"\n",
"next(iter(DataLoader(train_dataset, batch_size=5)))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"None"
]
}
],
"source": [
"device = 'cpu'\n",
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
"data = DataLoader(train_dataset, batch_size=5000)\n",
"optimizer = torch.optim.Adam(model.parameters())\n",
"criterion = torch.nn.NLLLoss()\n",
"\n",
"model.train()\n",
"step = 0\n",
"for x, y in data:\n",
" x = x.to(device)\n",
" y = y.to(device)\n",
" optimizer.zero_grad()\n",
" ypredicted = model(x)\n",
" loss = criterion(torch.log(ypredicted), y)\n",
" if step % 100 == 0:\n",
" print(step, loss)\n",
" step += 1\n",
" loss.backward()\n",
" optimizer.step()\n",
"\n",
"torch.save(model.state_dict(), 'model1.bin')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Policzmy najbardziej prawdopodobne kontynuacje dla zadanego słowa:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('ciebie', 73, 0.1580502986907959), ('mnie', 26, 0.15395283699035645), ('<unk>', 0, 0.12862136960029602), ('nas', 83, 0.0410110242664814), ('niego', 172, 0.03281523287296295), ('niej', 245, 0.02104802615940571), ('siebie', 181, 0.020788608118891716), ('którego', 365, 0.019379809498786926), ('was', 162, 0.013852755539119244), ('wszystkich', 235, 0.01381855271756649)]"
]
}
],
"source": [
"device = 'cuda'\n",
"model = SimpleBigramNeuralLanguageModel(vocab_size, embed_size).to(device)\n",
"model.load_state_dict(torch.load('model1.bin'))\n",
"model.eval()\n",
"\n",
"ixs = torch.tensor(vocab.forward(['dla'])).to(device)\n",
"\n",
"out = model(ixs)\n",
"top = torch.topk(out[0], 10)\n",
"top_indices = top.indices.tolist()\n",
"top_probs = top.values.tolist()\n",
"top_words = vocab.lookup_tokens(top_indices)\n",
"list(zip(top_words, top_indices, top_probs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Teraz zbadajmy najbardziej podobne zanurzenia dla zadanego słowa:\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('.', 3, 0.404473215341568), (',', 4, 0.14222915470600128), ('z', 14, 0.10945753753185272), ('?', 6, 0.09583134204149246), ('w', 10, 0.050338443368673325), ('na', 12, 0.020703863352537155), ('i', 11, 0.016762692481279373), ('<unk>', 0, 0.014571071602404118), ('...', 15, 0.01453721895813942), ('</s>', 1, 0.011769450269639492)]"
]
}
],
"source": [
"vocab = train_dataset.vocab\n",
"ixs = torch.tensor(vocab.forward(['kłopot'])).to(device)\n",
"\n",
"out = model(ixs)\n",
"top = torch.topk(out[0], 10)\n",
"top_indices = top.indices.tolist()\n",
"top_probs = top.values.tolist()\n",
"top_words = vocab.lookup_tokens(top_indices)\n",
"list(zip(top_words, top_indices, top_probs))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[('poszedł', 1087, 1.0), ('idziesz', 1050, 0.4907470941543579), ('przyjeżdża', 4920, 0.45242372155189514), ('pojechałam', 12784, 0.4342481195926666), ('wrócił', 1023, 0.431664377450943), ('dobrać', 10351, 0.4312002956867218), ('stałeś', 5738, 0.4258835017681122), ('poszła', 1563, 0.41979148983955383), ('trafiłam', 18857, 0.4109022617340088), ('jedzie', 1674, 0.4091658890247345)]"
]
}
],
"source": [
"cos = nn.CosineSimilarity(dim=1, eps=1e-6)\n",
"\n",
"embeddings = model.model[0].weight\n",
"\n",
"vec = embeddings[vocab['poszedł']]\n",
"\n",
"similarities = cos(vec, embeddings)\n",
"\n",
"top = torch.topk(similarities, 10)\n",
"\n",
"top_indices = top.indices.tolist()\n",
"top_probs = top.values.tolist()\n",
"top_words = vocab.lookup_tokens(top_indices)\n",
"list(zip(top_words, top_indices, top_probs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Zapis przy użyciu wzoru matematycznego\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Powyżej zaprogramowaną sieć neuronową można opisać następującym wzorem:\n",
"\n",
"$$\\vec{y} = \\operatorname{softmax}(CE(w_{i-1}),$$\n",
"\n",
"gdzie:\n",
"\n",
"- $w_{i-1}$ to pierwszy wyraz w bigramie (poprzedzający wyraz),\n",
"- $E(w)$ to zanurzenie (embedding) wyrazy $w$ — wektor o rozmiarze $m$,\n",
"- $C$ to macierz o rozmiarze $|V| \\times m$, która rzutuje wektor zanurzenia w wektor o rozmiarze słownika,\n",
"- $\\vec{y}$ to wyjściowy wektor prawdopodobieństw o rozmiarze $|V|$.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Hiperparametry\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Zauważmy, że nasz model ma dwa hiperparametry:\n",
"\n",
"- $m$ — rozmiar zanurzenia,\n",
"- $|V|$ — rozmiar słownika, jeśli zakładamy, że możemy sterować\n",
" rozmiarem słownika (np. przez obcinanie słownika do zadanej liczby\n",
" najczęstszych wyrazów i zamiany pozostałych na specjalny token, powiedzmy, `<UNK>`.\n",
"\n",
"Oczywiście możemy próbować manipulować wartościami $m$ i $|V|$ w celu\n",
"polepszenia wyników naszego modelu.\n",
"\n",
"**Pytanie**: dlaczego nie ma sensu wartość $m \\approx |V|$ ? dlaczego nie ma sensu wartość $m = 1$?\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Diagram sieci\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jako że mnożenie przez macierz ($C$) oznacza po prostu zastosowanie\n",
"warstwy liniowej, naszą sieć możemy interpretować jako jednowarstwową\n",
"sieć neuronową, co można zilustrować za pomocą następującego diagramu:\n",
"\n",
"![img](./09_Zanurzenia_slow/bigram1.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka\")\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Zanurzenie jako mnożenie przez macierz\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Uzyskanie zanurzenia ($E(w)$) zazwyczaj realizowane jest na zasadzie\n",
"odpytania (<sub>look</sub>-up\\_). Co ciekawe, zanurzenie można intepretować jako\n",
"mnożenie przez macierz zanurzeń (embeddingów) $E$ o rozmiarze $m \\times |V|$ — jeśli słowo będziemy na wejściu kodowali przy użyciu\n",
"wektora z gorącą jedynką (<sub>one</sub>-hot encoding\\_), tzn. słowo $w$ zostanie\n",
"podane na wejściu jako wektor $\\vec{1_V}(w) = [0,\\ldots,0,1,0\\ldots,0]$ o rozmiarze $|V|$\n",
"złożony z samych zer z wyjątkiem jedynki na pozycji odpowiadającej indeksowi wyrazu $w$ w słowniku $V$.\n",
"\n",
"Wówczas wzór przyjmie postać:\n",
"\n",
"$$\\vec{y} = \\operatorname{softmax}(CE\\vec{1_V}(w_{i-1})),$$\n",
"\n",
"gdzie $E$ będzie tym razem macierzą $m \\times |V|$.\n",
"\n",
"**Pytanie**: czy $\\vec{1_V}(w)$ intepretujemy jako wektor wierszowy czy kolumnowy?\n",
"\n",
"W postaci diagramu można tę interpretację zilustrować w następujący sposób:\n",
"\n",
"![img](./09_Zanurzenia_slow/bigram2.drawio.png \"Diagram prostego bigramowego neuronowego modelu języka z wejściem w postaci one-hot\")\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
},
"org": null
},
"nbformat": 4,
"nbformat_minor": 1
}

10519
dev-0/out-embed-100.tsv Normal file

File diff suppressed because it is too large Load Diff

10519
dev-0/out-embed-500.tsv Normal file

File diff suppressed because it is too large Load Diff

10519
dev-0/out.tsv Normal file

File diff suppressed because it is too large Load Diff

View File

@ -1,158 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoTokenizer\n",
"from datasets import load_dataset"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dataset prep"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"model_checkpoint = \"distilroberta-base\""
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"def tokenize_function(examples):\n",
" return tokenizer(examples[\"text\"], max_length=512, truncation=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tokenized_datasets = datasets.map(tokenize_function, batched=True, num_proc=4, remove_columns=[\"text\"])"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Model training"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from transformers import AutoModelForMaskedLM\n",
"from transformers import Trainer, TrainingArguments"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'AutoModelForMaskedLM' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[11], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m model \u001b[39m=\u001b[39m AutoModelForMaskedLM\u001b[39m.\u001b[39mfrom_pretrained(model_checkpoint)\n",
"\u001b[0;31mNameError\u001b[0m: name 'AutoModelForMaskedLM' is not defined"
]
}
],
"source": [
"model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"model_name = model_checkpoint.split(\"/\")[-1]\n",
"training_args = TrainingArguments(\n",
" f\"{model_name}-finetuned-america\",\n",
" evaluation_strategy = \"epoch\",\n",
" learning_rate=2e-5,\n",
" weight_decay=0.01,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=dataset[:len(dataset)*0.8],\n",
" eval_dataset=dataset[len(dataset)*0.8:]\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"trainer.train()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

13
gonito.yaml Normal file
View File

@ -0,0 +1,13 @@
description: nn, trigram, previous and next
tags:
- neural-network
- trigram
params:
epochs: 1
vocab-size: 20000
batch-size: 10000
embed-size:
- 100
- 500
- 1000
topk: 10

View File

@ -1,166 +0,0 @@
{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"from transformers import pipeline, AutoTokenizer\n",
"import lzma"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"tokenizer = AutoTokenizer.from_pretrained(\"distilroberta-base\", use_fast=True)\n",
"classifier = pipeline(\"fill-mask\", model=\"distilroberta-base-finetuned-america\", tokenizer=tokenizer)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def preprocess_data(X):\n",
" parsed_data = []\n",
"\n",
" for line in X:\n",
" left = line.strip().split('\\t')[6].replace('\\\\n', ' ').split()\n",
" right = line.strip().split('\\t')[7].replace('\\\\n', ' ').split()\n",
" if len(left) + len(right) > 450:\n",
" print('truncating -----------')\n",
" print(f\"before: {' '.join(left)} {' '.join(right)}\")\n",
" text = ' '.join(left[-100:]) + f' <mask> ' + ' '.join(right[:100])\n",
" else:\n",
" text = ' '.join(left) + f' <mask> ' + ' '.join(right)\n",
"\n",
" parsed_data.append(text)\n",
"\n",
" return parsed_data"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"with open('test-A/in.tsv', mode='rt', encoding='utf-8') as f:\n",
" X = f.readlines()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"X = preprocess_data(X)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def parse_output(out):\n",
" string = ''\n",
" for i in out:\n",
" string += f\"{i['token_str']}:{i['score']} \"\n",
"\n",
" rest = 1 - sum([i['score'] for i in out])\n",
" string += f\":{rest}\"\n",
"\n",
" return string"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Token indices sequence length is longer than the specified maximum sequence length for this model (576 > 512). Running this sequence through the model will result in indexing errors\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"had no relations nearthetn, they gave him bert. There were many whodid not love a cordial welcome. He was o very hand- him, and more to whom his stern integri- sotuc, fascinating man; but if you looked ty was a reproach, but none had yet utter- long or closely on him there was a fierce ed those falsehoods that aim at dishonor- gleam in his dark eye, a w ithering sneer ing merit. That deeper wound was re- on his lip that caused you distrust his mo- served for the noon of his manhood, when dulated tones ot courtesy and blandness heart and brain and mind, if they be se- of manners. Ilis conversational powers parable, were alike in the full vigor of wore versatile and charming, and he seem- susceptibility. Ilis talents pointed him cd to have traversed tho globe, so thor- out as a fit person to represent his dis- ough and accurate was bis almost univer- trict in the national legislature, and his sal intelligence. It was astonishing w hat integrity was so universally admitted as oceans of wine and punch he could swal- his powers of intellect. He was accord- low without being at all intoxicated. He ingly returned for Congress and took his hunted too, and in short, was accomplish- seat beneath the dome of that capitol that <mask> in ail those things in which poor Robert should garner none but the sage and the was sadly deficient and which his uncle patriot, but beneath which the sordid and so highly prized. At first he courted Rob- unprincipled are too often thrust Ilisre- ertt society, but a change soon came and putation for high, unbending honor and he devoted all his time to the old man. spotless Roman virtue had preceded him Now that he had a companion in his rev- his peers received him with courtesy; but els, the planter gave up to deep drinkfng. it was not long before ho learned to look Robert never came in his sight but to be with contempt, ht took no pains toconceal cursed, and fie often attempted to cane his or qualify, on the occupations to which fine and manly son for supposed or trivial honorable members stooped, or at least offences, and upon Roberts asking him tacitly permitted. Stern and uncompro- to marry o beautiful, but portionless or- mising he stood aloof from the fetters of phan in the neighborhood, he drove him party and denounced with bitter scorn the from his presence and his roof without gross violations of truth and honor, his any provision being made for his present strong sense detected in many public func- or future maintenance.\n",
"There are two machines upon w.h one is apt to look as having ezhaus the inventive genius of the Yankee tion without having accomplished a «juate results. 'We mean Washing H chines and Churns. More timo n money have been spent In attempt! to |>$rfect substantial Improvements theseHwo machines' than on any otl two in all the long list that haveengi ed the attention of inventors sine* patent office had an existence.' 1 pertinacity with which inventors cc tinue their endeavors to mitigate 1 hardships of these two branches household industry, speaks well fori gallantry of this large and respectal class of citizens, ana entitles them the favorable consideration of all 1 housewives of the land. We have c amined a churn patented by Mr. lis of New York city. whicfa for silnplicl of construction is barely surpassed the old-fashioned Dasher churn itsc It is Biinply a neat, plain box, susper, ed In a framo so as to socuipa pend lam or swinging motion. The dast is a piece of nard board, full of hol< and fastened vertically to a staff, ai when put into the churn .ttie staff is t cured to1 a <mask> by a wedge slipped with the hand. The good lady, or wh ever does the churnlug, when .wear* with oth6r labor, can Just Bit down in chair, and by the aid of a stick hook< to the lower edge of the churn box, ci do her churning with very little effoi and get up from what is ordinarily co; sidered a severe task, thoroughly res Mr. Hall claims for his churn l fancy Illusions such as bring butter! five minutes, and. many others of 111 nature, but merely that it is essentiall a labor saving machine; that it does i work as well as any churn ever use that It requires no more labor to opera It than it does to rock a cradle. %The is not a wheel or cog about it, and r auires no more skill to use it than oes to operate the common old fas! ioned dasher churn. We have exau ined the'churn, and are satisfied th» Mr.;Hall claims nothing foe his chur that Is* toot frilly apparent' at a slngl glance. Mr. Hall can be fonnd at U Grant House, where any one wishin to purchase county rights could do a on\n",
"C0N8TAKTIxori.K, Oct. C . .A baiul of brigands yesterday, in spito of the ro- cunt diplomatic action of Germany and France, and of tho efforts of the Porto to suppress brigandage, made a desper¬ ate attempt to wreck and rob a passon- gor train. Tho latter was missing along a desolate portion of tho ilaidar-Pacal- amidt railroad, when tho engineer dis¬ covered that eomuthing was wrong along tho rails ahead of his train. Tho train wan brought to a standstill as soon as possible, and an examination of tho lino showed that the, much dreaded brigands had torn up the rails and so damaged the road bed that had tho train not boon stopped in time an acci¬ dent would have surely happened. Tho brigands, as soon as they saw that their plan had miscarried, instead of attacking tho train, decamped. This is only one of similar outrages upon tho part of Turkish brigands. Mm. lUitller and EugenodoKaymond, sub-managers of a vineyard company, wore captured early in August last by tho brigands of Chief Athanasias, and were <mask> 011 payment of 5,000 Turkish pounds, Later in tho samo month several Italian railroad ollicials were carried off by Chief Mohadisn, and others were murdered. Hero again a ransom of $10,000 had to be paid for tho releaso of the captured railroad ofllcials. On .June I last the same band, that of Athanasius, placed obstructions across tho railroad track near Tcherossdoii, derailed an express train and captured several German tourists, lor their ransom -10,000 was paid. Athaniasus is pictured as being a bri¬ gand of the old school, an oriental Claude Duval practicing the tradition of robber courtesy and building up a huge fortune for himself at tho expense of the Sultan's privy purse, ior tho de¬ mands 0i the ambassadors for compen¬ sation lor brigand outrages have been complied with from that fund. Tho Oriental Railway Company, as tho re¬ sult of the recent brigandage outbreaks, has demanded special guarantees from the Porte if it in to carry on its traflic, all tiio more as it is suggested that the Indian mail should take that route.\n",
"Recent letters from Germany show that tho social as well as the political condition of the empire is far from satisfactory. The1 Germans have taken pride in their freedom rom nervous restlessness and selfish hard ness of other nationalities more given over to active business, and to looking at every-1 thingfromamerepeeuniarystandpoint. No people were more contented in the simple home enjoyments, or satisfied with the do¬ mestic comforts, if we may believe their own accounts, or tho testimony of stran¬ gers who have spent much time with them. N'ow, intelligent representatives of tho lat¬ ter class inform us that tho Germans are peevish, irritable, excited, given to pecun¬ iary speculation, and in many respects, changed much for tho worse. Even the skeptical Dr. Schenkel thinks they need preaching to. He would not enforce the old religious dogmas, but would enforcj the preeminent importance of the princi¬ ple of universal love. There are not wan ting ouier witnesses to the change, but we have cited a sufficient nmber. Among the causes of the deterior¬ ation may be numbered the decay of ac¬ cepted moral principles, following the re¬ pudiation of the dogmas on <mask> they were based; the demoralization that en¬ sued from the national pride inspired bv the conquest of France, also the pecuni¬ ary inflation produced by the French mil¬ liard*, and the disorganizing influence of. the Socialism of late so prevalent Com¬ munistic theories have no doubt been ren¬ dered attractive as offering a possible way of escape from a burdensome military ser- yico, while the growth of skepticism al¬ ready noticed has provided a moral vacuum to be filled by them. * The moststriking proof that all is not well is afforded, however, by the government's change of policy toward the Roman Church. A few days ago we had occasion to notice the fact, previously denied, that the nego¬ tiations between the empire and the Vati¬ can had borne important fruit. The re¬ sult in brief was that while the former would not recall any of its former acts or pay back fines already received, it would permit a resumption of duties on the part of Bishops and priests, and would not ex¬ act the discharge of penulties affixed but not discharged. There may have boon more important concessions not yet made public.\n",
"the Superintendent of the State. War, and Navv- - Department Building. Scaled proposals in duplicate, indorsed \"Pro- posals for Fuel,\" will be received at this ofiiee until 2 p. m on THURSDAY, May 7, 1896, to supplv the State, War, and Navy Department Building with f'ici during the fiscal year ending Jane 30, 1897, a follows 5.000 tons or extra hard white ash furnace coal. 25 tons of white ash stove coai, 100 cords of lncktrr wood, 100 cords or cak wood, and 50 cords ot spruce pine wood. All coal to be or best quality, free from dust or im- purities, ami inspected by a person who shall be designated by the Superintendent, and to be wcighcii upon the government scales in the court yard. All wood to be of the best quality and inspected by a pe'rson who shall be designated by the Superintendent. The hickory and oak wood to be sawed into three pieces nnd measured after it is saweel and delivered. The coal and wood to be de- livered at the State, War, and Navy Build- ing and toreel in the vaults by the party or parties to whom the contract or con- <mask> may be awarded at such time and in such quantities as the eonvenience of the oflice may require. Reserving -- the right to order as much more or as much lessor either coal or word as may be re- quired at the contract price; also the right to reject any or all bids, or to accept anv portion or any bid. The successful blutTer to furnish bond in the sum of S5.000 as a guarantee of the lnithful performance of the contraet. G. W . BA1RD, Chief En- gineer, U. S. N., Superintendent. PROPOSALS FOR MISCELLANEOUS ITEMS OfI ice of the Superintendent of tlie State, War. and Navy Building. Sealed proposals in duplicate, indorsee! \"Propo- sals for Miscellaneous Items.\" will be received at this oflice until 2 p. m . on THURSDAY, May 7, 1896, for mrnishing this office, during the fiscal year ending June30, 1897, with soap, brushes, sponges, paints, oils, towels, crash, nails, screws, etc., etc. Schedules, forms of proposals, and ail necessary information can be ob- tained upon application to G. W. BAIRD, Chief Engineer, U. S .N. . Superintendent. apll.l8 .25,26.my2,6 PHOPOSALS FOR MISCELLA- NEOUS sui'PLIES FOR THE POST-OF F IC - E\n",
"rccognlzed the groat fact that the in­ dustrial classes in this country, and the better sentimrats of onr civilisa­ tion. alike demanded that diplomacy should be exhausted before a resort t» arms was contemplated. And the pro­ ductive industry and peaceful arts which to-day blees oar country are largely the result of this wisdom. The settlement of thi# question by n new system of arbitration has also fixed a potnt in th# woHd'a history and estab­ lished a precedent which it is believed will do more to prevent In th# ffctnre ihe wickedness and waste of war thin any other #vent of tb# eeniory. Again, whan from maay quartern there am# an attempt to complicate ofli, relations and discourage n«gotlations with Spain, In th#bop# that war Woold result fTom the Cuban difficulty, the President was neolat# in the position, that diplomacy sbaald be eAaasted first, and that war sbtfuld bfjthe last «nd a reluctant mart. T%t||laking •nasses felt that the young ilea of onr country would be more usefbit* them- -elves and tbe world, engaged in pro luctlve Industry, tbaajmwa«>% their ives iu camps, and oh mavehe<, to :hase down a ,/#w bloodthirsty Span- tarda; aatftth^lSAtaa that we Want •id a greater bniaath of th# eountry we <mask> aoeaaas brought Into cultiva­ tion, aad naw Industries developed to IIversify and employ profitably tbe nbor of tbe p#op!o, aer# than we wanted Cuba or its population. Upon >hls theory tb# admlalstration ietad, ml b!atory will attaat Ita wisdom. I now turn to matters mora Imma* ttately eaanaetad with oat #<aN gov- irnment, Th# faet thUta di^r wa are •ot ful y recovered from the effects af i financial panic auggeata the probabU- ty that th# public mind will look with nora than ordinary anxiety far iegis- atfva remedle#. It baa tweome a •abit of American thoaght, wkt#b rises almost to the eharacter of a nanla, to believe that, for all tbo pollt- ical ilia to wUeh human economy 1* •elr. either miw statate, or aa amend neat to soma existing atab** w.ll b» to efTeotive eura. Almoe* #v#ry man •as a panacca whloh la hie judgment #111 Improve tha bualnaas proeperltr •f tbe country, and secure the future igalnet tha recurrane# of ttaa| fipaab -ial disasters which have beTetafoiV periodically a*l#ct#d all rnmmsislV 'latlona. In Iowa thaaa dbtarbtinom tave increaaed the former grlevanae» >f egricutture, and have eat la%Riry on ip>to#lnseaiehofaraB»aay. MTeare told by aaa thai Area bnakthg wll duce order ont of financial\n",
"Tlfoc, closli e Ht 78^0. Corn, cash No. 2, 40%^; October «Kh^;o, closing unOKo; November41% a4tj<£c, clod: gat 41X«: December 4t9iai;c.cloa« lug ht 41c; Umy 44%t45c, Closing r; -uy{fl OaM, cash No. J . 3io: October 2So^sW5, cosing at ioKc: November 2&5£o; Doeamber 25Kc: May !WKa29Hc, cloving at 2I&& itye, No. 2 , G«c. Barley, No 2,71a KlHS^ta, No. J . 81 08K. Prlmo Urn' otby scefl S2'j0a222. Moss Pork ?l3 CO: January i\\\\210a1217KC. Closed at 81212& Lord, per 1C0 Its. , 6 8f»c: October 6 80o; November G.17XaG22Xc. closed at C 17%o; .Jaauarv 6 22Xa0 25c, cla»od at 6.22V4c: May C.bfcJ*c, closed at 6.62>(a Bacon, Bhortrlba7l)Cc; shoulders &0Qa&.20o; short dear 4iao7.lCa7.20c. Whisky 8110. Sugar*, cutloat 7o; granulated 6jfc; standard A Q>4c. Butter, markot quiet at l&atto lor creamery; lCallc^for dairy, £ggs VJalsc and quiet. Fbi<. iDtLrmk- Pa.. Oct. 74..Klonr stoady with moderato demand; Ohio and other Western <mask> 8i '.U'a-t 10: do. atrslgtu 81 10n4 2d: winter patent |t 3-iaft 00; Minnesota clear old wheat 84 '^5; do atralghi« 87Vial CO; do patent« 75a4 SO. Wheat auletand weak; No. 2 red on track 82J<o: No 2 red ctobcr 82a82>4o; Novemner 8'.%h}?3c; Dumber 83%aB4o; January 8tJ4iS5c. Corn, spot'dull aud wcak;lulurut sto«dy;No 2 mixed on track &3Ma 63K«: do In gtaln dei ot &Uko: No. 2 mixed October 5la52c: November Nofcio; December 48!*a49c: Jan* uary4M*a49o, Gats, *pot In moderate demand; No 3 mixed 82c; No. 2 mixed81o; No. 8 whlto34o; No. 2 white 34Ka35c: futures iteady; No. 2 white I October 84^aVo; November 8t5ia36c; December I I S5V.iiIlI»?^c: January 8fia£(ft£r. 1 request and steady. Fori, mess 515 50; do prime mens new 31& CO] do family S1Q 00a38 60. Butter i firm P.nd demand Mr ror lro»h good'? creamery cztra ific; western factory Italic, ChMM quiet J and steady.\n",
"arriving at the place, found a young ecuted by order of General Jacksou m the white-man stripped naked, bound to a tree Seminole war of 1817, 18, and believing and his captors preparing to put him to that* the circumstances of her history pre- death. On observing this, Milly instant- seirîed a case of very peculiar interest, I ly went to herfathei, who, as before sta*, mule it a point to obtain from hersell a ted, was she Prophet Francis, and a prin-J statement of her conduct in 1818, when, cipal chief of the nation, and besought; as public history has already recorded,she him to save the prisoners life. This he saved the life of an American citizen, whp declined, saying at the same time, that he was a prisoner in the power of some of had no power to do so. She then turned her tribe. The history states that the to his captors, and begged them to spare white man was about to be burned alive, the life of the white man; but one of but was saved by the interposition of the them who had lost two sisters in the war prophets daughter. Being in the viçim- refused to listen to <mask> supplications in be* ty of the Indian girl, near the mouth oj half of the prisoner, declaring that his the Verdigris river, and being acquainted life should atone for the wrongs which he with a portion of her history, 1 rode sever- had received at the hands of the white al miles to hear her story from herself, people The active humanity of Milly 1 had been informed that she has a claim would not be discouraged. She reasoned to some negro property, now held by the and entreated, telling the vindictive sa- Seminoles; and I first questioned her vnge who was bent on the destruction of relation to her claim, and then directed the prisoner that his death would not re- her mind back to 1818, and told her I had store his sisters to life. Aftern long time heard that she had saved the life ofa white spent in her generous effort, she succeed- man in the war ofthat year. She answer­ ed in rescuing the prisoner from the dread- ed that she had, and immediately gave me ful death to which he had been doomed by a very minute and graphic account of the his cruel captors. The condition on which circumstances.\n",
"In a music store on Third street, be- tween Marion and Columbia, there is an old piano which attracts much attention. The old musical instrument is of the upright style and is in a fair state of preservation, though it is nearly one hun- dred year3 old. It has a keyboard with white keys for the regular notes, and black keys for the sharps and flats, just like the pianos of today. These, when deftly touched, cause the ancient instru- ment to discourse most eloquently. No one could tell its great age by hear- ing it played on. Its tones are still har- monious and tuneful, though, of course, it cannot be compared with tfee best pianos of today, xf hen volume or modula- tion of tone is considered. Its front is ornamented with wooden scrollwork, behind which is a crimson cloth of fine texture. The frame on which the stringa are stretched is of wood, while the frame of the modern piano is of iron. The double row of keys is followed to this day, and the interior construction is much the same as in vogue at present. The fact that the ancient instrument is in such a good state of preservation is a high tribute to the old time piano makers. They built their instruments <mask> last. This is said not to be the case with many of the present piano manufacturers. The superannuated instrument has an interesting history. The Nineteenth Century had counted off but three years when it -- was bought by an English gen- tleman for his family of the makers, J. & J. Hopkinson, of Regent street, Lon- don. It was made in the year 1803 and sold in 1803. It passed as an heirloom from one member of the family to an- other until it came into the possession of a branch that left London for America in the year 1334. The voyage was made in the celebrated ship Robert Lowe. During the voyage a heavy gale was en- countered, and the piano was washed overboard with other things, but was finally fished out of the briny ocean. The family that brought the instru- ment to America settled at Victoria, B. C, and they passed away one by one until only two sisters were left. Finally one of these died and the other became insane with grief. Then it became nec- essary to administer on the estatejf the sisters, and the piano was sold by order of the probate court. The instrument then fell into the hands of a gentleman named Johnson, who resided in Victo- ria.\n",
"\" With Israel fully restored—raised from tho dead, able to see und able to speak, great results will follow. Tho Millenium will bo at once Introduced. Verse 35 Is a picture of It; Jesus going about all the cities and villages, leaching In their syn­ agogues, and preaching the Gospel of tho Kingdom, and healing ALL manner of di­ sease and ALL manner of sickness. \"This is what 1» will bo when the King comes. In the meantime the multltudee are Just like a great lot of sheep distre»»- ed and scattered, not having a shepherd. H» 1» moved with compassion, and Ha puts the remedy Into the hands of HI» disciple». That remedy 1» prayer. The harvest truly I» plenteous, but the labor­ ers are few. Pray ye, therefore, the Lord of the harvest, that He send forth labor­ ers into His harvest,' (V. 38.) “And that I» His plan, for the King­ dom not only, but also for the Church. Is it not true today that the harvest Is plenteous? Lift up your eyes and look on the fields, that they are white already unto harvest,' (John 4; 35 ) An'- are not the laborers few? With all our boasted foreign <mask> movements, we are only trifling with the great task, and ou» program is backward Instead .of forward. And what shall we do about it? Do about It? What can we do about It? Apart from Him we can do nothing. But we can da ail things through Him that strength­ ened! us: and It is our business to do thld thing, it is not something that only <a few can do. We can all pray; and if wig know not what we should pray for aj* we ought, tho Spirit Is here to help ou» infirmity; and Ho will lead us to pray th Lord or harvest, that He send forth Is borers into His harvest. It Is written ■Whosoever shall call upon the name o the Lord shall be saved. How thon chai they call on Him In whom they have believed? and how shall they believe 1 Him whom they have nol heard? and ho shall they heed without a preacher? how shall they preach, except they b< sent? And how shall they be sent, ex cept He. tho Lord of tha harvest, shai send them. Otherwise their going h worse than their remaining at home. Maj God help us to pray!”\n",
"*10.35 a. m .; *12.21, *2.29, *3.29, *5.22, *7.43, *1 p. m . Sundays, *3.13, »9.40, *11.25 a. m. ■8.29, *5.22 . *7.43, *11 ». m. PHILADELPHIA, week days, *3.13 . 5 .55 4.40, *7.16, 7.35. *8.25, 9.00. *9.40, *10.25. 11 .10 a tn.: *12.21, 1.20, *2.29, *3.29, 3.50, • •7.43, 9.15, *U p. m . Sundays. *3.13, 7.35 8.50, *9.40, *11.25, 11.25 a. in.; *3.29, 3.60, *6.22 6.30, *7.43, 9.15, *11 p. m. CHESTER, week days, *3.13 , 6.65, •7.16. 7.35. *8.25, 9.00, *9.40, *10.25. 11 .10 a. m. l.20, *2.29, 3.50, *5.22, 6.30, *7.43, 9.15. *11 p. m Sundays, *3.13, 7.25. 8.50, *9.10, *11.25, U.25 a m.; *3.29, 3.50, *5.22, 6.39, *7.43, 9.15, *11 p. m ATLANTIC CITY, week days, *7.15 a m., *8.25 a. in.. *12.21, *2.29, *3.29, *u.22 p. m Sundays, 7.35 a. m.; *3.29 p. m. CAPE MAY, we-k day», *7.U a. m <mask> •2.29 p. ra. Sundays, 7.35 a. m. BALTIMORE AND week days. *4.13 . 7 .10, *3.49. *11 a. m .; *12.5» •2.07, 3.04, *4.03, *4.67. *6.16. *8.17, *8.53 p. ra Sundays. *4.13, 7.10, »8.19 a. in.; *12.56, *3.07 l.04, *4.57, *8.17 . *8.53 p. in. BALTIMORE AND WAY STATIONS 7.10 a. m.; 3.04 p. m . daily. NEWARK, week days, *4.13 , 7.10, *8.42 •11.00 a. m.; *12.56 . 3.04, *4.03, *4 57, *6.16 7.36, *8.17, 10.46 p. in. Sundays, *4.13, 7.10 •8.49 a. m .; *12.56, 3.04, *4.57, 7.35, *8.17 p. m PITTSBURG, week days, *8.1« p. in Sundays. *4.57 p. m. CHICAGO, dally, *4.57 p m. CHICAGO via CINCINNATI and IN DIANA POLIS, *8.49 a. ra. daily. CINCINNATI AND ST. LOUIS, *12.51 p m. and *8.17 p. m . dally. TOLEDO AND DETROIT, rI dally to Toledo and dally except to Detroit.\n",
"ran ii(i ui'ui vrimiir, nucio uu unu ucoi 1 treating the prohibition question witt 2 his usual eloquence. Mr. Carskadon'; model of a house without nails has at traded much attention. \"Silos? said Mr Carakadon,\" \"the ello has come to stay It doubles the capacity of th< farm. Yes, I'm writting on book on ensilage, in fact nave writ en it. 1 expect to navo it out now in aMe» weeks, and 1 think you will find it a con tribution to the subject I intend it to bei practical guide. I want to say that thl year's Fair surprises and delights me You have made rapid progress since I wa here three years ago. I regard it as om of the finest in the country, as it shouli be with such territory to draw from.\" Mr. James L. Henderson, <mask> Washing ton,Pa.,owuer of the Locust Farm Heri of Holsteina: \"I oughtn't to find fault but your types made our Bitje two year old, when she is six. Velleure is a two year-old. They each took first premium And then the herd was established ii 1871), not 1859. I think that is all, excep that we are having a fine show.\" Colonel J. F . Ch&rlesworth, of St. Clairs ville, said: \"I am working up the layin] of the corner stone of our new cour house, September22. Come up. Wewil have a great big day. The procession wil be something immense, and the town wil be full of people. We will have the K T. Coinmandery, the Patriarchate of G. U Odd Fellows, the Knighta of Pythias am many other Wheeling people, and the Masons from all over tlie country.\n",
"of lluu,11 The London ]Im,the lint' uh tyuarlcrly, Avplelon'i Annual Cyelo- pedta, the infldilt Caitelar nod De L'Arlege, the cxoomnuntcatcd friar Hyacinths, (Uo condemned Janaenlat Duptn, erroneously quoted u \"Rom- lab,\" and liarper'# ilagasine lot He- oember should be added to the I let. The only Oal hollo authorities men- Honed are Cyprian anil Tertnlllan, wlioae lexta are not apeelfledi a pawage from tlio Now York Tablet, whlob la correct, and lb* \"Koman\" blatorlan Hefele, vol. I, p. 7 . If by this latter name Is meant ttie learned Clatliollo < Doctor Helele, or Tubingen, who wrote a \"History ot tbo UooboIIs,\" so far from bis Haying anything to favor the aaeertlon that the Oral eight (Jounolla . were convoked by the euiperora, he completely refute) it. Let Mr. Flaber 1 tell us whom hemeana by tlio \"Koraan\" i Historian Hefele. I The aermon of Ur. Fisher remind) > me of the witty description of a Msg- ( pie's nest by the poet Y darts. There <mask> are 111 It rags and tags and bobtails, i bits and scraps, odds sod ends from all i quarters, I shall oondense the subject I of his borrowed materials and answer 1 very bristly. If any man wants more I Information on any of these subjects whlob upace will not permit we to i develops, let hlrn oomenut under his 1 own name nnd he will get it. { Hour can vis know a general Council 1 since the conditions oj ecumenicity are i undecided? Answer: They are notun. I decided, We can tell general Councils only after they are ended. There have been eighteen ho far. All Gsthollop c admitthis. IfItIssobardtotella I general Conncll how do you explain I thla unanimity t A general Council Is c a historical fact. When the Oounoll > of tlie Vatican Is ended we oan tell I whether It 1b general or not as easily aa > we can know a meeting of Parliament 1 ni> * OM»lnn nf 1 lnnnasae \"\n",
"berg to aald Jlardmau, dated May 27th, 1874, and recorded in Deed Book No. 40, page 267; a deed from Harmon and Martha A. Tricket to tba aald llardman and Mary E. MiUner for two tracts of land, both containing &2U acres, dated June 10th, 1870, and recorded In Deed Book No. 38, pages 9 5.4, and deed from John B. Bherrard and others to aald llardman, dated May 18th, 1872, and ro- cordtd in Deed Book No. 41, pagea 2 and 3; a died from Margaret, Georgo B., James V., Julia A. and 6arah E. Jackson to said George llardman for one acre, dated March I6tb, 1872, and recorded in Deed Book No. 3\\\\ paxei 484-5; a deed from Wm. B. and C. Brown to aald llardman for two acres and 21 perches, dated March 27th, 1874, and recorded in Deed Book No. 41, psgea 4 and 6; a deed from Crrui and Nancy J. Linton to aald Hardman for 12 acres and 2U perches, dated June 17th, 1874, and recorded in Deed Book No. 41, pages 18 and 10; a deed from Bucknerand Bebecca Fairfax tosild Hardman for 1C0 acres, dated August loth, 1872, and recorded In Deed Book No. 41, pane <mask> and 21; and a deed from John K. and Mary E. MiUner U the aald Hardman for four tracts of land, aggre¬ gating 888 acres, dated March 16th, 1874, and re¬ corded in Deed Book No. 41, pages 22, 8 and 4. Tbe whole containing in tbe aggregate at least 850 acres, with all the improvements and appurte¬ nances thereto In any wise belonging, Including the furnace and fixtures, and being tbe same prop¬ erty conveyed to Thomas Y. Canby and George H. Miller, trustees, for the said George Hardman and wife, by mortgage deed dated November lit, 1874, and recorded In Book No. 39, pages 90 and 58, In the office of add County Clerk of aald Preston county, lielug tbe aatna property conveyed to the tald Aimer Evans, Jr., by Hannibal Forbes, 8pe- cial Commiaaioner, aid deed la of record among the land record* of Preston county, West Virginia. Tirmh or HiLK.One-third of the purchase money, or such greater amount thereof aa the pur¬ chaser may elect to pay, cash in hand, tbe residue to two equal yearly p-yments, with interest from day of aale, and the deferred payment! to be se¬ cured by deed of trust on tbe property sold.\n",
"Accommodation,8 00, 8 88,7 ofi, 808, I 45. II 33 a tu,1288,226.345.(25.82*\\\\840.740.to3i!pm. NewYork. I85.256.430,815», H55,s». IoIV. tlMam, #12lit.128“,l:n,31«, a45,510,s17 65b,606,Ri21 70«,7 18,» 12.1030pto. Boston, without change, 10 18 a in, 5 88 p m. « est Chester, via Lamokin, 8 80, 8 08 am. 225,o45nin. Newark Center anfl Intermediate stations. 140am,1264,633pm. Baltimore and intermediate am, 12U6,247,445.81)611m,1203night. Baltimore and Bay Line, 5 28 p m. Baltimore and Washington, 4 48 ,801,811. 1015, 11UO am, 1208, II18,208.428.523,*81«. ~ 58,7«0,830p,m,1249night. Trains for Delaware Division leave for: NewCastle,815,1121ain,250.380,440.8I 883,950pm,1208night. Lewes,815am,487pna. Harrington, Delmar and way staMcns, 8 a m. Harrington and way stations, 2 50 p m. Express for Dover. Harrington and Delmar 18ain,437pm,1201night. Express for Wyoming and Smyrna, 8 53 p Express for Cape Charles, Old Point Com fort and Norfolk. 11 18 a m. 12 01 night. Leave Philadelphia, Broad street for Wll tnington, express. 3 50, 7 20, 7 27, 8 81,0 lu, 10 30 1033,1118am,11235,130.21«,301,3ill,353,401 441,508.+517.530,55»,817,657,740,1116. pm. 12 00 night. Accommodation, 6 25.7 4», 10 38,11 55 a m, 1 32 228,310.408,448,«32,838,10OS.Ill4P.1138pIn Sunday Trains—Leave Wilmington for; Philadelphia, express, 1 56, 2 51, 4 2', 8 50, 8 00 10(«,11 51 a m.l 39,3 05;504,5 <mask> 0«, 7 08, 7 25 9 12 p m. Accommodation, 7 1». 81« a m, 12 1c 145,4U5,530.1030pm. Chester, express, 1 55,4 20. 8 50. H («1,10 (», 11 51. a m,504,558,706,912p m. Accommodation, 700.805am, 12in,145,405,620,725,lo30 pm. New York, express. 155, 2 55,4 20,7 00, 860 1151am. 1210,13o. 31«. 41«. 510. 5»,«(»■ ■*6 21,7 08,10 30 pm. Boston, without change. 5 58 p m. West Chester.via Lamokin, s 05 a m, 5 3o p 111 New Castle, 9 50 p m, 12 08 night. Cape Charles, Old Point Comfort and Nor­ folk, 12 01 night. Middletown, Clayton, I)ov»r. Wyoming, Eel- ton, Harrington, BridgevlUe. Seaiord, Laure and Delmar, 12 01 night. Baltimore and Washington. 4 48, 8 01, 10 P am,1306,523, +603,740,820pm.124« night Baltimore only, 6 00 u in, 12 13 night Leave Philadelphia. Broad street, for Wll mlngton. express, 3 50, 7 30. »10. 11 18 am. 4 41 5 08, « 57,7 40,8 35.1] 16. 1130 p m, 12IW night. Accommodation, 8 35, 10 38 a m, 12 35.2 05.8 838,1003and1138pm. For further Information, passengers are re ferred to the ticket ofllce at the station. ♦Congressional Limited Express train* com posed entirely of Pullman Vestibule Peril, and Dinln ^\n",
"The market for old rails is reported \"easier\" at $23, with an unwillingness on the part of havers to go above $22. In regard to the pig metal market, the Jfcntifacturrrdiscourses as follows: \"Dealers report that transactions foot up about the same as for some weeks past. Price* hare undergone no change what¬ ever, but strictly red-ehort iron continues very firm and in good demand at our quo¬ tations. Most of the red-short iron that comes to this market is made in the Shen- ango valley, where another furnace has blown in, after having gone out for repairs some time during the summer. The fur¬ naces in the eastern part of the State are having a better demand for their product than those in the west, letters from pig manufacturers in the region from Altoona to Reading to dealers here show that most of the furnaces that are in blast areselling their iron as fast as they can make it, while some are sold ahead to April, and at better prices at the furnace than could be obtained here, delivered. This will, of course, be good news to western furnace owners, as it relieves them of a competi¬ tion that <mask> felt to some extent not long ago. These letters state that some of the furnaces are sending iron to South America, the others rinding a market in the Kast. This favorable condition is more especially observable in the Reading dis¬ trict and\"in the Lehigh Vallev. It will, however, dampen the hopes that we may have kindled in the breasts of western furnace men when. we add Uuu the de¬ mand wc have spoken off is mostly for foundrv iron, and that there is an over¬ stock o* some other sorts. But as an offiset to this it should not be forgotten that the Bessemer works are all well supplied with orders, and will require a large quantity of nig. Not only are a goodly number of furnaces west of the Alleghenies now run¬ ning on this grade of iron, but in the Kast furnaoes have within a month chanced from other kinds to this, and some that were out of blast have blown in to make it. The pig iron trade is gradually adjusting itself to the changing conditions in the metallurgical world, and we may hope¬ fully look for a better and more settled state of affairs by and by.\n",
"mighty works. It was common com- plaint that in the days of his greatest victories, men could not find Mr. Moodi when a service was dismissed, or get into his quarters at the hotels; he would give no opportunity for self-- glorification. Paul and Barnabas had hard work to restrain these hero wor- shipers (v. 14), and to convince them who they were and how they had been enabled to accomplish such a wonder- ful miracle (v. 15). Paul was of \"like stature\" with them and would not ac- cept worship as did the Caesars or Herod (12:22, 23). He exhorted the Lystrians to turn from \"these vain things.\" i. e., such idol worship, unto the \"living God\" (see also I Cor. 8:4; I Thess. 1:9). Hitherto God had not miraculously interfered to turn men from their evil ways (v. 16), but left them to their own devices to show their inability to find their way back to him (see Acts 17:30; I Cor. 1:21). Yet God is not \"without witnesses\" (v. 17). The seasons and the natural laws point to God, yet men still re- main blind and ungrateful. Thus by vehement exhortation they <mask> this act of sacrilege. (2) Persecution (vs. 19. 20). The mob is ever fickle, (v. 18). but it did not turn them \"unto the living God\" (v. 15). Conversion is the simple turning from idols (I Thess. 1 -9), a rational thing, but one contrary to the pride of men who de- sire to \"do something\" whereby they may merit or can demand their sal- I vation. Even as Paul had difficulty to turn people aside from idols, so today it is hard to keep men and women from idolatry, not the gross or vulgar idolatry of heathenism, but the re I fined idols of culture, success, power, money and pleasure. To his difficul- ties Paul had the added persecution of the vindictive Iconians and those from Antioch (v. 19). God delivered him from this trial (I Cor. 11:25, 27). All loyal witnesses must expect pereecu- tion from the G(;od-hating world ill Tim. 3:12; John 15:18-20) Some think that this was when Paul was \"caught up into the third heaven\" (II Cor. 12: 2-4). Hils treatment did not stop his testimony, nor separate him from friends vs. 2u, 21). III. The Return (vv. 22-28).\n",
"In a fold of the Kentish hills, surrounded by apple orchards and hop gardens, there stands a humble building whose walls are eloquent of the past, a writer in the London Globe says. It is almost the only one of Its kind left standing-so far as the exterior Is concerned-in its entirety. The adjoining land was granted to one of his knights by Edward I. In 1272, and the most roll- able antiquarian opinion is in' favor of the house having been built shortly after. Our knight, in the matter of building, did not despise the record of the past, for he adopted the Norman method, then dying out, of placing his living rooms on the second floor. This made for safety and the ground floor apartments were simply windowless dun- geons and storerooms. In those days they built for strength, and the walls of Kentish rag are of great thickness, cal- culated to withstand the assaults of any quarrel. some neighbors, while' the turret, which gives ad- mittance by a stone spiral staircase to the living rooms above, is guarded top and bottom by mas- sive oaken doors, and is lighted by oylets through which a rain of arrows could be poured upon in. truders below. The main style of the building is that of the transition from early English to <mask> orated. Oblong in form, it has gables north .and south, and at either end of the long east wall is a square projection. Aseending the stairs we find ourselves In a room of truly noble proportions, occupying the length and breadth of the building, 28 feet by 18%, and lighted by windows east, west, north and south. It is open to the roof, which contains nearly, if not quite, Its original form, and has a fireplace and an \"ambrey\" or cupboard in which cooking and table requisites and alms for the poor were kept. In this \"airs\"or aitre\" the fam- ily lived and worked, and here visitors and better class retainers slept. Here, perhaps, from the beams supportlng the roof hung the store of dried provisions for winter use, and the herbs collected by the squire's dame. It was here in the \"airs\" that, at even, the family gathered round the firelight (candles were expensive luxuries in thoe days) to listen to story of battle or chase. The windows were ura glesed, but glass might be fied in the shutters, the Iron hook for which still remains. Oaken set tIes did duty as seats by day and as restting places at night and meals were served on a board placed on trestles--hence, perhaps, the phranse \"the te tive board. \"\n",
"Captaiu J. W . Plumuicr of Mineral Hill arrived here from Eureka, yesti r d.»y. nud departs for thu east by the train thiit evening, utjr he enjoy the journey, cml iu good time return to bin many friends iu the Silver state. The Grand Jury having discharged its duties most of tlic members arouu tbeir way to their respective homes again. The train last evening took several towanl tho Eastern part of tho county, and the stage for the north this morn- frig tarried away a number. The reir guard, in the person of Colonel J. B Mooro, is about ull of the otgauiz ition now visible to the naked eye. The Republican 'jonfnnls claim that at the last Presidential election, Hayes received a majority of the electoral vote, while tho Democrats as stoutly main- tain that Tildett was justly entitled to tho fiuits of a clearly wou victory. Iu all well regulated rnccs, a dead boat is required to beruu a second time. The Democrats aro perfectly willing to back Mr. Tildeu to buy amount iu the com¬ ing contest aul ilaro <mask> opposition to trot out Mr. Hayes, ltither than mis* such a race they would even give the lat¬ ter tho advantage of the distance to the qu irter-polo. Why is it that wo. never hear the nainoof Rithcrford mentioned iu connection with the uext Presidency ? Auuther attempt wis m.ulo to sup¬ press the Czar of Russia and the baluuce of tho Imperial fauiily.Tucs lay evening. A urine was explod d under the dining . room ol tho Winter Palace at St. Peters¬ burg which tore out the floor for a space of six. by ten feet, killing fivo soldiers an 1 wounding thirty live others. The Salvation of the family was owing to nil accidental d. lay by whicll they were n trifle behind their usual timo at supper. The adage \"fWasJ ra'sts tho head that wears a Crown,\" is peculiarly applica¬ ble to the ease of tho present ruler of Russia. Ho appears to bear n cli irtned life, howovi-r, and may possibly worry through his three score and ten, shnf- tling off his mortal coil like a mere hu¬ man. after ull.\n",
"Importance to put tlie town In a posi- tion to sustain n siege ot some length. Those works were begun on tlie day on which Tomsk fell Into the hands of the Tartars. At the same time ns that last news the grand duke loarned that the emir of Dokbnra aud the allied khans were directing the movement lu person, but what he did not know was that the lieutenant of those barbarous chiefs wns Ivan Ogareff, a Itusslan olll -c e- r whom he himself hnd cashiered. From the first, as has been seen, the Inhabitants of the province of Irkutsk hnd been ordered to abandon' the towns nnd villages. Those who did not seek refuge In the capital were compelled to retire beyond Lake Baikal, to where the Invasion would not likely extend Its ravages. The crops of corn and forage wero requisitioned for the town, aud that last rampart of Itusslan powor lu tho extreme east was prepared to re- sist for some time. Irkutsk, founded In 1011, Is tltttnted nt the confluence of the <mask> nnd the Angara, on the right bank of tho river. Two wooden bridges, built on piles nnd so arranged ns to open the whole width of tho river for tho necessities of navi- gation, joined the town with Its out- skirts which extended nlong the left bank. The outskirts were nbandoned, the bridges destroyed. Tho passage of the Angara, which was very wide at that place, would not have been possi- ble under the lire of the besieged. But the river could le crossed either above or below the town, aud ns a conse- quence Irkutsk wns in dnnger of being attacked on the east side, which no ramuart urotccted. It was, then, lu works of fortification thnt the hands were llrst employed. They worked day nnd night. The grand duke found a spirited population lu supplying that need, and afterward he found them most brave in Its defense. Soldiers, merchants, exiles, pensants. all do voted themselves to the common safety. Eight days before the Tnrtars had appeared on the Augara ramparts of inrt1i linil hi'fin rnUnil.\n",
"Mr. J .T., Reading, Pa: My Dear Sir Your favor of tho 20th Inst., Isjust received in reforenco to tho admission of colored children into tho public schools of our city, nnd contain- ing a copy ofyourremarksatn meeting held by your colored citizens. I will forward your remarks to Washington us requested, and I think you need havo no fears of removal. I um forming no opinion ust now on thd question, but think Mr. Sumner's bill will settlo tho wholo matter. A great deal of my timo tho past season has been occupied in preparing a newaud enlarged cdltlonof \"What I Know about Farmlng,\"a most excellent aud serviceable book, which I think yon ought to havo. (I will send you a copy, postngo prepaid, on receipt of price: Si, 50) As tho season Is advanced and has kept mo lu tho houso n great deal, I have been trying to better the condition of our people by endeavor- ing to make improvements in cooking. For somo years I have found that doughnuts Ho too heavy on my stomach, which my physicians attrlbuto to tho fat In which they arc fried. <mask> tell mo that a doughnut contains about eight times as much fat as Is consistent with a doughnut. To overcomothls dlfllculty,I havo gono to cousldorablo philosophical research. By using only ono eighth of tho usual amount of fat for frying them, Mrs. Greeley assurod mo the doughnuts would burn. By using eight times as much Hour I would havo just eight times as many doughnuts as I wnnted. I therefore determined touso eighttlmes tho usual quantity of sots. Mrs. G. mixed up tho batter in lho bread bowl, nnu Having matlo most exact propor- tions, I put In oilo pint of sots. Tho next morning, on entering tho kitchen, wo found that our batch of douehnuts had risen about ninety degrees abovo our highest expectations, and the tido was still rising. Mrs. G. heated tho lard whilo I tried to stir down tho bat- ter, but all to no use. I poured in somo fat, but it only spritzed and crackled, and I was mortified to find my experi- ment a failuro as tho doughnuts would not stick together. Too much sots in a doughnut is worse than CnrlSchurzln a caucus.\n",
"Commercial and Financial* TX .rr^.n\".-..' Prospects of the Soger Market. Fmm tM Tribunt qf Tuuday. The week opens upon'on easy money market, with rates on call at 5@0 pei sent,\" but with an uneasy feeling In the nlnds of the. people, aa well as witli.tbi Panics and bankers having large wester: iccounts, as to what Is to be thecondltloi jf the money marker as soon as there U i -loader call for money from the west h novo,the crop*. The President of one o >ur strongest banks Inlormed us this morn hg that last week he sent to the wes ibout $60,000 a day; and bankers in Oht sago write blm,'\"If there, is an nctivi novoment in 'wheat, they will be imme Uately in' want of currency. and thi nonoy market'there will be tight; aw her will atooco beobliged to draw out al <mask> fundB east to accommodate rprodaci layers' at home.\" 'One ot the most promt lent bankers of Wall-st. and a gentleman >1 conservative.views la financial matten old to ua this morning, \"Mr. Boutwcll hoi tin hi* power, -with' hit'160,000,000 a [old and his $10,000,000 of currency, t< c«ep thomoney market,: noti.only of thii ilty, butof the -whole' country, in a verj ssy condition this Fall. He has only ti elfalittle more gold, and then Increast ome $5,000,000 or more of hl> purchase! if bonds thu monthTsnd from $5,000,0Q( o$ld,000,000 next month, asthe bnslnes lemondB may requlre. and the banken viil feel confldoneo iir the sitratioo, ant! vould discount business'Diaper with i enseof satety in ttie 1 future.'1..The $3,' 100,000 of. bonds, ajreafly taken: sby! tiu iocrotary above the tumi named In hit idvcrtisement this month;baa prodocec iie best of fueling, and made money east\n"
]
}
],
"source": [
"with open('test-A/out.tsv', mode='wt', encoding='utf-8') as f:\n",
" for line in X:\n",
" try:\n",
" out = classifier(line)\n",
" parsed_out = parse_output(out)\n",
" f.write(f\"{parsed_out}\\n\")\n",
" except:\n",
" print(line)\n",
" f.write(f\":1\\n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}

15
lm0.py Normal file
View File

@ -0,0 +1,15 @@
import sys
import random
distribs = [
'a:0.6 the:0.2 :0.2',
'the:0.7 a:0.2 :0.1',
'the:0.9 :0.1',
'the:0.3 be:0.2 to:0.15 of:0.15 and:0.05 a:0.05 in:0.05 :0.05',
]
for line in sys.stdin:
ctx = line.split('\t')[6:]
i = random.randint(0, len(distribs) - 1)
print(distribs[i])

56
lm1.py Normal file
View File

@ -0,0 +1,56 @@
import sys
import random
from tqdm import tqdm
from collections import defaultdict
import pickle
import os
corpus = []
with open('train/in.tsv', 'r') as f:
print('Reading corpus...')
for line in tqdm(f):
ctx = line.split('\t')[6:]
corpus.append(ctx[0] + 'BLANK' + ctx[1])
corpus = ' '.join(corpus)
corpus = corpus.replace('-\n', '')
corpus = corpus.replace('\\n', ' ')
corpus = corpus.replace('\n', ' ')
corpus = corpus.split(' ')
if (os.path.exists('distrib.pkl')):
print('Loading distribution...')
distrib = pickle.load(open('distrib.pkl', 'rb'))
else:
print('Generating distribution...')
distrib = defaultdict(lambda: defaultdict(int))
for i in tqdm(range(len(corpus) - 1)):
distrib[corpus[i]][corpus[i+1]] += 1
with open('distrib.pkl', 'wb') as f:
print('Saving distribution...')
pickle.dump(dict(distrib), f)
results = []
with open('dev-0/in.tsv', 'r') as f:
print('Generating output...')
for line in tqdm(f):
ctx = line.split('\t')[6:]
last_word = ctx[0].split(' ')[-1]
try:
blank_word = max(distrib[last_word], key=distrib[last_word].get)
except:
blank_word = 'NONE'
results.append(blank_word)
with open('dev-0/out.tsv', 'w') as f:
print('Writing output...')
for result in tqdm(results):
if result == 'NONE':
f.write('a:0.6 the:0.2 :0.2')
else:
f.write(f'{result}:0.9 :0.1')

83
lmn.py Normal file
View File

@ -0,0 +1,83 @@
from tqdm import tqdm
from numpy import argmax
def preprocess(corpus):
corpus = corpus.replace('-\n', '')
corpus = corpus.replace('\\n', ' ')
corpus = corpus.replace('\n', ' ')
corpus = corpus.replace('.', ' EOS')
return corpus
def generate_freq(tokens):
tokens_freq = {}
for token in tqdm(tokens):
if token not in tokens_freq:
tokens_freq[token] = 1
else:
tokens_freq[token] += 1
return tokens_freq
def generate_ngrams(tokens, n):
ngrams = []
for i in tqdm(range(len(tokens) - n + 1)):
ngrams.append(tokens[i:i+n])
return ngrams
def generate_distribution(unique_tokens, tokens_freq, bigrams_freq):
n = len(unique_tokens)
distribution = [[] * n] * n
for i in tqdm(n):
denominator = tokens_freq[unique_tokens[i]]
for j in range(n):
try:
numerator = bigrams_freq[unique_tokens[i] + unique_tokens[j]]
except:
numerator = 0
distribution[unique_tokens[i] + unique_tokens[j]] = numerator / denominator
return distribution
with open('train/in.tsv', 'r') as f:
print('Reading corpus...')
corpus = []
for line in tqdm(f):
ctx = line.split('\t')[6:]
corpus.append(ctx[0] + 'BLANK' + ctx[1])
print('Preprocessing corpus...')
corpus = preprocess(' '.join(corpus))
tokens = corpus.split()
unique_tokens = set(sorted(corpus))
print('Generating tokens frequency...')
tokens_freq = generate_freq(tokens)
print('Generating n-grams...')
bigrams = generate_ngrams(tokens, 2)
print('Generating bigrams frequency...')
bigrams_freq = generate_freq(bigrams)
print('Generate distribution...')
distribution = generate_distribution(unique_tokens, tokens_freq, bigrams_freq)
with open('dev-0/in.tsv', 'r') as f:
print('Generating output...')
results = []
for line in tqdm(f):
ctx = line.split('\t')[6:]
last_word = preprocess(ctx[0]).split(' ')[-1]
try:
blank_word = unique_tokens[argmax(distribution[unique_tokens.index(last_word)])]
except:
blank_word = 'NONE'
results.append(blank_word)
with open('dev-0/out.tsv', 'w') as f:
print('Writing output...')
for result in tqdm(results):
if result == 'NONE':
f.write('a:0.6 the:0.2 :0.2')
else:
f.write(f'{result}:0.9 :0.1')

View File

@ -1,36 +0,0 @@
import lzma
import json
def preprocess_train_data(X, y):
parsed_data = []
for line, masked in zip(X, y):
left = line.strip().split('\t')[6].replace('\\n', ' ')
right = line.strip().split('\t')[7].replace('\\n', ' ')
masked = masked.strip()
text = left + f' {masked} ' + right
parsed_data.append({'text': text})
return parsed_data
with lzma.open('train/in.tsv.xz', mode='rt', encoding='utf-8') as f:
X = f.readlines()
with open('train/expected.tsv', mode='rt', encoding='utf-8') as f:
y = f.readlines()
data = preprocess_train_data(X, y)
data = data[:10000]
train_data = data[:int(len(data) * 0.8)]
val_data = data[int(len(data) * 0.8):]
with open('train/train.json', mode='wt', encoding='utf-8') as f:
json.dump(train_data, f)
with open('train/val.json', mode='wt', encoding='utf-8') as f:
json.dump(val_data, f)

153
ripped.py Normal file
View File

@ -0,0 +1,153 @@
import lzma
import matplotlib.pyplot as plt
from math import log
from collections import OrderedDict
from collections import Counter
import regex as re
from itertools import islice
def freq_list(g, top=None):
c = Counter(g)
if top is None:
items = c.items()
else:
items = c.most_common(top)
return OrderedDict(sorted(items, key=lambda t: -t[1]))
def get_words(t):
for m in re.finditer(r'[\p{L}0-9-\*]+', t):
yield m.group(0)
def ngrams(iter, size):
ngram = []
for item in iter:
ngram.append(item)
if len(ngram) == size:
yield tuple(ngram)
ngram = ngram[1:]
PREFIX_TRAIN = 'train'
words = []
counter_lines = 0
with lzma.open(f'{PREFIX_TRAIN}/in.tsv.xz', 'r') as train, open(f'{PREFIX_TRAIN}/expected.tsv', 'r') as expected:
for t_line, e_line in zip(train, expected):
t_line = t_line.decode("utf-8")
t_line = t_line.rstrip()
e_line = e_line.rstrip()
t_line_splitted_by_tab = t_line.split('\t')
t_line_cleared = t_line_splitted_by_tab[-2] + ' ' + e_line + ' ' + t_line_splitted_by_tab[-1]
words += t_line_cleared.split()
counter_lines+=1
if counter_lines > 90000:
break
# lzmaFile = lzma.open('dev-0/in.tsv.xz', 'rb')
# content = lzmaFile.read().decode("utf-8")
# words = get_words(trainset)
ngrams_ = ngrams(words, 2)
def create_probabilities_bigrams(w_c, b_c):
probabilities_bigrams = {}
for bigram, bigram_amount in b_c.items():
if bigram_amount <=2:
continue
p_word_before = bigram_amount / w_c[bigram[0]]
p_word_after = bigram_amount / w_c[bigram[1]]
probabilities_bigrams[bigram] = (p_word_before, p_word_after)
return probabilities_bigrams
words_c = Counter(words)
word_=''
bigram_c = Counter(ngrams_)
ngrams_=''
probabilities = create_probabilities_bigrams(words_c, bigram_c)
items = probabilities.items()
probabilities = OrderedDict(sorted(items, key=lambda t:t[1], reverse=True))
items=''
# sorted_by_freq = freq_list(ngrams)
PREFIX_VALID = 'dev-0'
def count_probabilities(w_b, w_a, probs, w_c, b_c):
results_before = {}
results_after = {}
for bigram, probses in probs.items():
if len(results_before) > 20 or len(results_after) > 20:
break
if w_b == bigram[0]:
results_before[bigram] = probses[0]
if w_a == bigram[1]:
results_after[bigram] = probses[1]
a=1
best_ = {}
for bigram, probses in results_before.items():
for bigram_2, probses_2 in results_after.items():
best_[bigram[1]] = probses * probses_2
for bigram, probses in results_after.items():
for bigram_2, probses_2 in results_before.items():
if bigram[0] in best_:
if probses * probses_2 < probses_2:
continue
best_[bigram[0]] = probses * probses_2
items = best_.items()
return OrderedDict(sorted(items, key=lambda t:t[1], reverse=True))
with lzma.open(f'{PREFIX_VALID}/in.tsv.xz', 'r') as train:
for t_line in train:
t_line = t_line.decode("utf-8")
t_line = t_line.rstrip()
t_line = t_line.replace('\\n', ' ')
t_line_splitted_by_tab = t_line.split('\t')
words_pre = t_line_splitted_by_tab[-2].split()
words_po = t_line_splitted_by_tab[-1].split()
w_pre = words_pre[-1]
w_po = words_po[0]
probs_ordered = count_probabilities(w_pre, w_po,probabilities, words_c, bigram_c)
if len(probs_ordered) ==0:
print(f"the:0.5 a:0.3 :0.2")
continue
result_string = ''
counter_ = 0
for word_, p in probs_ordered.items():
if counter_>4:
break
re_ = re.search(r'\p{L}+', word_)
if re_:
word_cleared = re_.group(0)
result_string += f"{word_cleared}:{str(p)} "
else:
if result_string == '':
result_string = f"the:0.5 a:0.3 "
continue
counter_+=1
result_string += ':0.1'
print(result_string)
a=1

233
run.py Normal file
View File

@ -0,0 +1,233 @@
import lzma
import regex as re
from torchtext.vocab import build_vocab_from_iterator
from torch import nn
import pickle
from os.path import exists
from torch.utils.data import IterableDataset
import itertools
from torch.utils.data import DataLoader
import torch
from matplotlib import pyplot as plt
from tqdm import tqdm
def get_words_from_line(line):
line = line.rstrip()
line = line.split("\t")
text = line[-2] + " " + line[-1]
text = re.sub(r"\\\\+n", " ", text)
text = re.sub('[^A-Za-z ]+', '', text)
for t in text.split():
yield t
def get_word_lines_from_file(file_name):
with lzma.open(file_name, "r") as fh:
for line in fh:
yield get_words_from_line(line.decode("utf-8"))
def look_ahead_iterator(gen):
first = None
second = None
for item in gen:
if first is not None and second is not None:
yield (first, second, item)
first = second
second = item
class Trigrams(IterableDataset):
def __init__(self, text_file, vocabulary_size):
self.vocab = build_vocab_from_iterator(
get_word_lines_from_file(text_file),
max_tokens=vocabulary_size,
specials=["<unk>"],
)
self.vocab.set_default_index(self.vocab["<unk>"])
self.vocabulary_size = vocabulary_size
self.text_file = text_file
def __iter__(self):
return look_ahead_iterator(
(
self.vocab[t]
for t in itertools.chain.from_iterable(
get_word_lines_from_file(self.text_file)
)
)
)
class TrigramModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim):
super(TrigramModel, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.hidden = nn.Linear(embedding_dim * 2, hidden_dim)
self.output = nn.Linear(hidden_dim, vocab_size)
self.softmax = nn.Softmax()
def forward(self, x, y):
x = self.embeddings(x)
y = self.embeddings(y)
z = self.hidden(torch.cat([x, y], dim=1))
z = self.output(z)
z = self.softmax(z)
return z
embed_size = 500
vocab_size = 20000
vocab_path = "vocabulary.pickle"
if exists(vocab_path):
print("Loading vocabulary from file...")
with open(vocab_path, "rb") as fh:
vocab = pickle.load(fh)
else:
print("Building vocabulary...")
vocab = build_vocab_from_iterator(
get_word_lines_from_file("train/in.tsv.xz"),
max_tokens=vocab_size,
specials=["<unk>"],
)
with open(vocab_path, "wb") as fh:
pickle.dump(vocab, fh)
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)
dataset_path = 'train/dataset.pickle'
if exists(dataset_path):
print("Loading dataset from file...")
with open(dataset_path, "rb") as fh:
train_dataset = pickle.load(fh)
else:
print("Building dataset...")
train_dataset = Trigrams("train/in.tsv.xz", vocab_size)
with open(dataset_path, "wb") as fh:
pickle.dump(train_dataset, fh)
print("Building model...")
model = TrigramModel(vocab_size, embed_size, 64).to(device)
data = DataLoader(train_dataset, batch_size=10000)
optimizer = torch.optim.Adam(model.parameters())
criterion = torch.nn.NLLLoss()
print("Training model...")
model.train()
losses = []
step = 0
max_steps = 1000
for x, y, z in tqdm(data):
x = x.to(device)
y = y.to(device)
z = z.to(device)
optimizer.zero_grad()
ypredicted = model(x, z)
loss = criterion(torch.log(ypredicted), y)
losses.append(loss.item())
loss.backward()
optimizer.step()
step += 1
if step > max_steps:
break
plt.plot(losses)
plt.show()
torch.save(model.state_dict(), f"trigram_model-embed_{embed_size}.bin")
vocab_unique = set(train_dataset.vocab.get_stoi().keys())
output = []
print('Predicting dev...')
with lzma.open("dev-0/in.tsv.xz", encoding='utf8', mode="rt") as file:
for line in tqdm(file):
line = line.split("\t")
first_word = re.sub(r"\\\\+n", " ", line[-2]).split()[-1]
first_word = re.sub('[^A-Za-z]+', '', first_word)
next_word = re.sub(r"\\\\+n", " ", line[-1]).split()[0]
nenxt_word = re.sub('[^A-Za-z]+', '', next_word)
if first_word not in vocab_unique:
word = "<unk>"
if next_word not in vocab_unique:
word = "<unk>"
first_word = torch.tensor(train_dataset.vocab.forward([first_word])).to(device)
next_word = torch.tensor(train_dataset.vocab.forward([next_word])).to(device)
out = model(first_word, next_word)
top = torch.topk(out[0], 10)
top_indices = top.indices.tolist()
top_probs = top.values.tolist()
unk_bonus = 1 - sum(top_probs)
top_words = vocab.lookup_tokens(top_indices)
top_zipped = list(zip(top_words, top_probs))
res = ""
for w, p in top_zipped:
if w == "<unk>":
res += f":{(p + unk_bonus):.4f} "
else:
res += f"{w}:{p:.4f} "
res = res[:-1]
res += "\n"
output.append(res)
with open(f"dev-0/out-embed-{embed_size}.tsv", mode="w") as file:
file.writelines(output)
model.eval()
output = []
print('Predicting test...')
with lzma.open("test-A/in.tsv.xz", encoding='utf8', mode="rt") as file:
for line in tqdm(file):
line = line.split("\t")
first_word = re.sub(r"\\\\+n", " ", line[-2]).split()[-1]
first_word = re.sub('[^A-Za-z]+', '', first_word)
next_word = re.sub(r"\\\\+n", " ", line[-1]).split()[0]
next_word = re.sub('[^A-Za-z]+', '', next_word)
if first_word not in vocab_unique:
word = "<unk>"
if next_word not in vocab_unique:
word = "<unk>"
first_word = torch.tensor(train_dataset.vocab.forward([first_word])).to(device)
next_word = torch.tensor(train_dataset.vocab.forward([next_word])).to(device)
out = model(first_word, next_word)
top = torch.topk(out[0], 10)
top_indices = top.indices.tolist()
top_probs = top.values.tolist()
unk_bonus = 1 - sum(top_probs)
top_words = vocab.lookup_tokens(top_indices)
top_zipped = list(zip(top_words, top_probs))
res = ""
for w, p in top_zipped:
if w == "<unk>":
res += f":{(p + unk_bonus):.4f} "
else:
res += f"{w}:{p:.4f} "
res = res[:-1]
res += "\n"
output.append(res)
with open(f"test-A/out-embed-{embed_size}.tsv", mode="w") as file:
file.writelines(output)

BIN
test-A/in.tsv.xz Normal file

Binary file not shown.

7414
test-A/out-embed-100.tsv Normal file

File diff suppressed because it is too large Load Diff

7414
test-A/out-embed-500.tsv Normal file

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.

BIN
trigram_model-embed_100.bin Normal file

Binary file not shown.

BIN
trigram_model-embed_500.bin Normal file

Binary file not shown.