aitech-eks-pub/cw/15_similarity_search.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![Logo 1](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech1.jpg)\n",
    "<div class=\"alert alert-block alert-info\">\n",
    "<h1> Ekstrakcja informacji </h1>\n",
    "<h2> 15. <i>Similarity search</i>  [ćwiczenia]</h2> \n",
    "<h3> Jakub Pokrywka (2021)</h3>\n",
    "</div>\n",
    "\n",
    "![Logo 2](https://git.wmi.amu.edu.pl/AITech/Szablon/raw/branch/master/Logotyp_AITech2.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://arxiv.org/pdf/1910.10683.pdf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "https://github.com/applicaai/kleister-nda"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import T5Tokenizer, T5ForConditionalGeneration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"translate English to French: My name is Azeem and I live in India\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"summarize: Machine learning involves computers discovering how they can perform tasks without being explicitly programmed to do so. It involves computers learning from data provided so that they carry out certain tasks. For simple tasks assigned to computers, it is possible to program algorithms telling the machine how to execute all steps required to solve the problem at hand; on the computer's part, no learning is needed. For more advanced tasks, it can be challenging for a human to manually create the needed algorithms. In practice, it can turn out to be more effective to help the machine develop its own algorithm, rather than having human programmers specify every needed step.\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "from transformers import T5Tokenizer, T5ForConditionalGeneration\n",
    "\n",
    "tokenizer = T5Tokenizer.from_pretrained('t5-small')\n",
    "\n",
    "model = T5ForConditionalGeneration.from_pretrained('t5-small', return_dict=True,).to('cuda')\n",
    "\n",
    "\n",
    "# You can also use \"translate English to French\" and \"translate English to Romanian\"\n",
    "input_ids = tokenizer(text, return_tensors=\"pt\").input_ids.to('cuda')  # Batch size 1\n",
    "\n",
    "outputs = model.generate(input_ids)\n",
    "\n",
    "decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
    "\n",
    "print(decoded)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "KLEISTER_PATH = '/media/kuba/ssdsam/Syncthing/Syncthing/przedmioty/2020-02/IE/applica/kleister-nda/'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_exp_f = open(KLEISTER_PATH + 'train/expected.tsv')\n",
    "train_exp = []\n",
    "for line in train_exp_f:\n",
    "    line_splitted = line.strip('\\n').split(' ')\n",
    "    found = False\n",
    "    for elem in line_splitted:\n",
    "        if 'jurisdiction=' in elem:\n",
    "            train_exp.append('jurisdiction: ' + elem.split('=')[1])\n",
    "            found = True\n",
    "            break\n",
    "    if not found:\n",
    "        train_exp.append('jurisdiction: NONE')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev_exp_f = open(KLEISTER_PATH + 'dev-0/expected.tsv')\n",
    "dev_exp = []\n",
    "for line in dev_exp_f:\n",
    "    line_splitted = line.strip('\\n').split(' ')\n",
    "    found = False\n",
    "    for elem in line_splitted:\n",
    "        if 'jurisdiction=' in elem:\n",
    "            dev_exp.append('jurisdiction: ' + elem.split('=')[1])\n",
    "            found = True\n",
    "            break\n",
    "    if not found:\n",
    "        dev_exp.append('jurisdiction: NONE')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_exp"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_in_f = open(KLEISTER_PATH + 'train/in.tsv')\n",
    "train_in = []\n",
    "for line in train_in_f:\n",
    "    line = line.rstrip('\\n')\n",
    "    train_in.append(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev_in_f = open(KLEISTER_PATH + 'dev-0/in.tsv')\n",
    "dev_in = []\n",
    "for line in dev_in_f:\n",
    "    line = line.rstrip('\\n')\n",
    "    dev_in.append(line)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_in[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.device"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input = train_in[0]\n",
    "\n",
    "# You can also use \"translate English to French\" and \"translate English to Romanian\"\n",
    "input_ids = tokenizer(input, return_tensors=\"pt\").input_ids[:,:512].to('cuda')  # Batch size 1\n",
    "\n",
    "outputs = model.generate(input_ids)\n",
    "\n",
    "decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
    "\n",
    "print(decoded)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_ids = tokenizer('translate English to German: The house is wonderful.', return_tensors='pt').input_ids.to('cuda')\n",
    "labels = tokenizer('Das Haus ist wunderbar.', return_tensors='pt').input_ids.to('cuda')\n",
    "# the forward function automatically creates the correct decoder_input_ids\n",
    "loss = model(input_ids=input_ids, labels=labels).loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AdamW\n",
    "\n",
    "optimizer = AdamW(model.parameters(), lr=5e-5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for line_in, line_exp in zip(train_in, train_exp):\n",
    "    input_ids = tokenizer(line_in, return_tensors='pt').input_ids[:,:512].to('cuda')\n",
    "    labels = tokenizer(line_exp, return_tensors='pt').input_ids.to('cuda')\n",
    "    # the forward function automatically creates the correct decoder_input_ids\n",
    "    loss = model(input_ids=input_ids, labels=labels).loss\n",
    "    loss.backward()\n",
    "    optimizer.step()\n",
    "    optimizer.zero_grad()\n",
    "    print(loss.item())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.eval()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input = dev_in[0]\n",
    "\n",
    "input_ids = tokenizer(input, return_tensors=\"pt\").input_ids[:,:512].to('cuda')  # Batch size 1\n",
    "\n",
    "outputs = model.generate(input_ids)\n",
    "\n",
    "decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
    "\n",
    "print(decoded)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "dev_exp[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "input = dev_in[2]\n",
    "\n",
    "input_ids = tokenizer(input, return_tensors=\"pt\").input_ids[:,:512].to('cuda')  # Batch size 1\n",
    "\n",
    "outputs = model.generate(input_ids)\n",
    "\n",
    "decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
    "\n",
    "print(decoded)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dev_exp[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## pytanie:\n",
    "- co można poprawić w istniejącym rozwiązaniu?"
   ]
  }
 ],
 "metadata": {
  "author": "Jakub Pokrywka",
  "email": "kubapok@wmi.amu.edu.pl",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "lang": "pl",
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  },
  "subtitle": "15.Similarity search[ćwiczenia]",
  "title": "Ekstrakcja informacji",
  "year": "2021"
 },
 "nbformat": 4,
 "nbformat_minor": 4
}