gonito/templates/presentation-datech-2017.hamlet

<script src="/static/js/sigma.min.js">
<script src="/static/js/sigma.parsers.json.min.js">

<div id="title" class="step" data-x="0" data-y="0">
   <h1>RetroC challenge
   <p>how to guess the publication year of a text?
   <p class="footnote">Filip Graliński, Rafał Jaworski,<br/>Łukasz Borchmann, Piotr Wierzchoń
   <p class="footnote">DATeCH 2017

<div class="step slide" data-x="0" data-y="1000">
   <h2>Overview
   <ol>
     <li>Raw data used
     <li>The RetroC challenge
     <ul>
        <li><i>jetzt neu!</i> RetrocC2
     <li>Results

<div class="step slide" data-x="0" data-y="2000">
   <h2>Raw corpus
   <ul>
     <li>Polish digital libraries (OCRed DjVus/PDFs)
     <li>digital-born material, (pre-)history of Polish Internet, manually transcribed, grassroots digitization efforts, <i>samiskan</i> etc.
     <ul>
        <li>… anything that is timestamped

<div class="step slide" data-x="1000" data-y="2000">
   <h2>Raw corpus in numbers
   <ul>
     <li>3.3M publications
     <li>24M pages
     <li>19.5G words
     <li>98.8G characters
     <li>18th century till today
     <li>mostly Polish
     <ul>
       <li>… (also German if you're interested — 350K publications)

<div class="step slide" data-x="0" data-y="3000">
   <h2>The challenge…

   <div style="font-size: 50%" class="readme">
     ^{readme}

   <p>Available at <a href="http://gonito.net/retroc">Gonito.net platform</a> or just:

   <pre>
      git clone git://gonito.net/retroc

<div class="step slide" data-x="1000" data-y="3000">
   <p>Core assumption
   <ul>
     <li>the temporal classifier will be used to date historical texts with no publication date given in metadata

   <h2>Assumptions
   <ul>
     <li>random sample of the whole corpus
     <ul>
       <li>stable random so the corpus can grow
     <li>test set balanced for years (1814-2013)
     <ul>
       <li>but train set — not
     <li><b>different</b> sources used in the train and test set
     <li>500-word text snippets
     <li>OCR noise kept as it is

<div class="step slide" data-x="2000" data-y="3000">
   <h2>RetroC(2) in numbers

   <p>Number of texts:

   <table class="stats">
     <tr>
       <th>
       <th align="right">train
       <th align="right">dev-0
       <th align="right">dev-1
       <th align="right">test
     <tr>
       <td>RetroC
       <td align="right">40K
       <td align="right">9.9K
       <td align="right">-
       <td align="right">10K
     <tr>
       <td>RetroC2
       <td align="right">107.4K
       <td align="right">20K
       <td align="right">11.5K
       <td align="right">14.2K

   <p>RetroC2 train set contains more information (source is given for each entry), but this is <b>not</b> present in the test set.

<div class="step slide" style="height: 800px;" data-x="0" data-y="4000">
   <h2>Current status…

   <div style="font-size: 50%">
     ^{retrocLeaderboard}

<div class="step slide" style="height: 800px;" data-x="1000" data-y="4000">
   <h2>The best solution so far…

   <ul>
     <li>based on Vowpal Wabbit
     <li>features:
     <ul>
       <li>character 1-, 2-, 3- and 4-grams
       <li>tokens cut at 7th letter
     <li>small NN (6 units)
     <li>weights inversely proportional to (the root of) year frequency in the train set

<div class="step slide" style="height: 800px;" data-x="2000" data-y="4000">
   <h2>Current status…

   <div style="font-size: 50%">
     ^{retroc2Leaderboard}
start DATeCH presentation 2017-05-27 08:59:10 +02:00			`<script src="/static/js/sigma.min.js">`
			`<script src="/static/js/sigma.parsers.json.min.js">`

			`<div id="title" class="step" data-x="0" data-y="0">`
			`<h1>RetroC challenge`
			`<p>how to guess the publication year of a text?`
			`<p class="footnote">Filip Graliński, Rafał Jaworski,<br/>Łukasz Borchmann, Piotr Wierzchoń`
			`<p class="footnote">DATeCH 2017`

			`<div class="step slide" data-x="0" data-y="1000">`
			`<h2>Overview`
			`<ol>`
			`<li>Raw data used`
			`<li>The RetroC challenge`
			`<ul>`
			`<li><i>jetzt neu!</i> RetrocC2`
			`<li>Results`

			`<div class="step slide" data-x="0" data-y="2000">`
			`<h2>Raw corpus`
			`<ul>`
			`<li>Polish digital libraries (OCRed DjVus/PDFs)`
			`<li>digital-born material, (pre-)history of Polish Internet, manually transcribed, grassroots digitization efforts, <i>samiskan</i> etc.`
			`<ul>`
			`<li>… anything that is timestamped`

			`<div class="step slide" data-x="1000" data-y="2000">`
			`<h2>Raw corpus in numbers`
			`<ul>`
			`<li>3.3M publications`
			`<li>24M pages`
			`<li>19.5G words`
			`<li>98.8G characters`
			`<li>18th century till today`
			`<li>mostly Polish`
			`<ul>`
update datech presentation 2017-05-29 23:24:12 +02:00			`<li>… (also German if you're interested — 350K publications)`

			`<div class="step slide" data-x="0" data-y="3000">`
			`<h2>The challenge…`

			`<div style="font-size: 50%" class="readme">`
			`^{readme}`

			`<p>Available at <a href="http://gonito.net/retroc">Gonito.net platform</a> or just:`

			`<pre>`
			`git clone git://gonito.net/retroc`

			`<div class="step slide" data-x="1000" data-y="3000">`
			`<p>Core assumption`
			`<ul>`
			`<li>the temporal classifier will be used to date historical texts with no publication date given in metadata`

			`<h2>Assumptions`
			`<ul>`
			`<li>random sample of the whole corpus`
			`<ul>`
			`<li>stable random so the corpus can grow`
			`<li>test set balanced for years (1814-2013)`
			`<ul>`
			`<li>but train set — not`
			`<li><b>different</b> sources used in the train and test set`
			`<li>500-word text snippets`
			`<li>OCR noise kept as it is`

			`<div class="step slide" data-x="2000" data-y="3000">`
			`<h2>RetroC(2) in numbers`

			`<p>Number of texts:`

			`<table class="stats">`
			`<tr>`
			`<th>`
update presentation 2017-05-30 14:20:22 +02:00			`<th align="right">train`
			`<th align="right">dev-0`
			`<th align="right">dev-1`
			`<th align="right">test`
update datech presentation 2017-05-29 23:24:12 +02:00			`<tr>`
			`<td>RetroC`
update presentation 2017-05-30 14:20:22 +02:00			`<td align="right">40K`
			`<td align="right">9.9K`
			`<td align="right">-`
			`<td align="right">10K`
update datech presentation 2017-05-29 23:24:12 +02:00			`<tr>`
			`<td>RetroC2`
update presentation 2017-05-30 14:20:22 +02:00			`<td align="right">107.4K`
			`<td align="right">20K`
			`<td align="right">11.5K`
			`<td align="right">14.2K`

			`<p>RetroC2 train set contains more information (source is given for each entry), but this is <b>not</b> present in the test set.`
update datech presentation 2017-05-29 23:24:12 +02:00
			`<div class="step slide" style="height: 800px;" data-x="0" data-y="4000">`
			`<h2>Current status…`

			`<div style="font-size: 50%">`
update presentation 2017-05-30 14:20:22 +02:00			`^{retrocLeaderboard}`

			`<div class="step slide" style="height: 800px;" data-x="1000" data-y="4000">`
			`<h2>The best solution so far…`

			`<ul>`
			`<li>based on Vowpal Wabbit`
			`<li>features:`
			`<ul>`
			`<li>character 1-, 2-, 3- and 4-grams`
			`<li>tokens cut at 7th letter`
			`<li>small NN (6 units)`
			`<li>weights inversely proportional to (the root of) year frequency in the train set`

			`<div class="step slide" style="height: 800px;" data-x="2000" data-y="4000">`
			`<h2>Current status…`

			`<div style="font-size: 50%">`
			`^{retroc2Leaderboard}`