gonito/templates/presentation-datech-2017.ha...

145 lines
4.3 KiB
Plaintext
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<script src="/static/js/sigma.min.js">
<script src="/static/js/sigma.parsers.json.min.js">
<script type="text/javascript" src="/static/js/jquery-1.11.3.min.js">
<script type="text/javascript" src="http://code.highcharts.com/highcharts.js">
<div id="title" class="step" data-x="0" data-y="0">
<h1>RetroC challenge
<p>how to guess the publication year of a text?
<table>
<tr>
<td>
<p class="footnote">Filip Graliński, Rafał Jaworski,<br/>Łukasz Borchmann, Piotr Wierzchoń&nbsp;&nbsp;&nbsp;
<td>
<center>
<img src="/static/images/amu-logo.jpg" width="300">
<p class="footnote">DATeCH 2017
<div class="step slide" data-x="0" data-y="1000">
<h2>Overview
<ol>
<li>Raw data used
<li>The RetroC challenge
<ul>
<li><i>jetzt neu!</i> RetrocC2
<li>Results
<div class="step slide" data-x="0" data-y="2000">
<h2>Raw corpus
<ul>
<li>Polish digital libraries (OCRed DjVus/PDFs)
<li>digital-born material, (pre-)history of Polish Internet, manually transcribed, grassroots digitization efforts, <i>samiskan</i> etc.
<ul>
<li>… anything in Polish that is timestamped
<center>
<img src="/static/images/sample1.png">
<div class="step slide" data-x="1000" data-y="2000">
<h2>Raw corpus in numbers
<ul>
<li>3.3M publications
<li>24M pages
<li>19.5G words
<li>98.8G characters
<li>18th century till today
<li>language: Polish
<ul>
<li>… (also German if you're interested — 350K publications)
<div class="graph">
<div id="gcontainerX01" style="width:100%; height:300px;">
<script type="text/javascript" src="/static/js/years-stats-draw.js">
<div class="step slide" data-x="0" data-y="3000">
<h2>The challenge…
<div style="font-size: 50%" class="readme">
^{readme}
<p>Available at <a href="http://gonito.net/retroc">Gonito.net platform</a> or just:
<pre>
git clone git://gonito.net/retroc
<div class="step slide" data-x="1000" data-y="3000">
<p>Core motivation
<ul>
<li>temporal classifier will be used to date historical texts with no publication date given in metadata
<h2>Assumptions
<ul>
<li>random sample of the whole corpus
<ul>
<li>stable random so the corpus can grow
<li>test set balanced for years (1814-2013)
<ul>
<li>but train set — not
<li><b>different</b> sources used in the train and test set
<li>500-word text snippets
<li>OCR noise kept as it is
<div class="step slide" data-x="2000" data-y="3000">
<h2>RetroC(2) in numbers
<p>Number of texts:
<table class="stats">
<tr>
<th>
<th align="right">train
<th align="right">dev-0
<th align="right">dev-1
<th align="right">test
<tr>
<td>RetroC
<td align="right">40K
<td align="right">9.9K
<td align="right">-
<td align="right">10K
<tr>
<td>RetroC2
<td align="right">107.4K
<td align="right">20K
<td align="right">11.5K
<td align="right">14.2K
<p style="padding-top: 30px">RetroC2
<ul>
<li>train set contains more information (source is given for each entry),<br/>but this is <b>not</b> present in the test set
<li>timestamps given as years with possible fractions<br/> (if publication day/month is known)
<div class="step slide" style="height: 800px;" data-x="0" data-y="4000">
<h2>Current status…
<div style="font-size: 50%">
^{retrocLeaderboard}
<div class="step slide" style="height: 800px;" data-x="1000" data-y="4000">
<h2>The best solution so far…
<ul>
<li>based on Vowpal Wabbit
<li>features:
<ul>
<li>character 1-, 2-, 3- and 4-grams
<li>tokens cut at 7th letter
<li>small NN (6 units)
<li>weights inversely proportional to (the root of) year frequency in the train set
<p style="padding-top: 30px">The result to beat:&nbsp;&nbsp;<span style="font-size:64px">RMSE=24.8 years</span>
<div class="step slide" style="height: 800px;" data-x="2000" data-y="4000">
<h2>RetroC2
<div style="font-size: 50%">
^{retroc2Leaderboard}
<div class="step slide" style="height: 800px;" data-x="2000" data-y="5000">
<h2>Thank you!
<p style="padding-top: 40px">This work was partially funded by:
<img width="400" src="/static/images/nprh-logo.jpg">