gonito/templates/presentation-datech-2017.hamlet

145 lines
4.3 KiB
Plaintext
Raw Normal View History

2017-05-27 08:59:10 +02:00
<script src="/static/js/sigma.min.js">
<script src="/static/js/sigma.parsers.json.min.js">
2017-05-31 23:19:12 +02:00
<script type="text/javascript" src="/static/js/jquery-1.11.3.min.js">
<script type="text/javascript" src="http://code.highcharts.com/highcharts.js">
2017-05-27 08:59:10 +02:00
<div id="title" class="step" data-x="0" data-y="0">
<h1>RetroC challenge
2017-05-31 23:19:12 +02:00
<p>how to guess the publication year of a text?
2017-05-30 22:21:01 +02:00
<table>
<tr>
<td>
<p class="footnote">Filip Graliński, Rafał Jaworski,<br/>Łukasz Borchmann, Piotr Wierzchoń&nbsp;&nbsp;&nbsp;
<td>
<center>
2017-05-31 23:19:12 +02:00
<img src="/static/images/amu-logo.jpg" width="300">
2017-05-27 08:59:10 +02:00
<p class="footnote">DATeCH 2017
<div class="step slide" data-x="0" data-y="1000">
<h2>Overview
<ol>
<li>Raw data used
<li>The RetroC challenge
<ul>
<li><i>jetzt neu!</i> RetrocC2
<li>Results
<div class="step slide" data-x="0" data-y="2000">
<h2>Raw corpus
<ul>
<li>Polish digital libraries (OCRed DjVus/PDFs)
<li>digital-born material, (pre-)history of Polish Internet, manually transcribed, grassroots digitization efforts, <i>samiskan</i> etc.
<ul>
2017-05-31 23:19:12 +02:00
<li>… anything in Polish that is timestamped
<center>
<img src="/static/images/sample1.png">
2017-05-27 08:59:10 +02:00
<div class="step slide" data-x="1000" data-y="2000">
<h2>Raw corpus in numbers
<ul>
<li>3.3M publications
<li>24M pages
<li>19.5G words
<li>98.8G characters
<li>18th century till today
2017-05-30 22:21:01 +02:00
<li>language: Polish
2017-05-27 08:59:10 +02:00
<ul>
2017-05-29 23:24:12 +02:00
<li>… (also German if you're interested — 350K publications)
2017-05-31 23:19:12 +02:00
<div class="graph">
<div id="gcontainerX01" style="width:100%; height:300px;">
<script type="text/javascript" src="/static/js/years-stats-draw.js">
2017-05-29 23:24:12 +02:00
<div class="step slide" data-x="0" data-y="3000">
<h2>The challenge…
<div style="font-size: 50%" class="readme">
^{readme}
<p>Available at <a href="http://gonito.net/retroc">Gonito.net platform</a> or just:
<pre>
git clone git://gonito.net/retroc
<div class="step slide" data-x="1000" data-y="3000">
2017-05-31 23:19:12 +02:00
<p>Core motivation
2017-05-29 23:24:12 +02:00
<ul>
2017-05-31 23:19:12 +02:00
<li>temporal classifier will be used to date historical texts with no publication date given in metadata
2017-05-29 23:24:12 +02:00
<h2>Assumptions
<ul>
<li>random sample of the whole corpus
<ul>
<li>stable random so the corpus can grow
<li>test set balanced for years (1814-2013)
<ul>
<li>but train set — not
<li><b>different</b> sources used in the train and test set
<li>500-word text snippets
<li>OCR noise kept as it is
<div class="step slide" data-x="2000" data-y="3000">
<h2>RetroC(2) in numbers
<p>Number of texts:
<table class="stats">
<tr>
<th>
2017-05-30 14:20:22 +02:00
<th align="right">train
<th align="right">dev-0
<th align="right">dev-1
<th align="right">test
2017-05-29 23:24:12 +02:00
<tr>
<td>RetroC
2017-05-30 14:20:22 +02:00
<td align="right">40K
<td align="right">9.9K
<td align="right">-
<td align="right">10K
2017-05-29 23:24:12 +02:00
<tr>
<td>RetroC2
2017-05-30 14:20:22 +02:00
<td align="right">107.4K
<td align="right">20K
<td align="right">11.5K
<td align="right">14.2K
2017-05-31 23:19:12 +02:00
<p style="padding-top: 30px">RetroC2
2017-05-30 22:21:01 +02:00
<ul>
<li>train set contains more information (source is given for each entry),<br/>but this is <b>not</b> present in the test set
<li>timestamps given as years with possible fractions<br/> (if publication day/month is known)
2017-05-29 23:24:12 +02:00
<div class="step slide" style="height: 800px;" data-x="0" data-y="4000">
<h2>Current status…
<div style="font-size: 50%">
2017-05-30 14:20:22 +02:00
^{retrocLeaderboard}
<div class="step slide" style="height: 800px;" data-x="1000" data-y="4000">
<h2>The best solution so far…
<ul>
<li>based on Vowpal Wabbit
<li>features:
<ul>
<li>character 1-, 2-, 3- and 4-grams
<li>tokens cut at 7th letter
<li>small NN (6 units)
<li>weights inversely proportional to (the root of) year frequency in the train set
2017-05-31 23:19:12 +02:00
<p style="padding-top: 30px">The result to beat:&nbsp;&nbsp;<span style="font-size:64px">RMSE=24.8 years</span>
2017-05-30 14:20:22 +02:00
<div class="step slide" style="height: 800px;" data-x="2000" data-y="4000">
2017-05-30 22:21:01 +02:00
<h2>RetroC2
2017-05-30 14:20:22 +02:00
<div style="font-size: 50%">
^{retroc2Leaderboard}
2017-05-30 22:21:01 +02:00
<div class="step slide" style="height: 800px;" data-x="2000" data-y="5000">
<h2>Thank you!
<p style="padding-top: 40px">This work was partially funded by:
<img width="400" src="/static/images/nprh-logo.jpg">