Metadata-Version: 2.1 Name: tika Version: 1.23 Summary: Apache Tika Python library Home-page: http://github.com/chrismattmann/tika-python Author: Chris Mattmann Author-email: chris.a.mattmann@jpl.nasa.gov License: Apache License version 2 ("ALv2") Download-URL: http://github.com/chrismattmann/tika-python Description: [![Build Status](https://travis-ci.org/chrismattmann/tika-python.svg?branch=master)](https://travis-ci.org/chrismattmann/tika-python) [![Coverage Status](https://coveralls.io/repos/github/chrismattmann/tika-python/badge.svg?branch=master)](https://coveralls.io/github/chrismattmann/tika-python?branch=master) tika-python =========== A Python port of the [Apache Tika](http://tika.apache.org/) library that makes Tika available using the [Tika REST Server](http://wiki.apache.org/tika/TikaJAXRS). This makes Apache Tika available as a Python library, installable via Setuptools, Pip and Easy Install. To use this library, you need to have Java 7+ installed on your system as tika-python starts up the Tika REST server in the background. Inspired by [Aptivate Tika](https://github.com/aptivate/python-tika). Installation (with pip) ----------------------- 1. `pip install tika` Installation (without pip) -------------------------- 1. `python setup.py build` 2. `python setup.py install` Airgap Environment Setup ------------------------ To get this working in a disconnected environment, download a tika server file and set the TIKA_SERVER_JAR environment variable to TIKA_SERVER_JAR="file:////tika-server.jar" which successfully tells `python-tika` to "download" this file and move it to `/tmp/tika-server.jar` and run as background process. This is the only way to run `python-tika` without internet access. Without this set, the default is to check the tika version and pull latest every time from Apache. Environment Variables --------------------- These are read once, when tika/tika.py is initially loaded and used throughout after that. 1. `TIKA_VERSION` - set to the version string, e.g., 1.12 or default to current Tika version. 2. `TIKA_SERVER_JAR` - set to the full URL to the remote Tika server jar to download and cache. 3. `TIKA_SERVER_ENDPOINT` - set to the host (local or remote) for the running Tika server jar. 4. `TIKA_CLIENT_ONLY` - if set to True, then `TIKA_SERVER_JAR` is ignored, and relies on the value for `TIKA_SERVER_ENDPOINT` and treats Tika like a REST client. 5. `TIKA_TRANSLATOR` - set to the fully qualified class name (defaults to Lingo24) for the Tika translator implementation. 6. `TIKA_SERVER_CLASSPATH` - set to a string (delimited by ':' for each additional path) to prepend to the Tika server jar path. 7. `TIKA_LOG_PATH` - set to a directory with write permissions and the `tika.log` and `tika-server.log` files will be placed in this directory. 8. `TIKA_PATH` - set to a directory with write permissions and the `tika_server.jar` file will be placed in this directory. 9. `TIKA_JAVA` - set the Java runtime name, e.g., `java` or `java9` 10. `TIKA_STARTUP_SLEEP` - number of seconds (`float`) to wait per check if Tika server is launched at runtime 11. `TIKA_STARTUP_MAX_RETRY` - number of checks (`int`) to attempt for Tika server startup if launched at runtime 12. `TIKA_JAVA_ARGS` - set java runtime arguments, e.g, `-Xmx4g` Testing it out ============== Parser Interface (backwards compat prior to REST) ------------------------------------------------- ``` #!/usr/bin/env python import tika tika.initVM() from tika import parser parsed = parser.from_file('/path/to/file') print(parsed["metadata"]) print(parsed["content"]) ``` Parser Interface ---------------------- The parser interface extracts text and metadata using the /rmeta interface. This is one of the better ways to get the internal XHTML content extracted. Note: ![Alert Icon](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon28.png "Alert") The parser interface needs the following environment variable set on the console for printing of the extracted content. ```export PYTHONIOENCODING=utf8``` ``` #!/usr/bin/env python import tika from tika import parser parsed = parser.from_file('/path/to/file') print(parsed["metadata"]) print(parsed["content"]) # Optionally, you can pass Tika server URL along with the call # what's useful for multi-instance execution or when Tika is dockerzed/linked parsed = parser.from_file('/path/to/file', 'http://tika:9998/tika') string_parsed = parser.from_buffer('Good evening, Dave', 'http://tika:9998/tika') ``` Specify Output Format To XHTML --------------------- The parser interface is optionally able to output the content as XHTML rather than plain text. Note: ![Alert Icon](https://github.com/adam-p/markdown-here/raw/master/src/common/images/icon28.png "Alert") The parser interface needs the following environment variable set on the console for printing of the extracted content. ```export PYTHONIOENCODING=utf8``` ``` #!/usr/bin/env python import tika from tika import parser parsed = parser.from_file('/path/to/file', xmlContent=True) print(parsed["metadata"]) print(parsed["content"]) # Note: This is also available when parsing from the buffer. ``` Unpack Interface ---------------- The unpack interface handles both metadata and text extraction in a single call and internally returns back a tarball of metadata and text entries that is internally unpacked, reducing the wire load for extraction. ``` #!/usr/bin/env python import tika from tika import unpack parsed = unpack.from_file('/path/to/file') ``` Detect Interface ---------------------- The detect interface provides a IANA MIME type classification for the provided file. ``` #!/usr/bin/env python import tika from tika import detector print(detector.from_file('/path/to/file')) ``` Config Interface ---------------------- The config interface allows you to inspect the Tika Server environment's configuration including what parsers, mime types, and detectors the server has been configured with. ``` #!/usr/bin/env python import tika from tika import config print(config.getParsers()) print(config.getMimeTypes()) print(config.getDetectors()) ``` Language Detection Interface --------------------------------- The language detection interface provides a 2 character language code texted based on the text in provided file. ``` #!/usr/bin/env python from tika import language print(language.from_file('/path/to/file')) ``` Translate Interface ------------------------ The translate interface translates the text automatically extracted by Tika from the source language to the destination language. ``` #!/usr/bin/env python from tika import translate print(translate.from_file('/path/to/spanish', 'es', 'en')) ``` Using a Buffer -------------- Note you can also use a Parser and Detector .from_buffer(string) method to dynamically parser a string buffer in Python and/or detect its MIME type. This is useful if you've already loaded the content into memory. Using Client Only Mode ---------------------- You can set Tika to use Client only mode by setting ```python import tika tika.TikaClientOnly = True ``` Then you can run any of the methods and it will fully omit the check to see if the service on localhost is running and omit printing the check messages. Changing the Tika Classpath --------------------------- You can update the classpath that Tika server uses by setting the classpath as a set of ':' delimited strings. For example if you want to get Tika-Python working with [GeoTopicParsing](http://wiki.apache.org/tika/GeoTopicParser), you can do this, replace paths below with your own paths, as identified [here](http://wiki.apache.org/tika/GeoTopicParser) and make sure that you have done this: kill Tika server (if already running): ```bash ps aux | grep java | grep Tika kill -9 PID ``` ```python import tika.tika import os from tika import parser home = os.getenv('HOME') tika.tika.TikaServerClasspath = home + '/git/geotopicparser-utils/mime:'+home+'/git/geotopicparser-utils/models/polar' parsed = parser.from_file(home + '/git/geotopicparser-utils/geotopics/polar.geot') print parsed["metadata"] ``` Customizing the Tika Server Request --------------------------- You may customize the outgoing HTTP request to Tika server by setting `requestOptions` on the `.from_file` and `.from_buffer` methods (Parser, Unpack , Detect, Config, Language, Translate). It should be a dictionary of arguments that will be passed to the request method. The [request method documentation](https://requests.kennethreitz.org/en/master/api/#requests.request) specifies valid arguments. This will override any defaults except for `url` and `params `/`data`. ``` from tika import parser parsed = parser.from_file('/path/to/file', requestOptions={'timeout': 120}) ``` New Command Line Client Tool ============================ When you install Tika-Python you also get a new command line client tool, `tika-python` installed in your /path/to/python/bin directory. The options and help for the command line tool can be seen by typing `tika-python` without any arguments. This will also download a copy of the tika-server jar and start it if you haven't done so already. ``` tika.py [-v] [-o ] [--server ] [--install ] [--port ]