This week I was working on a personal project and at some point I wanted to covert a PDF file to HTML. After doing some research, I found Apache Tika, a toolkit that can detect and extract text and metadata from different file types, such as PDF, PPT, XLS.
In my case, I decided to use it on the command line, but you can use it as a dependency in a Gradle project, as Maven dependency and so on. Click here to check more details on how to get started in their website.
CLI
To use it in your command line, you have to have Java installed on your server. If you’re running it locally, you’ll also have to install java in your computer or container. Run java --version
to check your current version.
First, donwload the JAR file. Here we’re using wget
but you can get it using other package to retrieve files such as curl
.
wget https://archive.apache.org/dist/tika/2.6.0/tika-app-2.6.0.jar
Then, run the the command with the following structure:
java -jar /[path_to]/tika-app-2.6.0.jar --[output_format] [source_file]
For example:
java -jar /home/alison/dev/project-x/tika-app-2.6.0.jar --html ./static/report.pdf
This command will transform your PDF file in HTML and display the new content in your stdout. Check their documentation for more options and formats.