Apache Tika – This is a must have tool if you doing the Natural Language Understanding related work in Java. As you have to prepare your training materials with many text and articles. Tika is a tool to help you extract the text from all kinds of the docs such as html, PPT, word and other office doc types, and many many others.
“Tika detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). ”
Add these dependency to your maven:
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-core</artifactId> <version>1.16</version> </dependency>
and you can use the core tika, such as check the doc file type etc.
If you want more to extract content, you also need to add parser and also some others upon needs.
<dependency> <groupId>org.apache.tika</groupId> <artifactId>tika-parsers</artifactId> <version>1.16</version> </dependency>
It also support running at Restful service mode with Jetty server. so you can call API through the web service. And it has simple GUI too.