NLU tools – Apache Tika

Apache Tika – This is a must have tool if you doing the Natural Language Understanding related work in Java. As you have to prepare your training materials  with many text and articles. Tika is a tool to help you extract the text from all kinds of the docs such as  html, PPT, word and other office doc types, and many many others.

“Tika detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). ”

Add these dependency to your maven:


and you can use the core tika, such as check the doc file type etc.

If you want more to extract content, you also need to add parser and also some others upon needs.


It also support running at Restful service mode with Jetty server. so you can call API through the web service. And it has simple GUI too.




How to caculate the Word Error Rate By Python

You can use this package to do it.

before you can use it, you need to install the scipy module as it include the numpy.

do like this:

sudo apt-get install python-scipy ## for Python2
sudo apt-get install python3-scipy ## for Python3


Then after that, you can unzip the WER-in-python-master

and  run this cmd :

$ python reference.txt hypothesis.txt
REF: This great machine can recognize speech           
HYP: This       machine can wreck     a      nice beach
EVA:      D                 S         S      I    I    
WER: 83.33%


reference.txt hypothesis.txt can be the multiple lines text files.s


How to caculate two article’s similarity or distance?


May 2015

1. Use the Topic Modelling. reflect the text content of the articles to the dimensions of the topic, then you can try to calculate the similarity.

gensim by Google in Python


2. Use the Cosine Similarity, Steps like this:

(1)use the TF-IDF to find out the key words of tow articles

(2)combine the the two key words set into one set, and get the frequency of the each keys.

(3)create the frequency vectors for two articles.

(4)caculate the Cosine Similarity of each vector, then the bigger , the similar they two.

TF-IDF (term frequency–inverse document frequency)



SOLR commons issues

By W.Zh   May 2015


Exception in thread “main” java.lang.UnsupportedClassVersionError: org/apache/solr/util/SolrCLI : Unsupported major.minor version 51.0


After unzip the solr, you will try to start using:


bin/solr start -e cloud -noprompt



JAVA version is less than 7

How to:
please look at my page to
How to upgrade the JAVA version at ubuntu