NLU tools – Apache Tika

Apache Tika – This is a must have tool if you doing the Natural Language Understanding related work in Java. As you have to prepare your training materials  with many text and articles. Tika is a tool to help you extract the text from all kinds of the docs such as  html, PPT, word and other office doc types, and many many others.

“Tika detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF). ”

Add these dependency to your maven:


and you can use the core tika, such as check the doc file type etc.

If you want more to extract content, you also need to add parser and also some others upon needs.


It also support running at Restful service mode with Jetty server. so you can call API through the web service. And it has simple GUI too.



How to expand disk size of CentOS 7 for VSphere virtual machines

Here I list major steps to extend hard disk (add sda3 partition) from 40GB to 100GB at my CentOS 7 VM. I refer to some internet links to make it work but there also has a bit notice you need to note at some tricks part.

  1. Show current disk partition info and size :

df -h

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 38G 34G 3.8G 90% /
devtmpfs 2.0G 0 2.0G 0% /dev
tmpfs 2.0G 80K 2.0G 1% /dev/shm
tmpfs 2.0G 8.9M 2.0G 1% /run
tmpfs 2.0G 0 2.0G 0% /sys/fs/cgroup
/dev/sda1 497M 246M 252M 50% /boot


sda 8:0 0 40G 0 disk 
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 39.5G 0 part 
 ├─centos-swap 253:0 0 2G 0 lvm [SWAP]
 └─centos-root 253:1 0 37.5G 0 lvm /
sr0 11:0 1 1024M 0 rom

sudo fdisk -l

Disk /dev/sda: 42.9 GB, 42949672960 bytes, 83886080 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x0001af8a

Device Boot Start End Blocks Id System
/dev/sda1 * 2048 1026047 512000 83 Linux
/dev/sda2 1026048 83886079 41430016 8e Linux LVM

Disk /dev/mapper/centos-swap: 2164 MB, 2164260864 bytes, 4227072 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

Disk /dev/mapper/centos-root: 40.3 GB, 40256929792 bytes, 78626816 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes


 Static hostname: wayne-v-ctos7
 Icon name: computer-vm
 Chassis: vm
 Machine ID: da93c1884d894932aef5bd13121e7478
 Boot ID: 772e152a486c45e58feb119b02a4c5f7
 Operating System: CentOS Linux 7 (Core)
 CPE OS Name: cpe:/o:centos:centos:7
 Kernel: Linux 3.10.0-229.20.1.el7.x86_64
 Architecture: x86_64

shutdown -h now

2. Then shut down VM , back up (export) your VM, and delete all the snpshots of the VM, and then use vSphere Client 5.5 to extend the disk size from 40GB to 100GB. If you do not remove the snapshots, you can not extend disk size!

3. Then start the VM and create a new partition to use the new disk blocks:

sudo fdisk -l
sudo fdisk /dev/sda

Target in fdisk is to Create new primary partition 3(sda3), set type as linux lvm (8e).
n - to create new primary partition 3 use left disk blocks
3 .......
t - to change 3 partition to the 8e type
w - to save and quit

You need to understand why I use 8e here for LVM and these each option doing. use “m” for help inside the fdisk.
4. After reboot the VM, now I have the sda3 but not be used yet. I need to use it to extend the Volume Group:

df -h
sudo fdisk -l
sudo vgdisplay
sudo vgextend centos /dev/sda3
df -h
sudo vgdisplay
sudo lvextend -L+59999M /dev/centos/root 
sudo resize2fs /dev/centos/root  **** for centOS 5,6
sudo xfs_growfs /dev/centos/root  **** for CentOS 7
df -h

Filesystem Size Used Avail Use% Mounted on
/dev/mapper/centos-root 97G 35G 62G 36% /
devtmpfs 2.9G 0 2.9G 0% /dev
tmpfs 2.9G 92K 2.9G 1% /dev/shm
tmpfs 2.9G 8.8M 2.9G 1% /run
tmpfs 2.9G 0 2.9G 0% /sys/fs/cgroup
/dev/sda1 497M 246M 252M 50% /boot


*** About this  command :

sudo lvextend -L+59999M /dev/centos/root

– here I left 60GB but I only use 59.999GB as I need to left a bit space to let system can run it, do not use all space.
and how this /dev/centos/root is got?  centos is the “VG Name”  in the vgdisplay command,  and root is volume, if you see ”df -h“, you should see in this way
/dev/mapper/centos-root  (VolumeGroup-Volume)
So do not make any error at here.

About the each steps means and details, you can refer to these two links:






How you extract the diff from two text files in linux

Q: You have the two files file1.txt and file2.txt. Now what you want to get is the lines in file1, but not in file2, how to do this by command in linux?

A: If your files are sorted already in linux, then there is a command already has this ability to do:


 comm - compare two sorted files line by line

 comm [OPTION]... FILE1 FILE2


-1 suppress lines unique to FILE1

-2 suppress lines unique to FILE2

-3 suppress lines that appear in both files

So this command will solve your issue:

comm -2 -3 file1.txt file2.txt > fileResult.txt


But if your file is not sorted, how?

There is another way to solve this too, suppose all your lines are less than 400 chars.

diff -a --width=400 --suppress-common-lines -y file1.txt file2.txt > fileResult.txt


By this diff command, the result you get is not clear in fact, as it compensates each line with “…..(spaces)….<“, so you still need to find and search this kind of extras and remove them.

If you get a better way in diff way, please let me know. Thanks here.