Text structure extract from PDF brainstorming

1. Use existing tools like grobid.


Use machine learning to get scientific paper structure data.  It has demo page at here, http://cloud.science-miner.com/grobid/ . My tests show that it can extract title and some other data, but still content maybe mixed with footer and headers.  As this one is designed for academic docs, so it may have some issues to other types of PDF.

2. Borrow the ideas like grobid, to build a system to adapt to the PDF types that you use.
Unified format PDf may get better result, but not for general PDF.

3. Convert a PDF to doc file, and then use doc tools to extract content structures out?


Some brain storming ideas:

1). Use tool like pdfclown to extract position and style info of the PDF text data.

2). Then for same category PDF share patterned style and positions, we have chance to find the structures of file.

3). And based on the style of text, it is possible have a tree-like text structure, but this tree may not match with the real chapters tree. This method can help on section levels and titles

4). How to find the main content text?
By statistical info of the text font, the biggest portion of the occurrence normally the text content section. As the main content has different styles with other part, it is possible get good result here.

5). By check from bottom of each page to center’s first line, if PDF have many pages and a unified footer format, then it possible to find what style and font is for footer, and possible find out the footer pattern by statistical.

6). Use same trick of footer, it is possible find out the header if page have.

7). If we have position and style info of the PDF text data.
Then we maybe can do classification based on the position and styles training too to find the basic structure of file.


Some similar work and papers

a chinese patente for this:

a ch team works called Xed



HP paper

a text classification algorithm
that follows a multi-pass sieve framework to automatically classify PDF text snippets (for brevity, texts) into TITLE, ABSTRACT, BODYTEXT, SEMISTRUCTURE, and METADATA categories.


Tools for PDF text extract

PDF Tools list

Jpedal (commercial software)
https://www.snowtide.com/ (commercial software)
itext (commercial software)

apache tika






1. Tools compare and benchmark


PDF format

2. How to remove the header and footer from PDF


3. Find paragraph in text



4. Find sentences


7. Structure extract from PDF



How to build and run the CMU Olympus-Ravenclaw dialog system framework – 3

How the Olympus system works ? This is a summary after I read a bit of the code of Olympus.

1. How system started?

Open your SystemRun.bat, it will call this line:

START “” /DConfigurations\%RunTimeConfig% “%OLYMPUS_ROOT%\Agents\Pythia\dist\process_monitor.exe” %RunTimeStartList%.config

About the startlist.config file and Pythia’s process_monitor.exe (MITRE in old project) , you need to read this page:


Pythia is windows process manager that control many process start and stop. Written by pl and build as process_monitor.exe, it will read in the startlist.config file to control each process of system.

2. How each module started?

Pythia started many process and they communicate each other, each process has its ability as module in system, like ASR, TTS. They work together to make whole system work.

TTYRecongnitionServer is the module to interface with terminal in audio and keyboard input way. Pythia in fact will start this file : ttyserver_start.config to start this process. It will in fact run this cmd as one process:

--start - is to let Pythia start it
--input_line - is to let Pythia open a input box for it on the GUI.

3. What is the HUB?

So how each process feature doing and how these processes communicate with each other is very important now. Each process will be called a server, so there is a HUB to link all the servers together:


So what is a server?

5. So how hub and server exchange data?

There is pgm file defined all servers info, name port, and rules etc, hub just read this file  and then will link all servers to exchange data.
Rules in programs (like main) tell Galaxy what the Hub should do when it gets a certain message



Now you have basic structure of the whole system.



5. So how task is organized as dialog system (ravenclaw) ?

ravenclaw use tree to define the task relations, like example here:


Sub node under a task is a sub task, to finish a task you need to go through from left to right to do each task. And they define a set of macro in C and developer need to use these set of macro to define this task tree:

example like this:



How to build and run the CMU Olympus-Ravenclaw dialog system framework – 2 ?

After olympus is compiled, it is a platform that will let other application can run. so we can get some examples from this SVN:

And ravenclaw has wiki page to teach two tutorials at here:

So let us take a look how to run applications.

1. After go through the example-systems codes, in all branch folders we can see the 2.0, 2.1, 2.5. 2.6 etc. These folders in fact are different application matches to the version of the olympus version. So as our olympus compiled is 2.6.1, so we only need to look for the 2.6 folders. Only two are in the examples:

2. Let us try the MeetingLine as example, build it first:
cd .\MeetingLine\branches\2.6
perl Systembuild.pl

It will use the code of olympus and tools there to build this application.

3. After MeetingLine is built, let us make sure our speaker and micro working fine. and then we can just run it:
or SystemRun-JavaTTY.bat (This one runs through TTY, no sound I think.)

4. MeetingLine will open a GUI dashboard should come up, and all modules should be green.

Click “TTYRecognitionServer” and then type init_session into the input box.

The system should greet you by voice. and then you can then either type into the input box or speak into a connected microphone.

5. Tutorial1 and Tutorial2 both can run in same way, but they do not have 2.6 version folder in the branch folder.
So I use trunk folder version code to compile for 2.6 and it can compiled.
So you can try out each example by trunck version code to see if they are latest version is for 2.6 version or not.

Some notes about the Visual Studio versions and where is the MSBuild.exe location :
OlympusBuild.pm need these knowledge to debug.
What is Visual Studio version means

http://en.wikipedia.org/wiki/Microsoft_Visual_Studio#History – Visual Studio
Visual Studio 6.0 (1998)
Visual Studio .NET (2002) = version 7
Visual Studio .NET 2003 = version 7.1
Visual Studio 2005 = version 8
Visual Studio 2008 = version 9
Visual Studio 2010 = version 10
Visual Studio 2012 = version 11
Visual Studio 2013 = version 12
Visual Studio 2015 = version 14
Visual Studio 2017 = version 15


MSBuild in the previous versions of .NET Framework was installed with it but, they decided to install it with Visual Studio or with the package BuildTools_Full.exe.

The path for .NET framework installation is C:\Windows\Microsoft.NET\Framework[64 or empty][framework_version] The path when Visual Studio is installed is C:\Program Files (x86)\MSBuild[version]\Bin for x86 and, C:\Program Files (x86)\MSBuild[version]\Bin\amd64 for x64.

The path when BuildTools_Full.exe is installed is the same as when MSBuild is installed with Visual Studio.


As of 2013 msbuild ships with Visual Studio:
C:\Program Files (x86)\MSBuild\14.0\Bin\MSBuild.exe
C:\Program Files (x86)\MSBuild\12.0\Bin\MSBuild.exe


Before that msbuild shipped with the .NET Framework, up to version 4.5.1:




Just to add more information to the answer, in Windows 8.1, mine show up under
C:\Program Files (x86)\MSBuild\12.0\Bin
C:\Program Files (x86)\MSBuild\12.0\Bin\amd64


Don’t forget, older MSBuild versions are also updated when newer .NET frameworks are installed. For example, .NET Framework 4.5.1 also updates the .NET Framework 4.0 MSBuild version to 4.0.30319.18408


How to build and run the CMU Olympus-Ravenclaw dialog system framework – 1 ?

CMU has developed many speech related projects and you can take a look from these two places:
Olympus is one of them.

Olympus dialog system is a speech system envoling from CMU Communictor in history. Current wiki page are at here:

From SVN and wiki, we can know Olympus ended around 2015. From SVN history log of the Olympus, most of the components in Olympus were done around 2000-2006. This project stop major work at 2010. From 2010, they almost do nothing anymore on project code level.

To know about the Olympus, it is better to make it run and test it out, here I will note down how to to make the Olympus runs.


0. Take some time reading the wiki pages to get some clues, but if you try it out by wiki instructions, you will faces many difficuties as wiki, code and enviroment are very outdate and mismatched, these will bring you many issues.


1. Install a windows 7 computer, try to get an old computer having windows 7. it only can run at win 7 .

2. Download and install tortoisesvn, this one you can choose latest version:

3. Get CMake https://cmake.org/files/ use the exe install for windows , I choose 3.4.0 version at here, do not choose latest version, it maybe cause issues.

4. Install ActivePerl from ActiveState – i choose 5.22 version

5. Install Python 2.7

6. Install java 8 with netbeans IDE (must with netbean!), MUST install it under default folder, as later olympus script can only looking for ant of netbean under the default folders:
C:\Program Files…..


7. Install Visual Studio – this is a bit troublesome, I tested VS 2010, it must have the SP1 to work with CMake :


VS 2012 version should be better. later you will know reason.

8. Intall TTS Flite – this step I do not think it is useful, as olympus has the flite in already. Maybe very very old version use this step.

9. Now, all software are ready, we need to config system environment:

Set the windows ENV for these:
OLYMPUS_ROOT = C:\CMU\olympus\2.6.1
LOGIOS_ROOT = C:\CMU\olympus\2.6.1\Tools\logios

Second LOGIOS_ROOT maybe no need anymore.

10. So, now we cna check out code to C:\CMU\olympus\2.6.1
SVN URL – http://trac.speech.cs.cmu.edu/svn/olympus/tags/2.6.1

I alos check out the example and tutorials to some folders


11. I before start build there are some perl module need be installed:
ppm install Win32::RunAsAdmin
ppm install Win32::Env

12. Enter the powershell, try to build the source:

cd C:\CMU\olympus\2.6.1

OlympusBuild.bat has a bug on source, you should change its content to this:
perl Build\OlympusBuild.pl
So I just use OlympusRebuild.bat to build it.

12. If you are lucky, then your build could be success. But most of chance you will get some errors. As this system is very old and it is very sensitive to software versoins.

Appears successful…

13. As I use the VS 2010 (10.0), so I need to open the OlympusBuild.pm file to change some text there.

$self->{‘BuildSysExe’} = $msbuildpath.’v4.0.30319/’.’MSBuild.exe’; # v 11.0
$vers = ‘10.0’; // here change 11 to 10 to match my VS version

MSBuild.exe location is at v4.0.30319/ folder for both VS 11 and 10, so I need to change $vers in this script to match with my VS version 10. This in fact a bug in this file.

14. OK now we have the Olympus compiled and installed, and next step to to run some examples to see how it works.