This is a fork of the pdfminer tool, with a specific focus on extracting semantic XML out of OCR-ed PDF. It extracts pdf content page by page, and also identifies words and lines using distinct tags.