- MyPyTutor: an interactive tutorial system for Python Peter J. Python installations come with Tkinter (a GUI li-brary based on Tcl/Tk) and an IDE written in Tk. Wrong, we often use parsing of student code to ei-ther give more information about where the student.
- You can also take a look at PDFMiner, an other PDF parser in Python. The particularity of PDFMiner that can interest you is that you can control how it regroup text parts when doing the extracting. You do this by specifing the space between lines, words, characters, etc.
I'm looking for a PDF library which will allow me to extract the text from a PDF document. I've looked at PyPDF, and this can extract the text from a PDF document very nicely. The problem with this is that if there are tables in the document, the text in the tables is extracted in-line with the rest of the document text. This can be problematic because it produces sections of text that aren't useful and look garbled (for instance, lots of numbers mashed together).
• Tkinter is Python’s default GUI library. It is based on the Tk toolkit, originally designed for the Tool Command Language (Tcl). Due to Tk’s popularity, it has been ported to a variety of other scripting languages, including Perl (Perl/Tk), Ruby (Ruby/Tk), and Python (Tkinter). Python GUI Programming (Tkinter) Georgios Stavrou. Learn Python - Free Interactive Python Tutorial. 80+ Best Free Python Tutorials, eBooks & PDF To Learn Programming Online| FromDev. Cynthia Dise. Parsing PDFs in Python with Tika.
I'm looking for something that's a bit more advanced. I'd like to extract the text from a PDF document, excluding any tables and special formatting. Is there a library out there that does this? Or am I forced to do some post-processing on the output text to get rid of these sections?
N.N.closed as off-topic by Anderson Green, CRABOLO, Ken Herbert, HaveNoDisplayName, Shankar DamodaranApr 13 '15 at 2:28
This question appears to be off-topic. The users who voted to close gave this specific reason:
- 'Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.' – Anderson Green, CRABOLO, Ken Herbert, HaveNoDisplayName, Shankar Damodaran
2 Answers
You can also take a look at PDFMiner, an other PDF parser in Python.
A particular feature of interest in PDFMiner is that you can control how it regroups text parts when extracting them. You do this by specifying the space between lines, words, characters, etc. So, maybe by tweaking this you can achieve what you want (that depends of the variability of your documents). PDFMiner can also give you the location of the text in the page, it can extract data by Object ID and other stuff. So dig in PDFMiner and be creative!
But your problem is really not an easy one to solve because, in a PDF, the text is not continuous, but made from a lot of small groups of characters positioned absolutely in the page. The focus of PDF is to keep the layout intact. It's not content oriented but presentation oriented.
EtienneEtienneThat's a difficult problem to solve since visually similar PDFs may have a wildly differing structure depending on how they were produced. In the worst case the library would need to basically act like an OCR. On the other hand, the PDF may contain sufficient structure and metadata for easy removal of tables and figures, which the library can be tailored to take advantage of.
I'm pretty sure there are no open source tools which solve your problem for a wide variety of PDFs, but I remember having heard of commercial software claiming to do exactly what you ask for. I'm sure you'll run into them while googling.
akaiholaakaiholaNot the answer you're looking for? Browse other questions tagged pythonpdfparsingtext-extractioninformation-extraction or ask your own question.
PDFMiner is a tool for extracting information from PDF documents.Unlike other PDF-related tools, it focuses entirely on gettingand analyzing text data. PDFMiner allows one to obtainthe exact location of text in a page, as well asother information such as fonts or lines.It includes a PDF converter that can transform PDF filesinto other text formats (such as HTML). It has an extensiblePDF parser that can be used for other purposes than text analysis.
- Webpage: https://euske.github.io/pdfminer/
- Download (PyPI): https://pypi.python.org/pypi/pdfminer/
- Demo WebApp: http://pdf2html.tabesugi.net:8080/
Features
- Written entirely in Python.
- Parse, analyze, and convert PDF documents.
- PDF-1.7 specification support. (well, almost)
- CJK languages and vertical writing scripts support.
- Various font types (Type1, TrueType, Type3, and CID) support.
- Basic encryption (RC4) support.
- Outline (TOC) extraction.
- Tagged contents extraction.
- Automatic layout analysis.
How to Install
Install Python 2.6 or newer. (For Python 3 support have a look at pdfminer.six).
Download the source code.
Unpack it.
Run
setup.py
:$ python setup.py install
Do the following test:
$ pdf2txt.py samples/simple1.pdf
For CJK Languages
In order to process CJK languages, do the following beforerunning setup.py install:
On Windows machines which don't have make
command,paste the following commands on a command line prompt:
Command Line Tools
PDFMiner comes with two handy tools:pdf2txt.py and dumppdf.py.
pdf2txt.py
pdf2txt.py extracts text contents from a PDF file.It extracts all the text that are to be rendered programmatically,i.e. text represented as ASCII or Unicode strings.It cannot recognize text drawn as images that would require optical character recognition.It also extracts the corresponding locations, font names, font sizes, writingdirection (horizontal or vertical) for each text portion.You need to provide a password for protected PDF documents when its access is restricted.You cannot extract any text from a PDF document which does not have extraction permission.
(For details, refer to the html document.)
dumppdf.py
dumppdf.py dumps the internal contents of a PDF file in pseudo-XML format.This program is primarily for debugging purposes,but it's also possible to extract some meaningful contents (e.g. images).
(For details, refer to the html document.)
API Changes
As of November 2013, there were a few changes made to the PDFMiner APIprior to October 2013. This is the result of code restructuring. Hereis a list of the changes:
- PDFDocument class is moved to pdfdocument.py.
- PDFDocument class now takes a PDFParser object as an argument.PDFDocument.set_parser() and PDFParser.set_document() is removed.
- PDFPage class is moved to pdfpage.py
- process_pdf function is implemented as a class method PDFPage.get_pages.
TODO
- Replace STRICT variable with something better.
- Use logging module instead of sys.stderr.
- Proper test cases.
- PEP-8 and PEP-257 conformance.
- Better documentation.
- Crypt stream filter support.
Related Projects
Tkinter Tutorial Python 3 Pdf
Terms and Conditions
Tkinter Tutorial Python
(This is so-called MIT/X License)
Copyright (c) 2004-2016 Yusuke Shinyama
Permission is hereby granted, free of charge, to any personobtaining a copy of this software and associated documentationfiles (the 'Software'), to deal in the Software withoutrestriction, including without limitation the rights to use,copy, modify, merge, publish, distribute, sublicense, and/orsell copies of the Software, and to permit persons to whom theSoftware is furnished to do so, subject to the followingconditions:
The above copyright notice and this permission notice shall beincluded in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANYKIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THEWARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULARPURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS ORCOPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHERLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OROTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THESOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.