Skip to main content
HomeAbout PythonLearn Python

Reading and Editing PDF’s and Word Documents From Python

This tutorial will allow you to read PDF documents and merge multiple PDF files into one PDF file. It will also show how to read and write word documents from Python.
Feb 2020  · 8 min read

PDF Documents

PDF is a Portable Document Format where it contains texts, images, charts, etc. which is different from plain text files. It is a file that contains the '.pdf.' extension and was invented by Adobe. This type of file is independent of any platforms like software, hardware, and operating systems.

Steps to Install-Package:

  1. You need to install a package named "pypdf2" which can handle the file with '.pdf' extension. install package 1

  2. You can see the 'pypdf2' package is installed and shown below. install package 2

Reading PDF documents and Extracting Data

You will be extracting only the text from the pdf file as PyPDF2 has a limitation when it comes to extracting the rich media content. The logos, pictures, etc. couldn't be extracted from it — the following pdf file needs to be download to work with this tutorial.
Download Pdf file

import pdf

The 'import' statement in the code above gets the PyPDF2 module. You need to use 'open('pdfFileName' , 'openingMode')'where the 'pdfFilename' is 'test.pdf', and the 'openingMode' is 'rb' which is the reading only in binary format.

open pdf

The PyPDF2 has a method as 'PdfFileReader', which takes the newly created object 'pdfFileObject'.You can now access the attribute named 'numPages' from 'pdfFileObject', which gives a total number of the pages.

total number of pages The above output is 1.Since; you can see the pdf file is of only one page.

first page You can use the 'getPage(0)' method inside the pdfReaderObject to get the first page.The result then is stored in the 'firstPageObject' where all the text inside that particular page can be printed out by using the 'extractText()' method.

all text The above code gives all the text from the pdf file. However, the image is not shown in the terminal, which cannot be obtained using pyPDF2.

Merging Multiple PDF Files into one PDF

You will be merging two different pdf files into a single pdf file. The old PDF file is previous that you've worked with, whereas a new PDF file can be downloaded from the following link:
New PDF file.

new pdf

You will be importing the PdfFileMerger module from the PyPDF2 package, which helps to merge the pdf files. The 'path' is specified, which indicates the path for the folder where the file is located. Also, pdf files to merge are included in 'pdf_files' in a list.

merger object You can see the merger object is created using the help of 'PdfFileMerger.' The looping is done for each file in a list where merging is done by passing the path and file to the 'append' method. At last, the final output can be obtained by using 'merger.write()' where the merged content with a new PDF filename is obtained.

merged pdf The above picture indicates a 'merged.pdf,' which consists of the content merged from 'test.pdf' and 'test-1.pdf'.

Word Documents

The Word documents consist of the ".docx" extension at the end of the filename. These documents don't only contain text as in plain text files, but it includes a rich-text document. The rich-text document contains the different structures for the document, which have size, align, color, pictures, font, etc. associated with them.

It would be best if you had an application for working with the Word Documents. The popular application for Windows and Mac Operating systems is Microsoft Word, but it is a paid subscription platform. However, there is a free alternative option like "LibreOffice", which is an application in Linux which comes pre-installed. The applications can be downloaded for Windows and Mac Operating systems. This tutorial will use Microsoft Word in the Windows Operating system.

Steps to Install-Package

  1. You need to install a package named "python-docx" which can handle the word documents of the '.docx' extension. python-docx package

  2. You can see the 'python-docx' package installed and shown below. installed package

You can code along in the interactive shell provided by Python, but it is preferred to use the Text Editor. So, Sublime Text is used for the coding part of this tutorial.

Writing a Word Document

document module imported

You can see above the 'document' module is imported from the 'docx' package in the first line. The code in the second line produces a new word document through the help of the Document object.The filename is saved as 'first.docx' using the 'document.save()'.

Add a heading

add heading

The above code contains a 'Document()'opens a new file and 'document.save('addHeader.docx')' is used to create a newly edited docx file. You can add the heading through method 'add_heading('text,' level=number),' which takes the text as a heading, and the heading level from starts from 0 to 4.

heading levels

The above code gives the output as a newly created 'addedHeader.docx' file where level 0 makes a headline with the horizontal line below the text, whereas the heading level 1 is the main heading. Similarly, the other headings are sub-heading with their's font-sizes in decreasing order.

Add a Paragraph

add a paragraph

The above code contains a 'Document()' which opens a new document file and 'document.save('addParagraph.docx')' is used to create a newly edited docx file. You can add the heading through method 'add_paragraph('text,' style='required_style')' which takes the text and also 'style' is an optional argument where 'List Number' and 'List Bullet' can be used.

newly created file

The above code gives the output as a newly created 'addedParagraph.docx' file where there is a simple paragraph at first line. Similarly, there is a heading, and below it contains an ordered list containing an item with number 1 and 2.Below that there is another heading containing a two item in an unordered bulleted list.

Add a Picture

add a picture

The above code contains a 'Document(),' which creates a new document file, and 'document.save('addPicture.docx')' is used to create a newly edited docx file. You can add the image by using 'add_picture()' which contains the first parameter as 'cat-1.jpeg' is the path for the cat image. The widths and heights are the optional parameters that will be default as '72 dp,' but we have used 'Inches' for our purpose.

added picture

The above code gives the output as a newly created 'addedPicture.docx' file, which contains a cat image where there are 1.25 inches of width and height in the image.

Reading a Word Document

You'll now read a sample word document from Python, and it can be found in:
Download Sample.

reading word document

The first line in the code imports the Document from the 'docx' module, which is used to pass the required document file and to create an object .'obtainText' is a function that receives the file 'fullText.docx.' The looping is done for each paragraph, which is accessed by 'document.paragraphs' and is inserted into an empty list using the 'append' method. Finally, the function returns a list of paragraphs ending with a new line.

plain text

The above output gives the plain text without any style, color, etc. which is not a rich text document.

Congratulations

Congratulation, you've finished reading this tutorial.

If you would like to learn more about importing data in Python, try DataCamp's Introduction to Importing Data in Python course.

Check out our Python Data Structures Tutorial.

You can also look at the following resources to help broaden your knowledge on specific topics.

  1. Documentation of Python Docx
  2. Documentation of PyPDF2

References:
Automate Boring Stuff With Python: Working with Word Documents

Topics

Learn more about Python

Certification available

Course

Introduction to Importing Data in Python

3 hr
278.2K
Learn to import data into Python from various sources, such as Excel, SQL, SAS and right from the web.
See DetailsRight Arrow
Start Course
See MoreRight Arrow
Related

Mastering the Pandas .explode() Method: A Comprehensive Guide

Learn all you need to know about the pandas .explode() method, covering single and multiple columns, handling nested data, and common pitfalls with practical Python code examples.

Adel Nehme

5 min

Python NaN: 4 Ways to Check for Missing Values in Python

Explore 4 ways to detect NaN values in Python, using NumPy and Pandas. Learn key differences between NaN and None to clean and analyze data efficiently.
Adel Nehme's photo

Adel Nehme

5 min

Seaborn Heatmaps: A Guide to Data Visualization

Learn how to create eye-catching Seaborn heatmaps
Joleen Bothma's photo

Joleen Bothma

9 min

Test-Driven Development in Python: A Beginner's Guide

Dive into test-driven development (TDD) with our comprehensive Python tutorial. Learn how to write robust tests before coding with practical examples.
Amina Edmunds's photo

Amina Edmunds

7 min

Exponents in Python: A Comprehensive Guide for Beginners

Master exponents in Python using various methods, from built-in functions to powerful libraries like NumPy, and leverage them in real-world scenarios to gain a deeper understanding.
Satyam Tripathi's photo

Satyam Tripathi

9 min

Python Linked Lists: Tutorial With Examples

Learn everything you need to know about linked lists: when to use them, their types, and implementation in Python.
Natassha Selvaraj's photo

Natassha Selvaraj

9 min

See MoreSee More