Metadata: data about data

In this section, we are going learn about the pyPdf module, which helps in extracting the metadata from a pdf file. But first, what is metadata? Metadata is data about data. Metadata is structured information that describes primary data. Metadata is a summary of that data. It contains the basic information regarding your actual data. It helps in finding a particular instance of your data.

Make sure you have the pdf file present in your directory from which you want to extract the information.

First, we have to install the pyPdf module, as follows:

pip install pyPdf

Now, we will write a metadata_example.py script and we will see how we get the metadata information from it. We are going to write this script in Python 2:

import pyPdf
def main():
file_name = '/home/student/sample_pdf.pdf'
pdfFile = pyPdf.PdfFileReader(file(file_name,'rb'))
pdf_data = pdfFile.getDocumentInfo()
print ("----Metadata of the file----")
for md in pdf_data:
print (md+ ":" +pdf_data[md])
if __name__ == '__main__':
main()

Run the script as follows:

student@ubuntu:~$ python metadata_example.py
----Metadata of the file----
/Producer:Acrobat Distiller Command 3.0 for SunOS 4.1.3 and later (SPARC)
/CreationDate:D:19980930143358

In the preceding script, we used the pyPdf module of Python 2. First, we created a file_name variable that stores the path of our pdf. Using PdfFileReader(), data gets read. The pdf_data variable will hold the information about your pdf. Lastly, we wrote a for loop to get the metadata information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset