In this section, we are going learn about the pyPdf module, which helps in extracting the metadata from a pdf file. But first, what is metadata? Metadata is data about data. Metadata is structured information that describes primary data. Metadata is a summary of that data. It contains the basic information regarding your actual data. It helps in finding a particular instance of your data.
First, we have to install the pyPdf module, as follows:
pip install pyPdf
Now, we will write a metadata_example.py script and we will see how we get the metadata information from it. We are going to write this script in Python 2:
import pyPdf
def main():
file_name = '/home/student/sample_pdf.pdf'
pdfFile = pyPdf.PdfFileReader(file(file_name,'rb'))
pdf_data = pdfFile.getDocumentInfo()
print ("----Metadata of the file----")
for md in pdf_data:
print (md+ ":" +pdf_data[md])
if __name__ == '__main__':
main()
Run the script as follows:
student@ubuntu:~$ python metadata_example.py
----Metadata of the file----
/Producer:Acrobat Distiller Command 3.0 for SunOS 4.1.3 and later (SPARC)
/CreationDate:D:19980930143358
In the preceding script, we used the pyPdf module of Python 2. First, we created a file_name variable that stores the path of our pdf. Using PdfFileReader(), data gets read. The pdf_data variable will hold the information about your pdf. Lastly, we wrote a for loop to get the metadata information.