Reading parquet files

Apache Parquet is another file format that makes use of columnar compression for efficient read and write operations. It was designed to be compatible with big data ecosystems such as Hadoop and can handle nested data structures and sparsely populated columns. Though the parquet and feather formats share a similar base, parquet has a better compression routine than feather. The compressed file is smaller in parquet than it is in feather. Columns with similar data types use the same encoding for compression. The use of different encoding schemes for the compression of parquet makes it efficient. Just like feather, parquet is a binary file format that can work well with all pandas data types and is supported across several languages. Parquet can be used for the long-term storage of data.

The following are some limitations of the parquet file format:

  • While parquet can accept multi-level indices, it requires that the index level name is in string format.
  • Python data types such as Period are not supported.
  • Duplicates in column names are not supported.
  • When Categorical objects are serialized in a parquet file, they are deserialized as an object datatype.

Serialization or deserialization of parquet files t in pandas can take place in either of the pyarrow and fastparquet engines. These two engines have different dependencies. Pyarrow does not support Timedelta.

Let's read a parquet file using the pyarrow engine:

pd.read_parquet("sample.paraquet",engine='pyarrow')

This results in the following output:

Output of read_parquet

Parquet allows us to select columns when reading a file, which saves time:

pd.read_parquet("sample.paraquet",engine='pyarrow',columns=["First_Name","Score"])

The same works for the fastparquet engine as well:

pd.read_parquet("sample.paraquet",engine='fastparquet')
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset