The PDI MongoDB GridFS Output Step

The BJSON document size in MongoDB is limited to 16 MB. If you want to store large files and/or different file types, you can use GridFS. There are some cases in which storing large files may be more efficient in MongoDB than in a filesystem, for example, if the filesystem is limited in the number of files in a directory or it's possible to access only some portions of large files without loads all the files in the memory.

SPEC INDIA has contributed to the Pentaho community with the MongoDB GridFS Output Step under a GPL license on GitHub at https://github.com/SPECUSA/MongoDBGridfs.

Getting ready

To get ready for this recipe, you will again need to start your ETL development environment Spoon and make sure that you have the MongoDB server running with the data from the previous chapters.

How to do it…

Perform the following steps to use the MongoDB GridFS Output step:

  1. Let's install the MongoDB GridFS Output step:
    1. From the menu bar of Spoon, select Help and then Marketplace.
    2. A PDI Marketplace popup will show you the list of plugins available for installation. Search for MongoDB in the Detected Plugins field.
    3. Expand the MongoDB GridFS Output Plugin item, as you can see in the following screenshot:
      How to do it…
    4. Click on the Install this plugin button.
    5. Next, click on the OK button in the alert for restarting Spoon.
    6. Restart Spoon.
  2. Let's insert the orders.csv file. This file is available in the source code of this chapter, in the MongoDB files database:
    1. In Spoon, create a new transformation with the name insert-order.csv-mongodb.ktr.
    2. Select the Design tab in the left-hand-side view.
    3. From the Input category folder, find the Generate Rows step, and drag and drop it into the working area in the right-hand-side view.
    4. Double-click on the step to open the Generate Rows configuration dialog.
    5. Set Step Name to Get order.csv.
    6. Set the Limit field to 1.
    7. In the Fields table, add the filePath field as a String type and set the value with the location of the order.csv source file in your filesystem.
    8. From the Big Data category folder, find the Mongodb GridFS Output step, and drag and drop it into the working area in the right-hand-side view.
    9. Connect the Get Values step to the Mongodb GridFS Output step.
    10. Double-click on the step to open the Mongodb GridFS Output configuration dialog.
    11. Set Step Name to Insert order.csv.
    12. Next, set the Database field to files and the GridFS Bucket field to fileBucket.
    13. In the File field, select the filePath option. The configuration should look like what is shown in this screenshot:
      How to do it…
    14. Click on the OK button.
    15. You will be able to run the transformation successfully. After that, you can, using the MongoDB shell, check whether a new database called files exists. To check whether the file was inserted, you can run the following query:
      db.fileBucket.files.find().pretty();
      
    16. Then see the information about the new file. The transformation should look like what is shown here:
    How to do it…

How it works…

Basically, this recipe guides you through inserting a file into GridFS of MongoDB. However, you can insert any other file, and as many as you wish.

Storing entire files in MongoDB isn't a usual operation to do, but in some cases, it may be a good option for getting dynamic storage space with shards and replication.

A good exercise, if you understand the functionality of GridFS, is to create a transformation that gets the list of all the files available in a particular folder of your filesystem, and insert them into MongoDB.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset