Chapter 16

What are the benefits of writing tasks rather than using simple scripts?

Scripts are great for simple and one-off jobs. If you have a repetitive task to do  or even more so if there is a set of tasks that depend on each other, and you need to ensure that they don't run without a dependency missing, or that they won't override (or append to) existing data—then ETL pipelines and tasks are for you. As a free bonus, frameworks such as Luigi have a lot of utility code that helps to build pipelines – you won't need to write a solution for writing to S3 or a database, or parse a command-line command.

What is the base element of Luigi jobs?

The base element of Luigi jobs (pipelines) is the Task class. All the business logic of a task needs to be wrapped in the run method. Its output and dependencies are defined within the output and requires methods.

How are DAGs defined in Luigi? What are the benefits of that architecture?

Luigi forms DAGs (pipelines) automatically; there is no need to set them up explicitly. To define a DAG, you need to run the last task in the pipe—Luigi will check for its dependencies if they are not met, will check for theirs, and so on. Once the queue of tasks to computing is ready, Luigi starts to compute them, one by one, starting with the earliest dependency—and adding others once their requirements are met. 

This allows the pipeline to be flexible and easy to build, one step at a time. If something "external" to the pipeline task needs to be dependent, all it needs is to refer to a task.

How can we parameterize a task?

To parameterize a task, all we need to do is set a task attribute to be of the luigi.Parameter type, or its derivative. Once set, the parameter can be used as an argument that's passed on class initiation, or passed on the command line. Parameters can be used to run the task on a specific subset of data, or with a specific mode – for example, you can pass a production flag that will direct the dataflow to the production database or staging if the flag is not raised. 

What is the best way to run time-based tasks in bulk?

For time-based jobs, Luigi provides built-in functionality for bulk execution with the main focus on backfill. By using the DailyRange (or other ranges) built-in utility, you can pass either the start and end date, or one of those and a number of days to fill. The program will automatically spawn and execute the given task for each day in this range. However, this has one caveat—a task can only have one DateTime parameter, which will be used.

How can we schedule a job with Luigi?

Luigi itself does not provide a scheduling mechanism. To schedule a task, an external tool such as cron should be used. Cron is a tool that's used for scheduling arbitrary tasks and is built into all Mac and Linux OS systems. Windows has its own similar tool such as schtasks or PyCron.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset