Guidelines for communication

Now that we have covered debugging, monitoring and iterative testing of predictive models, we close with a few notes on communicating results of algorithms to a more general audience.

Translate terms to business values

In this text, we frequently discuss evaluation statistics or coefficients whose interpretations are not immediately obvious, nor the difference in numerical variation for these values. What does it mean for a coefficient to be larger or smaller? What does an AUC mean in terms of customer interactions predicted? In any of these scenarios, it is useful to translate the underlying value into a business metric in explaining their significance to non-technical colleagues: for example, coefficients in a linear model represent the unit change in an outcome (such as revenue) for a 1-unit change in particular input variable. For transformed variables, it may be useful to relate values such as the log-odds (from logistic regression) to a value such as doubling the probability of an event. Additionally, as discussed previously, we may need to translate the outcome we predict (such as a cancelation) into a financial amount to make its implication clear. This sort of conversion is useful not only in communicating the impact of a predictive algorithm, but also in clarifying priorities in planning. If the development time for an algorithm (whose cost might be approximated by the salaries of the employees involved) is not offset by the estimated benefit of its performance, then this suggests it is not a useful application from a business perspective.

Visualizing results

While not all algorithms we have discussed are amenable to visualization, many have elements that may be plotted for clarity. For example, regression coefficients can be compared using a barplot, and tree models may be represented visually by the branching decision points leading to a particular outcome. Such graphics help to turn inherently mathematical objects into more understandable results as well as provide ongoing insight into the performance of models, as detailed previously.

As a practical example of building such a service, this chapter's case study will walk through the generation of a custom dashboard as an extension of the prediction service we built in Chapter 8, Sharing Models with Prediction Services.

Case Study: building a reporting service

In Chapter 8, Sharing Models with Prediction Services, we created a prediction service that uses MongoDB as a backend database to store model data and predictions. We can use this same database as a source to create a reporting service. Like the separation of concerns between the CherryPy server and the modeling service application that we described in Chapter 8, Sharing Models with Prediction Services, a reporting service can be written without any knowledge of how the information in the database is generated, making it possible to generate a flexible reporting infrastructure as the modeling code may change over time. Like the prediction service, our reporting service has a few key components.

  • The server that will receive requests for the output of the reporting service.
  • The reporting application run by the server, which receive requests from the server and routes them to display the correct data.
  • The database from which we retrieve the information required to make a plot.
  • Charting systems that render the plots we are interested in for the end user.

Let us walk through an example of each component, which will illustrate how they fit together.

The report server

Our server code is very similar to the CherryPy server we used in Chapter 8, Sharing Models with Prediction Services.

Note

This example was inspired by the code available at https://github.com/adilmoujahid/DonorsChoose_Visualization.

The only difference is that instead of starting the modelservice application, we use the server to start the reportservice, as you can see in the main method:

>>> if __name__ == "__main__":
…      service = reportservice()
…    run_server(service)

We can test this server by simple running the following on the command line:

python report_server.py

You should see the server begin to log information to the console as we observed previously for the modelserver.

The report application

In the application code, which is also a Flask application like the model service we built in Chapter 8, Sharing Models with Prediction Services, we need a few additional pieces of information that we didn't use previously. The first is path variable to specify the location of the JavaScript and CSS files that we will need when we construct our charts, which are specified using the commands:

>>> static_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'templates/assets')

We also need to specify where to find the HTML pages that we render to the user containing our charts with the argument:

>>> tmpl_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'templates')

When we initialize our application, we will pass both of these as variables to the constructor:

>>> app = Flask(__name__,template_folder=tmpl_dir,static_folder=static_dir)

To return this application when called by the server, we simply return app in the reportservice function:

>>> def reportservice():
…    return app

We now just need to specify the response of the application to requests forwarded by the server. The first is simply to render a page containing our charts:

>>> @app.route("/")
…   def index():
…     return render_template("layouts/hero-thirds/index.html")

The template in this example is taken from https://github.com/keen/dashboards, an open source project that provides reusable templates for generating quick dashboards.

The second route will allow us to retrieve the data we will use to populate the chart. This is not meant to be exposed to the end user (though you would see a text dump of all the JSONS in our collection if you navigated to this endpoint in your browser): rather it is used by the client-side JavaScript code to retrieve the information to populate the charts. First we need to start the mongodb application in another terminal window using:

> mongod

Next, in our code, we need to specify the MongoDB parameters to use in accessing our data. While we could have passed these as parameters in our URL, for simplicity in this example, we will just hard-code them at the top of the reportservice code to point to the results of bulk scoring the bank dataset we used to train our Spark Logistic Regression Model in Chapter 8, Sharing Models with Prediction Services:

>>> FIELDS = {'score': True, 
…          'value': True, 
…          '_id': False}
… MONGODB_HOST = 'localhost'
… MONGODB_PORT = 27017
… DBS_NAME = 'datasets'
… COLLECTION_NAME = 'bankResults'

Note that we could just as easily have pointed to a remote data source, rather than one running on our machine, by changing the MONGODB_HOST parameter. Recall that when we stored the results of bulk scoring, we saved records with two elements, the score and the original data row. In order to plot our results, we will need to extract the original data row and present it along with the score using the following code:

>>> @app.route("/report_dashboard")
…  def run_report():
…    connection = MongoClient(MONGODB_HOST, MONGODB_PORT)
…    collection = connection[DBS_NAME][COLLECTION_NAME]
…    data = collection.find(projection=FIELDS)
…    records = []
…    for record in data:
…        tmp_record = {}
…        tmp_record = record['value']
…        tmp_record['score'] = record['score']
…        records.append(tmp_record)
…    records = json.dumps(records, default=json_util.default)
…    connection.close()

Now that we have all of our scored records in a single array of json strings, we can plot them using a bit of JavaScript and HTML.

The visualization layer

The final piece we will need is the client-side JavaScript code used to populate the charts, and some modifications to the index.html file to make use of the charting code. Let us look at each of these in turn.

The chart generating code is a JavaScript function contained in the file report.js that you can find under templates/assets/js in the project directory for Chapter 9, Reporting and Testing – Iterating on Analytic Systems. We begin this function by calling for the data we need and waiting for it to be retrieved using the asynchronous function d3.queue():

>>> d3_queue.queue()     
… .defer(d3.json, "/report_dashboard")
… .await(runReport);

Notice that this URL is the same endpoint that we specified earlier in the report application to retrieve the data from MongoDB. The d3_queue function calls this endpoint and waits for the data to be returned before running the runReport function. While a more extensive discussion is outside the scope of this text, d3_queue is a member of the d3 library (https://d3js.org/), a popular visualization framework for the javascript language.

Once we have retrieved the data from our database, we need to specify how to plot it using the runReport function. First we will declare the data associated with the function:

>>> function runReport(error, recordsJson) {     
…  var reportData = recordsJson;      
…  var cf = crossfilter(reportData);

Though it will not be apparent until we visually examine the resulting chart, the crossfilter library (http://square.github.io/crossfilter/) allows us to highlight a subset of data in one plot and simultaneously highlight the corresponding data in another plot, even if the dimensions plotted are different. For example, imagine we had a histogram of ages for particular account_ids in our system, and a scatterplot of click-through-rate versus account_id for a particular ad campaign. The Crossfilter function would allow us to select a subset of the scatterplot points using our cursor and, at the same time, filter the histogram to only those ages that correspond to the points we have selected. This kind of filtering is very useful for drilling down on particular sub-segments of data. Next we will generate the dimensions we will use when plotting:

>>>  var ageDim = cf.dimension(function(d) { return d["age"]; });
…  var jobDim = cf.dimension(function(d) { return d["job"]; });
…  var maritalDim = cf.dimension(function(d) { return d["marital"]; });

Each of these functions takes the input data and returns the requested data field. The dimension contains all the data points in a column and forms the superset from which we will filter when examining subsets of data. Using these dimensions, we construct groups of unique values that we can use, for example, in plotting histograms:

>>>  var ageDimGroup = ageDim.group();
…  var jobDimGroup = jobDim.group();
…  var maritalDimGroup = maritalDim.group();

For some of our dimensions, we want to add values representing that maximum or minimum, which we use in plotting ranges of numerical data:

>>> var minAge = ageDim.bottom(1)[0]["age"];
… var maxAge = ageDim.top(1)[0]["age"];
… var minBalance = balanceDim.bottom(1)[0]["balance"];
… var maxBalance = balanceDim.top(1)[0]["balance"];

Finally, we can specify our chart objects using dc (https://dc-js.github.io/dc.js/), a charting library that uses d3 and crossfilter to create interactive visualizations. The # tag given to each chart constructor specifies the ID we will use to reference it when we insert it into the HTML template later. We construct the charts using the following code:

>>>  var ageChart = dc.barChart("#age-chart"); 	
…  var jobChart = dc.rowChart("#job-chart"); 	
…  var maritalChart = dc.rowChart("#marital-chart"); 	

Finally, we specify the dimension and axes of these charts:

>>>  ageChart
…  .width(750)     
…  .height(210)     
…  .dimension(ageDim)     
…  .group(ageDimGroup)     
…  .x(d3_scale.scaleLinear()
…  .domain([minAge, maxAge]))
…  .xAxis().ticks(4);     	

>>>  jobChart     
…  .width(375)     
…  .height(210)     	
…  .dimension(jobDim)     
…  .group(jobDimGroup)     
…  .xAxis().ticks(4);

We just need a call to render in order to display the result:

>>>  dc.renderAll();

Finally, we need to modify our index.html file in order to display our charts. If you open this file in a text editor, you will notice several places where we have a <div> tag such as:

>>>  <div class="chart-stage">

…       </div>

This is where we need to place our charts using the following IDs that we specified in the preceding JavaScript code:

>>>  <div id="age-chart">
…         </div>

Finally, in order to render the charts, we need to include our javascript code in the <script> arguments at the bottom of the HTML document:

>>> <script type="text/javascript" … src="../../assets/js/report.js"></script>

Now, you should be able to navigate to the URL to which the CherryPy server points, localhost:5000, should now display the charts like this:

The visualization layer

Crossfilter chart highlighting other dimensions of subset of users in a given age range.

The data is drawn from the bank default example we used to train our model service in Chapter 8, Sharing Models with Prediction Services. You can see that by selecting a subset of data points in the age distribution, we highlight the distribution of occupations, bank balance, and educations for these same users. This kind of visualization is very useful for drill-down diagnosis of problems points (as may be the case, for example, if a subset of data points is poorly classified by a model). Using these few basic ingredients you can now not only scale model training using the prediction service in Chapter 8, Sharing Models with Prediction Services, but also visualize its behavior for end users using a reporting layer.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset